The Illustrated Network

The Illustrated Network
The Illustrated Network
The Morgan Kaufmann Series in Networking
Series Editor, David Clark, M.I.T.
The Illustrated Network
Walter Goralski
P2P Networking and Applications
John Buford, Heather Yu, and Eng Lua
Broadband Cable Access Networks: The HFC
Plant
David Large and James Farmer
Technical, Commercial, and Regulatory
Challenges of QoS: An Internet Service Model
Perspective
XiPeng Xiao
MPLS: Next Steps
Bruce S. Davie and Adrian Farrel
Wireless Networking
Anurag Kumar, D. Manjunath, and Joy Kuri
Bluetooth Application Programming with the
Java APIs, Essentials Edition
Timothy J.Thompson, Paul J. Kline, and C Bala
Kumar
Internet Multimedia Communications Using
SIP
Rogelio Martinez Perea
Information Assurance: Dependability and
Security in Networked Systems
Yi Qian, James Joshi, David Tipper, and Prashant
Krishnamurthy
Network Simulation Experiments Manual,
Second Edition
Emad Aboelela
Network Analysis, Architecture, and Design,
Third Edition
James D. McCabe
Wireless Communications & Networking: An
Introduction
Vijay K. Garg
Ethernet Networking for the Small Office and
Professional Home Office
Jan L. Harrington
IPv6 Advanced Protocols Implementation
Qing Li,Tatuya Jinmei, and Keiichi Shima
Computer Networks: A Systems Approach,
Fourth Edition
Larry L. Peterson and Bruce S. Davie
Network Routing: Algorithms, Protocols, and
Architectures
Deepankar Medhi and Karthikeyan Ramaswami
Deploying IP and MPLS QoS for Multiservice
Networks: Theory and Practice
John Evans and Clarence Filsfils
Traffic Engineering and QoS Optimization of
Integrated Voice & Data Networks
Gerald R. Ash
IPv6 Core Protocols Implementation
Qing Li,Tatuya Jinmei, and Keiichi Shima
Smart Phone and Next-Generation Mobile
Computing
Pei Zheng and Lionel Ni
GMPLS: Architecture and Applications
Adrian Farrel and Igor Bryskin
Network Security: A Practical Approach
Jan L. Harrington
Content Networking: Architecture, Protocols,
and Practice
Markus Hofmann and Leland R. Beaumont
Network Algorithmics: An Interdisciplinary
Approach to Designing Fast Networked Devices
George Varghese
Network Recovery: Protection and Restoration
of Optical, SONET-SDH, IP, and MPLS
Jean Philippe Vasseur, Mario Pickavet, and Piet
Demeester
Routing, Flow, and Capacity Design in
Communication and Computer Networks
Michał Pióro and Deepankar Medhi
Wireless Sensor Networks: An Information
Processing Approach
Feng Zhao and Leonidas Guibas
Communication Networking: An Analytical
Approach
Anurag Kumar, D. Manjunath, and Joy Kuri
The Internet and Its Protocols: A Comparative
Approach
Adrian Farrel
Multicast Communication: Protocols,
Programming, and Applications
Ralph Wittmann and Martina Zitterbart
Modern Cable Television Technology: Video,
Voice, and Data Communications, 2e
Walter Ciciora, James Farmer, David Large, and
Michael Adams
MPLS: Technology and Applications
Bruce Davie and Yakov Rekhter
Bluetooth Application Programming with the
Java APIs
C Bala Kumar, Paul J. Kline, and Timothy
J.Thompson
Policy-Based Network Management: Solutions
for the Next Generation
John Strassner
MPLS Network Management: MIBs, Tools, and
Techniques
Thomas D. Nadeau
Developing IP-Based Services: Solutions for
Service Providers and Vendors
Monique Morrow and Kateel Vijayananda
Telecommunications Law in the Internet Age
Sharon K. Black
Optical Networks: A Practical Perspective,
Second Edition
Rajiv Ramaswami and Kumar N. Sivarajan
Internet QoS: Architectures and Mechanisms
Zheng Wang
TCP/IP Sockets in Java: Practical Guide for
Programmers
Michael J. Donahoo and Kenneth L. Calvert
TCP/IP Sockets in C: Practical Guide for
Programmers
Kenneth L. Calvert and Michael J. Donahoo
High-Performance Communication Networks,
Second Edition
Jean Walrand and Pravin Varaiya
Internetworking Multimedia
Jon Crowcroft, Mark Handley, and Ian Wakeman
Understanding Networked Applications: A First
Course
David G. Messerschmitt
Integrated Management of Networked Systems:
Concepts, Architectures, and Their Operational
Application
Heinz-Gerd Hegering, Sebastian Abeck, and
Bernhard Neumair
Virtual Private Networks: Making the Right
Connection
Dennis Fowler
Networked Applications: A Guide to the New
Computing Infrastructure
David G. Messerschmitt
Wide Area Network Design: Concepts and Tools
for Optimization
Robert S. Cahn
For further information on these books and for a
list of forthcoming titles, please visit our Web site
at http://www.mkp.com.
This page intentionally left blank
The Illustrated Network
How TCP/IP Works in a
Modern Network
Walter Goralski
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier
Morgan Kaufmann Publishers is an imprint of Elsevier.
30 Corporate Drive, Suite 400
Burlington, MA 01803
This book is printed on acid-free paper. `
Copyright © 2009 by Elsevier Inc. All rights reserved.
Designations used by companies to distinguish their products are often claimed as
trademarks or registered trademarks. In all instances in which Morgan Kaufmann
Publishers is aware of a claim, the product names appear in initial capital or all capital
letters. Readers, however, should contact the appropriate companies for more complete
information regarding trademarks and registration.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, scanning, or otherwise,
without prior written permission of the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights
Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333,
e-mail: [email protected] may also complete your request on-line via the
Elsevier homepage (http://elsevier.com), by selecting “Support & Contact” then
“Copyright and Permission” and then “Obtaining Permissions.”
Library of Congress Cataloging-in-Publication Data
Goralski, Walter.
The illustrated network: how TCP/IP works in a modern network/Walter Goralski.
p. cm.—(The Morgan Kaufmann series in networking)
Includes bibliographical references and index.
ISBN 978-0-12-374541-5 (alk. paper)
1.TCP/IP (Computer network protocol) 2. Computer networks. I.Title.
TK5105.585.G664 2008
004.6’2--dc22
2008046728
For information on all Morgan Kaufmann publications,
visit our Website at www.mkp.com or www.books.elsevier.com
Printed in the United States
08 09 10 11 12 10 9 8 7 6 5 4 3 2 1
Working together to grow
libraries in developing countries
www.elsevier.com | www.bookaid.org | www.sabre.org
Contents
Foreword ........................................................................................ xxi
Preface ............................................................................................ xxiii
About the Author ............................................................................ xxx
PART I Networking Basics
CHAPTER 1
Protocols and Layers ......................................................
The Illustrated Network .......................................................
Remote Access to Network Devices ................................
File Transfer to a Router ...................................................
CLI and GUI......................................................................
Ethereal and Packet Capture ............................................
First Explorations in Networking .....................................
Protocols ..............................................................................
Standards and Organizations ............................................
Request for Comment and the Internet Engineering
Task Force ......................................................................
Internet Administration .......................................................
Layers ...................................................................................
Simple Networking ..........................................................
Protocol Layers.................................................................
The TCP/IP Protocol Suite ...................................................
The TCP/IP Layers.............................................................
Protocols and Interfaces...................................................
Encapsulation ...................................................................
The Layers of TCP/IP ............................................................
The Physical Layer ............................................................
The Data Link Layer ..........................................................
The Network Layer ...........................................................
The Transport Layer ..........................................................
The Application Layer .......................................................
Session Support ................................................................
Internal Representation Conversion ................................
Applications in TCP/IP......................................................
The TCP/IP Protocol Suite ...................................................
Questions for Readers .........................................................
3
7
8
10
11
12
14
14
16
18
21
22
23
24
25
26
27
28
30
30
32
35
38
41
41
41
42
43
45
viii
Contents
CHAPTER 2
TCP/IP Protocols and Devices ......................................
Protocol Stacks on the Illustrated Network .........................
Layers, Protocols, Ports, and Sockets ....................................
The TCP/IP Protocol Stack ...................................................
The Client–Server Model .....................................................
TCP/IP Layers and Client–Server .........................................
The IP Layer .........................................................................
The Transport Layer .............................................................
Transmission Control Protocol .........................................
User Datagram Protocol ...................................................
The Application Layer ..........................................................
Bridges, Routers, and Switches.............................................
Segmenting LANs .............................................................
Bridges .............................................................................
Routers .............................................................................
LAN Switches ...................................................................
Virtual LANs......................................................................
VLAN Frame Tagging.........................................................
Questions for Readers ..........................................................
CHAPTER 3
Network Link Technologies ........................................... 71
Illustrated Network Connections.........................................
Displaying Ethernet Traffic ...............................................
Displaying SONET Links...................................................
Displaying DSL Links ........................................................
Displaying Wireless Links .................................................
Frames and the Link Layer................................................
The Data Link Layer .............................................................
The Evolution of Ethernet....................................................
Ethernet II and IEEE 802.3 Frames ...................................
MAC Addresses .................................................................
The Evolution of DSL ...........................................................
PPP and DSL .....................................................................
PPP Framing for Packets ...................................................
DSL Encapsulation............................................................
Forms of DSL ....................................................................
The Evolution of SONET ......................................................
A Note about Network Errors ..........................................
Packet over SONET/SDH ..................................................
Wireless LANS and IEEE 802.11............................................
Wi-Fi..................................................................................
47
50
51
54
55
55
57
58
58
59
59
60
61
63
63
64
65
66
69
74
74
76
78
81
83
84
86
88
89
90
91
92
93
94
96
96
97
98
98
Contents
ix
IEEE 802.11 MAC Layer Protocol ..................................... 100
The IEEE 802.11 Frame..................................................... 102
Questions for Readers .......................................................... 105
Part II Core Protocols
CHAPTER 4
IPv4 and IPv6 Addressing .............................................. 109
IP Addressing........................................................................
The Network/Host Boundary ..............................................
The IPV4 Address ..................................................................
Private IPv4 Addresses......................................................
Understanding IPv4 Addresses .........................................
The IPv6 Address..................................................................
Features of IPv6 Addressing .............................................
IPv6 Address Types and Notation .....................................
IPv6 Address Prefixes .......................................................
Subnetting and Supernetting ...............................................
Subnetting in IPv4 ............................................................
Subnetting Basics .............................................................
CIDR and VLSM ................................................................
IPV6 Addressing Details ........................................................
IP Address Assignment......................................................
Questions for Readers ..........................................................
112
117
118
122
122
123
124
125
126
127
127
128
131
135
138
141
CHAPTER 5
Address Resolution Protocol.........................................
ARP and LANs ......................................................................
ARP Packets .........................................................................
Example ARP Operation.......................................................
ARP Variations ......................................................................
Proxy ARP.........................................................................
Reverse ARP .....................................................................
ARPs on WANs ..................................................................
ARP and IPv6 .......................................................................
Neighbor Discovery Protocol ..........................................
ND Address Resolution.....................................................
Questions for Readers ..........................................................
143
146
153
155
157
157
158
158
159
160
161
163
CHAPTER 6
IPv4 and IPv6 Headers .................................................... 165
Packet Headers and Addresses .............................................
The IPv4 Packet Header .......................................................
Fragmentation and IPv4 .......................................................
Fragmentation and MTU ..................................................
168
170
172
175
x
Contents
Fragmentation and Reassembly ........................................
Path MTU Determination .................................................
A Fragmentation Example ....................................................
Limitations of IPv4 ...........................................................
The IPv6 Header Structure ...............................................
IPv4 and IPv6 Headers Compared .......................................
IPv6 Header Changes .......................................................
IPv6 and Fragmentation .......................................................
Questions for Readers ..........................................................
CHAPTER 7
Internet Control Message Protocol ............................... 189
ICMP and Ping .....................................................................
The ICMP Message Format...................................................
ICMP Message Fields ........................................................
ICMP Types and Codes .....................................................
Sending ICMP Messages .......................................................
When ICMP Must Be Sent.................................................
When ICMP Must Not Be Sent ..........................................
Ping ......................................................................................
Traceroute ............................................................................
Path MTU .............................................................................
ICMPv6.................................................................................
Basic ICMPv6 Messages ....................................................
Neighbor Discovery and Autoconfiguration.....................
Routers and Neighbor Discovery .....................................
Interface Addresses ..........................................................
Neighbor Solicitation and Advertisement ........................
Questions for Readers ..........................................................
CHAPTER 8
192
196
197
198
203
204
204
204
205
206
208
209
211
212
212
213
215
Routing ................................................................................ 217
Routers and Routing Tables..................................................
Hosts and Routing Tables .....................................................
Direct and Indirect Delivery ................................................
Routing .............................................................................
Direct Delivery without Routing......................................
Indirect Delivery and the Router .....................................
Questions for Readers ..........................................................
CHAPTER 9
176
176
177
179
179
182
183
184
187
220
222
226
229
230
231
235
Forwarding IP Packets..................................................... 237
Router Architectures ............................................................ 242
Basic Router Architectures ............................................... 243
Another Router Architecture ............................................ 246
Contents
Router Access .......................................................................
The Console Port ..............................................................
The Auxiliary Port .............................................................
The Network ....................................................................
Forwarding Table Lookups ...................................................
Dual Stacks,Tunneling, and IPV6 ..........................................
Dual Protocol Stacks ........................................................
Tunneling ..........................................................................
Tunneling Mechanisms ........................................................
Transition Considerations ....................................................
Questions for Readers ..........................................................
xi
248
248
248
248
249
251
252
252
255
256
257
CHAPTER 10 User Datagram Protocol .................................................. 259
UDP Ports and Sockets .........................................................
What UDP Is For ..................................................................
The UDP Header ..................................................................
IPv4 and IPv6 Notes .............................................................
Port Numbers.......................................................................
Well-Known Ports .............................................................
The Socket ........................................................................
UDP Operation ....................................................................
UDP Overflows ....................................................................
Questions for Readers .........................................................
262
266
267
268
269
269
273
274
274
277
CHAPTER 11 Transmission Control Protocol....................................... 279
TCP and Connections ..........................................................
The TCP Header ...................................................................
TCP Mechanisms ..................................................................
Connections and the Three-Way Handshake........................
Connection Establishment ...............................................
Data Transfer.....................................................................
Closing the Connection ...................................................
Flow Control ........................................................................
TCP Windows ...................................................................
Flow Control and Congestion Control .............................
Performance Algorithms ......................................................
TCP and FTP ........................................................................
Questions for Readers ..........................................................
282
282
285
286
288
289
291
292
293
294
294
296
299
CHAPTER 12 Multiplexing and Sockets ............................................... 301
Layers and Applications .........................................................301
The Socket Interface ..............................................................304
xii
Contents
Socket Libraries ..................................................................305
TCP Stream Service Calls ....................................................306
The Socket Interface: Good or Bad? .......................................307
The “Threat” of Raw Sockets ...............................................308
Socket Libraries ..................................................................309
The Windows Socket Interface ..............................................309
TCP/IP and Windows ..........................................................310
Sockets for Windows ..........................................................310
Sockets on Linux ....................................................................311
Questions for Readers ............................................................317
Part III Routing and Routing Protocols
CHAPTER 13 Routing and Peering ......................................................... 321
Network Layer Routing and Switching ................................
Connection-Oriented and Connectionless Networks ..........
Quality of Service .............................................................
Host Routing Tables .............................................................
Routing Tables and FreeBSD .............................................
Routing Tables and RedHat Linux ....................................
Routing and Windows XP.................................................
The Internet and the Autonomous System...........................
The Internet Today ...............................................................
The Role of Routing Policies ................................................
Peering .................................................................................
Picking a Peer.......................................................................
Questions for Readers ..........................................................
324
325
326
328
329
330
331
332
334
336
338
340
343
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS .............................................. 345
Interior Routing Protocols ...................................................
The Three Major IGPs ..........................................................
Routing Information Protocol ..............................................
Distance-Vector Routing...................................................
Broken Links ....................................................................
Distance-Vector Consequences ........................................
RIPv1 ................................................................................
RIPv2 ................................................................................
RIPng for IPv6 ..................................................................
A Note on IGRP and EIGRP..................................................
Open Shortest Path First ..................................................
Link States and Shortest Paths ..........................................
353
354
355
355
356
357
358
359
362
364
365
365
Contents
What OSPF Can Do...........................................................
OSPF Router Types and Areas ...........................................
OSPF Designated Router and Backup
Designated Router .........................................................
OSPF Packets ....................................................................
OSPFv3 for IPv6 ...............................................................
Intermediate System–Intermediate System..........................
The IS–IS Attraction ..........................................................
IS–IS and OSPF .................................................................
Similarities of OSPF and IS–IS ..........................................
Differences between OSPF and IS–IS ...............................
IS–IS for IPv6 ....................................................................
Questions for Readers ..........................................................
xiii
366
368
370
371
372
372
373
373
374
374
376
377
CHAPTER 15 Border Gateway Protocol ................................................ 379
BGP as a Routing Protocol ...................................................
Configuring BGP ..............................................................
The Power of Routing Policy ............................................
BGP and the Internet ...........................................................
EGP and the Early Internet ...............................................
The Birth of BGP ..............................................................
BGP as a Path-Vector Protocol .............................................
IBPG and EBGP ....................................................................
IGP Next Hops and BGP Next Hops ................................
BGP and the IGP ..............................................................
Other Types of BGP..............................................................
BGP Attributes ......................................................................
BGP and Routing Policy .......................................................
BGP Scaling ......................................................................
BGP Message Types ..............................................................
BGP Message Formats ..........................................................
The Open Message ...........................................................
The Update Message .........................................................
The Notification Message .................................................
Questions for Readers ..........................................................
379
382
384
386
386
387
388
389
390
391
392
393
395
395
396
397
397
397
399
401
CHAPTER 16 Multicast ............................................................................. 403
A First Look at IPV4 Multicast .............................................. 406
Multicast Terminology .......................................................... 408
xiv
Contents
Dense and Sparse Multicast .................................................
Dense-Mode Multicast ......................................................
Sparse-Mode Multicast......................................................
Multicast Notation................................................................
Multicast Concepts ..............................................................
Reverse-Path Forwarding..................................................
The RPF Table ...................................................................
Populating the RPF Table..................................................
Shortest-Path Tree .............................................................
Rendezvous Point and Rendezvous-Point Shared Trees....
Protocols for Multicast .........................................................
Multicast Hosts and Routers.............................................
Multicast Group Membership Protocols ..........................
Multicast Routing Protocols .............................................
Any-Source Multicast and SSM ..........................................
Multicast Source Discovery Protocol ...............................
Frames and Multicast........................................................
IPv4 Multicast Addressing ................................................
IPv6 Multicast Addressing ................................................
PIM-SM .............................................................................
The Resource Reservation Protocol and PGM..................
Multicast Routing Protocols .............................................
IPv6 Multicast ...................................................................
Questions for Readers ..........................................................
410
410
410
411
411
411
412
412
413
414
415
415
416
417
418
419
420
421
423
425
425
426
428
429
CHAPTER 17 MPLS and IP Switching ................................................... 431
Converging What? ................................................................
Fast Packet Switching .......................................................
Frame Relay ......................................................................
Asynchronous Transfer Mode ..........................................
Why Converge on TCP/IP?................................................
MPLS ....................................................................................
MPLS Terminology ............................................................
Signaling and MPLS ..........................................................
Label Stacking ..................................................................
MPLS and VPNs.................................................................
MPLS Tables ......................................................................
Configuring MPLS Using Static LSPs ....................................
The Ingress Router ...........................................................
The Transit Routers ...........................................................
The Egress Router .............................................................
435
435
435
438
441
442
446
447
448
449
449
450
450
450
451
Contents
xv
Traceroute and LSPs ......................................................... 452
Questions for Readers .......................................................... 455
Part IV Application Level
CHAPTER 18
Dynamic Host Configuration Protocol ......................... 459
DHCP and Addressing ..........................................................
DHCP Server Configuration .............................................
Router Relay Agent Configuration ....................................
Getting Addresses on LAN2 ..............................................
Using DHCP on a Network ..............................................
BOOTP .................................................................................
BOOTP Implementation...................................................
BOOTP Messages..............................................................
BOOTP Relay Agents ........................................................
BOOTP “Vendor-Specific Area” Options ...........................
Trivial File Transfer Protocol ...............................................
TFTP Messages..................................................................
TFTP Download ................................................................
DHCP ...............................................................................
DHCP Operation ..............................................................
DHCP Message Type Options ...........................................
DHCP and Routers ...............................................................
DHCPv6............................................................................
DHCPv6 and Router Advertisements................................
DHCPv6 Operation ..........................................................
Questions for Readers ..........................................................
CHAPTER 19
462
462
464
465
466
468
469
469
471
471
472
473
473
475
475
478
479
479
479
480
481
The Domain Name System ............................................. 483
DNS Basics ...........................................................................
The DNS Hierarchy ...........................................................
Root Name Servers ...........................................................
Root Server Operation .....................................................
Root Server Details ...........................................................
DNS in Theory: Name Server, Database, and Resolver ..........
Adding a New Host...........................................................
Recursive and Iterative Queries .......................................
Delegation and Referral....................................................
Glue Records ....................................................................
DNS in Practice: Resource Records and
Message Formats ...............................................................
DNS Message Header .......................................................
486
486
487
487
489
489
490
490
491
493
493
496
xvi
Contents
DNSSec .............................................................................
DNS Tools: nslookup, dig, and host ...................................
DNS in Action .......................................................................
Questions for Readers ..........................................................
CHAPTER 20
File Transfer Protocol ..................................................... 509
Overview .............................................................................
PORT and PASV ................................................................
FTP and GUIs .......................................................................
FTP Basics ........................................................................
FTP Commands and Reply Codes ....................................
FTP Data Transfers ............................................................
Passive and Port................................................................
File Transfer Types ............................................................
When Things Go Wrong ...................................................
FTP Commands ....................................................................
Variations on a Theme ......................................................
A Note on NFS ..................................................................
Questions for Readers ..........................................................
CHAPTER 21
512
513
516
518
519
521
524
526
526
527
529
530
533
SMTP and Email ............................................................... 535
Architectures for Email ........................................................
Sending Email Today .........................................................
The Evolution of Email in Brief.........................................
SMTP Authentication ........................................................
Simple Mail Transfer Protocol...........................................
Multipurpose Internet Mail Extensions ...............................
MIME Media Types ...........................................................
MIME Encoding ................................................................
An Example of a MIME Message .......................................
Using POP3 to Access Email.................................................
Headers and Email ...............................................................
Home Office Email ...............................................................
Questions for Readers ..........................................................
CHAPTER 22
496
497
498
507
538
540
544
544
545
547
548
548
549
550
552
555
557
Hypertext Transfer Protocol .......................................... 559
HTTP in Action.....................................................................
Uniform Resources ...........................................................
URIs ..................................................................................
URLs .................................................................................
562
565
565
566
Contents
URNs ................................................................................
HTTP ....................................................................................
The Evolution of HTTP .....................................................
HTTP Model .....................................................................
HTTP Messages ................................................................
Trailers and Dynamic Web Pages..........................................
HTTP Requests and Responses ........................................
HTTP Methods .................................................................
HTTP Status Codes ...........................................................
HTTP Headers ..................................................................
General Headers ...............................................................
Request Headers ..............................................................
Response Headers ............................................................
Entity Headers ..................................................................
Cookies ............................................................................
Questions for Readers ..........................................................
CHAPTER 23
xvii
568
569
570
571
572
573
573
575
576
576
577
577
578
579
580
583
Securing Sockets with SSL ........................................... 585
SSL and Web Sites .................................................................
The Lock ...........................................................................
Secure Socket Layer .........................................................
Privacy, Integrity, and Authentication ...................................
Privacy ..............................................................................
Integrity............................................................................
Authentication ..................................................................
Public Key Encryption .........................................................
Pocket Calculator Encryption at the Client......................
Example ...........................................................................
Pocket Calculator Decryption at the Server.....................
Public Keys and Symmetrical Encryption ............................
SSL as a Protocol ..................................................................
SSL Protocol Stack ............................................................
SSL Session Establishment ................................................
SSL Data Transfer ..............................................................
SSL Implementation .........................................................
SSL Issues and Problems...................................................
A Note on TLS 1.1 .............................................................
SSL and Certificates ..........................................................
Questions for Readers ..........................................................
585
591
592
593
593
593
594
595
595
596
597
598
598
599
599
601
601
602
604
604
605
xviii
Contents
Part V Network Management
CHAPTER 24
Simple Network Management Protocol ...................... 609
SNMP Capabilities ................................................................
The SNMP Model .................................................................
The MIB and SMI ..............................................................
The SMI.............................................................................
The MIB ............................................................................
RMON ..............................................................................
The Private MIB ................................................................
SNMP Operation ..................................................................
SNMPv2 Enhancements ...................................................
SNMPv3 ............................................................................
Questions for Readers ..........................................................
612
616
618
618
620
622
622
623
627
628
629
Part VI Security
CHAPTER 25 Secure Shell (Remote Access) ...................................... 633
Using SSH .............................................................................
SSH Basics ........................................................................
SSH Features .....................................................................
SSH Architecture...............................................................
SSH Keys...........................................................................
SSH Protocol Operation ...................................................
Transport Layer Protocol ..................................................
Authentication Protocol ...................................................
The Connection Protocol .................................................
The File Transfer Protocol .................................................
SSH in Action ........................................................................
Questions for Readers ..........................................................
633
636
637
639
640
641
642
644
645
647
649
657
CHAPTER 26 MPLS-Based Virtual Private Networks......................... 659
PPTP for Privacy ..................................................................
Types of VPNs ...................................................................
Security and VPNs ............................................................
VPNs and Protocols ..........................................................
PPTP .................................................................................
L2TP .................................................................................
PPTP and L2TP Compared ...............................................
Types of MPLS-Based VPNs ..................................................
Layer 3 VPNs.....................................................................
662
662
664
665
666
667
668
668
668
Contents
Layer 2 VPNs.....................................................................
VPLS: An MPLS-Based L2VPN ...............................................
Router-by-Router VPLS Configuration ..............................
P Router (P9)....................................................................
CE6 Router .......................................................................
Does It Really Work?.............................................................
Questions for Readers ..........................................................
xix
671
672
672
674
676
677
679
CHAPTER 27 Network Address Translation ......................................... 681
Using NAT ............................................................................
Advantages and Disadvantages of NAT .............................
Four Types of NAT ............................................................
NAT in Action .......................................................................
Questions for Readers ..........................................................
684
684
685
691
695
CHAPTER 28 Firewalls.............................................................................. 697
What Firewalls Do ................................................................
A Router Packet Filter .......................................................
Stateful Inspection on a Router........................................
Types of Firewalls ................................................................
Packet Filters ....................................................................
Application Proxy .............................................................
Stateful Inspection ...........................................................
DMZ .................................................................................
Questions for Readers ..........................................................
700
700
701
705
706
706
706
708
711
CHAPTER 29 IP Security .......................................................................... 713
IPSec in Action .....................................................................
CE0 ...................................................................................
CE6 ...................................................................................
Introduction to IPSec ...........................................................
IPSec RFCs........................................................................
IPSec Implementation ......................................................
IPSec Transport and Tunnel Mode ....................................
Security Associations and More............................................
Security Policies ...............................................................
Authentication Header .....................................................
Encapsulating Security Payload ........................................
Internet Key Exchange .....................................................
Questions for Readers ..........................................................
716
716
718
719
719
719
721
722
722
723
725
728
731
xx
Contents
Part VII Media
CHAPTER 30 Voice over Internet Protocol .......................................... 735
VOIP in Action ......................................................................
The Attraction of VoIP .......................................................
What Is “Voice”? ................................................................
The Problem of Delay .......................................................
Packetized Voice ...............................................................
Protocols for VOIP ................................................................
RTP for VoIP Transport .....................................................
Signaling ...........................................................................
H.323, the International Standard ....................................
SIP, the Internet Standard .................................................
MGCP and Megaco/H.248 ................................................
Putting It All Together ..........................................................
Questions for Readers ..........................................................
List of Acronyms
Bibliography
Index
738
741
741
742
744
744
745
748
749
750
752
753
755
757
767
769
Foreword
Network consolidation has been an industry trend since the turn of the century.
Reducing capital investment by converging data, voice, video, virtual private
networks (VPNs), and other services onto a single shared infrastructure is financially attractive; but the larger benefit is in not having to maintain and operate
multiple, service-specific infrastructures. Fundamental to network consolidation—
supporting a diverse set of services with a single infrastructure—is a common
encapsulating protocol that accommodates different service transport requirements.The Internet protocol (IP) is that protocol.
Everything over IP
Things move fast in the networking industry; technologies can go from cutting
edge to obsolete in a decade or less (think ATM, frame relay, token ring, and FDDI
among others). It is therefore amazing that TCP/IP is 35 years old and evolved from
ideas originating in the early 1960s.
Yet while the protocol invented by Vint Cerf and Bob Kahn in 1973 has
undergone—and continues to undergo—hundreds of enhancements and one version upgrade, its core functions are essentially the same as they were in the mid
1980s. TCP/IP’s antiquity, in an industry that unceremoniously discards technologies when something better comes along, is a testament to the protocol’s elegance
and flexibility.
And there is no sign that IP is coming to the end of its useful life.To the contrary,
so many new IP-capable applications, devices, and services are being added to networks every day that a newer version, IPv6, has become necessary to provide sufficient IP addresses into the foreseeable future. As this foreword is written, IPv6 is
in the very early stages of deployment; readers will still be learning from this book
when IPv6 is the only version most people know.
The story of how TCP/IP came to dominate the networking industry is well
known. Cerf, Kahn, Jon Postel, and many others who contributed to the early
development of TCP/IP did so as a part of their involvement in creating ARPANET,
the predecessor of the modern Internet. The protocol stack became further
embedded in the infant industry when it was integrated into Unix, making it popular with developers.
But its acceptance was far from assured in those early years. Organizations such
as national governments and telcos were uncomfortable with the informal “give
it a try and see what works” process of the Working Groups—primarily made up
of enthusiastic graduate students—that eventually became the Internet Engineering Task Force (IETF).Those cautious organizations wanted a networking protocol
developed under a rigorous standardization process.The International Organization
for Standardization (ISO) was tapped to develop a “mature” networking protocol
suite, which was eventually to become the Open Systems Interconnection (OSI).
xxii
Foreword
The ISO’s modus operandi of establishing dense, thorough standards and
releasing them only in complete, production-ready form took time. Even strong OSI
advocates began using TCP/IP as a temporary but working solution while waiting
for the ISO standards committees to finish their work. By the time OSI was ready,
TCP/IP was so widely deployed, proven, and understood that few network operators could justify undertaking a migration to something different.
OSI survives today mainly in a few artifacts such as IS–IS and the ubiquitous OSI
reference model. TCP/IP, in the meantime, is becoming an almost universal communications transport protocol.
The Illustrated Network
I am a visual person. I admire the capability of my more verbally oriented colleagues
to easily discuss, in detail, a networking scenario, but I need to draw pictures to
keep up.
When the first volume of the late W. Richard Stevens’s TCP/IP Illustrated was
released in 1994, it immediately became one of my favorite books, and continues to
be at the top of my list of recommended books both for the student and for the reference shelf. Stevens’s use of diagrams, configurations, and data captures to teach
the TCP/IP protocol suite makes the book not just a textbook but a comprehensive
set of case studies. It’s about as visual as you can get without sitting in front of a
protocol analyzer and watching packets fly back and forth.
But while the Stevens book has always been excellent for illustrating the behavior of individual TCP/IP components, it does not step back from that narrow focus
to show you how these components interact at a large scale in a real network.
This is where Walt Goralski steps up. The book you are holding takes the same
bottom-up approach (Stevens’ words) to teaching the protocol suite: Each chapter
builds on the previous, and each chapter gives you an intimate look at the protocol in action. But through an unprecedented collaboration with Juniper Networks,
Goralski shows you not just interactions between a few devices in a lab but a
production-scale view of a modern working network.The result is a practical, reallife, highly visual exploration of TCP/IP in its natural state.
The Illustrated Network: How TCP/IP Works in a Modern Network is destined
to become one of the classics on practical IP networking and a cornerstone of the
required reading lists of students and professionals alike.
Jeff Doyle
Westminster, Colorado
Preface
This is not a book on how to use the Internet. It is a book about how the Internet
is made useful for you. The Internet is a public global network that runs on TCP/
IP, which is frequently called the Internet Protocol Suite. A networking protocol
is a set of rules that must be followed to accomplish something, and TCP/IP is
actually a synthesis of the first two protocols that launched the Internet in its
infancy, the Transmission Control Protocol (TCP) and the Internet Protocol (IP),
which of course, allowed the transmission of information across the then youthful
Internet. TCP/IP is the heart and soul of modern networks, and this book illustrates
how that is accomplished. By using TCP/IP, we can observe how modern networks
operate by following the transmission of modern data across all sorts of Internet
connections.
Audience
This book is intended as a technical introduction into networking in general and
the Internet in particular. I will not pretend that someone who has had no previous
experience with either can easily plow through the entire book. But anyone who
is experienced enough to check their email online, browse a Web site, download a
movie or song, or chat with people around the world should have no trouble tackling the content of this book.
There are questions at the end of each chapter, but this is not a textbook per
se. It can be used as a textbook as a first course in computer networking at the
high school or undergraduate level. It will fit in with the computer science and
electrical engineering departments. It is also explicitly intended for those entering the telecommunications industry or working for a company where the Internet is an essential part of the business plan (of which there are more and more
each day).
Only one chapter uses C language code, and that only to provide information for
the reader. Mathematical concepts that are not taught in high school are not used.
There is no calculus, probability theory, and stochastic process concepts used in
any chapter. The “pocket calculator” examples of public key encryption and DiffieHellman key distribution were carefully designed to illustrate the concepts, and yet
make the mathematics as simple as possible.
What Is Unique about This Book?
What’s in this book that you won’t find in a half-dozen other books about TCP/IP?
The list is not short.
1. This book uses the same network topology and addresses for every example
and chapter.
xxiv
Preface
2. This book treats IPv4 and IPv6 as equals.
3. This book covers the routing protocols as well as TCP/IP applications.
4. This book discusses ISPs as well as corporate LANs.
5. This book covers services provided as well as the protocols that provide them.
6. This book covers topics (MPLS, IPSec, etc.) not normally covered in other
books on TCP/IP.
Why was the book written this way? Even in the Internet-conscious world we live
in today, few study the entire network, the routers, TCP/IP, the Internet, and a host
of related topics as part of their general education. What they do learn might seem
like a lot, but when considered in relation to the enormous complexity of each of
these topics, what is covered in general computer “literacy” or basic programming
courses is really only a drop in the bucket.
As I was writing this book, and printing it out at my workplace, a silicon chip
engineer-designer found a few chapters on top of the printer bin, and he began
reading it. When I came to retrieve the printout, he was fascinated by the sample
chapters. He wanted the book then and there. And as we talked, he made me realize that thousands of people are entering the networking industry every day, many
from other occupations and disciplines. As the Internet grows, and society’s dependence on the digital communication structure continues, more and more people
need this overview of how modern networks operate.
The intellectually curious will not be satisfied with this smattering of and
condensation of networking knowledge in a single volume. I’m hoping they
will seek ways to increase their knowledge in specific areas of interest. This
book covers hundreds of networking topics, and volumes have been written
devoted to the intricacies of each one. For example, there are 20 to 30 solid
books written on MPLS complexities and evolution, while the chapter here runs
at about the same number of pages. My hope is that this book and this method
of “illustrating” how a modern network works will contribute to more people
seeking out those 20 to 30 books now that they know how the overall thing
looks and works.
Like everyone else, I learned about networks, including routers and TCP/IP,
mostly from books and from listening to others tell me what they knew. The missing piece, however, was being able to play with the network. The books were great,
the discussions led to illumination of how this or that operated, but often I never
“saw” it working. This book is a bit of a synthesis of the written and the seen. It
attempts to give the reader the opportunity to see common tasks in a real, working, hands-on environment of the proper size and scale, and follow what happens
behind the scenes. It’s one thing to read about what happens when a Web site is
accessed, but another to see it in action.
The purpose of this book is to allow you to see what is happening on a modern
network when you access a Web site, write an email, download a song, or talk on
the phone over the Internet. From that observation you will learn how a modern
network works.
Preface
xxv
What You Won’t Find in This Book
It might seem odd to list things that the book does not cover. But rather than have
readers slog through and then find they didn’t find what they were after, here’s
what you will not find in this edition of the book.
You will find no mention of the exciting new peer-to-peer protocols that distribute the server function around the network. There is no mention of the protocols
used by chat rooms or services. The book does not explore music or movie download services. In other words, you won’t find YouTube, IRC, iTunes, or even eBay
mentioned in this book.
These topics are, of course, interesting and/or important. But the limitations of
time and page count forced me to focus on essential topics. The other topics could
easily form the foundation for The Illustrated Network, Volume II: Beyond the
Basics.
The Illustrated Network
Many people frustrated with simple lab setups and restricted “live” networks have
wished for a more complex and realistic yet secure environment where they can
feel free to explore the TCP/IP protocols, layers, and applications without worrying
that what they are seeing is limited to a quiet lab, or what they do might bring the
whole network to its knees.
The days are long gone when an interested party could take over the whole
network, from clients to servers to routers, and play with them at night or over the
weekend. Networks are run on a normal business-hour schedule, especially now
that the Web makes “prime time” on one side of the world when the other half is
trying to get some sleep.
Many times I have encountered a new feature or procedure and said to myself,
“I wish I could play with this and see what happens.” But only after nearly 40 years
of networking experience (I hooked up my first modem, about the size of a microwave oven, in 1966), have I finally arrived at the point where I could say,“I want to
do this . . .,” and someone didn’t tell me it could not be done.
Juniper Networks Inc., my employer, was in a unique position to help me with
my plans to not merely talk about TCP/IP, or show contrived examples of the protocols in action, but to “illustrate” each piece with a series of clients, servers, routers,
and connections (including the public Internet). They had the routers and links,
and employed all the Unix and Windows-based hosts that I could possibly need.
(In retrospect, there was probably some overkill in the network, as most chapters
used only a couple of routers.) We decided not to upgrade the XP hosts to Vista,
which was relatively new at the time, and I kept Internet Explorer 6 active, more
or less out of convenience.
In any case, with the blessings of Juniper Networks, I set about creating the
kind of network I needed for this book. It took a while, but in the end it was well
worth it. We assembled a collection of five routers connected with SONET links,
xxvi
Preface
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/1
79.2
so-0/0/3
49.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
45
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
FIGURE P.1
The illustrated Network.
so-0/0/3
49.1
-0
/0/
2
P4
lo0: 192.168.4.1
/0
0/0
1
7
4.
so-0/0/1
24.2
so-
AS 65459
Preface
xxvii
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
2
16.
ge-
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
AS 65127
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
xxviii
Preface
two Ethernet LANs, two pairs of Windows XP clients and servers (Home and Pro
editions), one pair of Red Hat Linux hosts (running the RH 9 kernel 2.4.20-8), and
a pair of FreeBSD (release 4.10) hosts.
Figure P.1 shows the network that we built and that is used in every chapter of
this book to illustrate the networking concepts discussed.
Using This Book
This book is designed to be read from start to finish, chapter by chapter,
sequentially. It seems funny to say this, because a lot of technical books these
days are not meant to be “read” in the same way as a novel or a biography. Readers
tend to look things up in books like this, and then browse from the spot they land
on, which you can certainly do with this book, but probably more on a chapterby-chapter level.
But I hope that the story in this book is as coherent as a mystery, if not as exciting as an adventure tale. From the first chapter, which offers readers a unique look
at layered protocols, to the last, this book presents a story that proceeds in a logical fashion from the bottom of the Internet protocol suite to the top (and beyond,
in some cases). So if you can, read from start to finish, as the chapters depend on
previous ones. If you are new to networking concepts, or just beginning, I recommend this consecutive approach. For those more experienced, bobbing in and out
is just fine, but remember that all emphasis is equal in The Illustrated Network,
and sometimes you may question a topic’s coverage, when the item questioned is
covered in an earlier chapter.
As you’re reading, you’ll discover that generally, each chapter has the same
structure. The beginning chapters, however, diverge from this format more than
the later chapters do, as they require general exploration of the protocol, application, or concept. After the first few chapters, I begin the tasks of illustrating how it
all works. In some cases, this involves not only the network built for this book, but
the global Internet as well. Note that network configuration specifics, especially
those involving the routers, vary somewhat, but these changes are completely
detailed as they occur.
The companion Web site for this book is www.elsevierdirect.com/companions/
9780123745415. There you will find many of the capture files to explore some of
the protocols on your own.
Source Code
Chapter 3 on network technologies uses examples from wireless network captures
supplied by Aeropeek. Chapter 12 on sockets uses listings from utility programs
written by Michael J. Donahoo and Kenneth L. Calvert for their excellent book,
TCP/IP Sockets in C (Morgan Kaufmann, 2001). Thanks to both groups for letting
me use their material in this book.
Preface
xxix
ACKNOWLEDGMENTS
I would like to thank various leaders in their respective fields who have given
me their time and read and reviewed selected chapters of this work. Their comments have made this a much better book than it would have been without their
involvement. Any errors that remain are mine.
I would like to thank colleagues at Juniper Networks, Inc., who gave their time
and effort to create this network. In many cases, they also helped with the book. It
starts at the top with Scott Kriens, who has created an environment where creativity and exploration are encouraged. Thanks, Scott!
The list goes on to include June Loy, Aviva Garrett, Michael Tallon, Patrick Ames,
Jason Lloyd, Mark Whittiker, Kent Ketell, and Jeremy Pruitt.
Finally I would like to thank my lead technical reviewers, Joel Jaeggli and Robin
Pimentel, for the careful scrutiny they gave the book and the many fine corrections
and comments they provided.
Lead Technical Reviewers
Joel Jaeggli works in the security and mobile connectivity group within Nokia.
His time is divided between the operation of the nokia.net (AS 14277) research
network and supporting the strategic planning needs of Nokia’s security business.
Projects with former employer, the University of Oregon, included the Network
Startup Resource Center, Oregon Route views project, the Beyond BGP Project, and
the Oregon Videolab. He is an active participant in several industry-related groups
including the IETF (working group chair) and NANOG (two terms on the program
committee). Joel frequently participates as an instructor or presenter at regional and
international network meetings on Internet services and security-related topics.
Robin Pimentel is currently a network engineer at Facebook, where he helps
the production network sustain growth alongside Facebook’s user and application
growth. Previously, Robin worked on the production network teams at Google and
Yahoo. Robin also spent 6 years at Teradyne where he performed many networking, security, and Unix infrastructure engineering roles. Prior to his career in computer networks, Robin worked at Cadence Design Systems and Intel Corporation.
While working in the chip sector, Robin specialized in silicon place and route,
VHDL-based behavioral logic validation, and gate-level logic validation for on-chip
memories.
About the Author
Walter Goralski has worked in the telecommunications and networking industry
since 1970. He spent 14 years in the Bell System. After that he worked with minicomputers and LANs at Wang Laboratories and with the Internet at Pace University, where he was a graduate professor for 15 years. He joined Juniper Networks
as a senior staff engineer in 2000 after 8 years as a technical trainer. Goralski is
the author of 10 books about networking, including the bestselling SONET/SDH
(now in its third edition). He has a master’s degree in computer science from Pace
University.
PART
Networking
Basics
I
All networks, from the smallest LAN to the global Internet, consist of similar
components. Layered protocols are the rule, and this part of the book examines
protocol suites, network devices, and the frames used on links that connect the
devices.
■
■
■
Chapter 1—Protocols and Layers
Chapter 2—TCP/IP Protocols and Devices
Chapter 3—Network Link Technologies
CHAPTER
Protocols and Layers
1
What You Will Learn
In this chapter, you will learn about the protocol stack used on the global public
Internet and how these protocols have been evolving in today’s world. We’ll
review some key basic definitions and see the network used to illustrate all of the
examples in this book, as well as the packet content, the role that hosts and routers play on the network, and how graphic user and command line interfaces (GUI
and CLI, respectively) both are used to interact with devices.
You will learn about standards organizations and the development of TCP/IP
RFCs. We’ll cover encapsulation and how TCP/IP layers interact on a network.
This book is about what actually happens on a real network running the protocols and
applications used on the Internet today. We’ll be looking at the entire network—everything from the application level down to where the bits emerge from the local device
and race across the Internet. A great deal of the discussion will revolve around the
TCP/IP protocol suite, the protocols on which the Internet is built. The network that
will run these protocols is shown in Figure 1.1.
Like most authors, I’ll use TCP/IP as shorthand for the entire Internet protocol stack,
but you should always be aware that the suite consists of many protocols, not just
TCP and IP. The protocols in use are constantly growing and evolving as the Internet
adapts to new challenges and applications. In the past few years, four trends have
become clear in the protocol evolution:
Increased use of multimedia—The original Internet was not designed with
proper quality of service assurances to support digital voice and video. However, the Internet now carries this as well as bulk and interactive data. (In this
book, “data” means non-voice and non-video applications.) In the future, all
forms of information should be able to use the Internet as an interactive distribution medium without major quality concerns.
Increasing bandwidth and mobility—The trend is toward higher bandwidth
(capacity), even for mobile users. New wireless technologies seem to promise
4
PART I Networking Basics
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted Pair-Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/1
79.2
so-0/0/3
49.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
so-0/0/3
49.1
P4
lo0: 192.168.4.1
/0
0/0
1
7
4.
so-0/0/1
24.2
so-
AS 65459
FIGURE 1.1
The Illustrated Network, showing the routers, links, and hosts on the network. Many of the layer
addresses used in this book appear in the figure as well.
CHAPTER 1 Protocols and Layers
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted Pair-Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
AS 65127
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
5
6
PART I Networking Basics
the “Internet everywhere.” Users are no longer as restricted to analog telephone
network modem bit rates, and new end-electronics, last-mile technologies, and
improved wiring and backbones are the reason.
Security—Attacks have become much more sophisticated as well. The use of
privacy tools such as encryption and digital signatures are no longer an option,
but a necessity. E-commerce is a bigger and bigger business every year, and
on-line banking, stock transactions, and other financial manipulations make
strong security technologies essential. Identity verification is another place
where new applications employ strong encryption for security purposes.
New protocols—Even the protocols that make up the TCP/IP protocol suite
change and evolve. Protocols age and become obsolete, and make way for
newer ways of doing things. IPv6, the eventual successor for IPv4, is showing
up on networks around the world, especially in applications where the supply
of IPv4 addresses is inadequate (such as cell phones). In every case, each
chapter attempts to be as up-to-date and forward-looking as possible in its
particular area.
We will talk about these trends and more in later chapters in this book. For now, let’s
take a good look at the network that will be illustrated in the rest of this book.
Key Definitions
Any book about computers and networking uses terminology with few firm definitions and rules of usage. So here are some key terms that are used over and over
throughout this book. Keep in mind that these terms may have varying interpretations, but are defined according to the conventions used in this book.
■
■
■
Host: For the purposes of this book, a host is any endpoint or end system
device that runs TCP/IP. In most cases, these devices are ordinary desktop and
laptop computers. However, in some cases hosts can be cell phones, handheld
personal digital assistants (PDAs), and so on. In the past, TCP/IP has been made
to run on toasters, coffee machines, and other exotic devices, mainly to prove
a point.
Intermediate system: Hosts that do not communicate directly pass information through one or more intermediate systems. Intermediate systems are
often generically called “network nodes” or just “nodes.” Specific devices are
labeled “routers,”“bridges,” or “switches,” depending on their precise roles in the
network. The intermediate nodes on the Illustrated Network are routers with
some switching capabilities.
System: This is just shorthand for saying the device can be a host, router, switch,
node, or almost anything else on a network. Where clarity is important, we’ll
always specify “end system” or “intermediate system.”
CHAPTER 1 Protocols and Layers
7
THE ILLUSTRATED NETWORK
Each chapter in this book will begin with a look at how the protocol or chapter contents
function on a real network. The Illustrated Network, built in the Tech Pubs department
of Juniper Networks, Inc., in Sunnyvale, California, is shown in Figure 1.1.
The network consists of systems running three different operating systems (Windows
XP, Linux, and FreeBSD Unix) connected to Ethernet local area networks (LANs). These
systems are deployed in pairs, as either clients (for now, defined as “systems with users
doing work in front of them”) and servers (for now, defined as “systems with administrators, and usually intended only for remote use”). When we define the client and
server terms more precisely, we’ll see that the host’s role at the protocol level depends
on which host initiates the connection or interaction. The hosts can be considered to
be part of a corporate network with offices in New York and Los Angeles.
Addressing information is shown for each host,router,and link between devices. We’ll
talk about all of these addresses in detail later, and why the hosts in particular have
several addresses in varying formats. (For example, the hosts only have link-local IPv6
address, and not global ones.)
The LANs are attached to Juniper Networks’ routers (also called intermediate nodes,
although some are technically gateways), which in turn are connected in our network
to other routers by point-to-point synchronous optical network (SONET) links, a type
of wide area network (WAN) link. Other types of links, such as asynchronous transfer
mode (ATM) or Ethernet, can be used to connect widely separated routers, but SONET
links are very common in a telecommunications context. There is a link to the global
Internet and to a home-based wireless LAN as well. The home office link uses digital
Major Parts of the Illustrated Network
The Illustrated Network is composed of four major components. At the top are two
Ethernet LANs with the hosts of our fictional organization, one in New York and
one in Los Angeles. The offices have different ISPs (a common enough situation),
and the site routers link to Ace ISP on the West Coast and Best ISP on the East
Coast with Gigabit Ethernet links (more on links in the next chapter). The two
ISPs link to each other directly and also link to the “global public Internet.” Just
what this is will be discussed once we start looking at the routers themselves.
One employee of this organization (the author) is shown linking a home
wireless network to the West Coast ISP with a high-speed (“broadband”) digital
subscriber line (DSL) link. The rest of the links are high-speed WAN links and two
Gigabit Ethernet (GE) links. (It’s becoming more common to use GE links across
longer distances, but this network employs other WAN technologies.)
The Illustrated Network is representative of many LANs, ISPs, and users around
the world.
8
PART I Networking Basics
subscriber line (DSL), a form of dedicated broadband Internet access, and not dial-up
modem connectivity.
This network will be used throughout this book to illustrate how the different
TCP/IP protocols running on hosts and routed networks combine to form the Internet.
Some protocols will be examined from the perspective of the hosts and LAN (on the
local “user edge”) and others will be explored from the perspective of the service
provider (on the global “network edge”). Taken together, these viewpoints will allow
us to see exactly how the network works, inside and out.
Let’s explore the Illustrated Network a little, from the user edge, just to demonstrate
the conventions that will be used at the beginning of each chapter in this book.
Remote Access to Network Devices
We can use a host (client or server system running TCP/IP) to remotely access another
device on the local network. In the context of this book, a host is a client or server
system. We can loosely (some would say very loosely) define clients as typically the
PCs on which users are doing work, and that’s how we’ll use the term for now. On the
other hand, servers (again loosely) are devices that usually have administrators tending
them. Servers are often gathered in special equipment racks in rooms with restricted
access (the “server room”), although print servers are usually not. We’ll be more precise about the differences between clients and servers as the “initiating protocol” later
in this book.
Let’s use host lnxclient to remotely access the host bsdserver on one of the LANs.
We’ll use the secure shell application, ssh, for remote access and log in (the –l option)
as remote-user. There are other remote access applications, but in this book we’ll use
ssh. We’ll use the command-line interface (CLI) on the Linux host to do so.
[[email protected] admin]# ssh -l [email protected]
Password:
Last login: Sun Mar 17 16:12:54 2008 from securepptp086.s
Copyright (c) 1980, 1983, 1986, 1988, 1990, 1991, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 4.10-RELEASE (GENERIC) #0: Tue May 25 22:47:12 GMT 2004
Welcome to FreeBSD!...
We can also use a host to access a router on the network. As mentioned earlier, a
router is a type of intermediate system (or network node) that forwards IP data units
along until they reach their destination. A router that connects a LAN to an Internet
link is technically a gateway. We’ll be more precise about these terms and functions in
later chapters dealing with routers and routing specifically.
Let’s use host bsdclient to remotely access the router on the network that is directly
attached to the LAN, router CE0 (“Customer Edge router #10”). Usually, we’d do this to
configure the router using the CLI. As before, we’ll use the secure shell application, ssh,
for remote access and log in as remote-user. We’ll again use the CLI on the Unix host
to do so.
CHAPTER 1 Protocols and Layers
9
bsdclient> ssh -l [email protected]
[email protected]’s password:
--- JUNOS 8.4R1.3 built 2007-08-06 06:58:15 UTC
[email protected]>
These examples show the conventions that will appear in this book when command-line procedures are shown. All prompts, output, and code listings appear like
this. Whenever a user types a command to produce some output, the command typed
will appear like this. We’ll see CLI examples from Windows hosts as well.
Illustrated Network Router Roles
The intermediate systems or network nodes used on the Illustrated Network are
routers. Not all of the routers play the same role in the network, and some have
switching capabilities. The router’s role depends on its position in the network.
Generally, smaller routers populate the edge of the network near the LANs and
hosts, while larger routers populate the ISP’s network core. The routers on our
network have one of three network-centric designations; we have LAN switches
also, but these are not routers.
■
■
■
■
Customer edge (CE): These two routers belong to us,in our role as the customer
who owns and operates the hosts and LANs. These CE routers are smaller than
the other routers in terms of size, number of ports, and capabilities.Technically,
on this network, they perform a gateway role.
Provider edge (PE): These two routers gather the traffic from customers
(typically there are many CE routers, of course).They are not usually accessible
by customers.
Provider (P): These six routers are arranged in what is often called a “quad.” The
two service providers on the Illustrated Network each manage two providers’
routers in their network core. Quads make sure traffic flows smoothly even if
any one router or one link fails on the provider’s core networks.
Ethernet LAN switches: The network also contains two Ethernet LAN
switches. We’ll spend a lot of time exploring switches later. For now, consider that
switches operate on Layer 2 frames and routers operate on Layer 3 packets.
Now, what is this second example telling us? First of all, it tells us that routers,
just like ordinary hosts, will allow a remote user to log in if they have the correct
user ID and password. It would appear that routers aren’t all that much different from
hosts. However, this can be a little misleading. Hosts generally have different roles in a
network than routers. For now, we’ll just note that for security reasons, you don’t want
it to be easy for people to remotely access routers, because intruders can cause a lot
of damage after compromising just a single router. In practice, a lot more security than
just passwords is employed to restrict router access.
10
PART I Networking Basics
Secure remote access to a router is usually necessary, so not running the process or
entity that allows remote access isn’t an option. An organization with a large network
could have routers in hundreds of locations scattered all over the country (or even the
world). These devices need management, which includes tasks such as changing the configuration of the routers. Router configuration often includes details about the protocols’
operation and other capabilities of the router, which can change as the network evolves.
Software upgrades need to be distributed as well. Troubleshooting procedures often
require direct entry of commands to be executed on the router. In short, remote access
and file transfer can be very helpful for router and network management purposes.
File Transfer to a Router
Let’s look at the transfer of a new router configuration file, for convenience called
routerconfig.txt, from a client host (wincli2) to router CE0. This time we’ll use a GUI
for the file transfer protocol (FTP) application, which will be shown as a figure, as in
Figure 1.2. First, we have to remotely access the router.
The main window section in the figure shows remote access and the file listing of
the default directory on the router, which is /var/home/remote (the router uses the
Unix file system). The listing in the lower right section is the contents of the default
FIGURE 1.2
Remote access for FTP using a GUI. Note how the different panes give different types of
information, yet bring it all together.
CHAPTER 1 Protocols and Layers
11
FIGURE 1.3
File transfer with a GUI. There are commands (user mouse clicks that trigger messages), responses
(the server’s replies), and status lines (reports on the state of the interaction).
directory, not part of the command/response dialog between host and router. The
lower left section shows the file system on the host, which is a Windows system. Note
that the file transfer is not encrypted or secured in any way.
Most “traditional” Unix-derived TCP/IP applications have both CLI and GUI interfaces
available, and which one is used is usually a matter of choice. Older Unix systems, the
kind most often used on the early Internet, didn’t typically have GUI interfaces, and
a lot of users prefer the CLI versions, especially for book illustrations. GUI applications work just as well, and don’t require users to know the individual commands
well. When using the GUI version of FTP, all the user has to do is “drag and drop” the
local routerconfig.txt file from the lower left pane to the lower right pane of the
window to trigger the commands (which the application produces “automatically”) for
the transfer to occur. This is shown in Figure 1.3.
With the GUI, the user does not have to issue any FTP commands directly.
CLI and GUI
We’ll use both the CLI and GUI forms of TCP/IP applications in this book. In a nod to
tradition, we’ll use the CLI on the Unix systems and the GUI versions when Windows
systems are used in the examples. (CLI commands often capture details that are not
easily seen in GUI-based applications.) Keep in mind that you can use GUI applications
12
PART I Networking Basics
on Unix and the CLI on Windows (you have to run cmd first to access the Windows
CLI). This listing shows the router configuration file transfer of newrouterconfig.txt
from the Windows XP system to router CE6, but with the Windows CLI and using the
IP address of the router.
C:\Documents and Settings\Owner> ftp 10.10.12.1
Connected to 10.10.12.1.
220 R6 FTP server (version 6.00LS) ready.
User (10.10.12.1:none)):walterg
331 Password required for walterg.
Password: ********
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for '/bin/ls'.
total 128
drwxr-xr-x 2 remote staff 512 Nov 20 2004 .ssh
-rw-r--r-- 1 remote staff 4316 Mar 25 2006 R6-base
-rw-r--r-- 1 remote staff 4469 May 11 20:08 R6-cspf
-rw-r--r-- 1 remote staff 4316 Jun 3 18:46 R6-rsvp
-rw-r--r-- 1 remote staff 4242 Jun 16 14:44 R6-rsvp-message
-rw-r----- 1 remote staff 559 Feb 3 2005 juniper.conf
-rw-r--r-- 1 remote staff 4081 Dec 2 2005 merisha-base
-rw-r--r-- 1 remote staff 2320 Dec 3 2005 richard-ASP-manual-SA
-rw-r--r-- 1 remote staff 2358 Dec 2 2005 richard-base
-rw-r--r-- 1 remote staff 7344 Sep 30 11:28 routerconfig.txt
-rw-r--r-- 1 remote staff 4830 Jul 13 17:04 snmp-forwarding
-rw-r--r-- 1 remote staff 3190 Jan 7 2006 tp6
-rw-r--r-- 1 remote staff 4315 May 5 12:49 wjg-ORA-base-TP6
-rw-r--r-- 1 remote staff 4500 May 6 09:47 wjg-tp6-with-ipv6
-rw-r--r-- 1 remote staff 4956 May 8 13:42 wjg-with-ipv6
226 transfer complete
ftp: 923 bytes received in 0.00Seconds 923000.00Kbytes/sec.
ftp> bin
200 Type set to I
ftp> put newrouterconfig.text
200 PORT command successful.
150 Opening ASCII mode data connection for "newrouterconfig.txt".
226 Transfer complete.
ftp: 7723 bytes received in 0.00Seconds 7344000.00Kbytes/sec.
ftp>_
In some cases, we’ll list CLI examples line by line, as here, and in other cases we will
show them in a figure.
Ethereal and Packet Capture
Of course, showing a GUI or command line FTP session doesn’t reveal much about
how the network functions. We need to look at the bits that are flowing through the
CHAPTER 1 Protocols and Layers
13
network. Also, we need to look at applications, such as the file transfer protocol, from
the network perspective.
To do so, we’ll use a packet capture utility. This book will use the Ethereal packet
capture program in fact and by name throughout, although shortly after the project
began, Ethereal became Wireshark. The software is the same, but all development will
now be done through the Wireshark organization. Wireshark (Ethereal) is available free
of charge at www.wireshark.org. It is notable that Wireshark, unlike a lot of similar
applications, is available for Windows as well as most Unix/Linux variations.
Ethereal is a network protocol analyzer program that keeps a copy of every packet
of information that emerges from or enters the system on a particular interface. Ethereal also parses the packet and shows not only the bit patterns, but what those bit
groupings mean. Ethereal has a summary screen, a pane for more detailed information, and a pane that shows the raw bits that Ethereal captured. The nicest feature of
Ethereal is that the packet capture stream can be saved in a standard libpcap format
file (usually with a .cap or .pcap extension), which is common among most protocol
analyzers. These files can be read and parsed and replayed by tcpdump and other applications or Ethereal on other systems.
Figure 1.4 shows the same router configuration file transfer as in Figure 1.2 and 1.3,
and at the same time. However, this time the capture is not at the user level, but at the
network level.
FIGURE 1.4
Ethereal FTP capture of the file transfer shown earlier from the user perspective.
14
PART I Networking Basics
Each packet captured is numbered sequentially and given a time stamp, and its
source and destination address is listed. The protocol is in the next column, followed
by the interpretation of the packet’s meaning and function. The packet to request the
router to STOR routerconfig.txt is packet number 26 in the sequence.
Already we’ve learned something important: that with TCP/IP, the number of
packets exchanged to accomplish even something basic and simple can be surprisingly large. For this reason, in some cases, we’ll only show a section of the panes of the
full Ethereal screen, only to cut down on screen clutter. The captured files are always
there to consult later.
With these tools—CLI listings, GUI figures, and Ethereal captures—we are prepared
to explore all aspects of modern network operation using TCP/IP.
First Explorations in Networking
We’ve already seen that an authorized user can access a router from a host. We’ve
seen that routers can run the ssh and ftp server applications sshd and ftpd, and the
suspicion is that they might be able to run even more (they can just as easily be ssh
and ftp clients). However, the router application suite is fairly restrictive. You usually
don’t, for example, send email to a router, or log in to a router and then browse Web
sites. There is a fundamental difference in the roles that hosts and routers play in a
network. A router doesn’t have all of the application software you would expect to
find on a client or server, and a router uses them mainly for management purposes.
However, it does have all the layers of the protocol suite.
TCP/IP networks are a mix of hosts and routers. Hosts often talk to other devices
on the network, or expose their applications to the network, but their basic function
is to run programs. However, network systems like routers exist to keep the network
running, which is their primary task. Router-based applications support this task,
although in theory, routers only require a subset of the TCP/IP protocol suite layers to
perform their operational role. You also have to manage routers, and that requires some
additional software in practice. However, don’t expect to find chat or other common
client applications on a router.
What is it about protocols and layers that is so important? That’s what the rest of
this chapter is about. Let’s start with what protocols are and where they come from.
PROTOCOLS
Computers are systems or devices capable of running a number of processes. These
processes are sometimes referred to as entities, but we’ll use the term processes.
Computer networks enable communication between processes on two different
devices that are capable of sending and receiving information in the form of bits
(0s and 1s). What pattern should the exchange of bits follow? Processes that exchange
bit streams must agree on a protocol. A protocol is a set of rules that determines all
aspects of data communication.
CHAPTER 1 Protocols and Layers
15
A protocol is a standard or convention that enables and controls the connection, communication, and transfer of information between two communications
endpoints, or hosts. A protocol defines the rules governing the syntax (what can
be communicated), semantics (how it can be communicated), and synchronization (when and at what speed it can be communicated) of the communications
procedure. Protocols can be implemented on hardware, software, or a combination
of both.
Protocols are not the same as standards: some standards have never been implemented as workable protocols, while some of the most useful protocols are only
loosely defined (this sometimes makes interconnection an adventure). The protocols
discussed in this book vary greatly in degree of sophistication and purpose. However,
most of the protocols specify one or more of the following:
Physical connection—The host typically uses different hardware depending on whether
the connection is wired or wireless, and some other parameters might require manual configuration. However, protocols are used to supply details about the network
connection (speed is part of this determination). The host can usually detect the
presence (or absence) of the other endpoint devices as well.
Handshaking—A protocol can define the rules for the initial exchange of information across the network.
Negotiation of parameters—A protocol can define a series of actions to establish
the rules and limits used for communicating across the network.
Message delimiters—A protocol can define what will constitute the start and end
of a message on the network.
Message format—A protocol can define how the content of a message is structured, usually at the “field” level.
Error detection—A protocol can define how the receiver can detect corrupt messages, unexpected loss of connectivity, and what to do next. A protocol can
simply fail or try to correct the error.
Error correction—A protocol can define what to do about these error situations.
Note that error recovery usually consists of both error-detection and errorcorrection protocols.
Termination of communications—A protocol can define the rules for gracefully
stopping communicating endpoints.
Protocols at various layers provided the abstraction necessary for Internet success. Application developers did not have to concern themselves overly with the
physical properties of the network. The expanded use of communications protocols
has been a major contributor to the Internet’s success, acceptance, flexibility, and
power.
16
PART I Networking Basics
Standards and Organizations
Anyone can define a protocol. Simply devise a set of rules for any or all of the phases
of communication and convince others to make hardware or software that implements the new method. Of course, an implementer could try to be the only source
of a given protocol, a purely proprietary situation, and this was once a popular way
to develop protocols. After all, who knew better how to network IBM computers
than IBM? Today, most closed protocols have given way to open protocols based on
published standards, especially since the Internet strives for connectivity between
all types of computers and related devices and is not limited to equipment from
a certain vendor. Anyone who implements an open protocol correctly from public
documents should in most cases be able to interoperate with other versions of the
same protocol.
Standards promote and maintain an open and competitive market for network
hardware and software. The overwhelming need for interoperability today, both
nationally and internationally, has increased the set of choices in terms of vendor and
capability for each aspect of data communications. However, proprietary protocols
intended for a limited architecture or physical network are still around, of course. Proprietary protocols might have some very good application-specific protocols, but could
probably not support things like the Web as we know it. Making something a standard
does not guarantee market acceptance, but it is very difficult for a protocol to succeed
without a standard for everyone to follow. Standards provide essential guidelines to
manufacturers, vendors, service providers, consultants, government agencies, and users
to make sure the interconnectivity needed today is there.
Data communication standards fall into two major categories: de jure (“by rule or
regulation”) and de facto (“by fact or convention”).
De jure—These standards have been approved by an officially recognized body
whose job is to standardize protocols and other aspects of networking. De jure
standards often have the force of law, even if they are called recommendations (for these basic standards, it is recommended that nations use their own
enforcement methods, such as fines, to make sure they are followed).
De facto—Standards that have not been formally approved but are widely followed
fall into this category. If someone wants to do something different, such as
a manufacturer of network equipment, this method can be used to quickly
establish a new product or technology. These types of standards can always be
proposed for de jure approval.
When it comes to the Internet protocols, things are a bit more complicated. There
are very few official standards, and there are no real penalties involved for not following them (other than the application not working as promised). On the Internet, a
“de facto standard” forms a reference implementation in this case. De facto standards
are also often subportions or implementation details for formal standards, usually when
CHAPTER 1 Protocols and Layers
17
the formal standard falls short of providing all the information needed to create a working program. Internet standard proposals in many cases require running code at some
stages of the process: at least the de facto code will cover the areas that the standard
missed.
The standards for the TCP/IP protocol suite now come from the Internet Engineering Task Force (IETF), working in conjunction with other Internet organizations. The
IETF is neither strictly a de facto nor de jure standards organization: There is no force
of law behind Internet standards; they just don’t work the way they should if not done
correctly. We’ll look at the IETF in detail shortly. The Internet uses more than protocol
standards developed by the IETF. The following organizations are the main ones that
are the sources of these other standards.
Institute of Electrical and Electronics Engineers
This international organization is the largest society of professional engineers in the
world. One of its jobs is to oversee the development and adaptation of international
standards, often in the local area network (LAN) arena. Examples of IEEE standards are
all aspects of wireless LANs (IEEE 802.11).
American National Standards Institute
Although ANSI is actually a private nonprofit organization, and has no affiliation with the
federal government, its goals include serving as the national institution for coordinating
voluntary standardization in the United States as a way of advancing the U.S. economy
and protecting the public interest. ANSI’s members are consumer groups, government
and regulatory bodies, industry associations, and professional societies. Other countries
have similar organizations that closely track ANSI’s actions. The indispensable American
Standard Code for Information Interchange (ACSII) that determines what bits mean is
an example of an ANSI standard.
Electronic Industries Association
This is a nonprofit organization aligned with ANSI to promote electronic manufacturing concerns. The EIA has contributed to networking by defining physical connection
interfaces and specifying electrical signaling methods. The popular Recommended
Jack #45 (RJ-45) connector for twisted pair LANs is an example of an EIA standard.
ISO, or International Standards Organization
Technically, this is the International Organization for Standardization in English, one of
its official languages, but is always called the ISO.“ISO” is not an acronym or initialism
for the organization’s full name in either English or French (its two official languages).
Rather, the organization adopted ISO based on the Greek word isos, meaning equal.
Recognizing that the organization’s initials would vary according to language, its founders chose ISO as the universal short form of its name. This, in itself, reflects the aim of
the organization: to equalize and standardize across cultures. This multinational body’s
members are drawn from the standards committees of various governments. They are
18
PART I Networking Basics
a voluntary organization dedicated to agreement on worldwide standards. The ISO’s
major contribution in the field of networking is with the creation of a model of data
networking, the Open Systems Interconnection Reference Model (ISO-RM), which also
forms the basis for a working set of protocols. The United States is represented by ANSI
in the ISO.
International Telecommunications Union–Telecommunication Standards Sector
A global economy needs international standards not only for data networks, but for
the global public switched telephone network (PSTN). The United Nations formed a
committee under the International Telecommunications Union (ITU), known as the
Consultative Committee for International Telegraphy and Telephony (CCITT), that was
eventually reabsorbed into the parent body as the ITU-T in 1993. All communications
that cross national boundaries must follow ITU-T “recommendations,” which have
the force of law. However, inside a nation, local standards can apply (and usually do).
A network architecture called asynchronous transfer mode (ATM) is an example of an
ITU-T standard.
In addition to these standards organizations, networking relies on various forums to
promote new technologies while the standardization process proceeds at the national
and international levels. Forum members essentially pledge to follow the specifications of the forum when it comes to products, services, and so forth, although there
is seldom any penalty for failing to do so. The Metro Ethernet Forum (MEF) is a good
example of the modern forum in action.
The role of regulatory agencies cannot be ignored in standard discussions. It makes
no sense to develop a new service for wireless networking in the United States, for
example, if the Federal Communications Commission (FCC) has forbidden the use of
the frequencies used by the new service for that purpose. Regulated industries include
radio, television, and wireless and cable systems.
Request for Comment and the Internet Engineering Task Force
What about the Internet itself? The Internet Engineering Task Force (IETF) is the
organization directly responsible for the development of Internet standards. The
IETF has its own system for standardizing network components. In particular, Internet standards cover many of the protocols used by devices attached to the Internet,
especially those closer to the user (applications) than to the physical network.
Internet standards are formalized regulations followed and used by those who
work on the Internet. They are specifications that have been tested and must be
followed. There is a strict procedure that all Internet components follow to become
standards. A specification starts out as an Internet draft, a working document that
often is revised, has no official status, and has a 6-month life span. Developers often
work from these drafts, and much can be learned from the practical experience of
implementation of a draft. If recommended, the Internet authorities can publish the
draft as a request for comment (RFC). The term is historical, and does not imply that
CHAPTER 1 Protocols and Layers
19
feedback is required (most of the feedback is provided in the drafting process). Each
RFC is edited, assigned a number, and available to all. Not all RFCs are standards, even
those that define protocols.
This book will make heavy use of RFCs to explain all aspects of TCP/IP and the
Internet, so a few details are in order. RFCs have various maturity levels that they go
through in their lifetimes, according to their requirement levels. The RFC life-cycle
maturity levels are shown in Figure 1.5. Note that the timeline does not always apply,
or is not applied in a uniform fashion.
A specification can fall into one of six maturity levels, after which it passes to historical status and is useful only for tracking a protocol’s development. Following introduction as an Internet draft, the specification can be a:
Proposed standard—The specification is now well understood, stable, and
sufficiently interesting to the Internet community. The specification is now
usually tested and implemented by several groups, if this has not already
happened at the draft level.
Draft standard—After at least two successful and independent implementations,
the proposed standard is elevated to a draft standard. Without complications,
and with modifications if specific problems are uncovered, draft standards normally become Internet standards.
Internet Draft
Experimental
RFCs
Informational
RFCs
Proposed
Standard
Six months
Draft Standard
Four months
Internet
Standard
Historic RFCs
FIGURE 1.5
The RFC life cycle. Many experimental RFCs never make it to the standards track.
20
PART I Networking Basics
Internet standard—After demonstrations of successful implementation, a draft
standard becomes an Internet standard.
Experimental RFCs—Not all drafts are intended for the “standards track” (and
a huge number are not). Work related to an experimental situation that does
affect Internet operation comprise experimental RFCs. These RFCs should not
be implemented as part of any functional Internet service.
Informational RFCs—Some RFCs contain general, historical, or tutorial information rather than instructions.
RFCs are further classified into one of five requirement levels, as shown in Figure 1.6.
Required—These RFCs must be implemented by all Internet systems to ensure
minimum conformance. For example, IPv4 and ICMP, both discussed in detail in
this book, are required protocols. However, there are very few required RFCs.
Recommended—These RFCs are not required for minimum conformance, but are
very useful. For example, FTP is a recommended protocol.
Elective—RFCs in this category are not required and not recommended. However,
systems can use them for their benefit if they like, so they form a kind of
“option set” for Internet protocols.
Limited Use—These RFCs are only used in certain situations. Most experimental
RFCs are in this category.
RFC Requirement Levels
Required: All systems must implement
Recommended: All systems should implement
Elective: Not required nor recommended
Limited Use: Used in certain situations, such as experimental
Not Recommended: Systems should not implement
FIGURE 1.6
RFC requirement levels. There are very few RFCs that are required to implement an Internet
protocol suite.
CHAPTER 1 Protocols and Layers
21
Not Recommended—These RFCs are inappropriate for general use. Most historic
(obsolete) RFCs are in this category.
RFCs can be found at www.rfc-editor.org/rfc.html. Current Internet drafts can be found
at www.ietf.org/ID.html. Expired Internet drafts can be found at www.watersprings.
org/pub/id/index-all.html.
INTERNET ADMINISTRATION
As the Internet has evolved from an environment with a large student user population
to a more commercialized network with a broad user base, the groups that have guided
and coordinated Internet issues have evolved. Figure 1.7 shows the general structure
of the Internet administration entities.
Internet Society (ISOC)—This is an international nonprofit organization formed in
1992 to support the Internet standards process. ISOC maintains and supports
the other administrative bodies described in this section. ISOC also supports
research and scholarly activities relating to the Internet.
Internet Society
Internet Architecture Board
Internet Engineering Task Force
Internet Research Task Force
IRSG
IESG
Area
Working
Group
Area
Working
Group
Working
Group
Working
Group
Research
Group
Research
Group
FIGURE 1.7
Internet administration groups, showing the interactions between the major components.
22
PART I Networking Basics
Internet Architecture Board (IAB)—This group is the technical advisor to
ISOC. The IAB oversees the continued development of the Internet protocol
suite and plays a technical advisory role to members of the Internet community involved in research. The IAB does this primarily through the two organizations under it. In addition, the RFC editor derives authority from the IAB, and
the IAB represents the Internet to other standards organizations and forums.
Internet Engineering Task Force (IETF)—This a forum of working groups
managed by the Internet Engineering Steering Group (IESG). The IETF identifies operational problem areas and proposes solutions. They also develop and
review the specifications intended to become Internet standards. The working
groups are organized into areas devoted to a particular topic. Nine areas have
been defined, although this can change: applications, Internet protocols,
routing, operations, user services, network management, transport, IPv6, and
security. The IETF has taken on some of the roles that were invested in ISOC.
Internet Research Task Force (IRTF)—This is another forum of working groups,
organized directly under the Internet Research Steering Group (IESG) for
management purposes. The IRTF is concerned with long-term research topics
related to Internet protocols, applications, architecture, and technology.
Two other groups are important for Internet administration, although they do not
appear in Figure 1.7.
Internet Corporation for Assigned Names and Numbers (ICANN)—This is a
private nonprofit corporation that is responsible for the management of all
Internet domain names (more on these later) and Internet addresses. Before
1998, this role was played by the Internet Assigned Numbers Authority (IANA),
which was supported by the U.S. government.
Internet Network Information Center (InterNIC)—The job of the InterNIC, run
by the U.S. Department of Commerce, is to collect and distribute information
about IP names and addresses. They are at http://www.internic.net.
LAYERS
When it comes to communications, all of these standard organizations have one
primary function: the creation of standards that can be combined with others to create
a working network. One concern is that these organizations be able to recommend
solutions that are both flexible and complete, even though no single standards entity
has complete control over the entire process from top to bottom. The way this is done
is to divide the communications process up into a number of functional layers.
Data communication networks rely on layered protocols. In brief, processes running on a system and the communication ports that send and receive network bits are
logically connected by a series of layers, each performing one major function of the
networking task.
CHAPTER 1 Protocols and Layers
23
The key concept is that each layer in the protocol stack has a distinct purpose and
function. There is a big difference between the application layer protocols we’ve seen,
such as FTP and SSH, and a lower-level protocol such as Ethernet on a LAN. Each protocol layer handles part of the overall task.
For example, Ethernet cards format the bits sent out on a LAN at one layer, and
FTP client software communicates with the FTP server at a higher layer. However, the
Ethernet card does not tell the FTP application which bits to send out the interface.
FTP addresses the higher-end part of the puzzle: sending commands and data to the
FTP server. Other layers take care of things like formatting, and can vary in capability
or form to address differences at every level. You don’t use different Web browsers
depending on the type of links used on a network. The whole point is that not all
networks are Ethernet (for example), so a layered protocol allows a “mix and match” of
whatever protocols are needed for the network at each layer.
Simple Networking
Most programming languages include statements that allow the programmer to send
bits out of a physical connector. For example, suppose a programming language allowed
you to program a statement like write(port 20$, "test 1"). Sure enough, when compiled, linked, and run, the program would spit the bits representing the string “test 1”
out the communications port of the computer. A similar statement like read(port 20$,
STUFF) would, when compiled, linked, and run, wait until something appeared in the
buffer of the serial port and store the bits in the variable called STUFF.
A simple network using this technique is shown in Figure 1.8. (There is still some
software in use that does networking this way.)
However, there are some things to consider. Is there anything attached to the port at
all? Or are the bits just falling into the “bit bucket”? If there was a link attached, what if
someone disconnected it while the bits are in flight? What about other types of errors?
How would we know that the bits arrived safely?
Even assuming that the bits got there, and some listening process received them,
does the content make sense? Some computers store bits differently than others, and
“test 1” could be garbled on the other system. How many bits are sent to represent the
System A
(sender)
System B
(receiver)
write (port 20$, “ test 1”)
read (port 20$, STUFF)
Bits
FIGURE 1.8
An extremely simple network with a distinctly non-layered approach to networking.
24
PART I Networking Basics
number 1? How do we know that a “short integer” used by the sender is the same as
the “short integer” used by another? (In fairness,TCP/IP does little to address this issue
directly.)
We see that the networking task is not as simple as it seems. Now, each and every
networked application program could conceivably include every line of code that is
needed to solve all of these issues (and there are even others), but that introduces
another factor into the networking equation. Most hosts attached to a network have
only one communications port active at any one time (the “network interface”). If an
“all-in-one” network application is using it, perhaps to download a music file, how can
another application use the same port for email? It can’t.
Besides the need to multiplex in various ways, another factor influencing layers
is that modern operating systems do not allow direct access to hardware. The need to
go through the operating system and multiplex the network interface leads to a centralization of the networking tasks in the end system.
Protocol layers make all of these issues easier to deal with, but they cannot be added
haphazardly. (You can still create a huge and ugly “layer” that implements everything
from hardware to transport to data representation, but it would work.) As important
as the layers are, the tasks and responsibilities assigned to those layers are even more
important.
Protocol Layers
Each layer has a separate function in the overall task of moving bits between
processes. These processes could be applications on separate systems, but on modern
systems a lot of process-to-process communication is not host-to-host. For example, a
lot of printer management software runs as a Web browser using a special loopback
TCP/IP address to interface with the process that gathered status information from the
printer.
As long as the boundary functions between adjacent layers are respected, layers
can be changed or even completely rewritten without having to change the whole
application. Layers can be combined for efficiency,“mixed-and-matched” from different
vendors, or customized for different circumstances, all without having to rework the
entire stack from top to bottom.
Nearly every layer has some type of multiplexing field to allow the receiver to
determine the type of payload, or content of the data unit at a particular layer. A key
point in networking is that the payload and control information at one layer is just a
“transparent” (meaningless) payload to the layer below. Transparent bits, as the name
implies, are passed unchanged to the next layer.
How can protocol layers work together? Introducing a bunch of new interfaces and
protocols seems to have made the networking task harder, not easier. There is a simple method called encapsulation that makes the entire architecture workable. What
is encapsulation? Think of the layers of the protocol suite in terms of writing a letter
and the systems that are involved in letter delivery. The letter goes inside an envelope
which is gathered with others inside a mailbag which is transported with others inside
CHAPTER 1 Protocols and Layers
25
a truck or plane. It sounds like a very complicated way to deliver one message, but
this system makes the overall task of delivering many messages easier, not harder. For
example, there now can be facilities that only deal with mailbags and do not worry
about an individual letter’s language or the transportation details.
THE TCP/IP PROTOCOL SUITE
The protocol stack used on the Internet is the Internet Protocol Suite. It is usually
called TCP/IP after two of its most prominent protocols, but there are other protocols as well. The TCP/IP model is based on a five-layer model for networking. From
bottom (the link) to top (the user application), these are the physical, data link, network, transport, and application layers. Not all layers are completely defined by the
model, so these layers are “filled in” by external standards and protocols. The layers
have names but no numbers, and although sometimes people speak of “Layer 2” or
“Layer 3,” these are not TCP/IP terms. Terms like these are actually from the OSI Reference Model.
The TCP/IP stack is open, which means that there are no “secrets” as to how it
works. (There are “open systems” too, but with TCP/IP, the systems do not have to be
“open” and often are not.) Two compatible end-system applications can communicate
regardless of their underlying architectures, although the connections between layers
are not defined.
The OSI Reference Model
The TCP/IP or Internet model is not the only standard way to build a protocol suite
or stack. The Open Standard Interconnection (OSI ) reference model is a sevenlayer model that loosely maps into the five layers of TCP/IP. Until the Web became
widely popular in the 1990s, the OSI reference model, with distinctive names and
numbers for its layers, was proposed as the standard model for all communication
networks.Today, the OSI reference model (OSI-RM) is often used as a learning tool
to introduce the functions of TCP/IP.
The TCP/IP stack is comprised of modules. Each module provides a specific
function, but the modules are fairly independent. The TCP/IP layers contain relatively
independent protocols that can be used depending on the needs of the system to
provide whatever function is desired. In TCP/IP, each higher layer protocol is supported by lower layer protocols. The whole collection of protocols forms a type of
hourglass shape, with IP in the middle, and more and more protocols up or down
from there.
26
PART I Networking Basics
Suite, Stack, and Model
The term “protocol stack” is often used synonymously with “protocol suite” as an
implementation of a reference model. However, the term “protocol suite” properly
refers to a collection of all the protocols that can make up a layer in the reference
model. The Internet protocol suite is an example of the Internet or TCP/IP reference model protocols, and a TCP/IP protocol stack implements one or more of
these protocols at each layer.
The TCP/IP Layers
The TCP/IP protocol stack models a series of protocol layers for networks and systems
that allows communications between any types of devices. The model consists of five
separate but related layers, as shown in Figure 1.9. The Internet protocol suite is based
on these five layers. TCP/IP says most about the network and transport layers, and a
lot about the application layer. TCP/IP also defines how to interface the network layer
with the data link and physical layers, but is not directly concerned with these two
layers themselves.
The Internet protocol suite assumes that a layer is there and available, so TCP/IP
does not define the layers themselves. The stack consist of protocols, not implementations, so describing a layer or protocols says almost nothing about how these things
should actually be built.
Not all systems on a network need to implement all five layers of TCP/IP. Devices
using the TCP/IP protocol stack fall into two general categories: a host or end system
(ES) and an intermediate node (often a router) or an intermediate system (IS). The
User Application Programs
Application Layer
Transport Layer
Network Layer
Data Link Layer
Physical Layer
Network Link(s)
FIGURE 1.9
The five layers of TCP/IP. Older models often show only four layers, combining the physical and
data link layers.
CHAPTER 1 Protocols and Layers
27
intermediate nodes usually only involve the first three layers of TCP/IP (although many
of them still have all five layers for other reasons, as we have seen).
In TCP/IP, as with most layered protocols, the most fundamental elements of the
process of sending and receiving data are collected into the groups that become the
layers. Each layer’s major functions are distinct from all the others, but layers can
be combined for performance reasons. Each implemented layer has an interface with
the layers above and below it (except for the application and physical layers, of course)
and provides its defined service to the layer above and obtains services from the layer
below. In other words, there is a service interface between each layer, but these are not
standardized and vary widely by operating system.
TCP/IP is designed to be comprehensive and flexible. It can be extended to meet
new requirements, and has been. Individual layers can be combined for implementation
purposes, as long as the service interfaces to the layers remain intact. Layers can even
be split when necessary, and new service interfaces defined. Services are provided to
the layer above after the higher layer provides the lower layer with the command, data,
and necessary parameters for the lower layer to carry out the task.
Layers on the same system provide and obtain services to and from adjacent layers.
However, a peer-to-peer protocol process allows the same layers on different systems to
communicate. The term peer means every implementation of some layer is essentially
equal to all others. There is no “master” system at the protocol level. Communications
between peer layers on different systems use the defined protocols appropriate to the
given layer.
In other words, services refer to communications between layers within the same
process, and protocols refer to communications between processes. This can be confusing, so more information about these points is a good idea.
Protocols and Interfaces
It is important to note that when the layers of TCP/IP are on different systems, they
are only connected at the physical layer. Direct peer-to-peer communication between
all other layers is impossible. This means that all data from an application have to flow
“down” through all five layers at the sender, and “up” all five layers at the receiver to
reach the correct process on the other system. These data are sometimes called a service data unit (SDU).
Each layer on the sending system adds information to the data it receives from the
layer above and passes it all to the layer below (except for the physical layer, which
has no lower layers to rely on in the model and actually has to send the bits in a form
appropriate for the communications link used).
Likewise, each layer on the receiving system unwraps the received message, often
called a protocol data unit (PDU), with each layer examining, using, and stripping off
the information it needs to complete its task, and passing the remainder up to the next
layer (except for the application layer, which passes what’s left off to the application
program itself). For example, the data link layer removes the wrapper meant for it, uses
it to decide what it should do with this data unit, and then passes the remainder up to
the network layer.
28
PART I Networking Basics
Intermediate
System (node)
Intermediate
System (node)
Device A
5
Device B
Peer-to-Peer Protocol at Layer 5
Application
Application
4–5 Interface
4
3
2
1
Peer-to-Peer Protocol at Layer 4
Transport
3–4 Interface
L3
L3
Transport
L3
Network
Network
Network
2–3 Interface
2–3 Interface
2–3 Interface
2–3 Interface
L2
L2
L2
Data Link
Data Link
Data Link
Data Link
1–2 Interface
1–2 Interface
1–2 Interface
1–2 Interface
L1
Physical
L1
Physical
L1
4
3–4 Interface
Network
Physical
5
4–5 Interface
Physical
3
2
1
Physical Communication Links
FIGURE 1.10
Protocols and interfaces, showing how devices are only physically connected at the lowest layer
(Layer 1). Note that functionally, intermediate nodes only require the bottom three layers of the
model.
The whole interface and protocol process is shown in Figure 1.10. Although TCP/IP
layers only have names, layer numbers are also used in the figure, but only for illustration. (The numbers come from the ISO-RM.)
As shown in the figure, there is a natural grouping of the five-layer protocol stack
at the network layer and the transport layer. The lower three layers of TCP/IP, sometimes called the network support layers, must be present and functional on all systems,
regardless of the end system or intermediate node role. The transport layer links the
upper and lower layers together. This layer can be used to make sure that what was
sent was received, and what was sent is useful to the receiver (and not, for example,
a stray PDU misdirected to the host or unreasonably delayed).
The process of encapsulation makes the whole architecture workable. Encapsulation of one layer’s information inside another layer is a key part of how TCP/IP
works.
Encapsulation
Each layer uses encapsulation to add the information its peer needs on the receiving
system. The network layer adds a header to the information it receives from the transport at the sender and passes the whole unit down to the data link layer. At the receiver,
CHAPTER 1 Protocols and Layers
29
the network layer looks at the control information, usually in a header, in the data it
receives from the data link layer and passes the remainder up to the transport layer for
further processing. This is called encapsulation because one layer has no idea what the
structure or meaning of the PDU is at other layers. The PDU has several more or less
official names for the structure at each layer.
The exception to this general rule is the data link layer, which adds both a header
and a trailer to the data it receives from the network layer. The general flow of encapsulation in TCP/IP is shown in Figure 1.11. Note that on the transmission media itself
(or communications link), there are only bits, and that some “extra” bits are added by
the communication link for its own purposes. Each PDU at the other layers is labeled
as data for its layer, and the headers are abbreviated by layer name. The exception is the
second layer, the data link layer, which shows a header and trailer added at that level
of encapsulation.
Although the intermediate nodes are not shown, these network devices will only
process the data (at most) through the first three layers. In other words, there is no
transport layer to which to pass network-layer PDUs on these systems for data communications (management is another issue).
Device A
Device B
Data from Application
Data to Application
Application Layer Data
Application Layer Data
Transport Layer Data
Network Layer Data
Trl
Transport Layer Data
TH
Data Link Layer Data
TH
Network Layer Data
NH
Trl
Hdr
010101010101011100101010101010101011110
110
Data Link Layer Data
NH
Hdr
010101010101011100101010101010101011110
Transmission Media
FIGURE 1.11
TCP/IP encapsulation and headers. The unstructured stream of bits represents frames with
distinct content.
110
30
PART I Networking Basics
THE LAYERS OF TCP/IP
TCP/IP is mature and stable, and is the only protocol stack used on the Internet. This
book is all about networking with TCP/IP, but it is easy to get lost in the particulars of
TCP/IP if some discussion of the general tasks that TCP/IP is intended to accomplish is
not included. This section takes a closer look at the TCP/IP layers, but only as a general
guide to how the layers work.
TCP/IP Layers in Brief
■
■
■
■
■
Physical Layer: Contains all the functions needed to carry the bit stream over a
physical medium to another system.
Data Link Layer: Organizes the bit stream into a data unit called a “frame” and
delivers the frame to an adjacent system.
Network Layer: Delivers data in the form of a packet from source to destination, across as many links as necessary, to non-adjacent systems.
Transport Layer: Concerned with process-to-process delivery of information.
Application Layer: Concerned with differences in internal representation, user
interfaces, and anything else that the user requires.
The Physical Layer
The physical layer contains all the functions needed to carry the bit stream over a
physical medium to another system. Figure 1.12 shows the position of the physical layer
to the data link layer and the transmission medium. The transmission medium forms a
pure “bit pipe” and should not change the bits sent in any way. Now, transmission “on
the wire” might send bits through an extremely complex transform, but the goal is to
enable the receiver to reconstruct the bit stream exactly as sent. Some information in
the form of transmission framing can be added to the data link layer data, but this is
only used by the physical layer and the transmission medium itself. In some cases, the
transmission medium sends a constant idle bit pattern until interrupted by data.
Physical layer specifications have four parts: mechanical, electrical or optical,
functional, and procedural. The mechanical part specifies the physical size and shape of
the connector itself so that components will plug into each other easily. The electrical/
optical specification determines what value of voltage or line condition determines
whether a pin is active or what exactly represents a 0 or 1 bit. The functional specification specifies the function of each pin or lead on the connector (first lead is send,
second is receive, and so on). The procedural specification details the sequence of
actions that must take place to send or receive bits on the interface. (For Ethernet, the
send pair is activated, then a “preamble” is sent, and so forth.) The Ethernet twistedpair interfaces from the IEEE are common implementations of the physical layer that
includes all these elements.
CHAPTER 1 Protocols and Layers
Data Link Layer
31
Data Link Layer
010101011100101010101010101011110
10110
010101011100101010101010101011110
10110
Transmission
Framing
Physical
Layer
Physical
Layer
Transmission Media
“bit pipe”
FIGURE 1.12
The physical layer. The transmission framing bits are used for transmission media purposes only,
such as low-level control.
There are other things that the physical layer must determine, or be configured to
expect.
Data rate—This transmission rate is the number of bits per second that can be
sent. It also defines the duration of a symbol on the wire. Symbols usually
represent one or more bits, although there are schemes in which one bit is
represented by multiple symbols.
Bit synchronization—The sender and receiver must be synchronized at the symbol level so that the number of bits expected per unit time is the same. In other
words, the sender and receiver clocks must be synchronized (timing is in the
millisecond or microsecond range). On modern links, the timing information is
often “recovered” from the received data stream.
Configuration—So far we’ve assumed simple point-to-point links, but this is not
the only way that systems are connected. In a multipoint configuration, a link
connects more than two devices, and in a multisystem bus/broadcast topology such as a LAN, the number of systems can be very high.
Topology—The devices can be arranged in a number of ways. In a full mesh topology, all devices are directly connected and one hop away, but this requires a
staggering amount of links for even a modest network. Systems can also be
arranged as a star topology, with all systems reachable through a central system.
There is also the bus (all devices are on a common link) and the ring (devices
are chained together, and the last is linked to the first, forming a ring).
Mode—So far, we’ve only talked about one of the systems as the sender and the
other as the receiver. This is operation in simplex mode, where a device can
only send or receive, such as with weather sensors reporting to a remote
32
PART I Networking Basics
weather station. More realistic devices use duplex mode, where all systems
can send or receive with equal facility. This is often further distinguished as
half-duplex (the system can send and receive, but not at the same time) and
full-duplex (simultaneous sending and receiving).
The Data Link Layer
Bits are just bits. With only a physical layer, System A has no way to tell System B,“Get
ready some bits,” “Here are the bits,” and “Did you get those bits okay?” The data link
layer solves this problem by organizing the bit stream into a data unit called a frame.
It is important to note that frames are the data link layer PDUs, and these are not the
same as the physical layer transmission frames mentioned in the previous section. For
example, network engineers often speak about T1 frames or SONET frames, but these
are distinct from the data link layer frames that are carried inside the T1 or SONET
frames. Transmission frames have control information used to manage the physical link
itself and has little to do directly with process-to-process communications. This “double-frame” arrangement might sound redundant, but many transmission frames originated with voice because digitized voice has no framing at the “data link” layer.
The data link layer moves bits across the link and can add reliability to the raw communications link. The data link layer can be very simple, or make the link appear errorfree to the layer above, the network layer. The data link layer usually adds both a header
and trailer to the data presented by the network layer. This is shown in Figure 1.13.
The frame header typically contains a source and destination address (known as the
“physical address” since it refers to the physical communication port) and some control information. The control information is data passed from one data link layer to the
From Network Layer
Frame
Trailer
Trl
To Network Layer
Frame
Header
Data Link Layer Data
Hdr
Trl
Data Link Layer Data
Frame
To Physical Layer
From Physical Layer
FIGURE 1.13
The data link layer, showing that data link layer frames have both header and trailer.
Hdr
CHAPTER 1 Protocols and Layers
33
other data link layer, and not user data. The body of the frame contains the sequence of
bits being transferred across the network. The trailer usually contains information used
in detecting bit errors (such as cyclical redundancy check [CRC]). A maximum size is
associated with the frame that cannot be exceeded because all systems must allocate
memory space (buffers) for the data. In a networking context, a buffer is just special
memory allocated for communications.
The data link layer performs framing, physical addressing, and error detection
(error correction is another matter entirely, and can be handled in many ways, such
as by resending a copy of the frame that had the errors). However, when it comes to
frame error detection and correction in the real world, error detection bits are sometimes ignored and frames that defy processing due to errors are simply discarded. This
does not mean that error detection and correction are not part of the data link layer
standards: It means that in these cases, ignoring and discarding are the chosen methods of implementation. In discard cases, the chore of handling the error condition is
“pushed up the stack” to a higher layer protocol.
This layer also performs access control (this determines whose turn it is to send
over or control the link, an issue that becomes more and more interesting as the
number of devices sharing the link grows). In LANs, this media access control (MAC)
forms a sublayer of the data link layer and has its own addressing scheme known (not
surprisingly) as the MAC layer address or MAC address. We’ll look at MAC addresses
in the next chapter. For now, it is enough to note that LANs such as Ethernet do not
have “real” physical layer addresses and that the MAC address performs this addressing
function.
In addition, the data link layer can perform some type of flow control. Flow control
makes sure senders do not overwhelm receivers: a receiver must have adequate time
to process the data arriving in its buffers. At this layer, the flow control, if provided, is
link-by-link. (We’ll see shortly that end-to-end—host-to-host—flow control is provided
by the transport layer.) LANs do not usually provide flow control at the data link layer,
although they can.
Not all destination systems are directly reachable by the sender. This means that
when bits at the data link layer are sent from an originating system, the bits do not arrive
at the destination system as the “next hop” along the way. Directly reachable systems
are called adjacent systems, and adjacent systems are always “one hop away” from the
sender. When the destination system is not directly reachable by the sender, one or
more intermediate nodes are needed. Consider the network shown in Figure 1.14.
Now the sender (System A) is not directly connected to the receiver (System B).
Another system, System 3, receives the frame and must forward it toward the
destination. This system is usually called a switch or router (there are even other names),
depending on internal architecture and network role. On a WAN (but not on a LAN),
this second frame is a different frame because there is no guarantee that the second
link is identical to the first. Different links need different frames. Identical frames are
only delivered to systems that are directly reachable, or adjacent, to the sender, such as
by an Ethernet switch on a LAN.
34
PART I Networking Basics
System A
(sender)
System 3
(switch/router)
System B
(receiver)
Send “STUFF”
to System B
Intermediate
System
I got “STUFF”
from System A
A Different
Frame
A Frame
FIGURE 1.14
A more complex network. Note that the frames are technically different even if the same medium
is used on both links.
Intermediate
System 1
Intermediate
System 2
End System B
End System A
Intermediate
System 3
End System C
Hop-by-Hop
Forwarding
Hop-by-Hop
Forwarding
Hop-by-Hop
Forwarding
Data Link
Frames
Physical
Bits
FIGURE 1.15
Hop-by-hop forwarding of frames. The intermediate systems also have a Layer 3, but this is not
shown in the figure for clarity.
Networking with intermediate systems is called hop-by-hop delivery. A “hop” is the
usual term used on the Internet or a router network to indicate the forwarding of a
packet between one router or another (or between a host and router). Frames can “hop”
between Layer 2 switches, but the term is most commonly used for Layer 3 router hops
(which can consist of multiple switch-to-switch frame “hops”). There can be more than
one intermediate system between the source and destination end systems, of course,
as shown in Figure 1.15. Consider the case where End System A is sending a bit stream
to End System C.
CHAPTER 1 Protocols and Layers
35
Note that the intermediate systems (routers) have two distinct physical and data link
layers, reflecting the fact that the systems have two (and often more) communication
links, which can differ in many ways. (The figure shows a typical WAN configuration
with point-to-point links, but routers on LANs, and on some types of public data service
WANs, can be deployed in more complicated ways.)
However, there is something obviously missing from this figure. There is no connection between the data link layers on the intermediate systems! How does the
router know to which output port and link to forward the data in order to ultimately
reach the destination? (In the figure, note that Intermediate System 1 can send data to
either Intermediate System 2 or Intermediate System 3, but only through Intermediate
System 3, which forwards the data, is the destination reachable.)
These forwarding decisions are made at the TCP/IP network layer.
The Network Layer
The network layer delivers data in the form of a packet from source to destination,
across as many links as necessary. The biggest difference between the network layer
and the data link layer is that the data link layer is in charge of data delivery between
adjacent systems (directly connected systems one hop away), while the network layer
delivers data to systems that are not directly connected to the source. There can be
many different types of data link and physical layers on the network, depending on the
variety of the link types, but the network layer is essentially the same on all systems,
end systems, and intermediate systems alike.
Figure 1.16 shows the relationship between the network layer and the transport
layer above and the data link layer below. A packet header is put in place at the sender
and interpreted by the receiver. A router simply looks at the packet header and makes
a forwarding decision based on this information. The transport layer does not play a
role in the forwarding decision.
From Transport Layer
To Transport Layer
Packet
Header
Network Layer Data
NH
Network Layer Data
NH
Packet
To Data Link Layer
From Data Link Layer
FIGURE 1.16
The network layer. These data units are packets with their own destination and source address
formats.
36
PART I Networking Basics
How does the network layer know where the packet came from (so the sender can
reply)? The key concept at the network layer is the network address, which provides
this information. In TCP/IP, the network address is the IP address.
Every system in the network receives a network address, whether an end system
or intermediate system. Systems require at least one network address (and sometimes
many more). It is important to realize that this network address is different from, and
independent of, the physical address used by the frames that carry the packets between
adjacent systems.
Why should the systems need two addresses for the two layers? Why can’t they
just both use either the data link (“physical”) address or the network address at
both layers? There are actually several reasons. First, LAN addresses like those used
in Ethernet come from one group (the IEEE), while those used in TCP/IP come
from another group (ICANN). Also, the IP address is universally used on the Internet, while there are many types of physical addresses. Finally, there is no systematic
assignment of physical addresses (and many addresses on WANs can be duplicates
and so have “local significance only”). On the other hand, IP network addresses are
globally administered, unique, and have a portion under which many devices are
grouped. Therefore, many devices can be addressed concisely by this network portion of the IP address.
A key issue is how the network addresses “map” to physical addresses, a process
known generally as address resolution. In TCP/IP, a special family of address resolution
protocols takes care of this process.
The network address is a logical address. Network addresses should be organized so
that devices can be grouped under a part of that address. In other words, the network
address should be organized in a fashion similar to a telephone number, for example,
212-555-1212 in the North American public switched telephone network (PSTN). The
sender need only look at the area code or “network” portion of this address (212) to
determine if the destination is local (area codes are the same) or needs to be sent to
an intermediate system to reach the 212 area code (source and destination area codes
differ).
For this scheme to work effectively, however, all telephones that share the 212 area
code should be grouped together. The whole telephone number beginning with 212
therefore means “this telephone in the 212 area code.” In TCP/IP, the network address
is the beginning of the device’s complete IP address. A group of hosts is gathered under
the network portion of the IP address. IP network addresses, like area codes, are globally administered to prevent duplication, while the rest of the IP address, like the rest
of the telephone number, is locally administered, often independently.
In some cases, the packet that arrives at an intermediate system inside a frame is too
large to fit inside the frame that must be sent out. This is not uncommon: different link
and LAN types have different maximum frame sizes. The network layer must be able
to fragment a data unit across multiple frames and reassemble the fragments at the
destination. We’ll say more about fragmentation in a later chapter.
CHAPTER 1 Protocols and Layers
Intermediate
System 1
37
Intermediate
System 2
End System B
End System A
Intermediate
System 3
End System C
Hop-by-Hop
Forwarding
Hop-by-Hop
Forwarding
Hop-by-Hop
Forwarding
End-to-End
Delivery
Network
Packets
Data Link
Frames
Physical
Bits
FIGURE 1.17
Source-to-destination delivery at the network layer. The intermediate systems now have all three
required layers.
The network layer uses one or more routing tables to store information about
reachable systems. The routing tables must be created, maintained, and purged of old
information as the network changes due to failures, the addition or deletion of systems
and links, or other configuration changes. This whole process of building tables to pass
data from source to destination is called routing, and the use of these tables for packet
delivery is called forwarding. The forwarding of packets inside frames always takes
place hop by hop. This is shown in Figure 1.17, which adds the network layer to the
data link layers already present and distinguishes between hop-by-hop forwarding and
end-to-end delivery.
On the Internet, the intermediate systems that act at the packet level (Layer 3)
are called routers. Devices that act on frames (Layer 2) are called switches, and some
older telephony-based WAN architectures use switches as intermediate network nodes.
Whether a node is called a switch or router depends on how they function internally.
38
PART I Networking Basics
In a very real sense, the network layer is at the very heart of any protocol stack, and
TCP/IP is no exception. The protocol at this layer is IP, either IPv4 or IPv6 (some think
that IPv6 is distinct enough to be known as TCPv6/IPv6).
The Transport Layer
Process-to-process delivery is the task of the transport layer. Getting a packet to the
destination system is not quite the same thing as determining which process should
receive the packet’s content. A system can be running file transfer, email, and other
network processes all at the same time, and all over a single physical interface. Naturally, the destination process has to know on which process the sender originated the
bits inside the packet in order to reply. Also, systems cannot simply transfer a huge
multimegabit file all in one packet. Many data units exceed the maximum allowable
size of a packet.
This process of dividing message content into packets is known as segmentation. The
network layer forwards each and every packet independently, and does not recognize
any relationship between the packets. (Is this a file transfer or email packet? The network layer does not care.) The transport layer, in contrast, can make sure the whole
message, often strung out in a sequence of packets, arrives in order (packets can be
delivered out of sequence) and intact (there are no errors in the entire message). This
function of the transport layer involves some method of flow control and error control (error detection and error correction) at the transport layer, functions which are
absent at the network layer. The transport-layer protocol that performs all of these
functions is TCP.
The transport-layer protocol does not have to do any of this, of course. In many
cases, the content of the packet forms a complete unit all by itself, called a datagram.
(The term “datagram” is often used to refer to the whole IP packet, but not in this book.)
Self-contained datagrams are not concerned with sequencing or flow control, and these
functions are absent in the User Datagram Protocol (UDP) at the transport layer.
So there are two very popular protocol packages at the transport layer:
■
■
TCP—This is a connection-oriented,“reliable” service that provides ordered
delivery of packet contents.
UDP—This is a connectionless,“unreliable” service that does not provide
ordered delivery of packet contents.
In addition to UDP and TCP, there are other transport-layer protocols that can be used
in TCP/IP, all of which differ in terms of how they handle transport-layer tasks. Developers are not limited to the standard choices for applications. If neither TCP nor UDP
nor any other defined transport-layer service is appropriate for your application, you
can write your own transport-layer protocols and get others to adapt it (or use your
application package exclusively).
CHAPTER 1 Protocols and Layers
From Application Layer
39
To Application Layer
Chunk of Data
Chunk of Data
Segments
TL data
TH
1
TL data
TH
2
To Network Layer
TL data
TH
TL data
TH
TL data
TH
TL data
TH
3
From Network Layer
FIGURE 1.18
The transport layer, showing how data are broken up if necessary and reassembled at the
destination.
In TCP/IP, it is often said that the network layer (IP itself) offers an “unreliable” or
“best effort” service, while the transport layer adds “reliability” in the form of flow and
error control. Later in this book, we’ll see why these terms are unfortunate and what
they really mean.
The network layer gets a single packet to the right system, and the transport
layer gets the entire message to the right process. Figure 1.18 shows the transport
layer breaking up a message at the sender into three pieces (each labeled “TL data” for
transport-layer data and “TH” for transport-layer header). The figure then shows the
transport layer reassembling the message at the receiver from the various segments that
make up a message. In TCP/IP, there are also data units known as datagrams, which are
always handled as self-contained units. There are profound differences between how
the transport layer treats segments and datagrams, but this figure is just a general illustration of segment handling.
The functions that the transport layer, which in some protocols is called the end-toend layer, might have to include follow:
Process addressing and multiplexing—Also known as “service-point addressing,”
the transport layer has to decide which process originated the message and to
which process the message must be delivered. These are also known as port
addresses in TCP/IP. Port addresses are an important portion of the application
socket in TCP/IP.
Segment handling—In cases where each message is divided into segments, each
segment has a sequence number used to put the message back together at the
destination. When datagrams are used, each data unit is handled independently
and sequencing is not necessary.
40
PART I Networking Basics
Process on System A
Process on System B
Internetwork
(for example, the Internet)
Network Layer
End-to-End Delivery
Transport Layer
Process-to-Process Delivery
FIGURE 1.19
Reliable process-to-process delivery with the transport layer.
Connection control—The transport layer can be connectionless or connection-oriented (in fact, several layers can operate in either one of these ways).
Connectionless (CL) layers treat every data unit as a self-contained, independent
unit. Connection-oriented (CO) layers go through a three-phase process every
time there is data to send to a destination after an idle period (connection
durations can vary). First, some control messages establish the connection,
then the data are sent (and exchanged if replies are necessary), and finally the
connection is closed. Many times, a comparison is made between a telephone
conversation (“dial, talk, hang up”) with connections and an intercom (“push
and talk any time”) for connectionless communications, but this is not precise.
Generally, segments are connection-oriented data units, and datagrams are connectionless data units.
Flow control—Just as with the data link layer, the transport layer can include flow
control mechanisms to prevent senders from overwhelming receivers. In this
case, however, the flow control is end-to-end rather than link-by-link. Datagrams do not require this service.
Error control—This is another function that can be performed at the data link
layer, but again end-to-end at the transport layer rather than link-by-link. Communications links are not the only source of errors, which can occur inside a
system as well. Again, datagrams do not require this service.
Figure 1.19 shows the relationship between the network layer and transport layer
more clearly. The network layer operates from network interface to network interface,
while the transport layer is more specific and operates from process to process.
CHAPTER 1 Protocols and Layers
41
The Application Layer
It might seem that once data are transferred from end-system process to end-system
process, the networking task is pretty much complete. There is a lot that still needs
to be done at the application level itself. In models of protocol stacks, it is common
to place another layer between the transport layer and the user, the application layer.
However, the TCP/IP protocol stack really stops at the transport layer (where TCP and
UDP are). It is up to the application programmer to decide what should happen at the
client and server level at that point, although there are individual RFCs for guidance,
such as for FTP.
Although it is common to gather these TCP/IP applications into their own layer,
there really is no such thing in TCP/IP as an application layer to act as some kind of
“glue” between the application’s user and the network.
In nearly all TCP/IP stacks, the application layer is part of the application process.
In spite of the lack of a defined layer, a TCP/IP application might still have a lot to do,
and in some ways the application layer is the most complex “layer” of all.
There are two major tasks that the application often needs to accomplish: session
support and conversion of internal representation. Not all applications need both, of
course, and some applications might not need either, but this overview includes both
major functions.
Session Support
A session is a type of dialog controller between two processes that establishes, maintains, and synchronizes (controls) the interaction (dialog). A session decides if the communication can be half-duplex (both ends take turns sending) or full-duplex (both
ends can send whenever they want). It also keeps a kind of “history” of the interaction
between endpoints, so that when things go wrong or when the two communicate
again, some information does not have to be resent.
In practical terms, the session consists of all “state variables” necessary to construct
the history of the connection between the two devices. It is more difficult, but not
impossible, to implement sessions in a connectionless environment because there is
no easy way to associate the variables with a convenient label.
Internal Representation Conversion
The role of internal representation conversion is to make sure that the data exchange
over the network is useful to the receivers. If the internal representation of data differs on the two systems (integer size, bit order in memory, etc.), the application layer
translates between the formats so the application program does not have to. This layer
can also provide encryption and compression functions, although it is more common
to implement these last two functions separately from the network.
Standard protocol specifications can use the Abstract Syntax Notation 1 (ASN.1)
definitions for translation purposes. ASN.1 can be used in programming, network
42
PART I Networking Basics
Architecture A
Architecture B
a
text “a”
a
00000001
integer 259
00000011
00000011
00000001
FIGURE 1.20
Internal representation differences. Integers can have different bit lengths and can be stored
differently in memory.
management, and other places. ASN.1 defines various things such as which bit is “first
on the wire” regardless of how it is stored internally, how many bits are to be sent for
the numbers 0 through 255 (8), and so on. Everything can be translated into ASN.1, sent
across the network, and translated back to whatever internal format is required at the
destination.
The role of internal representation conversion is shown in Figure 1.20. The figure
shows four sequential memory locations, each storing the letter “a” followed by the
integer 259. Note that not only are there differences between the amount of memory
addressed at once, but also in the order of the bits for numerics.
In some protocol stacks, the application program can rely on the services of a fully
functional conversion for internal representation to perform these services. However,
in TCP/IP, every network application program must do these things for itself.
Applications in TCP/IP
TCP/IP does not provide session or presentation services directly to an application.
Programmers are on their own, but this does not mean they have to create everything
from scratch. For example, applications can use a character-based presentation service called the Network Virtual Terminal (NVT), part of the Internet’s telnet remote
access specification. Other applications can use Sun’s External Data Representation
(XDR) or IBM’s (and Microsoft’s) NetBIOS programming libraries for presentation
services. In this respect, there are many presentation layer services that TCP/IP can
use, but there is no formal presentation service standard in TCP/IP that all applications must use.
Host TCP/IP implementations typically provide a range of applications that provide
users with access to the data handled by the transport-layer protocols. These applications use a number of protocols that are not part of TCP/IP proper, but are used
with TCP/IP. These protocols include the Hyper-Text Transfer Protocol (HTTP) used by
Web browsers, the Simple Message Transfer Protocol (SMTP) used for email, and many
others.
CHAPTER 1 Protocols and Layers
From User
SMTP
HTTP
43
To User
NVT
(others)
Application Data
SMTP
HTTP
NVT
(others)
Application Data
Content of Segment or Datagram
To Transport Layer
From Transport Layer
FIGURE 1.21
TCP/IP applications, showing how multiple applications can all share the same network
connection.
In TCP/IP, the application protocol, the application service, and the user application
itself often share the same name. The file transfer protocol in TCP/IP, called FTP, is at
once an application protocol, an application service, and an application run by a user.
It can sometimes be confusing as to just which aspect of FTP is under discussion.
The role of TCP/IP applications is shown in Figure 1.21. Note that this “layer” sits on
top of the TCP/IP protocol stack and interfaces with programs or users directly.
Some protocols provide separate layers for sessions, internal representation
conversion, and application services. In practice, these are seldom implemented
independently. It just makes more sense to bundle them together by major application,
as in TCP/IP.
THE TCP/IP PROTOCOL SUITE
To sum up, the five layers of TCP/IP are physical, data link, network, transport, and
application. The TCP/IP stack is a hierarchical model made up of interactive modules. Each module provides a specific function. In TCP/IP, the layers contain relatively independent protocols that can be “mixed and matched” depending on the
needs of the system to provide whatever function is desired. TCP/IP is hierarchical
in the sense that each higher layer protocol is supported by one or more lower layer
protocols.
Figure 1.22 maps some of the protocols used in TCP/IP to the various layers of TCP/IP.
Every protocol in the figure will be discussed in this book, most in chapters all their own.
44
PART I Networking Basics
FTP
DNS
SSH
SNMP
SMTP
HTTP
TFTP
DHCP
Application
Transport
UDP
TCP
IPv4
IP NAT
IPv6
IPSec
Network
ARP
Data Link
Others
IP Support
Protocols:
ICMPv4
ICMPv6
Neighbor
Discovery
Routing
Protocols:
RIP, OSPF,
BGP
RARP
Protocols and Links Determined by Underlying Network
(includes SLIP and PPP)
Physical
FIGURE 1.22
TCP/IP protocols and layers. Note the position of some protocols between layers.
With few exceptions, the TCP/IP protocol suite does not really define any low-level
protocols below the network layer. TCP/IP usually specifies how to put IP packets into
frames and how to get them out again. Many RFCs define IP mapping into these lowerlayer protocols. We’ll talk more about this mapping process in Chapter 2.
45
QUESTIONS FOR READERS
Refer to Figure 1.23 to help you answer the following questions.
Intermediate
System (node)
Intermediate
System (node)
Device A
5
Device B
Representation Differences Addressed
Application
Application
4-5 Interface
4
3-4 Interface
3
1
L3
L3
L2
2-3Interface
Transport
L3
3-4Interface
L2
2-3Interface
Network
Network
Network
2-3 Interface
2
Process-to-Process Communication
Transport
L2
2-3Interface
Network
Data Link
Data Link
Data Link
Data Link
1-2 Interface
1-2Interface
1-2Interface
1-2Interface
Physical
L1
Physical
L1
Physical
5
4-5Interface
L1
Physical
4
3
2
1
Physical Communication Links Supporting Communication between Peer Processess
FIGURE 1.23
Summary of layered communications.
1. What are the differences between network-layer delivery and transport-layer
delivery?
2. What are the main characteristics of a peer-to-peer process?
3. What are port addresses, logical addresses, and physical addresses?
4. What are the functions of the data link layer in the Internet model?
5. Which two major types of services can be provided at the application “layer”?
CHAPTER
TCP/IP Protocols
and Devices
2
What You Will Learn
In this chapter, you will learn more about the TCP/IP protocol stack and the tools
used in this book to investigate the Illustrated Network. We’ll look at more details of
TCP/IP and explore how TCP/IP devices provide internetworking from LAN to LAN.
You will learn about the types of devices used to connect LANs (such as
bridges and routers) and conclude with the concept of VLANs and Metro Ethernet
services.
The LANs on the Illustrated Network, including the LAN in the home office, are
connected using routers as the network nodes. Each LAN forms a discrete network by
itself, with its own clients and servers. When previously separate LANs are connected,
or a previously complete LAN is segmented, the result is often called an internetwork.
Routers can be used to build an internetwork of LANs, but this is not the only way.
Routers operate at the packet layer (Layer 3 of the TCP/IP model), and LANs can be
linked or segmented at other layers of a protocol stack as well. Some routers can also
function at these other layers, as the routers on the Illustrated Network can (i.e., routers often include functions other than pure routing). However, in many cases, different
devices are used to link and segment LANs, devices that are not really routers at all.
This chapter will take a closer look at the Illustrated Network in several areas. First,
we’ll take a closer look at the individual layers and protocols that make up the TCP/IP
protocol stack. Then, we’ll investigate how devices handle internetworking from LAN
to LAN at each protocol layer. Finally, we’ll describe some other devices or methods
that can be used between LANs, ending with a concept known as a virtual LAN or
VLAN. VLANs are used by service providers to support a service known as Metropolitan Ethernet or Metro Ethernet.
Figure 2.1 shows the areas of the Illustrated Network we will be investigating in this
chapter. The protocol stacks and layers run mainly on the host clients and servers, so
the devices on the two LANs are shaded, along with the customer edge routers. We’ll
also mention the Gigabit Ethernet links and a Metro Ethernet, so those are highlighted
as well.
48
PART I Networking Basics
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
so-0/0/3
49.1
P4
lo0: 192.168.4.1
/0/0
so-0
47.1
so-0/0/1
24.2
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
FIGURE 2.1
Internetworking on the Illustrated Network LAN. Note that there are two geographically
separate LANs in New York and Los Angeles that must communicate.
AS 65459
CHAPTER 2 TCP/IP Protocols and Devices
49
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/
29. 0/2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
50
PART I Networking Basics
Each host in Figure 2.1 has three types of addresses associated with the interface
connected to the LAN. The first is the IPv4 address. For example, the LAN interface on
host lnxserver is eth0 and the IPv4 address is 10.10.11.66. The next address is the
hardware address, or MAC address on a LAN: 00:d0:b7:1f:fe:e6. Finally, each host
lists the link-local IPv6 address based on this MAC address, or fe80::2d0:b7ff:fe1f:
fee6 for lnxserver. We’ll talk more about IPv4 and IPv6 addressing and packets in
Chapters 4 through 6.
PROTOCOL STACKS ON THE ILLUSTRATED NETWORK
LANs on the Illustrated Network send and receive frames, mainly Ethernet II frames.
Inside the frames are the packets that flow from source to destination. The packets, and
the messages inside the packets, are formatted according to the individual protocols
that make up the TCP/IP protocol stack.
What major TCP/IP protocols are used on the Illustrated Network? Ethereal has a
convenient summary screen that displays whenever Ethereal is capturing packets. Let’s
FIGURE 2.2
Ethereal capture summary, showing the number of packets used by different protocols. Often a
very few types predominate.
CHAPTER 2 TCP/IP Protocols and Devices
51
FIGURE 2.3
Ethereal protocol hierarchy statistics. We’ll be working almost exclusively with Ethernet frames
on the Illustrated Network, but not always.
run Ethereal on wincli2 and see what kind of protocols we capture when we remotely
access router CE6 and find the IP address associated with winsrv1. The summary screen
is shown in Figure 2.2.
Most of the packets we have captured contain TCP. There are a couple from the
User Datagram Protocol (UDP) and Address Resolution Protocol (ARP). The relationship between Ethernet II frames, IP packets, and these protocols is clearer when we
look at the Ethereal protocol hierarchy statistics screen, as shown in Figure 2.3.
It is easy to see in the figure that all of the frames are Ethernet (II) frames, and that
all but 3 of the 73 packets captured are IP packets. The 70 IP packets include 67 TCP
packets and 3 UDP packets. We’ll explore more about how all of these protocols fit
together in this chapter.
LAYERS, PROTOCOLS, PORTS, AND SOCKETS
We’ll take a closer look at frames in Chapter 3. For now, all we need to know is that
layered protocols like TCP/IP function in a specific way. Frames are sent on LANs and
inside the frame are packets. The packets carry the information from device to device.
This information can be application data, but there are also packets that perform control and administrative tasks as well as data transfer.
Layering is not a magical solution to network protocol implementation. There
is usually only one network interface on a host, so all applications must share this
common interface, which has the network (IP) address. But how are arriving packets
distributed to the proper application? The packets are all for this IP address, but which
application layer process gets the information inside the packet?
The transport-layer protocol that should process the information inside the packet
is indicated by the value in the protocol field of the IPv4 header. (We’ll talk about IPv4
now, and detail the fields in the IPv4 and IPv6 headers in a later chapter.)
52
PART I Networking Basics
Inside the transport layer data unit, the receiving application is indicated by the
port number in the transport layer header (again, we’ll discuss these header fields in
full in later chapters). By looking at the protocol and port fields, the TCP/IP stack at the
destination knows which application gets the information. If two applications try to
use the same port at the same time, this is an error condition.
Another important application layer concept in TCP/IP is the socket. A socket is the
combination of the IP address and port number. We’ve already seen that this combination will uniquely identify an application. The socket is also the way that programmers
often write networking application, using the socket as a kind of entry point to the
other layers of the protocol stack. Often, sockets are built into the application programming interface (API).
An API is an important part of the application layer interface, but not all APIs are
socket-based. Sockets are not even tied to the protocols themselves. Sockets and ports
are important enough in TCP/IP to merit a detailed examination in a later chapter
of this book. For now, we’ll just look where the port number is carried and how the
socket identifier is determined.
How can we find the port and socket in an IP packet inside an Ethernet frame? Let’s
use Ethereal to find them.
First, we’ll use a little “echo” client and server utility on the Linux hosts to generate
the frames for this exercise. (Note: This “echo” utility is not the same as the /bin/echo
program on Linux systems.) We can invoke the server on the lnxserver host and use
the client to send a simple string to be echoed back by the server process. We’ll use
Tethereal (the text version of Ethereal) this time, just to show that the same information
is available in either the graphical or text-based version.
First, we’ll run the Echo server process, which normally runs on port 7, on port 55555.
This will help us easily locate the data we are looking for in the Ethereal capture.
[[email protected] admin]# . /echo 55555
We have to run Tethereal on each end as well, if we want to compare frames. The
command is the same on the client and server. We’ll use the verbose (–V) switch to see
the MAC layer information as packets arrive.
[[email protected] admin]# /usr/sbin/tethereal –V
Capturing on eth0
Now we can invoke the Echo client to bounce the string
process.
[[email protected] admin]# . /echo 10.10.11.66 TESTING123 55555
Received: TESTING123
[[email protected] admin]#
TESTING123
off the server
CHAPTER 2 TCP/IP Protocols and Devices
53
What did we get? Let’s look at the frames leaving the client. We only need to examine
the information pertaining to the port and socket. Only one of the frames captured is
shown.
[[email protected] admin]# /usr/sbin/tethereal –V
Capturing on eth0
. . .
Frame 4 (52 bytes on wire, 52 bytes captured)
Arrival Time: May 16, 2008 13:32:59.702046000
Time delta from previous packet: 57.243134000 seconds
Time relative to first packet: 62.239970000 seconds
Frame Number: 4
Packet Length: 52 bytes
Capture Length: 52 bytes
Ethernet II, Src: 00:b0:d0:45:34:64, Dst: 00:05:85:8b:bc:db
Destination: 00:05:85:8b:bc:db (Juniper__8b:bc:db)
Source: 00:b0:d0:45:34:64 (Dell_45:34:64)
Type: IP (0x0800)
Internet Protocol, Src Addr: 10.10.12.166 (10.10.12.166), Dst Addr: 10.10.11.66
(10.10.11.66)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00)
0000 00.. = Differentiated Services Codepoint: Default (0x00)
.... ..0. = ECN-Capable Transport (ECT): 0
.... ...0 = ECN-CE: 0
Total Length: 38
Identification: 0x0000
Flags: 0x04
.1.. = Don't fragment: Set
..0. = More fragments: Not set
Fragment offset: 0
Time to live: 64
Protocol: UDP (0x11)
Header checksum: 0x0ecc (correct)
Source: 10.10.12.166 (10.10.12.166)
Destination: 10.10.11.66 (10.10.11.66)
User Datagram Protocol, Src Port: 32825 (32825), Dst Port: 55555 (55555)
Source port: 32825 (32825)
Destination port: 55555 (55555)
Length: 18
Checksum: 0x1045 (correct)
Data (10 bytes)
0000 54 45 53 54 49 4e 47 31 32 33 TESTING123
54
PART I Networking Basics
Table 2.1 Port and Sockets
Value
IP address
Inxclient
lnxserver
10.10.12.166
10.10.11.66
32825
55555
10.10.12.166:32825
10.10.11.66:55555
Port
Socket
Let’s look at the fields that are emphasized. First, we have captured an Ethernet II
frame with an IPv4 packet inside. The frame’s type field value of 0x800 determines this.
In the IP packet, the message from the client to the server, which starts on the next
line, the source address is 10.10.12.166 (lnxclient) and the destination address is
10.10.11.66 (lnxserver), as they should be.
We can ignore the rest of the IP header fields for now, and skip down to where the
source and destination port are highlighted. The source port chosen by the client is
32825 and the port on the server that will receive the data is 55555. We decided that
55555 would be the server port, and the client chose a port number to use based on
certain rules, which we will talk about in a later chapter.
Now that we know the IP addresses and ports used, we can determine the socket
at each host. This is shown in Table 2.1.
THE TCP/IP PROTOCOL STACK
The layering of TCP/IP is important if IP packets are to run on almost any type of
network. The IP packet layer is only one layer, and from the TCP/IP perspective, the
layer or layers below the IP layer are not as important as the overall flow of packets
from one host (end system) to another across the network.
Layering means that you only have to adapt one type of packet to an underlying network type to get the entire TCP/IP suite. Once the packet has been “framed,” you need
not worry about TCP/UDP, or any other protocol: they come along for the ride with the
layering. Only the IP layer has to deal with the underlying hardware.
All that really matters is that the device at the receiving end understands the type of
IP packet encapsulation used at the sending end. If only one form of packet encapsulation was used, the IP packets could remain inside the frame with a globally unique MAC
address from source to destination. Network nodes could forward the frame without
having to deal with the packet inside. We’ll talk more about the differences between
forwarding frames and forwarding packets later on in this book.
TCP/IP is considered to be a peer protocol stack, which means that every implementation of TCP/IP is considered to have the same capabilities as every other. There are
no “restricted” or “master” versions of TCP/IP that anyone need be concerned about. So,
for example, there is no special server stack needed.
However, this does not mean that all protocol stacks function in precisely the same
way. TCP/IP, like many other protocol stacks, is implemented according to a model
known as the client–server model.
CHAPTER 2 TCP/IP Protocols and Devices
55
THE CLIENT–SERVER MODEL
The hosts that run TCP/IP usually fall into one of two major categories: The host
could be client or the host could be a server. However, this is mostly an applicationlayer model issue because most computers are fully multitasking-capable today. It
is possible that the same host could be running the client version of a program for
one application (e.g., the Web browser) and the server version of another program
(e.g., a file transfer server) at the same time. Dedicated servers are most common
on the Internet, but almost all client computers can act as servers for a variety of
applications. The details are not as important as the interplay among layers and
applications.
Peer-to-Peer Models
The client–server model is not the only way to implement a protocol stack. Many
applications implement a peer-to-peer model. Peer applications have exactly the
same capabilities whether used as a client or as a server. Distributed file-sharing
systems on the Internet typically function as both client (fetching files for the
user) and as a server (allowing user files to be shared by others).
The differences between client–server and peer-to-peer models are mainly application layer differences. A desktop computer that runs a Web browser and has file
sharing turned on is both client and server, but is not now peer-to-peer. As an aside,
in X-windows, which is not discussed in this book, the terms “client” and “server”
are actually reversed and users sit in front of “X-servers” and access “X-clients.”
TCP/IP LAYERS AND CLIENT–SERVER
TCP/IP has five layers. The bottom layers are the physical layer and underlying network layer. The underlying network technologies at the network layer are the topic of
the next chapter. Above the data link layer is the IP layer itself. The IP layer forms and
routes the IP packet (also called a datagram in a lot of documentation) and IP is the
major protocol at this layer.
The transport layer of TCP/IP consists of two major protocols: the Transmission
Control Protocol (TCP) and the User Datagram Protocol (UDP). TCP is a reliable layer
added on top of the best-effort IP layer to make sure that even if packets are lost in
transit, the hosts will be able to detect and resend missing information. TCP data units
are called segments. UDP is as best-effort as IP itself, and UDP data units are called
datagrams. The messages that applications exchange are made up of strings of segments or datagrams. Segments and datagrams are used to chop up application content,
such as huge, multimegabyte files, into more easily handled pieces.
TCP is reliable in the sense that TCP always resends corrupt or lost segments. This
strategy has many implications for delay-sensitive applications such as voice or video.
56
PART I Networking Basics
TCP is a connection-oriented layer on top of the connectionless IP layer. This means
that before any TCP segment can be sent to another host, a TCP connection must be
established to that host. Connectionless IP has no concept of a connection, and simply
forwards packets without any understanding if the packets ever really got where they
were going.
In contrast to TCP, UDP is a connectionless transport layer on top of connectionless
IP. UDP segments are simply forwarded to a destination under the assumption that
sooner or later a response will come back from the remote host. The response forms
an implied or formal acknowledgment that the UDP segment arrived.
At the top of the TCP/IP stack is the application, or application services, layer. This is
where the client–server concept comes into play. The applications themselves typically
come in client or server versions, which is not true at other layers of TCP/IP. While a
host computer might be able to run client processes and server processes at the same
time, in the simplest case, these processes are two different applications.
Client–server application implementation can be extremely simple. A server process
can start and basically sit and “listen” for clients to “talk” to the server. For example, a
Web server is brought up on a host successfully whether there is a browser client
pointed at it or not. The Web server process issues a passive open to TCP/IP and essentially remains idle on the network side until some client requests content. However,
the Web browser (the client) process issues an active open to TCP/IP and attempts to
Other
FTP
TCP
Client–
Server
File
ApplicaTransfer
tions
SMTP
SSH
NFS*
Email
Remote
Access
Remote
File
Access
TCP
Connection-Oriented, Reliable
Some
Routing
Protocols
IP (Best-effort)
SNMP
DNS*
Network Name
Manage- Lookup
Service
ment
Other
UDP
Client–
Server
Applications
UDP
Connectionless, Best-Effort
ICMP
ARPs
Network Access and Physical Layer
(Etherent LANs or other)
*In some instances, NFS and DNS use TCP.
FIGURE 2.4
The TCP/IP protocol stack in detail. The many possible applications on top and many possible
network links on the bottom all funnel through the IP “hourglass.”
CHAPTER 2 TCP/IP Protocols and Devices
57
send packets to a Web site immediately. If the Web site is not reachable, that causes an
error condition.
To sum up the simplest application cases: Clients talk and servers listen (and usually reply). It is very easy to program an application that either talks or listens, although
TCP/IP specifications allow for the transition of passive and active open from one state
to another. We’ll talk more about client and server application and passive and active
opens in the chapter on sockets.
A more detailed look at the TCP/IP protocol stack is shown in Figure 2.4. The
TCP/IP stack bridges the gap between interface connector on the network side (hardware) and the memory address space of the application on the host (software).
The names of the protocol data units used at each layer are worth reviewing. The
unit of the network layer is the frame. Inside the frame is the data unit of the IP layer,
the packet. The unit of the transport layer is the segment in TCP and datagram in
UDP. The segment or datagram by definition is the content of the information-bearing packet. Finally, applications exchange messages. Segments and datagrams taken
together form the messages that the applications are sending to each other.
This is a good place to explore some of the operational aspects of the TCP/IP
protocol stack above the network access (or data link) layer.
THE IP LAYER
The connectionless IP layer routes the IP packets independently through the collection
of network nodes such as routers that make up the “internetwork” that connects the
LANs. Packets at the IP layer do not follow “paths” or “virtual circuits” or anything else
set up by signaled or manually defined connections for packet flow in other types of
network layers. However, this also means that the packets’ content might arrive out of
sequence, or even with gaps in the sequence due to lost packets, at the destination.
IP does not care to which application a packet belongs. IP delivers all packets without a sense of priority or sensitivity to loss. The whole point of IP is to get packets from
one network interface to another. IP itself is not concerned with the lack of guaranteed
quality of service (QoS) parameters such as bandwidth availability or minimal delay,
and this is characteristic of all connectionless, best-effort networks. Even the basics,
such as sequenced delivery of packet content, priorities, and guaranteed delivery in the
form of acknowledgments (if these are needed by the application), must be provided
by the higher layers of the TCP/IP protocol stack. These reliable transport functions are
not functions of the IP layer, and some are not even functions of TCP.
Two other major protocols run at the IP layer besides IPv4 or IPv6 (or both). The
routers that form the network nodes in a TCP/IP network must be able to send error
messages to the hosts if a router must discard a packet (e.g., due to lack of buffer
space because of congestion). This protocol is known as the Internet Control Message
Protocol (ICMP). ICMP messages are sent inside IP packets, but ICMP is still considered
a different protocol and not a separate layer.
58
PART I Networking Basics
The other major protocol placed at the IP layer has many different functions
depending on the type of network that IP is running on. This is the Address Resolution Protocol (ARP). The main function of ARP is to provide a method for IPv4, which
technically knows only about packets, to find out the proper network layer address to
place in the frame header destination field. On LANs, this is the MAC address. Without
this address, the network beneath the IP layer could not deliver the frame containing
the IP packet to the proper destination. (IPv6 does not use ARP: IPv6 uses multicast for
this purpose.)
On a LAN, ARP is a way for IPv4 to send a broadcast message onto the LAN asking,
in effect,“Who has IP address 192.168.13.84?” Each system, whether host or router, on
the LAN will examine the ARP message (all systems must pay attention to a broadcast)
and the system having the IP address in question will reply to the sender’s MAC address
found in the source field of the frame. This target system will also cache the IP address
information so that it knows the MAC address of the sender (this cuts down on ARP
traffic on the network). The MAC layer address needed by the sending system is found
in the source address field of the frame carrying the ARP reply packet.
ARP messages are broadcast to every host in what is called the network layer broadcast domain. The broadcast domain can be a single physical group (e.g., all hosts
attached to a single group of hubs) or a logical grouping of hosts forming a virtual LAN
(VLAN). More will be said about broadcast domains and VLANs later in this chapter.
THE TRANSPORT LAYER
The two main protocols that run above the IP layer at the transport layer are TCP and
UDP. Lately, UDP has been assuming more and more prominence on the Internet, especially with applications such as voice and multicast traffic such as video. One reason is
that TCP, with its reliable resending, is not particularly well suited for real-time applications (real time just means that the network delays must be low and stable or else the
application will not function properly). For these applications, late-arriving data are
worse than data that do not arrive at all, especially if the late data cause all the data
“behind” it to also arrive late. (Of course, in spite of these limitations,TCP is still widely
used for audio streaming and similar applications.)
Transmission Control Protocol
TCP’s built-in reliability features include sequence numbering with resending, which
is used to detect and resend missing or out-of-sequence segments. TCP also includes
a complete flow control mechanism (called windowing) to prevent any sender from
overwhelming a receiver. Neither of these built-in TCP features is good for real-time
audio and video on the Internet. These applications cannot “pause” and wait for missing segments, nor should they slow down or speed up as traffic loads vary on the
Internet. (The fact that they do just points out the incomplete nature of TCP/IP when
it comes to quality of service for these applications and services.)
CHAPTER 2 TCP/IP Protocols and Devices
59
TCP contains all the functions and mechanisms needed to make up for the
best-effort connectionless delivery provided by the IP layer. Packets could arrive at a
host with errors, out of their correct sequence, duplicated, or with gaps in sequence
due to lost (or discarded) packets. TCP must guarantee that the data stream is delivered
to the destination application error-free, with all data in sequence and complete. Following the practice used in connection-oriented networks,TCP uses acknowledgments
that periodically flow from the destination to the source to assure the sender that all is
well with the data received to that point in time.
On the sending side, TCP passes segments to the IP layer for encapsulation in
packets, which the IP layer in hosts and routers route connectionlessly to the destination host. On the receiving side, TCP accepts the incoming segments from the IP layer
and delivers the data they represent to the proper application running above TCP in
the exact order in which the data were sent.
User Datagram Protocol
The TCP/IP transport layer has another major protocol. UDP is as connectionless as IP.
When applications use UDP instead of TCP, there is no need to establish, maintain, or
tear down a connection between a source and destination before sending data. Connection management adds overhead and some initial delay to the network. UDP is a way to
send data quickly and simply. However, UDP offers none of the reliability services that
TCP does. UDP applications cannot rely on TCP to ensure error-free, guaranteed (via
acknowledgments), in-sequence delivery of data to the destination.
For some simple applications, purely connectionless data delivery is good enough.
Single request–response message pairs between applications are sent more efficiently
with UDP because there is no need to exchange a flurry of initial TCP segments to
establish a connection. Many applications will not be satisfied with this mode of operation, however, because it puts the burden of reliability on the application itself.
UDP is often used for short transactions that fit into one datagram and packet.
Real-time applications often use UDP with another header inside called the real-time
transport protocol (RTP). RTP borrows what it needs from the TCP header, such as a
sequence number to detect (but not to resend) missing packets of audio and video, and
uses these desirable features in UDP.
THE APPLICATION LAYER
At the top of the TCP/IP protocol stack, at the application layer, are the basic applications and services of the TCP/IP architecture. Several basic applications are typically
bundled with the TCP/IP software distributed from various sources and, fortunately, are
generally interoperable.
The standard application services suite usually includes a file transfer method
(File Transfer Protocol: FTP), a remote terminal access method (Telnet, which is not
commonly used today, and others, which are), an electronic mail system (Simple Mail
60
PART I Networking Basics
Transfer Protocol: SMTP), and a Domain Name System (DNS) resolver for domain name
to IP address translation (and vice versa), and more. Many TCP/IP implementations also
include a way of accessing files remotely (rather than transferring the whole file to the
other host) known as the Network File System (NFS). There is also the Simple Network
Management Protocol (SNMP) for network operations. For the Web, the server and
browser applications are based on the Hypertext Transfer Protocol (HTTP). Some of
these applications are defined to run on TCP and others are defined to run on UDP, and
in many cases can run on either.
BRIDGES, ROUTERS, AND SWITCHES
The TCP/IP protocol stack establishes an architecture for internetworking. These
protocols can be used to connect LANs in the same building, on a campus, or around
the world. Not all internetworking devices are the same. Generally, network architects
seeking to extend the reach of a LAN can choose from one of four major interconnection devices: repeaters, bridges, routers, and switches.
Not long ago, the network configuration and the available devices determined
which type of internetworking device should be used. Today, network configurations
are growing more and more complex, and the devices available often combine the features of several of these devices. For example, the routers on the Illustrated Network
have all the features of traditional routers, plus some switching capabilities.
In their simplest forms, repeaters, bridges, and routers operate at different layers of
the TCP/IP protocol stack, as shown in Figure 2.5. Roughly, repeaters forward bits from
one LAN segment to another, bridges forward frames, and routers forward packets.
Host
Host
Layer 5
Application Layer
Application Layer
Layer 4
Transport Layer
Transport Layer
Layer 3
Network Layer
Layer 2
Data Link Layer
Layer 1
Physical Layer
Router
Bridge
Repeater
Network Layer
Data Link Layer
Physical Layer
FIGURE 2.5
Repeater, bridge, and router. A repeater “spits bits,” while a bridge deals with complete frames.
A router operates at the packet level and is the main mode of the Internet.
CHAPTER 2 TCP/IP Protocols and Devices
61
Switches are important enough to deserve a separate discussion at the end of this
section.
This section will explore the major characteristics of internetworking with bridges,
routers, and switches. It will show how the LAN collision and broadcast domains are
defined. This section will also show how the IP layer in particular and other protocols
in TCP/IP interact in a routing environment.
Segmenting LANs
Network administrators and designers are often faced with a need to increase the
amount of bandwidth available to users, increase the number of users supported, or
extend the coverage of a LAN. The good news is that this means that the network is
popular and useful, but the bad news is that there are lots of ways that these goals can
be accomplished, some better than others.
Sometimes the answer is relatively straightforward. If a 100-Mbps Fast Ethernet is
congested, moving everyone to Gigabit Ethernet will provide an instant increase in
bandwidth (close to the theoretical tenfold increase with lots of tuning). However, this
also usually means replacing adapter cards and replacing the “hubs” to support the new
bandwidth and frames. This type of wholesale upgrade can be very expensive.
Hub
We avoid the use of the term “hub” in this book. Repeaters were called hubs when
there were no others types of hubs. When bridges and switches and other LAN
devices came along, it was better to call a repeater a repeater. Today the term “hub”
can mean a repeater, bridge, switch, or a hybrid device like a multispeed repeater
(which is really many single-speed repeaters connected by a bridge). The term
“hub” never had a specific meaning.
Another way to give each user more bandwidth (and at the same time increase
users and coverage) is to segment the LAN. Segmenting does not require replacing all
of the user equipment. As the name implies, segmenting breaks the LAN into smaller
portions and then reconnects them with an internetworking device.
Another consequence of the different protocol layers at which the various internetworking devices function is the number of LAN collision and broadcast domains
created. Ethernet’s CSMA/CD access method can result in collisions when stations on
the LAN try to send at almost the same time. Collisions “waste” bandwidth because they
destroy the frames, and the colliding stations must wait and try to send again. (Actually,
unless they are oversubscribed, CSMA/CD systems offer better performance than tokenpassing or other methods.) Even when Ethernets do not generate collisions, broadcast
62
PART I Networking Basics
Table 2.2 Collision and Broadcast Domains
Internetwork Device
Collision Domains
Broadcast Domains
Repeater
One
One
Bridge
Many
One
Router
Many
Many
Switch
Many
Depends on VLAN configuration
frames must be examined by each receiver because the destination address cannot be
used to determine interest in content. Bandwidth is wasted if broadcast frames are sent
to systems that have no interest in the content of the broadcast message. (In TCP/IP,
ARPs are the major type of broadcast frames that systems send and receive.)
It should be noted that although CSMA/CD is part of Gigabit Ethernet, it is essentially nonexistent and not present at all in 10-Gigabit Ethernet.
Extending a LAN by forward bits still creates a single collision and broadcast domain.
The number of collision and broadcast domains created by all the internetworking
devices discussed is shown in Table 2.2. We’ll look at why this is true of each device in
detail shortly.
The use of these devices is not mutually exclusive. In other words, a router can be
used to segment a LAN into two (or more) segments, and each resulting segment can
be divided further with bridges. In an extreme case, each individual user or system has
the full media bandwidth available. This is what switches can do.
Repeaters are a type of special case in that they do not segment a LAN at all. Repeaters do not furnish more bandwidth for users; they just extend the reach of the LAN.
Repeaters are included in the table as a “baseline” for comparison. Repeaters forward
bits from one segment to another and have no intelligence with regard to data format.
If the frame contains errors, violates rules about minimum or maximum frame sizes, or
anything else is wrong, the repeaters forward the frame anyway.
Note that wireless LAN devices connected to an attachment point share the same
properties as a repeater network. And repeaters, technically obsolete on wired networks, have renewed life on wireless networks, especially what are called “ad hoc”
wireless networks.
A 100BaseT Ethernet LAN consists of at least one multiport repeater (often called
a “hub”) with twisted-pair wires connected directly to each system. All systems see all
frames, for better or worse. There are strict limits to the size to which a network made
up of repeater-connected LAN segments can grow. The more systems there are that
can send, the less of the total shared bandwidth each system has. Ethernet limits the
number of systems that each LAN segment can have (the number varies by specific
Ethernet type). Finally, there are distance limits to the electrical signals that repeaters
propagate.
CHAPTER 2 TCP/IP Protocols and Devices
63
Bridges
Ethernet specifications limit the number of systems on a LAN segment and the overall
distance spanned. To add devices to a LAN that has reached the maximum in one or both
of these categories, a bridge can be used to connect LAN segments. Bridged networks
normally filter frames and do not forward all frames onto all segments connected to the
bridge. This is why bridges create more than one collision domain. However, the LAN
segments linked by the bridge still normally form one broadcast domain. Although the
word “bridge” is often applied to products, pure bridges are at least as obsolete as hubs.
The filtering process employed by a bridge differs according to specific LAN
technology. Ethernet uses transparent bridging to connect LAN segments. A transparent
bridge looks at the destination MAC address to decide if the frames should be:
■
■
■
Forwarded—The frame is sent only onto the LAN segment where the destination is
located. The bridge examines the source MAC address fields to find specific device
locations.
Filtered—The frame is dropped by the bridge. No message is sent back to the
source.
Flooded—The frame is sent to every LAN segment attached to the bridge. This is
done for broadcast and multicast traffic.
When bridges are used to connect LAN segments, the media bandwidth is shared
only by the devices on each segment. Because the broadcast domain is preserved, the
bridged LANs still function as one big LAN. Bridges also discard frames with errors, as
well as frames that violate LAN protocol length rules, and thus protect the other LAN
segments when things go wrong.
Bridges are certainly an improvement over repeaters, but still have a number of
issues. The common ARPs used to associate IP addresses at Layer 3 with LAN MAC
addresses at Layer 2 pass through all bridges, but broadcasts due to protocols are not
usually the issue. However, multicast traffic is also flooded, and multimedia applications
such as videoconferences can easily overwhelm a bridged network. Some issues are
more mundane: printers, which generate very little traffic, sometimes remain invisible
in a bridged network.
Ethernet bridges must also be spanning tree bridges. These bridges can detect
loops in the interconnected topology of LAN segments and bridges. Loops are a problem in bridged networks because some frames are always flooded onto all segments.
Flooding multiplies the total number of frames on the network. Loops multiply frames
over and over until a saturation point is reached.
Routers
Bridges add functions to an interconnected LAN because they operate at a higher layer
of the protocol stack than repeaters. Bridges run at Layer 2, the frame layer, and can do
64
PART I Networking Basics
everything a repeater can do, and more, because bridges create more collision domains.
In the same way, routers add functionality to bridges and operate at Layer 3, the packet
layer. Routers not only create more collision domains, they create more LAN broadcast
domains as well.
In a LAN with repeaters or bridges, all of the systems belong to the same subnet
or subnetwork. Layer 3 addresses in their simplest form—and IP addresses are a good
example of this—consist of a network and system (host) portion of the address. LANs
connected by routers have multiple broadcast domains, and each LAN segment belongs
to a different subnetwork.
Because of the presence of multiple subnets, TCP/IP devices must behave differently
in the presence of a router. Bridges connecting TCP/IP hosts are transparent to the
systems, but routers connecting hosts are not. At the very least, the host must know
the address of at least one router, the default router, to send packets beyond the local
subnet. As we’ll soon see, use of the default router requires the use of a default route, a
route that matches all IPv4/IPv6 packets.
Bridges are sometimes called “protocol independent” devices, which really means
that bridges can be used to connect LAN segments regardless of whether TCP/IP is
used or not. However, routers must have Layer 3 software to handle whichever Layer 3
protocols are in use on the LAN. Many routers, especially routers that connect to the
Internet, can and do understand only the IP protocol. However, many routers can handle multiple Layer 3 protocols, including protocols that are not usually employed with
routed networks.
LAN Switches
The term “switch” in networking has threatened to become as overused as “hub.”When
applied to LANs, a switch is still a device with a number of common characteristics that
can be compared to bridges and routers.
The LAN switch is really a complex bridge with many interfaces. LAN switching
is the ultimate extension of multiport bridging. A LAN switch has every device on its
own segment, giving each system the entire media bandwidth all for itself. Multiple
systems can transmit simultaneously as long as there are no “port collisions” on the
LAN switch. Port collisions occur when multiple source ports try to send a frame to the
same output port at the same time.
All of the ports on the switch establish their own broadcast domain. However,
when broadcast frames containing ARPs or multicast traffic arrive, the switch floods
the frames to all other ports. Unfortunately, this makes LAN switching not much better
than a repeater or a bridge when it comes to dealing with broadcast and multicast
traffic (but there is an improvement because broadcast traffic cannot cause collisions
that would force retransmissions).
To overcome this problem, a LAN switch can allow multiple ports to be assigned to
a broadcast domain. The broadcast domains on a LAN switch are configurable and each
floods broadcast and multicast traffic only within its own domain. As a matter of fact,
CHAPTER 2 TCP/IP Protocols and Devices
65
it is not possible for any frames to cross the boundary of a broadcast domain: Another
external device, such as a router, is always required to internetwork the domains.
When LAN switches define multiple broadcast domains they are creating virtual
LANs (VLANs). Not all LAN switches can define VLANs, especially smaller ones, but
many can. A VLAN defines membership to a LAN logically, through configuration, not
physically by sharing media or devices.
On a WAN, the term “switch” means a class of network nodes that behave very differently than routers. We’ll look more closely at how “fast packet network” devices, such as
Frame Relay and ATM switches as network nodes, differ from routers in a later chapter.
Virtual LANs
A VLAN, according to the official IEEE definition, defines broadcast domains at Layer 2.
VLANs, as a Layer 2 entity, really have little to do with the TCP/IP protocol stack,
but VLANs make a huge difference in how switches and routers operate on a TCP/IP
network.
Routers do not propagate broadcasts as bridges do, so a router automatically defines
broadcast domains on each interface. Layer 2 LAN switches logically create broadcast
domains based on configuration of the switch. The configuration tells the LAN switch
what to do with a broadcast received on a port in terms of what other ports should
receive it (or if it should even be flooded to all other ports).
When LAN switches are used to connect LAN segments, the broadcast domains
cannot be determined just by looking at the network diagram. Systems can belong to
different, the same, or even multiple, broadcast domains. The configuration files in the
LAN Switch
Cli
Cli
Cli
Svr
VLAN 1 VLAN 2 VLAN 1 VLAN 2
Broadcast messages from VLAN 1
devices are sent only to the
VLAN 1 broadcast domain.
Cli
Cli
Cli
Svr
VLAN 2 VLAN 1 VLAN 2 VLAN 1
Broadcast messages from VLAN 2
devices are sent only to the
VLAN 2 broadcast domain.
FIGURE 2.6
VLANs in a LAN switch. Broadcast domains are now logical entities connected by “virtual bridges”
in the device.
66
PART I Networking Basics
LAN switches determine the boundaries of these domains as well as their members.
Each broadcast domain is a type of “virtual bridge” within the switch. This is shown in
Figure 2.6.
Each virtual bridge configured in the LAN switch establishes a distinct broadcast
domain, or VLAN. Frames from one VLAN cannot pass directly to another VLAN on the
LAN switch (or else you create one big VLAN or broadcast domain). Layer 3 internetworking devices such as routers must be used to connect the VLANs, allowing internetworking and at the same time keeping the VLAN broadcast domains distinct. All
devices that can communicate directly without a router (or other Layer 3 or higher
device) share the same broadcast domain.
VLAN Frame Tagging
VLAN devices can come in all shapes and sizes, and configuration of the broadcast
domains can be just as variable. Interoperability of LAN switches is compromised when
there are multiple ways for a device to recognize the boundaries of broadcast domains.
To promote interoperability, the IEEE established IEEE 802.1Q to standardize the creation of VLANs through the use of frame tagging.
Some care is needed with this aspect of VLANs. VLANs are not really a formal networking concept, but they are a nice feature that devices can support. One key VLAN
feature is the ability to place switch ports in virtual broadcast domains. The other key
feature is the ability to tag Ethernet frames with a VLAN identifier so that devices can
easily distinguish the boundaries of the broadcast domains. These devices and tags are
not codependent, but you have to use both features to establish a useful VLAN.
Multiple tags can be placed inside Ethernet frames. There is also a way to assign
priorities to the tagged frames, often called IEEE 802.1p, but officially known as
IEEE 802.1D-1998. Internetworking devices, not just LAN switches, can read the tags
and establish VLAN boundaries based on the tag information.
VLAN tags add 4 bytes of information between the Source Address and Type/Length
fields of Ethernet frames. The maximum size of the modified Ethernet frame is increased
from 1518 to 1522 bytes, so the frame check sequence must be recalculated when the
VLAN tag is added. VLAN identifiers can range from 0 to 4095.
The use of VLAN “q in q” tags increases the available VLAN space (ISPs often assign
each customer a VLAN identifier, and customers often have their own VLANs as well).
In this case, multiple tags are placed in an Ethernet frame. The format and position of
VLAN tags according to IEEE 802.3ac are shown in Figure 2.7.
VLANs are built for a variety of reasons. Among them are:
Security—Frames on an Ethernet segment are delivered everywhere, and devices
only process (look inside) MAC frames that are addressed to them. Nothing
stops a device from monitoring everything that arrives on the interface (that’s
essentially how Ethereal works). Sensitive information, or departmental traffic,
can be isolated with virtual LANs.
CHAPTER 2 TCP/IP Protocols and Devices
67
Ethernet Frame Structure
Destination
Address
6 bytes
Source
Address
6 bytes
Tag
4 bytes
Tag Protocol ID
16 bits
TPID:
0 3 8100 (defaut),
0 3 9100,
0 3 9200
Type
2 bytes
Priority
3 bits
Information
46–1500 bytes
CFI
1 bit
802.1p
priority levels
(027)
FCS
4 bytes
VLAN ID
12 bits
VID (unique):
0 to 4095
(Canonical Format Indicator: 0 5 canonical MAC, 1 5 noncanonical MAC)
Ethernet q-in-q VLAN tags
DA
SA
Type
Data
DA
SA
Tag
Type
DA
SA
Tag
Tag
FCS
Data
Type
Original Ethernet Frame
FCS
Data
802.1q Tagged Frame
FCS
Doubly-Tagged Frame
FIGURE 2.7
VLAN tags and frames. Note that frames can contain more than one tag, and often do.
Cutting down on broadcasts—Some network protocols are much worse than
others when it comes to broadcasts. These broadcast frames can be an issue
because they rarely carry user data and each and every system on the segment
must process the content of a broadcast frame. VLANs can isolate protocol
broadcasts so that they arrive only at the systems that need to hear them. Also,
a number of hosts that might otherwise make up a very large logical network
(e.g., Page 19 what we will call later a “/19-sized wireless subnet”) could use
VLANs because they can be just plain noisy.
Router delay—Older routers can be much slower than LAN switches. VLANs can
be used to establish logical boundaries that do not need to employ a router to
get traffic from one LAN segment to another. (In fairness, many routers today
route at “wire speed” and do not introduce much latency into a network.)
The Illustrated Network uses Gigabit Ethernet links to connect the customer-edge
routers to the ISP networks. Many ISPs would assign the frame arriving from LAN1 and
LAN2 a VLAN ID and tag the frames at the provider-edge routers. If the sites are close
68
PART I Networking Basics
enough, some form of Metro Ethernet could be configured using the tag information.
However, the sites are far enough apart that we would have to use some other method
to create a single LAN out of LAN1 and LAN2.
In a later chapter, we’ll use VLAN tagging, along with some other router switching
features, to create a “virtual private LAN” between LAN1 and LAN2 on the Illustrated
Network, mainly for security purposes.
69
QUESTIONS FOR READERS
Figure 2.8 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
Client
Client
Transparent
Bridge
Server
Hub
Hub
Router
ARP on LAN segement
before sending frame
Client
Server
VLAN 1
Use UDP for connectionless,
TCP for connection-oriented
Server
Client
VLAN 2
Broadcast messages
sent only to the VLAN 1
broadcast domain
(and router).
Broadcast messages
sent only to the VLAN 2
broadcast domain
(and router).
Hub
LAN Switch
FIGURE 2.8
Hubs, bridges, and routers can connect LAN segments to form an internetwork.
1. What is the main function of the ARP message on a LAN?
2. What is the difference between TCP and UDP terms of connection overhead and
reliability?
3. What is a transparent bridge?
4. What is the difference between a bridge and a router in terms of broadcast
domains?
5. What is the relationship between a broadcast domain and a VLAN?
CHAPTER
Network Link Technologies
3
What You Will Learn
In this chapter, you will learn more about the links used to connect the nodes of
the Illustrated Network. We’ll investigate the frame types used in various technologies and how they carry packets. We’ll take a long look at Ethernet, and mention
many other link types used primarily in private networks.
You will learn about SONET/SDH, DSL, and wireless technologies as well as
Ethernet. All four link types are used on the Illustrated Network.
This chapter explores the physical and data link layer technologies used in the Illustrated Network. We investigate the methods used to link hosts and intermediate nodes
together over shorter LAN distances and longer WAN distances to make a complete
network.
For most of the rest of the book, we’ll deal with packets and their contents. This is
our only chance to take a detailed look at the frames employed on our network, and
even peer inside them. Because the Illustrated Network is a real network, we’ll emphasize the link types used on the network and take a more cursory look at link types that
might be very important in the TCP/IP protocol suite, but are not used on our network.
We’ll look at Ethernet and the Synchronous Optical Network/Synchronous Digital Hierarchy (SONET/SDH) link technologies, and explore the variations on the access theme
that digital subscriber line (DSL) and wireless technologies represent.
We’ll look at public network services like frame relay and Asynchronous Transfer
Mode (ATM) in a later chapter. In this book, the term private network is used to characterize network links that are owned or directly leased by the user organization, while
a public network is characterized by shared user access to facilities controlled by a
service provider. The question of Who owns the intermediate nodes? is often used as
a rough distinguisher between private and public network elements.
Because of the way the TCP/IP protocol stack is specified, as seen in Chapter 1, we
won’t talk much about physical layer elements such as modems, network interface
cards (NICs), and connectors. As important as these aspects of networking are, they
72
PART I Networking Basics
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
ink
LL
DS
ge0/0
50. /3
1
Wireless
in Home
PE5
lo0: 192.168.5.1
0
/0/
-0
so 9.2
5
0
/0/
-0 .1
o
s
59
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Dotted rules ⫽ DSL
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/3
49.1
P4
lo0: 192.168.4.1
so-0/0/1
79.2
so0/0
29. /2
2
/0
0/0
1
47.
so-
so-0/0/1
24.2
AS 65459
FIGURE 3.1
Connections used on the Illustrated Network. SONET/SDH links are indicated by heavy lines,
Ethernet types by dashed lines, and DSL is shown as a dotted line. The home wireless network
is not given a distinctive representation.
CHAPTER 3 Network Link Technologies
73
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
2
16.
ge-
Best ISP
/0
0/0
2
7
4.
so-0/0/3
27.2
so-
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
P7
lo0: 192.168.7.1
ge-
so-0/0/1
79.1
PE1
lo0: 192.168.1.1
0
/0/
-0
so 2.2
1
Global Public
Internet
AS 65127
74
PART I Networking Basics
have little to do directly with how TCP/IP protocols or the Internet operates. For example, a full exploration of all the connector types used with fiber-optic cable would take
many pages, and yet add little to anyone’s understanding of TCP/IP or the Internet.
Instead, we will concentrate on the structure of the frames sent on these link types,
which are often important to TCP/IP, and present some operational details as well.
ILLUSTRATED NETWORK CONNECTIONS
We will start by using Ethereal (Wireshark), the network protocol analyzer introduced
in the last chapter, to investigate the connections between systems on the Illustrated
Network. It runs on a variety of platforms, including all three used in the Illustrated
Network: FreeBSD Unix, Linux, and Windows XP. Ethereal can display real-time packet
interpretations and, if desired, also save traffic to files (with a variety of formats) for
later analysis or transfer to another system. Ethereal is most helpful when examining all
types of Ethernet links. The Ethernet links are shown as dashed lines in Figure 3.1.
The service provider networks’ SONET links are shown as heavy solid lines, and the
DSL link to the home office is shown as a dotted line. The wireless network inside the
home is not given a distinctive representation in the figure. Note that ISPs today typically employ more variety in WAN link types.
Displaying Ethernet Traffic
On the Illustrated Network, all of the clients and servers with detailed information
listed are attached to LANs. Let’s start our exploration of the links used on the Illustrated Network by using Ethereal both ways to see what kind of frames are used on
these LANs.
Here is a capture of a small frame to show what the output looks like using tethereal, the text-based version of Ethereal. The example uses the verbose mode (–V) to
force tethereal to display all packet and frame details. The example shows, highlighted
in bold, that Ethernet II frames are used on LAN1.
[[email protected] admin]# /usr/sbin/tethereal –V
Frame 2 (60 bytes on wire, 60 bytes captured)
Arrival Time: Mar 25, 2008 12:14:36.383610000
Time delta from previous packet: 0.000443000 seconds
Time relative to first packet: 0.000591000 seconds
Frame Number: 2
Packet Length: 60 bytes
Capture Length: 60 bytes
Ethernet II, Src: 00:05:85:88:cc:db, Dst: 00:d0:b7:1f:fe:e6
Destination: 00:d0:b7:1f:fe:e6 (Intel_1f:fe:e6)
Source: 00:05:85:88:cc:db (Juniper__88:cc:db)
Type: ARP (0x0806)
Trailer: 00000000000000000000000000000000...
CHAPTER 3 Network Link Technologies
75
Address Resolution Protocol (reply)
Hardware type: Ethernet (0x0001)
Protocol type: IP (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: reply (0x0002)
Sender MAC address: 00:05:85:88:cc:db (Juniper__88:cc:db)
Sender IP address: 10.10.11.1 (10.10.11.1)
Target MAC address: 00:d0:b7:1f:fe:e6 (Intel_1f:fe:e6)
Target IP address: 10.10.11.66 (10.10.11.66)
Many details of the packet and frame structure and content will be discussed in
later chapters. However, we can see that the source and destination MAC addresses
are present in the frame. The source address is 00:05:85:88:cc:db (the router), and
the destination (the Linux server) is 00:d0:b7:1f:fe:e6. Ethereal even knows which
organizations have been assigned the first 24 bits of the 48-bit MAC address (Intel and
Juniper Networks). We’ll say more about MAC addresses later in this chapter.
Figure 3.2 shows the same packet, and the same information, but in graphical format. Only a small section of the entire window is included. Note how the presence of
Ethernet II frames is indicated, parsed on the second line in the middle pane of the
window.
Why use text-based output when a graphical version is available? The graphical output shows the raw frame in hex, something the text-based version does not do, and the
interpretation of the frame’s fields is more concise.
However, the graphical output is not always clearer. In most cases, the graphical representation can be more cluttered, especially when groups of packets are involved. The
graphical output only parses one packet at a time on the screen, while a whole string
of packets can be parsed with tethereal (but printouts of graphical information can be
formatted like tethereal).
FIGURE 3.2
Graphical interface for Ethereal. There are three main panes. Top to bottom: (1) a digest of the
packets header and information, (2) parsed details about frame and packet contents, and (3) the
raw frame captured in hexadecimal notation and interpreted in ASCII.
76
PART I Networking Basics
In addition, many network administrators of Internet servers do not install or use
a graphical interface, and perform their tasks from a command prompt. If you’re not
sitting in front of the device, it’s more expedient to run the non-GUI version. Tethereal
is the only realistic option in these cases. We will use both types of Ethereal in the
examples in this book.
In our example network, what about LAN2? Is it also using Ethernet II frames? Let’s
capture some packets on bsdserver to find out.
bsdserver# tethereal –V
Capturing on em0
Frame 1 (98 bytes on wire, 98 bytes captured)
Arrival Time: Mar 25, 2008 13:05:00.263240000
Time delta from previous packet: 0.000000000 seconds
Time since reference or first frame: 0.000000000 seconds
Frame Number: 1
Packet Length: 98 bytes
Capture Length: 98 bytes
Ethernet II, Src: 00:0e:0c:3b:87:32, Dst: 00:05:85:8b:bc:db
Destination: 00:05:85:8b:bc:db (Juniper__8b:bc:db)
Source: 00:0e:0c:3b:87:32 (Intel_3b:87:32)
Type: IP (0x0800)
Internet Protocol, Src Addr: 10.10.12.77 (10.10.12.77), Dst Addr: 10.10.12.1
(10.10.12.1)
Version: 4
Header length: 20 bytes
....
Yes, an Ethernet II frame is in use here as well. Even though we’re running Ethereal
(tethereal) on a different operating system (FreeBSD) instead of on Linux, the output is
nearly identical (the differences are due to a slightly different version of Ethereal on the
servers). However, LANs are not the only type of connections used on the Illustrated
Network.
Displaying SONET Links
What about link types other than Ethernet? ISPs in the United States often use SONET
fiber links between routers separated by long distance. In most other parts of the world,
SDH is used. SONET was defined initially in the United States, and the specification was
adapted, with some changes, for international use by the ITU-T as SDH.
The Illustrated Network uses SONET, not SDH. There are small but important differences between SONET and SDH, but this book will only reference SONET. Line monitoring equipment that allows you to look directly at SONET/SDH frames is expensive
and exotic, and not available to most network administrators. So we’ll take a different
approach: We’ll show you the information that’s available on a router with a SONET
interface. This will show the considerable bandwidth available even in the slowest of
SONET links, which runs at 155 Mbps and is the same as the basic SDH speed.
CHAPTER 3 Network Link Technologies
77
SONET and SDH
The SONET fiber-optic link standard was developed in the United States and is
mainly used in places that follow the digital telephony system used in the United
States, such as Canada and the Philippines. SDH, on the other hand, is used in
places that follow the international standards developed for the digital telephony
system in the rest of the world. SDH must be used for all international links, even
those that link to SONET networks in the United States.
The differences between SONET and SDH transmission frame structures,
nomenclature, alarms, and other details are relatively minor. In most cases, equipment can handle SONET/SDH with equal facility.
We can log in to router CE0 and monitor a SONET interface for a minute or so and
see what’s going on.
Routers and Users
Usually, network administrators don’t let ordinary users casually log in to routers,
even edge routers, and poke around. Even if they were allowed to, the ISP’s core
routers would still remain off limits. But this is our network, and we can do as we
please, wherever we please.
Admin>ssh ce0
adminCE6’s password: *********
--- JUNOS 8.4R1.3 built 2007-08-06 06:58:15 UTC
[email protected]> monitor interface so-0/0/1
R2
Seconds: 59
Interface: so-0/0/1, Enabled, Link is Up
Encapsulation: PPP, Keepalives, Speed: OC3
Traffic statistics:
Input bytes:
166207481 (576 bps)
Output bytes:
171979817 (48 bps)
Input packets:
2868777 (0 pps)
Output packets:
2869671 (0 pps)
Encapsulation statistics:
Input keepalives:
477607
Output keepalives:
477717
LCP state: Opened
Error statistics:
Input errors:
0
Input drops:
0
Input framing errors:
0
Time: 13:36:05
Delay: 2/0/3
Current delta
[2498]
[2713]
[39]
[39]
[6]
[7]
[0]
[0]
[0]
78
PART I Networking Basics
Input runts:
0
Input giants:
0
Policed discards:
0
L3 incompletes:
0
L2 channel errors:
0
L2 mismatch timeouts:
0
Carrier transitions:
1
Output errors:
0
Output drops:
0
Aged packets:
0
Active alarms : None
Active defects: None
SONET error counts/seconds:
LOS count
1
LOF count
1
SEF count
3
ES-S
1
SES-S
1
SONET statistics:
BIP-B1
0
BIP-B2
0
REI-L
0
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
BIP-B3
Z
[0]
[0]
[0]
Not much is happening yet on our network in terms of traffic, but the output is
still informative. The first column shows cumulative values and the second column
shows the change since the last monitor “snapshot” on the link. “Live” traffic during
these 59 seconds, in this case mostly a series of keepalive packets, is shown in parentheses, both in bytes per second and in packets per second (the example rounds the
39 packets in 59 seconds, or 0.66 packets per second, down to 0 packets per second).
The frames carried on the link, listed as encapsulation, belong to a protocol called
Point-to-Point Protocol (PPP). Six PPP keepalives have been sent in the 59-second
window, and seven have been received (they are exchanged every 10 seconds), adding to the total of more than 477,000 since the link was initialized. The cumulative
errors also occurred as the link was initializing itself, and it is reassuring that there are
no new errors.
Displaying DSL Links
The Illustrated Network also has a broadband DSL link from an ISP that is used to allow
a home office to attach to the router network. This link is shown in red in Figure 3.1.
If the permissions are set up correctly, the home user will be able to access network
resources on LAN1 and LAN2. DSL links are much faster than ordinary dial-up lines and
are always available, just like a leased access line. The DSL link terminates at home in a
DSL router (more properly, a residential gateway), and the distribution of information
to devices in the home can be by wired or wireless LAN.
CHAPTER 3 Network Link Technologies
79
On the network end of the DSL link, the link terminates at a DSL access multiplexer
(DSLAM), typically using IP or ATM technology.
At the user end of the DSL link on the Illustrated Network, the office in the home
uses both a wired and a wireless network. This is a common arrangement today: People with laptops can wander, but desktop PCs usually stay put. The wireless network
encapsulates packets and sends them to a special device in the home (a wireless access
point, often built into a DSL router).
What kind of frames does the DSL link use? That’s hard to determine, because the
DSL modem is upstream of the DSL router in most cases (sometimes on the side of the
house, sometimes closer to the service provider). The wired LAN between DSL router
and computer uses the same type of Ethernet frames we saw on LAN1 and LAN2. On a
wired LAN, Ethereal will always capture Ethernet II frames, as shown in Figure 3.3.
What can we learn about DSL itself? Well, we can access the DSL router using a Web
browser and see what kinds of information are available. Figure 3.4 shows the basic
setup screen of the Linksys DSL router (although it’s really not doing any real routing,
just functioning as a simple gateway between ISP and home LAN).
Because this is a working LAN, I’ve restored the default names and addresses for
this example. The router itself is WRT54G (a product designation), and the ISP does not
expect only one host to use the DSL link, so no host or domain name is required. We’ll
talk about the maximum transmission unit (MTU) size later in this chapter. This is set
automatically on the link.
The DSL router itself uses IPv4 address 192.168.1.1. We’ll talk about what the subnet mask does in Chapter 4. The router hands out IP addresses as needed to devices on
the home network, starting with 192.168.1.100, and it uses the Dynamic Host Configuration Protocol (DHCP) to do this. We’ll talk about DHCP in Chapter 18.
FIGURE 3.3
Ethernet frames on a wired LAN at the end of a DSL link. Capturing raw DSL frame “on the wire”
is not frequently done, and is difficult without very expensive and specialized equipment.
80
PART I Networking Basics
FIGURE 3.4
Basic setup screen for a DSL link. We’ll talk about all of these configuration parameters and
protocols, such as subnet masks and DHCP, in later chapters.
What kinds of statistics are available on the DSL router? Not much on this model.
There are simple incoming and outgoing logs, but these capture only the most basic
information about addresses and ports. A small section of the outgoing log is shown in
Table 3.1.
These are all Web browser entries that were run with names, not IP addresses (Yahoo
is one of them). The table lists the addresses because the residential gateway does not
bother to look the names up. However, instead of presenting the port numbers, the log
interprets them as a service name (www is port 80 on most servers).
We’ll take a more detailed look at DSL later in this chapter. Now, let’s take a look at
the fourth and last link type used on the Illustrated Network: the four available wireless
links used to hook a laptop and printer up to the home office DSL router.
The wireless implementation is a fairly straightforward bridging exercise. A single
wireless interface is bridged in software with the Ethernets in the box. The wireless
network is a single broadcast/collision domain.
CHAPTER 3 Network Link Technologies
81
Table 3.1 Outgoing Log Table from DSL Router
LAN IP
Destination URL/IP
Service/Port Number
192.168.1.101
202.43.195.13
www
192.168.1.101
64.86.142.99
www
192.168.1.101
202.43.195.52
www
192.168.1.101
64.86.142.120
www
DSL Link to ISP
(4 Ethernet ports)
DSL
Router
PC 1
(4 wireless ports)
PC 2
Laptop
Color Laser
Printer
PC 3
FIGURE 3.5
The home office network for the Illustrated Network. Devices must have either Ethernet ports or
wireless interfaces (some have both). Not all printers are network-capable or wireless.
Displaying Wireless Links
The physical arrangement of the home office equipment used on the Illustrated Network is shown in Figure 3.5. In addition to the three wired PCs (used for various
equipment configurations), there are two wireless links. One is used by the laptop for
mobility, and the other is used to share a color laser printer. The DSL router does not
have “ports” in the same sense as wired network devices, but it only supports up to four
wireless devices.
The wireless link from the laptop to the DSL router, which uses something called
IEEE 802.11g (sometimes called Wireless-G), is a distinct Layer 2 network technology
and should not use Ethernet II frames. Let’s make sure.
Capturing traffic at the wireless frame level requires special software and special
drivers for the wireless network adapter card. The examples in this chapter use information from a wireless packet sniffer called Airopeek NX from Wildpackets.
82
PART I Networking Basics
A sample capture of a data packet and frame from a wireless link is shown in
Figure 3.6.
Wireless LANs based on IEEE 802.11 use a distinct frame structure and a complex
data link layer protocol. We’ll talk about 802.11 shortly, but for now we should just note
that the Illustrated Network uses USB-attached wireless NICs, and few wireless sniffers
support these types of adapters.
The frame addressing and encapsulation on wireless LANs is much more complicated than Ethernet. Note that the 802.11 MAC frame has three distinct MAC addresses,
labeled Destination, BSSID, and Source. The wireless LAN has to keep track of source,
destination, and wireless access point (Base Station System ID, or BSSID) addresses. Also
note that these are not really Ethernet II frames. The frames on the wireless link are
structured according to the IEEE 802.2 LLC header. These have “SNAP SAP,” indicated
by 0xAA, in the frame, in contrast to Ethernet II frames, which are indicated by 0x01.
FIGURE 3.6
Data frame and packet on a wireless link. Note that the IEEE 802.11 MAC header is different
from the Ethernet in many ways and uses the IEEE 802.2 LLC inside.
CHAPTER 3 Network Link Technologies
83
FIGURE 3.7
The next data frame in the sequence, showing how the contents of the address fields shift based
on direction and type of wireless frame.
The address fields in 802.11 also “shift” their meaning, as shown in Figure 3.7. The
fields are now BSSID, Source, and Destination. This is another capture from Airopeek
NX, showing the next data frame sent in the captured exchange. The address fields
have different meanings based on whether they are sent to the wireless router or are
received from the wireless router.
Frames and the Link Layer
In summary, we have seen that the connections on the Illustrated Network consist of
several types of links. There are wired Ethernet LANs and Gigabit Ethernet links, SONET
links and DSL links, and even a wired LAN in the home network. We’ve looked at some
of the frame types that carry information back and forth on the network connections.
84
PART I Networking Basics
There are many more types of frames that can carry IP packets between systems
at the data link layer. The rest of this chapter will explore the data link layer in a little
more depth.
RFCs and Physical Layers
Internet RFCs usually describe not how the physical (or data link) layers in a
TCP/IP network should function, but how to place packets inside data link frames
and get them out again at the other end of the link to the adjacent system. It is
always good to remember that frames flow between adjacent (directly connected
or reachable) systems on a network.
THE DATA LINK LAYER
Putting the world of connectors, modems, and electrical digital signal levels of the
physical layer aside, let’s go right to the data link layer of the TCP/IP protocol stack. It’s
not that these things are not important to networking; it’s just that these things have
nothing directly to do with TCP/IP.
The data link layer of TCP/IP takes an IP packet at the source and puts it inside
whichever frame structure is used between systems (e.g., an Ethernet frame). The data
link layer then passes the frame to the physical layer, which sends the frame as a series
of bits over the link itself. At the receiver, the physical and data link layers recover the
frame from the arriving sequence of bits and extract the packet. The packet is then
passed to the receiving network (IP) layer.
Interfaces for IP packets have been defined for all of the following network types,
for both LAN and WAN:
Ethernet—Originally from Digital Equipment Corporation, Intel, and Xerox (sometimes called DIX Ethernet).
IEEE (Institute of Electrical and Electronics Engineers) 802.3—Ethernet-based
LANs, including all its variations, such as Gigabit Ethernet.
Synchronous Optical Network, Synchronous Digital Hierarchy (SONET/SDH)—
A high-speed, optical WAN transport.
IEEE 802.11 Wireless LANs—Includes any technology, such as WiFi, based on variations of this.
Token Ring—LANs from IBM, the same as IEEE 802.5.
Point-to-Point Protocol (PPP)—This protocol is from the IP developers themselves, and is not limited to carrying IP packets.
X.25—An international standard, public, switched, connection-oriented network
protocol.
CHAPTER 3 Network Link Technologies
85
Frame Relay—An international standard, public, switched, connection-oriented
network protocol based on X.25.
Asynchronous Transfer Mode (ATM)—An international standard, public, switched,
connection-oriented network protocol based on cells instead of frames.
Fiber Distributed Data Interface (FDDI)—A LAN-like network ring running at
100 Mbps.
Switched Multimegabit Data Services (SMDS)—A high-speed, connectionless,
LAN-like, public network service.
Integrated Services Digital Network (ISDN)—A public switched network similar
to X.25.
Digital Subscriber Line (DSL)—Based on some older Integrated Services Digital Network (ISDN)–related technologies and used for high-speed Internet
access.
Serial Line Interface Protocol (SLIP) and Compressed SLIP (CSLI)—An older
way of sending IP packets over a dial-up, asynchronous modem arrangement
(also from the IP developers).
Cable Modems (CMODEMs)—A method of sending IP packets over a cable TV
infrastructure.
IPoFW IP over Firewire (IEEE 1394)—A popular PC interface for peripheral
devices. There are other interfaces as well, such as ARCnet and IEEE 802.4
LANs, but the point is that TCP/IP is not tied to any specific type of network
at the lower layers. The TCP/IP protocol stack is very flexible and encompassing, much more so than almost anything else that could be used on a global
network.
In the future, this list will get even longer as newer transports for IP packets are
standardized and older ones remain (in spite of diminishing interest, standards like
these tend to stay in place because no one cares enough to move them to “historic”
RFCs). Some of the newer network types that might find their way onto many networks
in the future follow:
VDSL—VDSL is a “very-high-speed” form of DSL that uses fiber feeders to reach
less than a mile from the home (often called fiber to the neighborhood, or
FTTN). Most VDSL service offerings deliver television, telephone, and highspeed Internet access over a unified residential cabling system through a special residential gateway box. On the Illustrated Network, the home office DSL
link is actually VDSL, but this service is not as widely available as other forms
of DSL.
GE-PONS—These Gigabit Ethernet Passive Optical Network (GE-PONS) nodes are
part of a global push toward Fiber to the Home (FTTH), an approach that has
been—somewhat ironically—slowed by the popularity of DSL over copper
86
PART I Networking Basics
wires. Based on IEEE 802.3ah standards, this technology can support gigabit
speeds in both directions and might take advantage of the popularity of voice
over IP (VoIP).
BPL—In some places, high-speed Internet access is provided by the electric
utility as part of broadband power line (BPL) technology. Delivered over the
same socket as power, BPL services might form a nice adjunct to wireless services, which are hard to cost-justify in sparsely populated areas and over rough
terrain.
The advantage of not tying the network layer to any specific type of links at the
lower layers is flexibility (IP can run on anything). A new type of network interface can
be added without great effort. Also, it makes linking these various network types into
an internetwork that much easier.
All TCP/IP implementations must be able to support at least one of the defined
interface types. Most implementations of TCP/IP will do fine today with only a handful
of interface types, and, as we have seen, Ethernet frames are perhaps the most common
of all data-link frame formats for IP packets, especially at the endpoints of the network.
The rest of this chapter provides a closer look at the four link types used on the
Illustrated Network, as well as PPP, the major IEFT data-link protocol that we saw used
on SONET. The coverage is not intended to be exhaustive, but will be enough to introduce the technologies.
Although all four link types are covered, the coverage is not equal. There is much
more information about Ethernet and wireless than SONET or DSL. The main reason
is that expensive and exotic line monitoring equipment is needed in order to burrow
deep enough in the lower layers of the protocol stacks used in SONET and DSL to show
the transmission frames. End users, and even many smaller ISPs, do just fine diagnosing
problems on SONET and DSL links with basic Ethernet and IP monitoring tools. Then
again, point-to-point links are a bit easier to diagnose than shared media networks. (Is
the line protocol up in both directions? Is the distance okay? Is the bit error rate acceptable? Okay, it’s not the link layer . . .)
SONET and DSL are distinguished from Ethernet and wireless LANs with regard to
addressing. SONET and DSL are point-to-point technologies and use much simpler linklevel addressing schemes than LAN technologies. There are only two ends in a pointto-point connection, and you always know which end you are. Anything you send is
intended for the other end of the link, and anything you receive comes from the other
end as well.
THE EVOLUTION OF ETHERNET
The original Ethernet was developed at the Xerox Palo Alto Research Center (PARC)
in the mid-1970s to link the various mainframes and minicomputers that Xerox used
in their office park campus environment of close-proximity buildings. The use of WAN
CHAPTER 3 Network Link Technologies
87
protocols to link all of these buildings did not appeal to Xerox for two reasons. First, an
efficient WAN infrastructure would have demanded a mesh of leased telephone lines,
which would have been enormously expensive given the number of computers. Second, leased telephone lines did not have the bandwidth (usually these carried only up
to 9600 bps, and at most 56 Kbps, in the late 1970s) needed to link the computers.
Their solution was to invent the local area network, the LAN. However, Xerox was
not interested in actually building hardware and chipsets for their new invention,
which was named Ethernet. Instead, Bob Metcalf, the Ethernet inventor, left Xerox and
recruited two other companies, one to make chipsets for Ethernet and the other to
make the hardware components to employ these chipsets. The two companies were
chip-maker Intel and computer-maker Digital Equipment Corporation (DEC). Ethernet
v1.0 was rolled out in 1980, followed by Ethernet v2.0 in 1982, which fixed some
annoying problems in v1.0. This is why, in our examples, Ethereal keeps showing that
IP packets are inside Ethernet II frames when they leave and arrive at hosts.
DIX Ethernet, the proprietary version, ran over a single, thick coaxial cable “bus” that
snaked through a building or campus. Transmitting and receiving devices (transceivers)
were physically clamped to the coaxial cable (with “vampire taps”) at predetermined
intervals. Transceivers usually had multiple ports for attaching the transceiver cables
that led to the actual PC or minicomputer linked by the Ethernet LAN. The whole LAN
ran at an aggregate speed of 10 Mbps, an unbelievable rate for the time. But Ethernet
had to be fast, because up to 1024 computers could share this single coaxial cable bus
to communicate using a media access method known as carrier-sense multiple access
with collision detection (CSMA/CD). DIX Ethernet had to be distinguished from all
other forms of Ethernet, which were standardized by the IEEE starting in 1984.
The IEEE first standardized a slightly different arrangement for 10-Mbps CSMA/CD
LANs (IEEE 802.3) in 1984. Why the IEEE felt compelled to change the proprietary Ethernet technology during the standardization process is somewhat of a puzzle. Some said
the IEEE always did this, but around the same time the IEEE essentially rubberstamped
IBM’s proprietary Token Ring LAN specification as IEEE 802.5. The changes to the hardware of DIX Ethernet were minor. There was no v1.0 support at all (i.e., all IEEE 802.3
LANs were DIX Ethernet v2.0) and the terminology was changed slightly. The DIX
transceiver became the IEEE 802.3 “media attachment unit” (MAU), and so on.
However, throughout the 1980s and into the 1990s, as research into network
capabilities matured, the IEEE added a number of variations to the original IEEE 802.3
CSMA/CD hardware specification. The original specification became 10Base5 (which
meant 10-Mbps transport,using baseband signaling,with a 500-meter LAN segment).This
was joined by a number of other variants designed to make LAN implementation more
flexible and—especially—less expensive. New IEEE 802.3 variations included 10Base2
(with 200-meter segments over thin coaxial cable), the wildly popular 10BaseT (with
hubs instead of segments linked to PCs by up to 100 meters of unshielded twistedpair copper wire), and versions that ran on fiber-optic cable. Eventually, all of these
technologies except those on coaxial cable went first to 100 Mbps (100BaseT), then
1000 Mbps (Gigabit Ethernet), which run over twisted pair for short spans and can use
fiber for increasingly long hauls, now in the SONET/SDH ranges.
88
PART I Networking Basics
Today, IEEE 802.3ae 10G-base-er (extended range) LAN physical layer links can span
40 km. Another,“zr,” is not standardized, but can stretch the span to 80 km. And interestingly, 10-Gbps Ethernet is back on coaxial cable as “10Gbps cx4.”
Ethernet II and IEEE 802.3 Frames
Today, of course, the term “Ethernet” essentially means the same as “IEEE 802.3 LAN.” In
addition to changing the hardware component names and creating IEEE 802.3 10BaseT,
the IEEE also changed the Ethernet frame structure for reasons that remain obscure. It
was this development that had the most important implication for those implementing
the TCP/IP protocol stack on top of Ethernet LANs.
The DIX Ethernet II frame structure was extremely simple. There were fields in the
frame header for the source and destination MAC (the upper part of the data link layer,
used on LANs) address, a type field to define content (packet) structure, a variablelength data field, and an error-detecting trailer. The source and destination addresses
were required for the mutually adjacent systems on a LAN (a point-to-point-oriented
data link layer with just a “destination” address would not work on LANs:Who sent this
frame?). The type field was required so the recipient software would know the structure of the data inside the frame. That is, the destination NIC could examine the type
field and determine if the frame contents were an IP packet, some other type of packet,
a control frame, or almost anything else. The destination NIC card could then pass the
frame contents to the proper software module (the network layer) for further processing on the frame data contents. The type field value for IP packets was set as 0x0800,
the bit string 00001000 00000000.
However, the IEEE 802 committee changed the simple DIX Ethernet II frame structure to produce the IEEE 802.3 CSMA/CD frame structure. Gone was the DIX Ethernet II
type (often called “Ethertype”) field, and in its place was a same-sized length field. This
action somewhat puzzled observers of LAN technology. DIX Ethernet II frames worked
just fine without an explicit length field. The total frame length was determined by the
positions of the starting and ending frame delimiters. The data were always after the
header and before the trailer. Simple enough for software to figure out.
Now, with IEEE 802.3 it was even easier to figure out the length of a received frame
(the software just had to look at the length field value). However, it was now impossible for the receiving software to figure out just what the structure of the frame data
was by looking only at the frame header. Clearly, a place in the IEEE 802.3 CSMA/CD
frame had to be found to put the DIX Ethernet II type field, since receivers had to have
a way to figure out which software process understood the frame content’s data structure. Other protocols did not understand IP packet structures, and vice versa.
The IEEE 802.3 committee “robbed” some bytes from the payload area, bytes which
in DIX Ethernet were data bytes. Since the overall length of the frame was already fixed,
and this set the length of the frame data to 1500 bytes (the same as in DIX Ethernet),
the outcome was to reduce the allowed length of the data contents of an IEEE 802.3
frame. A simplified picture of the two frame types indicating the location of the 0x0800
type field and the length of the data field is shown in Figure 3.8.
CHAPTER 3 Network Link Technologies
89
DIX Ethernet Frame Structure
Destination Address
6 bytes
Source Address
6 bytes
Type
2 bytes
Information
46–1500 bytes
FCS
4 bytes
Type 5 030800 for IP packets
IEEE 802.3 LANs Frame Structure
Destination Address
6 bytes
Source Address
6 bytes
Length
2 bytes
Information
48–1492 bytes
FCS
4 bytes
8 bytes of added overhead
Logical Link Control (LLC)
Destination Service Access Point (DSAP) 5 03AA (“SNAP SAP”)
Source Service Access Point (SSAP) 5 03AA
Control 5 0303 (same as in PPP)
Subnetwork Access Protocol (SNAP)
Organizationally Unique ID 5 3‘0000 0000’ (usually)
Type 5 030800 for IPv4 packets, 0308DD for IPv6, etc.
FIGURE 3.8
Types of Ethernet frames. The frames for Gigabit and 10 Gigabit Ethernet differ in detail, but
follow the same general structure.
MAC Addresses
The MAC addresses used in 802 LAN frames are all 48 bits (6 bytes) long. The first
24 bits (3 bytes) are assigned by the IEEE to the manufacturer of the NIC (manufacturers pay for them). This is the Organizationally Unique Identifier (OUI). The last 24 bits
(3 bytes) are the NIC manufacturer’s serial number for that NIC. Some protocol analyzers know the manufacturer’s ID (which is not public but seldom suppressed) and
display this along with the address. This is how Ethereal displays MAC addresses not
only in hex but starting with “Intel_” or “Juniper_.”
Note that both frame types use the same, familiar source and destination MAC
address, and use a 32-bit (4-byte) frame check sequence (FCS) for frame-level error
detection. The FCS used in both cases is a standard, 32-bit cyclical redundancy check
(CRC-32). The important difference is that the DIX Ethernet frame indicates information type (frame content) with a 2-byte type field (0x0800 means there is an IPv4 packet
inside and 0x86DD means there is an IPv6 packet inside) and the IEEE 802.3. CSMA/CD
frame places this Ethertype field at the end of an additional 8 bytes of overhead called
the Subnetwork Access Protocol (SNAP) header. Another 3 bytes are the OUI given to
the NIC vendor when they registered with the IEEE, but this field is not always used
for that purpose.
The 802.3 frame must subtract these 8 bytes from the IP packet length so that the
overall frame length is still the same as for DIX Ethernet II. This is because the maximum length of the frame is universal in almost all forms of Ethernet. The maximum
90
PART I Networking Basics
IEEE 802.3 frame data is 1492 due to the 8 extra bytes needed to represent the
type field. Any IP packet larger than this will not fit in a single frame, and must fragment its payload into more than one frame and have the payload reassembled at the
receiver.
That’s not all there is to it. LAN implementers and vendors quickly saw that the
IEEE 802.3 hardware arrangement was more flexible (and less expensive) than DIX
Ethernet. They also saw that the DIX Ethernet II frame structure was simpler and could
carry slightly more user data than the complex IEEE 802.3 frame structure. Being practical people, the vendors simply used the flexible IEEE 802.3 hardware with the simple
DIX Ethernet II frame structure, creating the mixture that is commonly seen today on
most LANs.
Today, just because the hardware is IEEE 802.3 compliant (e.g., 100BaseT), does not
mean that the frame structure used to carry IP packets is also IEEE 802.3 compliant. The
frame structure is most likely Ethernet II, as we have seen. (It’s worth pointing out that
Ethernet frame content other than IP usually uses the 802.3 frame format. However, the
Illustrated Network is basically an IP-only network.)
THE EVOLUTION OF DSL
IP packet interfaces have been defined for many LAN and WAN network technologies.
As soon as a new transport technology reaches the commercial-deployment stage, IP
is part of the scheme, if for no other reason than regardless of what is in the middle,
TCP/IP in Ethernet frames is at both ends. DSL technologies are a case in point. Originally designed for the “national networks” that would offer everything that the Internet
does today, but from the telephone company as part of the Integrated Services Digital
Network (ISDN) initiatives of the 1980s, DSL was adapted for “broadband” Internet
access when the grand visions of the telephone companies as content providers were
reduced to the reality of a restricted role as ISPs and little more. (Even the term “broadband” is a topic of much debate: A working definition is “speeds fast enough to allow
users to watch video without getting a headache or becoming disgusted,” speeds that
keep dropping as video coding and compression techniques become better.)
DSL once included a complete ATM architecture, with little or no TCP/IP. Practical
considerations forced service providers to adapt DSLs once again, this time for the real
consumer world of Ethernet LANs running TCP/IP. And a tortured adaptation it proved
to be. The problem was deeper than just taking an Ethernet frame and mapping it to a
DSL frame (even DSL bits are organized into a distinctive transport frame). Users had to
be assigned unique IP addresses (not necessary on an isolated LAN), and the issues of
bridging versus routing versus switching had to be addressed all over again. This was
because linking two LANs (the home user client LAN, even if it had but one PC, and the
server LAN) over a WAN link (DSL) was not a trivial task. The server LAN could be the
service provider’s “home server” or anyplace else the user chose to go on the Internet.
Also, ATM logical links (called permanent virtual circuits, or PVCs) are normally
provisioned between the usual local exchange carrier’s DSLAM and the Internet access
CHAPTER 3 Network Link Technologies
91
Networking Visions Today and Yesterday
Today, when anyone can start a Web site with a simple server and provide a service
to one and all over the Internet, it is good to remember that things were not always
supposed to be this way. Not so long ago, the control of services on a public global
network was supposed to be firmly under the control of the service provider.
Many of these “fast-packet” networking schemes were promoted by the national
telephone companies, from broadband ISDN to ATM to DSL. They all envisioned a
network much like the Internet is today, but one with all the servers “in the cloud”
owned and operated by the service providers. Anyone wanting to provide a service (such as a video Web site) would have to go to the service provider to make
arrangements, and average citizens would probably be unable to break into that
tightly controlled and expensive market.
This scheme avoided the risk of controversial Web site content (such as copyrighted material available for download), but with the addition of restrictions and
surveillance. Also, the economics for service providers are much different when
they control content from when they do not.
Today, ISPs most often provide transport and connectivity between Web
sites and servers owned and operated by almost anyone. ISP servers are usually
restricted to a small set of services directly related to the ISP, such as email or
account management.
provider’s aggregation router. This can be very costly because IP generally has much
better statistical multiplexing properties and there can be long hauls through the ATM
networks before the ATM link is terminated.
The solution was to scrap any useful role for ATM (and any non-TCP/IP infrastructure) except as a passive transport for IP packets. This left ATM without any rationale
for existence, because most of the work was done by running PPP over the DSL link
between a user LAN and a service provider LAN.
PPP and DSL
Why is PPP used with DSL (and SONET)? The core of the issue is that ISPs needed some
kind of tunneling protocol. Tunneling occurs when the normal message-packet-frame
encapsulation sequence of the layers of a networking protocol suite are violated. When
a message is placed inside a packet, then inside a frame, and this frame is placed inside
another type of frame, this is a tunneling situation. Although many tunneling methods
have been standardized at several different TCP/IP layers, tunneling works as long as
the tunnel endpoints understand the correct sequence of headers and content (which
can also be encrypted for secure tunnels).
In DSL, the tunneling protocol had to carry the point-to-point “circuits” from the
central networking location to the customer’s premises and across the shared media
92
PART I Networking Basics
LAN to the end user device (host). There are many ways to do this, such as using IP-inIP tunneling, a virtual private network (VPN), or lower level tunneling. ISPs chose PPP
as the solution for this role in DSL.
Using PPP made perfect sense. For years, ISPs had used PPP to manage their WAN
dial-in users. PPP could easily assign and manage the ISP’s IP address space, compartmentalize users for billing purposes, and so on. As a LAN technology, Ethernet had none
of those features. PPP also allowed user authentication methods such as RADIUS to be
used, methods completely absent on most LAN technologies (if you’re on the LAN, it’s
assumed you belong there).
Of course, keeping PPP meant putting the PPP frame inside the Ethernet frame, a
scheme called Point-to-Point Protocol over Ethernet (PPPoE), described in RFC 2516.
Since tunneling is just another form of encapsulation, all was well.
PPP is not the only data link layer framing and negotiation procedure (PPP is not a
full data link layer specification) from the IETF. Before PPP became popular, the Serial
Line Internet Protocol (SLIP) and a closely related protocol using compression (CSLIP,
or Compressed SLIP) were used to link individual PCs and workstations not connected
by a LAN, but still running TCP/IP, to the Internet over a dial-up, asynchronous analog
telephone line with modems. SLIP/CSLIP was also once used to link routers on widely
separated TCP/IP networks over asynchronous analog leased telephone lines, again
using modems. SLIP/CSLIP is specified in RFC 1055/STD 47.
PPP Framing for Packets
PPP addresses many of the limitations of SLIP, and can run over both asynchronous
links (as does SLIP) and synchronous links. PPP provides for more than just a simple
frame structure for IP packets. The PPP standard defines management and testing functions for line quality, option negotiation, and so on. PPP is described in RFC 1661, is
protocol independent, and is not limited to IP packet transport.
The PPP control signals, known as the PPP Link Control Protocol (LCP), need not
be supported, but are strongly recommended to improve performance. Other control
information is included by means of a Network Control Protocol (NCP), which defines
management procedures for frame content protocols. The NCP even allows protocols
other than IP to use the serial link at the same time. The LCP and NCP subprotocols are
a distinguishing feature of PPP.
The use of LCP and NCP on a PPP link on a TCP/IP network follows:
■
■
■
■
■
The source PPP system (user) sends a series of LCP messages to configure and
test the serial link.
Both ends exchange LCP messages to establish the link options to be used.
The source PPP system sends a series of NCP messages to establish the Network
Layer protocol (e.g., IP, IPX, etc.).
IP packets and frames for any other configured protocols are sent across the
link.
NCP and LCP messages are used to close the link down in a graceful and
structured manner.
CHAPTER 3 Network Link Technologies
Flag
037E
Address
03FF
Control
0303
0111
1110
1111
1111
0000
0011
Protocol
2 bytes
Information
(variable)
FCS
2 bytes
93
Flag
037E
0111
1110
Protocol field values:
03C021 5 Link Control Protocol (LCP)
038021 5 Network Control Protocol (NCP)
030021 5 IP Packet inside
FIGURE 3.9
The PPP frame. The flag bytes (037E) essentially form an “idle pattern” on the link that is
“interrupted” by frames carrying information.
The benefits are to create a more efficient WAN transport for IP packets. The structure
of a PPP frame is shown in Figure 3.9.
The Flag field is 0x7E (0111 1110), as in many other data link layer protocols. The
Address field is set to 0xFF (1111 1111), which, by convention, is the “all-stations” or
broadcast address. Note that none of the other fields in the Point-to-Point Protocol header
have a source address for the frame. Point-to-point links only care about the destination,
which is always 0xFF in PPP and essentially means “any device at the other end of this
link that sees this frame.”This is one reason why serial interfaces on routers sometimes
do not have IP addresses (but many serial interfaces, especially to other routers, have
them anyway—this is the only way to make the serial links “visible” to the IP layer and
network operations).
The Control field is set to 0x03 (0000 0011), which is the Unnumbered Information
(UI) format, meaning that there is no sequence numbering in these frames. The UI format is used to indicate that the connectionless IP protocol is in use. The Protocol field
identifies the format and use of the content of the PPP frame itself. For LCP messages,
the Protocol field has the value 0xC021 (1100 0000 0010 0001), for NCP the field has the
value 0x8021 (1000 0000 0010 0001), and for IP packets the field has the value 0x0021
(0000 0000 0010 0001).
Following the header is a variable-length Information field (the IP packet), followed
by a PPP frame trailer with a 16-bit, frame check sequence (FCS) for error control, and
finally an end-of-frame Flag field.
PPP frames may be compressed, field sizes reduced, and used for many specific
tasks, as long as the endpoints agree.
DSL Encapsulation
How are IP packets encapsulated on DSL links? DSL specifications establish a basic DSL
frame as the physical level, but IP packets are not placed directly into these frames. IP
packets are placed inside PPP frames, and then the PPP frames are encapsulated inside
Ethernet frames (this is PPP over Ethernet, or PPPoE). Finally, the Ethernet frames are
94
PART I Networking Basics
placed inside the DSL frames and sent to the DSL Access Module (DSLAM) at the telephone switching office.
Once at the switching office, it might seem straightforward to extract the Ethernet
frame and send it on into the “router cloud.” But it turns out that almost all DSLAMs are
networked together by ATM, a technology once championed by the telephone companies. (Some very old DSLAMs use another telephone company technology known as
frame relay.) ATM uses cells instead of frames to carry information.
So the network/data-link/physical layer protocol stack used between DSLAMs and
service provider routers linked to the Internet usually looks like five layers instead of
the expected three:
■
■
■
IP packet containing user data, which is inside a PPP frame, which is inside an
Ethernet frame running to the DSL router (PPPoE), which is inside a series of
ATM cells, which are sent over the physical medium as a series of bits.
We’ll take a closer look at frame relay and ATM in a later chapter on public network
technologies that can be used to link routers together.
Forms of DSL
Entire books are devoted to the variations of DSL and the DSL protocol stacks used by
service providers today. Instead of focusing on all the details of these variations, this
section will take a brief look at the variation of DSL that can be used when IP packets
make their way from a home PC onto the Internet.
DSL often appears as “xDSL” where the “x” can stand for many different letters. DSL
is a modern technology for providing broadband data services over the same twistedpair (TP), copper telephone lines that provide voice service. DSL services are often
called “last-mile” (and sometimes “first-mile”) technologies because they are used only
for short connections between a telephone switching station and a home or office. DSL
is not used between switching stations (SONET is often used there).
DSL is an extension of the Integrated Services Digital Network (ISDN) technology
developed by the telephone companies for their own set of combined voice and data
services. They operate over short ranges (less than 18 kilofeet) of 24 American Wire
Gauge (AWG) voice wire to a telephone central office. DSLs offer much higher speeds
than traditional dial-up modems, up to 52 mbps for traffic sent “downstream” to the
user and usually from 32 kbps to 1 Mbps from traffic sent “upstream” to the central
office. The actual speed is distance limited, dropping off at longer distances.
At the line level, DSLs use one of several sophisticated modulation techniques running in premises DSL router chipsets and DSLAMs at the telephone switching office.
These include the following:
■
■
■
■
■
Carrierless Amplitude Modulation (CAP)
Discrete Multitone Technology (DMT)
Discrete Wavelet Multitone (DWM)
Simple Line Code (SLC)
Multiple Virtual Line (MVL)
CHAPTER 3 Network Link Technologies
95
DSL can operate in a duplex (symmetrical) fashion, offering the same speeds
upstream and downstream. Others, mainly targeted for residential Internet browsing
customers, offer higher downstream speeds to handle relatively large server replies to
upstream mouse clicks or keystrokes. However, standard VDSL and VDSL2 have much
less asymmetry than other methods. For example, 100-Mbps symmetric operation is
possible at 0.3 km, and 50 Mbps symmetric at 1 km.
The DSLAMs connect to a high-speed service provider backbone, and then the
Internet. DSLAMs aggregate traffic, typically for an ATM network, and then connect to a
router network. On the interface to the premises, the DSLAM demultiplexes traffic for
individual users and forwards it to the appropriate users.
In order to support traditional voice services, most DSL technologies require a signal filter or “splitter” to be installed on the customer premises to share the twisted-pair
wiring. The DSLAM splits the signal off at the central office. Splitterless DSL is very
popular, however, in the form of “DSL Lite” or several other names.
In Table 3.2, various types of DSL are compared. The speeds listed are typical, as
are the distance (there are many other factors that can limit DSL reach) and services
offered.
VDSL requires a fiber-optic feeder system to the immediate neighborhood, but VDSL
can provide a full suite of voice, video, and data services. These services include the
highest Internet access rates available for residential services, and integration between
voice and data services (voice mail alerts, caller ID history, and so on, all on the TV
Table 3.2 Types of DSL
Type
Meaning
Typical Data Rate Mode
Distance
Applications
IDSL
ISDN DSL
128 Kbps
Duplex
18k ft on 24
AWG TP
ISDN services: voice and
data; Internet access
HDSL
High-speed
DSL
1.544 to
42.048 Mbps
Duplex
12k ft on 24
AWG TP
T1/E1 service, feeder,
WAN access, LAN connections, Internet access
SDSL
Symmetric
DSL
1.544 to
2.048 Mbps
Duplex
12k ft on 24
AWG TP
Same as HDSL
ADSL
Asymmetric
DSL
1.5 to 6 Mbps
16 to 640 kbps
Down
18k ft on 24
AWG TP
Internet access, remote
LAN access, some video
applications.
DSL Lite
(G.Lite)
“Splitterless”
ADSL
1.5 to 6 Mbps
16 to 640 kbps
Down
18k ft on 24
AWG TP
Same as ADSL, but does
not require a premises
“splitter” for voice services
VDSL
Very-highspeed DSL
13 to 52 Mbps
1.5 to 2.3 Mbps
Down
1k to 4.5k ft
depending
on speed
Same as ADSL plus full
voice and video services,
including HDTV
Up
Up
Up
96
PART I Networking Basics
screen). VDSL is used on the Illustrated Network to get packets from the home office’s
PCs to the ISP’s router network (the overall architecture is not very different from DSL
in general). From router to router over WAN distances, the Illustrated Network uses a
common form of transport for the Internet in the United States: SONET.
THE EVOLUTION OF SONET
SONET is the North American version of the international SDH standard and defines
a hierarchy of fast transports delivered on fiber-optic cable. One of the most exciting
aspects of SONET when it first appeared around 1990 was the ability to deploy SONET
links in self-healing rings, which nearly made outages a thing of the past. (The vast
majority of link failures today involve signal “backhoe fade,” a euphemism for accidental
cable dig-ups.)
Before networks composed almost entirely of fiber-optic cables came along, network errors were a high-priority problem. Protocols such as IP and TCP had extensive
error-detection and error-correction (the two are distinct) methods built into their
operation, methods that are now quietly considered almost a hindrance in modern
networks.
Now, SONET rings do not inherently protect against the common problem of a lack
of equipment or route diversity, but at least it’s possible. Not all SONET links are on
rings, of course. The links on the Illustrated Network are strictly point-to-point.
A Note about Network Errors
Before SONET, almost all WAN links used to link routers were supplied by a telephone
company that subscribed to the Bell System standards and practices, even if the phone
company was not part of the sprawling AT&T Bell System. In 1984, the Bell System
engineering manual named a bit error rate (BER) of 10–5 (one error in 100,000 bits
sent) as the target for dial-up connections, and put leased lines (because they could be
“tuned” through predictable equipment) at 10 times better, or 10–6 (one error in every
1,000,000 bits).
SONET/SDH fiber links typically have BERs of 1000 (103) to 1 million (106) times
better than those common in 1984. Since 1000 days is about 3 years, converting a copper link to fiber meant that all the errors seen yesterday are now spread out over the
next 3 years (a BER of 10–9) to 3000 years (10–12). LAN error rates, always much lower
than those of WANs due to shorter spans and less environmental damage, are in about
the same range. Most errors today occur on the modest-length (a kilometer or mile)
access links between LAN and WAN to ISP points of presence, and most of those errors
are due to intermittently failing or faulty connectors.
The only real alternatives for SONET/SDH high-speed WAN links are newer versions of Ethernet, especially in a metropolitan Ethernet context. The megabit-speed
T1 (1.544 Mbps) or E1 (2.048 Mbps) links are used for the local loop. However, even
those copper-based circuits are usually serviced by newer technologies and carried
over SONET/SDH fiber on the backbone.
CHAPTER 3 Network Link Technologies
97
How are IP packets carried inside SONET frames? The standard method is called
Packet over SONET/SDH (POS). The procedures used in POS are defined in three RFCs:
■
■
■
RFC1619, PPP over SONET/SDH
RFC1661, the PPP
RFC1662, PPP in HDLC-like framing
Packet over SONET/SDH
SONET/SDH frames are not just a substitute for Ethernet or PPP frames. SONET/SDH
frames, like T1 and E1 frames, carry unstructured bit information, such as digitized
voice telephone calls, and are not usually suitable for direct packet encapsulation. In
the case of IP, the packets are placed inside a PPP frame (technically, a type of HighLevel Data Link Control [“HDLC-like”] PPP frame with some header fields allowed to
vary in HDLC fixed for IP packet payloads). The PPP frame, delimited by a stream of
special 0x7E interframe fill (or “idle” pattern) bits, is then placed into the payload area
of the SONET/SDH frame.
Figure 3.10 shows a series of PPP frames inside a SONET frame running at 51.84
Mbps. Although SONET (and SDH) frames are always shown as two-dimensional arrays
SONET Frame Payload Area
SONET
Frame 1
SONET
Frame
Overhead
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E | PPP Hdr | IP packet.... | IP trailer | 7E
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E | PPP Hdr | IP packet ..............................
SONET
Frame 2
SONET
Frame
Overhead
....................................................... | IP trailer | 7E
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E | PPP Hdr | IP packet ..............................
................................................. | IP trailer | 7E 7E
SONET
Frame 3
SONET
Frame
Overhead
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E | PPP Hdr | IP packet | IP trailer | 7E
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E | PPP Hdr | IP packet ..............................
.............................................................................
SONET
Frame 4
SONET
Frame
Overhead
....................................................... | IP trailer | 7E
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E 7E
7E 7E 7E | PPP Hdr | IP packet ..............................
................................................. | IP trailer | 7E 7E
FIGURE 3.10
Packet over SONET, showing how the idle pattern of 0x7E surrounds the PPP frames with IP
packets inside.
98
PART I Networking Basics
of bits, the figure is not very accurate. It doesn’t show any of the SONET framing bytes,
and IP packets are routinely set to around 1500 bytes long, so they would easily fill an
entire 774-byte, basic SONET transmission-frame payload area. Even the typical network
default maximum IP packet size of 576 bytes is quite large compared to the SONET
payload area. However, many packets are not that large, especially acknowledgments.
One other form of transport used on the Illustrated Network is common on IP networks today. Wireless links might some day be more common than anything else.
WIRELESS LANS AND IEEE 802.11
Wireless technologies are the fastest-growing form of link layer for IP packets, whether
for cell phones or home office LANs. Cell phone packets are a bit of a challenge, and
wireless LANs are evolving rapidly, but this section will focus on wireless LANs, if only
because wireless LANs are such a good fit with Ethernet. This section will be a little
longer than the others, only because the latest wireless LANs are newer than the previous methods discussed.
The basic components of the IEEE 802.11 wireless LAN architecture are the wireless stations, such as a laptop, and the access point (AP). The AP is not strictly necessary,
and a cluster of wireless stations can communicate directly with each other without
an AP. This is called an IEEE 802.11 independent, basic service set (IBSS) or ad hoc
network. One or more wireless stations form a basic service set (BSS), but if there is
only one wireless station in the BSS, an AP is necessary to allow the wireless station to
communicate. An AP has both wired and wireless connections, allowing it to be the
access “point” between the wireless station and the world. In a typical home wireless
network (an arbitrarily low limit), one BSS supports up to four wireless devices, and
the AP is bundled with the DSL router or cable modem with the high-speed link for
Internet access. (The DSL router or cable modem can have multiple wired connections
as well.) In practice, the number of systems you can connect to a given type of AP
depends on your performance needs and the traffic mix.
A wireless LAN can have multiple APs, and this arrangement is sometimes called
an infrastructure wireless LAN. This type of LAN has more than one BSS, because each
AP establishes its own BSS. This is called an extended service set (ESS), and the APs are
often wired together with an Ethernet LAN or an Ethernet hub or switch. The three
major types of IEEE 802.11 wireless LANs—ad hoc (IBSS), BSS, and ESS—are shown in
Figure 3.11.
Wi-Fi
An intended interoperable version of the IEEE 802.11 architecture is known as Wi-Fi,
a trademark and brand of the Wi-Fi Alliance. It allows users with properly equipped
wireless laptops to attach to APs maintained by a service provider in restaurants, bookstores, libraries, and other locations, usually to access the Internet. In some places, especially downtown urban areas, a wireless station can receive a strong signal from two or
CHAPTER 3 Network Link Technologies
Wireless Laptop Wireless Laptop
Station
Station
Wireless Laptop
Wireless Laptop
Station
Station
Access Point
Wireless Laptop Wireless Laptop
Station
Station
Wireless Laptop
Station
BSS without AP
(ad hoc network)
Wireless Laptop
Station
Wireless Laptop
Station
99
Internet Wireless Laptop
Station
BSS with AP
Wireless Laptop
Station
Wireless Laptop Wireless Laptop
Station
Station
Access Point
for BSS 1
Access Point
for BSS 2
Internet
ESS
Wireless Laptop
Station
FIGURE 3.11
Wireless LAN architectures. Most home networks are built around an access point built into a
DSL router/gateway.
more APs. While a wireless station can belong to more than one BSS through its AP at
the same time, this is not helpful when the APs are offering different network addresses
(and perhaps prices for attachment). This collection of Wi-Fi networks is sometimes
called the “Wi-Fi jungle,” and will only become worse as wireless services turn up more
and more often in parks, apartment buildings, offices, and so on. How do APs and wireless stations sort themselves out in the Wi-Fi jungle?
If there are APs present, each wireless station in IEEE 802.11 needs to associate
with an AP before it can send or receive frames. For Internet access, the 802.11 frames
contain IP packets, of course. The network administrator for every AP assigns a Service
Set Identifier (SSID) to the AP, as well as the channels (frequency ranges) that are associated with the AP. The AP has a MAC layer address as well, often called the BSSID.
The AP is required to periodically send out beacon frames, each including the
AP’s SSID and MAC layer address (BSSID), on its wireless channels. These channels are
scanned by the wireless station. Some channels might overlap between multiple APs,
because the “jungle” has no central control, but (hopefully) there are other channels
that do not. In practice, interference between overlapping APs is not a huge problem
100
PART I Networking Basics
in the absence of a high volume of traffic. When you “view available networks” in
Windows XP, the display is a list of the SSIDs of all APs in range. To get Internet access,
you need to associate your wireless station with one of these APs.
After selecting an AP by SSID, the wireless host uses the 802.11 association protocol
to join the AP’s subnet. The wireless station then uses DHCP to get an IP address, and
becomes part of the Internet through the AP.
If the wireless Internet access is not free, or the wireless LAN is intended for
restricted use (e.g., tenants in a particular building), the wireless station might have
to authenticate itself to the AP. If the pool of users is small and known, the host’s MAC
address can be used for this purpose, and only certain MAC addresses will receive IP
addresses.
Once the user is on the wireless network, many hotels use the captive portal form of
authentication. The captive portal technique makes the user with a Web browser (HTTP
client) to see a special Web page before being granted normal Internet access. The
captive portal intercepts all packets regardless of address or port, until the browser is
used as a form of authentication device. Once the acceptable use terms are viewed or
the payment rates are accepted and arranged, “normal” Internet access is granted for
a fixed period of time. It should be noted that captive portals can be used to control
wired access as well, and many places (hotel rooms, business centers) use them in this
fashion. In many cases, the normal device “firewall” capabilities must be turned off or
configured to allow the captive portal Web page to appear.
Another post-access approach employs usernames and passwords—these are popular at coffee shops and other retail establishments. In both cases, there is usually a central
authentication server used by many APs, and the wireless host communicates with this
server using either RADIUS (RFC 2138) or DIAMETER (RFC 3588). Once authenticated,
the users’ traffic is commonly encrypted to preserve privacy over the airwaves, where
signals can usually be picked up easily and without the knowledge of end users.
When accessing the office remotely, even if captive portal or some other method is
used, most organizations add something to secure tunneling based on PPTP (Microsoft’s
Point-to-Point Tunneling Protocol) or PPPoE to run proprietary VPN client software. We’ve
already mentioned PPPoE, and PPTP with VPNs will be explored later in this book.
IEEE 802.11 MAC Layer Protocol
IEEE 802.11 defines two MAC sublayers: the distributed coordination function (DCF)
and the point coordination function (PCF). The PCF MAC is optional and runs on top
of the DCF MAC, which is mandatory. PCF is used with APs and is very complex, while
DCF is simpler and uses a venerable access method known as carrier sense multiple
access with collision avoidance (CSMA/CA). Note that while Ethernet LANs detect
collisions between stations sending at the same time with CSMA/CD, wireless LANs
avoid collisions. Collision detection is not appropriate for wireless LANs for a number
of reasons, the most important being the hidden terminal problem.
To understand the hidden terminal problem, consider the two wireless laptops and
AP shown in Figure 3.12. (The problem does not only occur with an AP, but the figure
CHAPTER 3 Network Link Technologies
Access Point
Wireless Laptop
L1
101
Wireless Laptop
L2
FIGURE 3.12
Hidden terminals on wireless LANs. This can be a problem in larger home networks, and special
“LAN extender” devices can be used to prevent the problem.
shows this situation.) Both laptops are within range of the AP, but not of each other
(there are many reasons for this, from distance to signal fading). Obviously, if L1 is sending a frame to the AP, L2 could also start sending a frame, because the carrier sensing
shows the network as “clear.” However, a collision occurs at the AP and both frames
have errors, although both L1 and L2 think their frames were sent just fine.
Now, the AP clearly knows what’s going on. It just needs a way to tell the wireless
stations when it’s okay to send (or not). CSMA/CD can use an optional method known
as request to send (RTS) and clear to send (CTS) to avoid these types of undetected
collisions. When a sender wants to send a data frame, it must first reserve the channel
by sending a short RTS frame to the AP, telling the AP how long it will take to send the
data, and receive an acknowledgement frame (ACK) that all went well. If the sender
receives a short CTS control frame back, then it can send. Other stations hear the CTS
as well, and refrain from sending during this time period.
The way that RTS/CTS works for sending data to an access point is shown in
Figure 3.13.
There are two time notations in the figure: DIFS and SIFS. The distributed interframe space (DIFS) is the amount of time a wireless station waits to send after sensing
that the channel is clear. The station waits a bit “just in case” because wireless LANs,
unlike Ethernet, do not detect collisions and cease sending, so collisions are very debilitating and must be avoided at all costs. The short inter-frame spacing (SIFS) is also
used between frames for collision avoidance. There is also a duration timer in all 802.11
frames, measured in microseconds, that tells the other stations how long it will take to
send the frame and receive a reply. Stations avoid link access during this time period.
While RTS/CTS does reduce collisions, it also adds delay and reduces the available
bandwidth on a channel. In practice, each wireless station sets an RTS threshold so that
CTS/RTS is used only when the frame is longer than this value. Many wireless stations
set the threshold so high that the value is larger than the maximum frame length, and
the RTS/CTS is skipped for all data.
102
PART I Networking Basics
Access Point
Source
All Other Nodes
Destination
DIFS
RTS
SIFS
CTS
SIFS
DATA
Defer Access
Reservation
Time
SIFS
ACK
FIGURE 3.13
RTS and CTS in wireless LANs showing how all other nodes must defer access to the medium.
The CTS is heard by all other nodes, although this is not detailed in the figure.
Frame
Duration Address 1
Control
2 bytes 2 bytes
6 bytes
Address 2
Address 3
6 bytes
6 bytes
Seq.
Address 4
Control
2 bytes
6 bytes
Payload
F
C
S
0–2312 4
bytes bytes
FIGURE 3.14
IEEE 802.11 frame structure. Note the potential number of address fields (four) in contrast to the
two used in Ethernet II frames.
The IEEE 802.11 Frame
Although the IEEE 802.11 frame shares a lot with the Ethernet frame (which is one reason some packet sniffers can parse wireless frames as if they were Ethernet), there are
a number of unique fields in 802.11. There are nine main fields, and the frame control
(FC) field has 10 fields. The nine major fields of the IEEE 802.11 MAC frame are shown
in Figure 3.14. The only fields in the two FC bytes that we will talk about are the From
DS and To DS fields. (In some cases, the first three fields of the 802.11 MAC frame, the
version, type, and subtype, are presented separately from the frame control flags, which
are all bits.)
CHAPTER 3 Network Link Technologies
103
Frame control (FC)—This field is 2 bytes long and contains, among other things,
two important flag bits: To DS (distribution system) and From DS.
Duration—This byte gives the duration of the transmission in all frame types
except one. In one control frame, this “D” byte gives the ID of the frame.
Addresses—There are four possible address fields, each 6 bytes long and structured the same as Ethernet MAC addresses. The fourth field is only present
when multiple APs are in use in an ESS. The meaning of each address field
depends on the value of the DS flags in the FC field, discussed later.
Sequence control—This 2-byte field gives the sequence number of the frame and
is used in flow control.
Payload—This field can be from 0 to 2312 bytes long. Usually it is fewer than
1500 bytes and holds an IP packet, but there are other types of payloads. The
precise type and subtype of the content is determined by the content of the
FC field.
CRC—The frame cyclical redundancy check is a 4-byte CRC-32, used to determine
the nature of the acknowledgement sent.
Why does the wireless frame need to define four address fields? Mainly because the
arrangements of wireless stations can be complicated. Is there an AP in the BSS? Is there
more than one AP? What type of frame is being sent? Data? Control? Management? The
number of address fields present, and what they represent, depend on the answers to
these questions.
How do receivers know exactly how many addresses are used and what they represent? That’s where the two DS flags in the FC field come in. The meaning of the address
fields (and possible presence of the Address 4 field) depends on the values of these two
bits. Actually, there are five types of MAC addresses used in wireless LANs:
BSSID—This is usually the MAC address of the AP, but it is generated randomly in
an IBSS or ad hoc network.
Transmitter Address (TA)—The TA is the MAC address of the individual station
that has just sent the frame.
Receiver Address (RA)—The RA is the MAC address of the immediate receiver of
the frame. This can be a group or broadcast address.
Source Address (SA)—The SA is the MAC address of the individual station that
originated the frame. Due to the possible role played by the AP, the SA is not
necessarily the same as the TA.
Destination Address (DA)—The DA is the MAC address of the final destination of
the frame, and can also be a group or broadcast as well as an individual station.
Again, due to the AP(s), this address might not match the RA.
104
PART I Networking Basics
Table 3.3 DS Bits and Wireless LAN Data Frame Address Fields
Type of Network
From DS
To DS
Address 1
Address 2
Address 3
Address 4
Ad hoc (IBSS)
0
0
DA (5 RA)
SA
BSSID
N/A
To AP
0
1
RA (5 BSSID)
SA
DA
N/A
From AP
1
0
DA (5 RA)
BSSID
SA
N/A
ESS (multiple
APs)
1
1
RA
TA
DA
SA
The interplay among these address types and the meaning of the two DS flags for
data frames is shown in Table 3.3.
A look back at Figures 3.6 and 3.7 will show that these address patterns are reflected
in the screen captures. The last two bits of the frame control flags are the DS bits,
which are 01 (To AP) and 10 (From AP), respectively. The Proxima AP is passing the
frame between the Cisco and Farallon wireless stations.
The Address 4 field appears only when there are multiple APs. Usually, data frames
in a simple BSS with AP use DS bit combinations 01 and 10 to make their way through
the AP from one wireless station to another.
105
QUESTIONS FOR READERS
Figure 3.15 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
Client
Client
Server
Router
LAN 1
Hub
Ethernet || Frames
Carrying IP Packets
Hub
IP Packet over
SONET (POS) on
SONET/SDH (with
added frame
overhead)
LAN 2
Client
Client
Home
Server
Router
FTTN
Router
Home
Wireless
AP
Fiber Carrying IP Packets
inside DSL Frames
Wireless Network Carrying
IP Packets inside
802.11 Frames
FIGURE 3.15
IP packets are carried in many different types of frames, and some of those frames are tucked
inside lower level transmission frames.
1. Both LAN1 and LAN2 use Ethernet II frames. What would happen if frame types
on the two LANs were different?
2. SONET/SDH still has its own overhead bytes when IP packets are carried inside
the SONET/SDH frames. Why is the SONET/SDH overhead still necessary?
3. What is the captive portal method of wireless access permission and how does
it work?
4. Ethernet LANs can extend to metropolitan area distances and perhaps beyond.
If Metro Ethernet evolved to remove all distance limits, what are the advantages
and disadvantages of always using Ethernet frames for IP packets?
5. Why are more than two addresses used in wireless frames in some cases? Which
cases require more than two addresses?
PART
Core Protocols
II
All hosts attached to the Internet run certain core protocols to enable their
applications to function properly. This part of the book examines these
protocols and shows how the router forms the glue that holds the Internet
together.
■
■
■
■
■
■
■
■
■
Chapter 4—IPv4 and IPv6 Addressing
Chapter 5—Address Resolution Protocol
Chapter 6—IPv4 and IPv6 Headers
Chapter 7—Internet Control Message Protocol
Chapter 8—Routing
Chapter 9—Forwarding IP Packets
Chapter 10—User Datagram Protocol
Chapter 11—Transmission Control Protocol
Chapter 12—Multiplexing and Sockets
CHAPTER
IPv4 and IPv6
Addressing
4
What You Will Learn
In this chapter, you will learn about the addressing used in IPv4 and IPv6. We’ll
assign addresses of both types to various interfaces on the hosts and routers of the
Illustrated Network. We’ll mention older classful IPv4 addressing and the current
classless system. We will start to explore the differences between IPv4 and IPv6
addressing and why both exist.
You will learn about the important concept of subnetting and supernetting
and other aspects of IP addressing. We’ll detail the IP subnet mask as well.
In many ways, IPv4 and IPv6 are distinct protocols with important differences. Nevertheless, both IPv4 and IPv6 are valid IP layer addresses, some networks use both IPv4
and IPv6, and the packet data content is the same in both. Network engineers often
deal with both every day, and we will too. In the future, the importance of IPv6 will
only grow.
IPv4 addressing was fairly straightforward to understand before the Internet
exploded all over the world. Then the original (“classful”) rules for assigning networks
IPv4 addresses didn’t work as well, and routers were getting overwhelmed by the size
and resources needed to maintain routing and forwarding tables.
This chapter investigates both IPv4 and IPv6 addressing, and the host and router
interfaces on the Illustrated Network have both IPv4 and IPv6 addresses (see
Figure 4.1). We’ll assign these addresses manually in this chapter.
We’ll start the discussion by describing the classless Internet routing (CIDR) rules
created so that we did not run out of IPv4 addresses in 1994, shortly after the Web
exploded onto the scene. Then we’ll describe the older classful system, and, finally,
we’ll talk about IPv6 addressing. This chapter also explores important aspects of IP
addressing subnetting and supernetting.
110
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
ink
LL
DS
ge0/0
50. /3
1
Wireless
in Home
PE5
lo0: 192.168.5.1
0
/0/
-0
so 9.2
5
0
/0/
-0 .1
o
s
59
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
so-0/0/3
49.1
so-
P4
lo0: 192.168.4.1
so-0/0/1
24.2
P9
lo0: 192.168.9.1
/0
0/0
1
47.
AS 65459
FIGURE 4.1
The Illustrated Network IP addressing, showing the interfaces on the LANs and customer-edge
routers that we will be working with. Note that in most cases, all of the network interfaces will
have both IPv4 and IPv6 addresses.
CHAPTER 4 IPv4 and IPv6 Addressing
111
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so-0/0/1
24.1
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so-0/0/3
27.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so0/
29 0/2
.1
so-0/0/3
27.2
so
-0
/
17 0/2
.2
ge-
/0/0
so-0
47.2
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
12
-0
so
Global Public
Internet
AS 65127
112
PART II Core Protocols
IP ADDRESSING
In Chapter 2 we worked a lot with the Linux and Windows clients and servers. Let’s
start with our FreeBSD hosts and routers to look at IPv4 and IPv6 addresses on the
device’s interfaces.
Figure 4.1 shows through shading the portion of the network we’ll be working
with in this chapter. All of the ISP routers have IP addresses, of course, both IPv4 and
IPv6, but we’ll only look at the addressing of the customer routers. Although it can be
important, we won’t worry about the addressing used internally by service providers.
The things that can go wrong there are far beyond this introductory discussion.
When the Illustrated Network was first configured, we manually assigned an IPv4
address to the bsdserver Ethernet interface (em0) with ifconfig. The only tricky part
was translating the prefix length used on our network (/24) to a decimal network mask
for this host (this was done only to show this common method). We could have used
10.10.12.77/24 as well, or even hex (0xffffff00). We’ll talk about prefix lengths and
network masks later on in this chapter. The ifconfig command generates no output,
but we can look at the result using ifconfig without any parameters.
bsdserver# ifconfig em0 inet 10.10.12.77 netmask 255.255.255.0
bsdserver# ifconfig
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=3<RXCSUM,TXCSUM>
inet6 fe80::20e:cff:fe3b:8732%em0 prefixlen 64 scopeid 0x1
inet 10.10.12.77 netmask 0xffffff00 broadcast 10.10.12.255
ether 00:0e:0c:3b:87:32
media: Ethernet autoselect (100baseTX <full-duplex>)
status: active
Automatic IP Addressing
This chapter assigns IPv4 and IPv6 addresses manually on each device. This is still
done, but it is more common by far to assign IP addresses automatically with the
Dynamic Host Configuration Protocol, or DHCP. Routers can use DHCP as well.
We’ll look at DHCP in a later chapter.
The interface flags are interpreted on the first line of the output. Interface em0 is up
and running, and can send or receive, but not at the same time (simplex). It can send
and receive broadcasts and multicast, and has a Maximum Transmission Unit (MTU)
of 1500 bytes (a normal Ethernet frame). If a packet is queued for output and is too
large for this 1500-byte frame, then the packet content must be fragmented into multiple frames, each in its own packet. We’ll talk about fragmentation in detail in a later
chapter. The option line says that the frame check sequence is generated when transmitting and checked when receiving.
CHAPTER 4 IPv4 and IPv6 Addressing
113
Note that we got an IPv6 address (the inet6 line) as well. This is called the linklocal (0xfe80) IPv6 address. It is based on the MAC address and generated automatically, with a prefix length (prefixlen) of /64. Newer versions of FreeBSD function
this way, as long as the local router is properly configured to run IPv6. You can use
the ifconfig command with the inet6 option to assign a specific IPv6 address to the
interface. (There’s a lot more to IPv6 addressing, such as router-assigned prefixes, but
we’re keeping it very basic here.)
The next line lists the IPv4 address, netmask, and the address used as an IP broadcast address to send packets to every device on the network. The MAC address has a
line all its own, followed by the type of media: 100-Mbps, twisted-pair Ethernet, capable
of sending and receiving (full-duplex) at the same time (but em0 will not do that). The
interface is active as well as up, which means that it is sending and receiving bits.
Linux uses slightly different syntax to assign IPv4 addresses to interfaces. Let’s assign
an IPv4 address to the lnxclient Ethernet interface (eth0) using ifconfig. In this case,
the network mask format is easier to read. We’ll look at the interface before the address
is assigned, and then after, and find something very different from FreeBSD with regard
to the network broadcast address.
[[email protected] admin]# ifconfig
eth0
Link encap:Ethernet HWaddr 00:B0:D0:45:34:64
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:43993 errors:0 dropped:0 overruns:1 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:7491082 (7.1 Mb) TX bytes:0 (0.0 b)
Interrupt:5 Base address:0xec00
[[email protected] admin]# ifconfig eth0 10.10.12.166 netmask 255.255.255.0
[[email protected] admin]# ifconfig
eth0
Link encap:Ethernet HWaddr 00:B0:D0:45:34:64
inet addr:10.10.12.166 Bcast:10.255.255.255
Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:44000 errors:0 dropped:0 overruns:1 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:7492614 (7.1 Mb) TX bytes:0 (0.0 b)
Interrupt:5 Base address:0xec00
This output gives much the same information as FreeBSD, but provides more details
for traffic statistics and error conditions. The last line of output gives details about how
the interface card communicates with the operating system and has nothing directly
to do with the network. Note that no automatic IPv6 addresses are generated. All versions of the Linux kernel newer than 2.2, regardless of distribution, now support ways
to give an interface an IPv6 address, but we will not do that.
However, Linux has also done something very odd with the broadcast address. We’ll
talk more about broadcast address formats later in this chapter, but it is supposed to be
formed by setting all of the host bits that follow the network bits in the IP address to 1.
114
PART II Core Protocols
Now, we set a network mask for 24 bits (255.255.255.0), but Linux has set all the bits in
the field to a string of 1 bits in the broadcast mask to the last 24 bits of the IPv4 address,
or 10.255.255.255. As we saw with FreeBSD, the correct broadcast address for this network mask should be 10.10.12.255.
This means, as we’ll soon discover, that this older version of Linux expects classful
IPv4 addresses, and today we mostly use classless IPv4 addresses. (There was some
debate as to whether this was a “broken” version or install, but the behavior is consistent and all else seems well.)
To fix the broadcast address so that the network functions properly (yes, it matters), we’ll have to specify a broadcast address for lnxclient (and do the same for
lnxserver).
[[email protected] admin]# ifconfig eth0 broadcast 10.10.12.255
[[email protected] admin]# ifconfig
eth0
Link encap:Ethernet HWaddr 00:B0:D0:45:34:64
inet addr:10.10.12.166 Bcast:10.10.12.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:44000 errors:0 dropped:0 overruns:1 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:7492614 (7.1 Mb) TX bytes:0 (0.0 b)
Interrupt:5 Base address:0xec00
Let’s move on to the Windows devices. In Windows, IPv4 and IPv6 address assignment can be awkward. In Windows XP, you typically use the graphical interface to assign
IPv4 addresses, subnet masks, and default gateways. The method is well-documented
in many places and need not be detailed here. You can easily view the current IP
addresses by running the Windows ipconfig command. Here’s the result on wincli2.
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\Owner>ipconfig
Windows IP Configuration
Ethernet adapter Local Area Connection:
Connection-specific DNS Suffix
IP Address .
.
. . . .
Subnet Mask .
.
. . . .
Default Gateway .
. . . .
.
.
.
.
:
: 10.10.12.222
: 255.255.255.0
: 10.10.12.1
Unlike the Unix-based output, Windows XP associates a default gateway with the
interface. This information is properly part of the host routing and forwarding routing
table, and we’ll talk more about default gateways in a later chapter on routing.
How can we give the LAN interface an IPv6 address? In XP, the graphical version
depends on the service packs installed. The easiest way is to use the command prompt
to first install the IPv6 protocol stack as a dual stack on the host. XP can generate
a series of IPv6 addresses automatically as well (you can also set them manually). It
should be noted that in Vista, IPv6 is typically turned on by default.
CHAPTER 4 IPv4 and IPv6 Addressing
115
C:\Documents and Settings\Owner>ipv6 install
Installing. . .
Succeeded.
C:\Documents and Settings\Owner>
Once IPv6 support is available, the output of the
very interesting things.
C:\Documents and Settings\Owner>ipconfig
Windows IP Configuration
Ethernet adapter Local Area Connection:
Connection-specific DNS Suffix
IP Address .
.
. . . .
Subnet Mask .
.
. . . .
IP Address .
.
. . . .
Default Gateway .
. . . .
.
.
.
.
.
:
:
:
:
:
ipconfig
command shows some
10.10.12.222
255.255.255.0
fe80::202:b3ff:fe27:fa8c%4
10.10.12.1
Tunnel adapter Automatic Tunneling Pseudo-Interface:
Connection-specific DNS Suffix
IP Address .
.
. . . .
Default Gateway .
. . . .
.
.
.
:
: fe80::5efe:10.10.12.222%2
:
Not only has the IPv6 installation created an IPv6 address for the LAN interface, it is a
site-local address based on the MAC address of the interface (see Chapter 3). The “%”
number is just an index for the order in which certain types of IPv6 addresses were
generated by the IPv6 installation.
On working networks, more than just the automatic tunnel IPv6 address is usually
created. It is not unusual to see a Tunnel adapter Teredo Tunneling Pseudo-Interface.
Teredo is a Microsoft initiative, defined in RFC 3904, that allows devices to reach the
IPv6 Internet from behind a network address translation (NAT) device. There is often
a Tunnel adapter 6to4 Tunneling Pseudo-Interface as well, depending on how the
routers are configured. A full discussion of these Windows IPv6 interfaces is beyond the
scope of this book, but we’ll discuss IPv6 tunneling in more detail in Chapter 9.
The customer edge routers are Juniper Networks routers. The configuration files on
these routers look very different from those on a Cisco router. Juniper Networks router
configurations are more like C language programs and are organized with braces in
indented stanzas. However, Juniper Networks router configurations can be rendered
in “set” language that looks more like Cisco’s style. For example, on router CE0, the
addressing on interface fe-1/3/0 is more complex than on a host:
[email protected]> show interface fe-1/3/0
unit 0 {
family inet {
address 10.10.11.1/24;
}
116
PART II Core Protocols
family inet6 {
address FC00:ffb3:d5:b:205:85ff:fe88:ccdb/64;
}
}
[email protected]>
In this format, all statements configured under another statement (indented) apply
to that higher level statement. Thus, both family inet and family inet6 apply to
unit 0, but only the address 10.10.11.1/24 applies to family inet. The form is used
often in this book, and becomes more familiar with repetition.
This form can also be shown in the following more compact format, which is the
style we will use in this book:
[email protected]> set interface fe-1/3/0 unit 0 family inet address 10.10.11.1/24;
[email protected]> set interface fe-1/3/0 unit 0 family inet6 address
FC00:ffb3:d5:b:205:85ff:fe88:ccdb/64;
This output is for logical unit 0, the simplest case. Juniper Networks router interfaces
can have logical units numbered from 0 to 65535, and each can have more than one
IPv4 or IPv6 address. The LAN interface on CE6 looks very much the same, except for
the address specifics.
We’ll talk about the specifics of the IPv4 and IPv6 address formats, network marks,
and prefix lengths, and other topics, in the rest of this chapter. At the end, we’ll see just
what the complex IPv6 address format is telling us about the Illustrated Network.
One type of address we won’t be exploring in this chapter is the anycast address.
To understand anycast addresses, consider that there are three major types of IP
addresses.
Unicast—This type of IP address is used to identify a single network interface.
It establishes a one-to-one relationship between the network address and
network endpoint (interface). So each unicast address uniquely identifies a
network source or destination.
Broadcast/Multicast—This type of IP address is used to identify a changeable
group of interfaces. Broadcast addresses are used to send a message to every
reachable interface, and broadcast domains are typically defined physically.
Multicast addresses are not limited to a single domain and multicast groups
are established logically. IPv6 relies on multicast addresses for many of the
discovery features of IPv6 and things that are done with broadcasts in IPv4.
In both multicast and broadcast, there is a many-to-one association between
network address and network endpoints. Consequently, one address identifies
a group of network endpoints, and information is replicated by routers to
reach them all.
Anycast—This type of IP address, formally defined in IPv6, is used to identify a
defined set of interfaces, usually on different devices. Anycast addresses are
CHAPTER 4 IPv4 and IPv6 Addressing
117
used to deliver packets to the “nearest” interface, where nearness is defined
as a routing parameter. The same can be done in IPv4, but not as elegantly.
However, multicasts deliver to many interface destinations, while anycasts
deliver to only one, although many might be reachable. Anycasts are useful for
redundancy purposes, so servers can exist around the world, all with the same
address, but traffic is only sent to the one that is the “closest” to the source.
This book uses mainly unicast IP addresses. Multicast and anycast addresses will be
introduced and used as necessary.
THE NETWORK /HOST BOUNDARY
We just saw that the mask determines where the boundary between the network
and host portions of the IP address lies. This boundary is important: If it is set too far
to the right, there are lots of networks, but none of them can have many hosts. If it
is set too far to the left, then there are plenty of hosts allowed, but fewer networks
overall.
In IP, the address boundary is moveable, and always has been. But in the past, right
through the big Internet explosion in the mid-1990s, the network/host boundary in
IPv4 could only be in one of three places. This produced lots of networks that were too
small in terms of hosts, and many that were far too large, capable of holding millions
of hosts. Not only that, but there were so many small networks, each of which needing
a separate routing table entry in each and every core Internet router, that the Internet
threatened to drown under its own weight.
In a nutshell, the inability to aggregate Class C blocks drove routing table pressure
and the unsustainable rate of allocation of Class A and Class B addresses. This would
have caused IPv4 exhaustion by 1994 to 1995, as projected in 1990.
So the rules were changed to allow the network/host boundary in IPv4 and IPv6
addresses to be set almost anywhere (there are still some basic rules). When applied
to the former, fixed, IPv4 octet boundaries, if you moved the “natural” boundary
of the mask to the right of its normal position, this was called subnetting and
the address space gets smaller. (Actually, even the older “natural” IPv4 addresses
could always be subnetted.) And if you moved the “natural” boundary of the mask
to the left of its normal position, this was called supernetting and the address space
became larger.
In this chapter, we will talk about subnetting and supernetting in detail. Supernetting is more commonly called “aggregation” today, but we’ll call it supernetting in this
chapter just to make the contrast with subnetting explicit. We will also talk about the
current system of rules for hosts and routers concerning the positioning of the boundary between the network and host portion of the IP address, variable-length subnet
masking (VLSM), and classless interdomain routing (CIDR). But first, let’s look at the
IPv4 address in detail.
118
PART II Core Protocols
THE IPV4 ADDRESS
The IPv4 address is a network layer concept and has nothing to do with the addresses
that the data link layer uses, often called the hardware address on LANs. IPv4 addresses
must be mapped to LAN hardware addresses and WAN serial link addresses. However,
there is no real relationship between LAN media access control (MAC) or WAN serial
link addresses in the frame header and the IPv4 addresses used in the packet header,
with the special exception of multicast addresses.
The original IPv4 addressing scheme established in RFC 791 is known as classful
addressing. The 32 bits of the IPv4 address fall into one of several classes based on
the value of the initial bits in the IPv4 address. The major classes used for addresses
were A, B, and C. Class D was (and is) used for IPv4 multicast traffic, and Class E was
“reserved” for experimental purposes. Each class differs in the number of IPv4 address
bits assigned to the network and the host portion of the IP address. This scheme is
shown in Figure 4.2.
Note that with Class A, B, and C, we are referring to the size of the blocks being allocated as well as the region from which they were allocated by IANA. However, Classes
D and E refer to the whole respective region. Multicast addresses, when they were
assigned for applications, for example, were assigned one at a time like (for instance)
port numbers. (We’ll talk about port numbers in a later chapter.) In the rest of this
chapter, references to Classes A, B, and C are concerned with address space sizes and
not locations.
The 4 billion (actually 4,294,967,296) possible IPv4 addresses are split up into five
classes. The five classes are not equal in size, and Class A covers a full half of the whole
Number of
Addresses:
32-bit Address Starts with:
% of
Address Space
Class A 0 (0–127)
231 5 2,147,483,648
50
Class B 10 (128–191)
230 5 1,073,741,824
25
Class C 110 (192–223)
229 5 536,870,912
12.5
Class D 1110 (224–239)
228 5 268,435,456
6.25
Class E 1111 (240–255)
228 5 268,435,456
6.25
First
byte
Second
byte
Third
byte
Fourth
byte
FIGURE 4.2
Classful IPv4 addressing, showing the number of addresses possible and percentage of the total
address space for each class. Class D is still the valid IPv4 address range used for multicasting.
CHAPTER 4 IPv4 and IPv6 Addressing
119
IPv4 address space. Class E addresses are “experimental” and some of them have been
used for that purpose, but they are seldom seen today.
In practice, only the Class D addresses are still used on the Internet in a classful manner. Class D addresses are the IPv4 multicast addresses (224.0.0.0 to 239.255.255.255),
and we’ll talk about those as needed. We will nonetheless talk about classful IPv4
addressing in this book, especially later on in this chapter when subnetting is considered and when mentioning the routing protocol RIPv1. However, the significance of
classful IPv4 addressing is strictly historical. Classful addressing comes up occasionally,
and at least some introduction is necessary.
This chapter, and this book, emphasizes classless IP addresses, the current way of
interpreting the 32-bit IPv4 address space. This scheme assumes that no classes exist
and is how routers on the Internet interpret IPv4 addresses. In classless addressing,
the IPv4 network mask or prefix determines the boundary between the network and
host portion of the IP address instead of the initial IP address bits. On a host, it is still
often called a network mask, because hosts don’t care about classful or classless, but it
is called a prefix on a router.
Hosts really don’t deal with the differences between classful and classless IP
addresses. Routers, on the other hand, must. Because this book deals with networks
as a whole, including routers, some understanding of both classful and classless IPv4
addressing is beneficial.
Dotted Decimal
IPv4 addresses are most often written in dotted decimal notation. In this format,
each 8-bit byte in the 32-bit IPv4 address is converted from binary or hexadecimal to a decimal number between 0 (0000 0000 or 0x00) and 255 (1111 1111 or
0xFF). The numbers are then written as four decimal numbers with dots between
them: W.X.Y.Z.
For example, 1010 1100 0001 0000 1100 1000 0000 0010 (0xAC 10 C8 02)
becomes 172.16.200.2. And 1011 1111 1111 1111 0000 1110 0010 1100 (0xBF FF
0E 2C) becomes 191.255.14.44, and so on.
Hosts on the same network (essentially a LAN) must have the prefix (network portion) of their IP addresses (IPv4 or IPv6) be the same. This is how routers route packets
between networks that form the Internet: by the network portion of the IP address.
The whole IP address specifies the host on the network, and the network portion
identifies the LAN. The boundary between network and host IP address bits is moveable for either classful or classless IP addresses. An IP address can be expressed in
dotted decimal, binary, octal, or hexadecimal. While all are correct and mean the same
thing, it’s most common to use dotted decimal notation for IPv4 and hexadecimal
(hex) for IPv6. (In fact, some RFCs, such as those for HTTP [covered in Chapter 22],
require dotted decimal for IPv4 addresses.)
120
PART II Core Protocols
The basic concepts of classful IPv4 addressing are shown in Figure 4.3 for the three
most common classes—A, B, and C. The figure shows the Internet name assigned to the
IPv4 address, the default network mask and prefix length for each of the three common classes, and the IPv4 address in dotted decimal.
Note that when no network mask is given, the class of the address is determined by
the value of the initial bits of the address, as already described. The network mask can
move this boundary, but in practice only to the right in classful addressing.
Classless IPv4 addressing, on the other hand, as used on routers, does not derive a
default subnet mask or prefix length. The prefix length for classless IPv4 addressing
must be given (by the netmask) to properly place the boundary between NetID and
HostID portions of the IPv4 address.
IP addresses, both IPv4 and IPv6, can be public or private. Public network address
spaces are assigned by a central authority and should be unique. Private network
addresses are very useful, but are not guaranteed to be unique. Therefore, the use of
private network address spaces has to be carefully managed, because routers on the
Internet would not work properly if a LAN showed up in two places at the same time.
Nevertheless, the use of private address spaces in IP is popular for perceived security
reasons. The security aspects are often overemphasized: The expansion of the locally
available address space is the key reason for private address use. (If you have one
IP address and three hosts, you have a problem without private addressing.) But private
address spaces must be translated to public addresses whenever a packet makes it way
onto the global public Internet.
First
byte
Class A
Second
byte
NetID
Third
byte
Fourth
byte
HostID
8 bits for NetID, 24 bits for HostID
Class B
NetID
HostID
16 bits for NetID, 16 bits for HostID
Class C
NetID
HostID
24 bits for NetID, 8 bits for HostID
FIGURE 4.3
The classful IPv4 address for classes A, B, and C. Note how the boundary between network
identifier and host identifier moves to the right, allowing more networks and fewer hosts in each
class.
CHAPTER 4 IPv4 and IPv6 Addressing
121
Moreover, private IP addresses are not routable outside a local network, so a router
is not allowed to advertise a route to a private address space onto the public Internet. Note that private addresses are just as routable as public ones within your own
network (as on the Illustrated Network), or by mutual consent with another party. They
are not generally routable on the global public Internet due to their lack of uniqueness
and usual practices.
Almost all networks today rely on private network addresses to prevent public IPv4
address exhaustion, so these addresses are not just to test networks and labs any longer.
Customer-edge routers often translate between a large pool of private (internal) and a
smaller pool of public (external) addresses and insulate the local LAN from the outside
world. We’ll talk more about private IPv4 address in the next section of this chapter.
When obtaining a public IP address, a user or organization receives an address
space that should be globally unique on the Internet. (Sadly, you often find yourself
“blackholed” to nowhere for some ISP to route your packets because someone else
used your address space internally for some private network without permission!) This
first piece is the network portion (prefix) of an IP address space, such as 191.255.0.0.
This example uses a so-called “Martian” IPv4 address, which is a valid IP address, but not
used on the Internet. Technically, the address space beginning with 191.255 is reserved,
but could be assigned in the future. The 0.0 ending means an IP network is referenced,
and not a host (in this case, but hosts sometimes have IPv4 addresses that end with
0). Some TCP/IP protocol stacks struggle with IPv4 addresses ending in 0 or 255, so it
is best to avoid them. The host portion of the IPv4 address is assigned locally, usually
by the LAN network administrator. For example, a host could be assigned IPv4 address
191.255.14.44.
The examples in this chapter use the manual, static IP address assignment method.
When this method is used with public IP addresses, the organization still either obtains
the IP network address range on its own, or uses the range of IP addresses assigned to
the organization by its ISP. The Dynamic Host Configuration Protocol (DHCP) makes it
possible to assign IP addresses to devices in a dynamic fashion. DHCP is the method
many organizations use either for security reasons (to make it harder to find device IP
addresses) or to assign a unique IP address to a device only when it actually needs to
access the Internet. There are many more uses for dynamic IP address allocations on
the Internet, and much more to discuss, and DHCP will be explored in a later chapter.
When the topic is routers, IP addresses are often written in the <netid, hostid/
prefix> form to determine the netid/hostid boundary. To completely identify a particular host on a particular network, the whole address is needed. When all 32 bits
of the IPv4 address are given, and the prefix is not, this is called a host address on a
router. In classless routing, there is no fixed separation point between the network and
host portion of the IP address: It is completely determined by the prefix, which must
be known. In dotted decimal notation, the full range of possible IP addresses can run
from 0.0.0.0 to 255.255.255.255. Prefixes can run from /0 (a special, but useful, case)
to /31. Until recently, the /31 prefix was often useless to routers, as we will see in a later
chapter, and the /32 prefix is the same as the host address.
122
PART II Core Protocols
Private IPv4 Addresses
RFC 1918 established private address spaces for Classes A, B, and C to be used on private IP networks, and these are still respected in classless IP addressing. Books such as
this one, where it is not desirable to use public IP addresses for examples, use RFC 1918
addresses throughout, much like using “555” telephone numbers in movies and on TV.
The private IP address ranges follow:
■
■
■
Class A: 10.0.0.0 through 10.255.255.255 (10.0.0.0/8, or just 10/8)
Class B: 172.16.0.0 through 172.31.255.255 (172.16.0.0/12, or just
172.16/12)
Class C: 192.168.0.0 through 192.168.255.255 (192.168.0.0/16, or just
192.168/16)
There are three very important points that should always be kept in mind regarding
private addresses. First, these addresses should never be announced by a routing protocol on a local router to the public Internet. However, these addresses are frequently
assigned and used when they are isolated or translated. We’ll look at network address
translation (NAT) in a later chapter. In summary,
■
■
■
Private IP addresses are not routable outside the local network (they cannot be
advertised to the public Internet).
They are widely used on almost all networks today (even our small home
network with DSL uses private IP addresses).
Private addresses are usually translated with NAT at an edge router to map the
private addresses used on a LAN to the public address space used by the ISP.
Understanding IPv4 Addresses
IP addresses and their prefixes are read in a certain way and have special meanings
depending on how they are written and used. For example, the classful IPv4 address
192.168.19.48 is read as “host 48 on IP network 192.168.19.0.” In a classless environment, as on a router, the prefix length, in this case /24, must be known. Routers
often drop trailing zeros, 192.168.19.0/24 is the same as 192.168.19/24. All IP network
addresses must have the bits in the host address field set to 0 and this address cannot
be assigned to any host. (Typically, nothing on a host prevents this address assignment.
It just won’t work properly.) Note that while the table is describing a particular /24
address in the examples, it’s not the address itself but its location in the field specified
by the mask that is critical.
Table 4.1 lists some specific forms of IPv4 addresses, what they look like, and whether
they can be used as a source or destination address or have some other special use.
IPv4 addresses in example formats such as 0.0.0.46 and 192.168.14.0 are never
actually seen as packet header addresses. Loopback addresses are used on hosts and
routers for testing and aren’t even numbered on the interface. All systems “know” that
packets sent to the loopback addresses (any IPv4 address starting with 127) are not
sent out the network interface.
CHAPTER 4 IPv4 and IPv6 Addressing
123
Table 4.1 Special Forms of IPv4 Addresses, Showing How Some Are Limited
in Application to Source or Destination
Special Address
NetID
HostID
Example
Use
Network itself
Non-0
All zeros
(0s)
192.168.14.0
Used by routers: on a host,
means “some host,” but it is
not used.
Directed broadcast
Non-0
All ones
(1s)
192.168.14.255
Destination only: used by
routers to send to all host on
this network.
Limited broadcast
All 1s
All 1s
225.255.255.255
Destination only: direct broadcast when NetID is not known.
This host on this
network
All 0s
All 0s
0.0.0.0
Source only: used when host
does not know its IPv4 address.
Specific host on
this network
All 0s
Non-0
0.0.0.46
Destination only: defined, but
not used
Loopback
127
Any
127.0.0.0
Destination only: packet is not
sent out onto network.
When these forms are not used in their defined roles (e.g., when something like
is used as a packet source address instead of a destination), the result
is usually an error.
172.16.255.255
THE IPv6 ADDRESS
In addition to IPv4 (often written as just IP), there is IP version 6 (IPv6). IPv6 was developed as IPng (“IP:The Next Generation” because the developers were supposedly fans
of the TV show “Star Trek: The Next Generation”). (IPv5 existed and is defined in RFC
1819 as the Streams 2 [ST2] protocol.)
This section is not intended to be an exhaustive investigation of IPv6. The emphasis here is on the IPv6 header and address, and how IPv6 will affect router operation.
IPv6 has been around since about 1995, but pressure to transition from IPv4 to IPv6
is mostly recent. (The exhaustion of the IPv4 address space has been delayed mainly
through the use of NAT and DHCP.) Today, the pressure for transition from IPv4 to IPv6
comes mainly from network service providers and operators and other groups with
large internal networks, such as cellular telephone network operators.
In some applications, major IPv6 addresses are confined to the core of large IP
networks, and customers and users still see only IPv4 addresses. Nevertheless, there is
nothing to fear about learning IPv6, and some familiarity with IPv6 will probably be
expected in the future.
124
PART II Core Protocols
Features of IPv6 Addressing
The major features of IPv6, such as IPSec, have nearly all been back-ported into IPv4.
However, the major design features of IPv6 follow:
■
■
■
■
■
■
An increase in the size of the IP address from 4 bytes (32 bits) to 16 bytes
(128 bits).
An increase in the size of the IP header from 24 bytes (192 bits) to 40 bytes
(320 bits). (Although aside from the address fields, the header is actually smaller
than in IPv4.)
Enhanced security capabilities using IPSec (if needed).
Provision of special “mobile” and autoconfiguration features.
Provision for support of flows between routers and hosts for interactive
multimedia.
Inclusion of header compression and extension techniques.
The IPv6 address increases the size of the IP address from 4 bytes (32 bits) to 16
bytes (128 bits). For backward compatibility, all currently assigned public IP addresses
are supported as a subset of the IPv6 address space. The IPv6 address size increases
the overall IP packet header size (and total TCP/IP overhead) from the current 24 bytes
(192 bits) to 40 bytes (320 bits). However, the IPv6 header is much simpler than the
IPv4 header.
IPv6 includes autoconfigured address and special support for mobile (not always
wireless) users. A new mobile feature called chained headers might allow the faster
forwarding of IPv6 packets through routers, and forbids intermediate fragmentation of
IPv6 packets in routers. The path MTU size must always be respected in IPv6 routers.
IPv6 features support for what are called “flows.” Flows were included in IPv6
because forwarding packets at wirespeed was originally considered impossible. Flow
caching (the association of IPv6 packets into flows with similar TCP/IP header fields)
was thought to be the workaround. However, flow caching is now widely discredited
in the IPv4 world and flows are now established and applied to stateful firewall filters
(Chapter 28). The flow field in IPv6 is normally set to all 0s.
IPv6 is a good fit for a dynamic environment. There are many address discovery
options bundled with IPv6, including support for autoconfiguration, finding the maximum path MTU size (to avoid the need for fragmentation, which IPv6 routers will not
do), finding other hosts’ MAC addresses without ARP broadcasts, and finding routers
other than the default.
The last major feature in IPv6 is a standard for header compression and extension.
At first, these two features may seem contradictory, but they are actually complementary. Header compression addresses situations where the 40 bytes of the IPv6 header
consists mostly of “empty” or repeated fields (like all-0 bit fields). In IPv6, there is a
standard way of compressing the 40 bytes of the header down to 20 or so. There is also
a way to extend these IPv6 header fields for future new features (IPv4 also has header
extension options).
CHAPTER 4 IPv4 and IPv6 Addressing
125
Most networks with a choice will be content to sit and wait before making a
transition to IPv6. Naturally, networks concerned with IPv4 address exhaustion (such
as huge, IP-based cell telephone networks) will convert to IPv6 right away, as large networks in China have. For the vast majority of TCP/IP users, IPv6 is a long way off, and
IPv4 will be around for many years.
IPv6 Address Types and Notation
There are no broadcast addresses at all in IPv6, even directed broadcasts (these were
favorites of IPv4 hackers). In IPv6, multicast addresses serve the same purpose as broadcasts do in IPv4. The difference between IPv6 anycast and multicast is that packets sent
to an anycast IPv6 address are delivered to one of several interfaces, while packets sent
to a multicast IPv6 address are delivered to all of many interfaces.
There is no such thing as dotted decimal notation for IPv6. All IPv6 addresses are
expressed in hexadecimal. They could be expressed in binary as well, but 128 0s and
1s are tedious to write down. IPv6 addresses are written in 8 groups of 16 bits each,
or 8 groups of 4 hexadecimal numbers, separated by colons. Some examples of IPv6
addresses (which appear over and over) follow:
FEDC:BA98:7654:3210:FEDC:BA98:7654:3210
1080:0000:0000:0000:0008:0800:200C:417A
Because this is still a lot to write or type, there are several ways to abbreviate IPv6
addresses. For example, any group can leave out leading 0s, and all-0 groups can be
expressed as just a single 0. A long string of leading 0s can simply be replaced by a
double colon (::). In fact, as long as there is no ambiguity, groups of 0s anywhere in the
IPv6 address can be expressed as ::. The double colon can only be used once in an IPv6
address.
Even with these conventions, the first IPv6 address given earlier cannot be compressed at all. The second address can be expressed as
1080::8:800:200C:417A
This is better than writing out all 128 bits, even as hexadecimal. Because only one set
of double colons can ever be used inside an IPv6 address,
1080:0000:0000:9865:0000:0000:0000:4321
could be written as
1080:0:0:9865::4321
or
1080::9865:0:0:0:4321
126
PART II Core Protocols
but never as
1080::9865::4321
(How big are the missing groups of 0s to the left or right of 9865?)
A special case in IPv6 is made for using IPv4 addresses as IPv6 addresses. For example, the IPv4 address 10.0.0.1 could be written in IPv6 as
0:0:0:0:0:0:A00:1
or even
::A00:1
IPv4 addresses in IPv6 can still be written in dotted decimal as
::10.0.0.1
The double colon at the start is the sign that this is an IPv6 address even though it looks
just like an IPv4 address. Many routers and other devices allow this convention.
IPv6 Address Prefixes
The first few bits of an IPv6 address do reveal something about the IPv6 address,
although IPv6 addressing is in no way classful. IPv6 addresses have an address type, and
the type is determined by the format prefix of the IPv6 address. There are reserved
addresses in IPv6 as well, for things like loopback (::1), multicast (starting with FF),
and so on. There is also an unspecified address consisting of all 0s (0:0:0:0:0:0:0:0,
compressed as just ::) that can be used as a source address by an IPv6 device that
has not yet been assigned an IPv6 address. IPv6 address space is also reserved for OSIRM Network Service Attachment Point (NSAP) addresses, and IPX addresses used with
Novell NetWare.
All of these format prefixes are supposed to be given in hexadecimal, not binary. An
IPv6 address that begins with 1101 means 0001 0001 0000 0001, and is the same as
11::1.... An IPv6 multicast address begins with FF and means 1111 1111:1111 1111.
There are several basic forms of IPv6 address. Like many IPv4 addresses, IPv6
address spaces are often handed out by ISPs to their customers, usually starting with
200x. There are also ways to assign variable-length fields for the registry identifier (the
authority that assigned this IPv6 address space to the ISP), provider identifier (the ISP),
subscriber identifier (the customer), subnet identifier (a group of physical links), and
the interface identifier (such as the MAC address). However, most ISPs will assign IPv6
addresses just as they do IPv4 addresses (i.e., as a network address space and prefix
length). Provider independent IPv6 addresses are not handed out by ISPs.
There used to be two types of local IPv6 addresses: site-local and link-local. Local
IPv6 addresses are addresses without global significance, and they can be used over and
CHAPTER 4 IPv4 and IPv6 Addressing
127
over again as long as they do not cause confusion to hosts or routers. Local addresses
start with the same 7 bits: 1111 111 or FE in hexadecimal (overall, the first 10 bits are
important). Site-local addresses are now deprecated (the Internet word for “more than
obsolete”). Link-local addresses can be used between two devices that are part of the
same broadcast domain or on a point-to-point link.
Private IPv6 addresses usually begin with FC00 (the full form is FC00::/7) and are
called unique local-unicast addresses (ULA or ULA local or even ULA-L). Usually, linklocal IPv6 addresses end with a 64-bit representation (called EUI-64 by the IEEE) of
the 48-bit MAC address. The EUI-64 is a concatenation of the 24-bit OUI used in the
MAC address with the 40-bit extension formed by prepending the 16 bits 0xFFFE to the
lower 24 bits of the MAC address.
SUBNETTING AND SUPERNETTING
Let’s take a look at all aspects of finding and moving the boundary between network
and host bits in the IP address. The moveable boundary is an important one, because
routers performing indirect delivery generally only need to look at the NetID or prefix
of the entire IP address to determine the next hop and then find the output interface
to send the packet on its way. Of course, direct delivery requires both prefix and host
addressing examination, which is why the location of the NetID/HostID boundary is
so important.
How do routers and hosts know precisely where the boundary between prefix and
host address is in the IP address? Only when this prefix/host boundary is known will
the device know if the next hop is a router. And that, as we’ll see in a later chapter,
makes all the difference.
In the following discussions, the examples used are chosen for their simplicity, not
for completeness.
Subnetting in IPv4
The IP address space was originally classful. (Of course, they didn’t know it was classful
back then—it was just the IP address space). As such, it contained a number of special
purpose and private addresses. These characteristics of the first three classes, which
have already been discussed, are summarized in Table 4.2.
Even before the Web exploded and everyone needed an IP network address for
their PCs and Web sites, it was obvious that Class A and B addresses would quickly
become exhausted, leaving only Class C addresses for most networks. However, these
addresses only allow 254 hosts per IP network (0 and 255 were for the network and
broadcast addresses). Many networks quickly exceeded this limit.
Also, Internet core routers must have a separate routing table entry for every reachable IP network. If most IP networks are Class C networks, then all Internet core routers
would potentially have to hold in memory (and maintain!) a list of more than 2 million entries. Even with inexpensive memory, routing and forwarding tables of this size
128
PART II Core Protocols
Table 4.2 Classful IPv4 Addresses and Default Masks
Class
Initial Bits
Range
Default Mask
A
0
0 to 127
255.0.0.0
B
10
128 to 191
225.255.0.0
C
110
192 to 223
255.255.255.0
Note: The value of the initial bits automatically limits the range of addresses possible in each class.
pose challenges. For example, in 1993 there were fewer than 10,000 routes on most
backbone routers, and this did not grow to 100,000 until about 2001. Now, it is not
uncommon to add 2000 routes per week.
Subnetting Basics
IP address subnetting applies to any IP address. The original application of subnetting
was so that point-to-point links between routers did not require a full /24 address for
each link. Subnetting also allowed a single Class C IP address to be used on small LANs
having fewer than 254 hosts connected by routers instead of bridges. Bridges would
simply shuttle frames among all of the ports on the bridge, but routers, as packet layer
devices, determine the output interface for a packet based on the network portion of
the IP address. If only one address is assigned to the entire site, but two LANs on the
site are connected through a router, then the address must be subnetted so that the
router functions properly. Basically, you need to create two distinct address spaces, and
the IP host addresses assigned on each LAN segment must be correct as well. The LAN
segments now become subnets of the main IP address space.
Subnetting is done using an IP address mask. The mask is a string of bits as long as
the IP address (32 bits in the case of IPv4). If the mask bit is a 1 bit, the corresponding bit in the IP address is part of the network portion of the IP address. If the address
bit is part of the host portion, the corresponding mask bit is set to a 0 bit. A mask
of 255.255.0.0 means that the first 16 bits of the IP address are part of the network
address and the last 16 bits are part of the host portion of the address.
All subnet masks must end in 0, 128, 192, 224, 240, 248, 252, 254, or 255—the values
of each bit position as they are “turned on” left to right in any octet. Strangely, subnet
masks were once allowed to turn on bits that were “noncontiguous” (not starting at
the left of the address without gaps). This is no longer true, and the effect is to restrict
masks to the ending values listed. Note that 255.224.0.0 is a valid subnet mask, as is
255.255.248.0 and 255.255.255.252. Once the 1 bits stop, the rest of the subnet mask
must be set to all 0 bits.
Subnet masks can be written in as many forms as there are for IP addresses: dotted
decimal notation, bit string, octal, or hexadecimal. Seeing subnet masks in either dotted
decimal or hexadecimal notation, or the newer prefix “slash” notation, also known as
CHAPTER 4 IPv4 and IPv6 Addressing
129
Table 4.3 Use of Default or “Natural” Subnet Masks*
Original Class
Default Mask
Network/Host Bits
Example Interpretation
A
255.0.0.0
8/24 (/8 prefix)
10.24.215.86 is host 0.24.215.86 on
network 10.0.0.0
B
255.255.0.0
16/16 (/16 prefix)
172.17.44.200 is host 0.0.44.200 on
network 172.17.0.0
C
255.255.255.0
24/8 (/24 prefix)
192.168.27.3 is host 0.0.0.3 on network
192.168.27.0
*The more bits, the more network identifiers; the fewer bits, the fewer host identifiers possible.
CIDR notation, are the most common. Sometimes the default mask for an IP address
class is called the “natural mask” for that type of address. In all cases it is possible to
change the default mask to move the boundary between the network and host portions of the IP address to wherever the device needs to see it. All devices, whether
hosts or routers, which need to route the packets within the subnetted network, must
have identical masks. All routing protocols in wide use today exchange subnet mask
information together with routing information.
The use of the default masks for the original classful IP address space is shown in
Table 4.3. The more bits, the more network identifiers, and the fewer bits, the fewer
host identifiers possible.
Subnetting moves the boundary between the network and host for a particular
classful IP address to the right of the position where the boundary is normally found.
We will see later that supernetting moves the boundary between network and host for
a particular classful IP address to the left of this position. CIDR (which uses VLSM) can
move the boundary anywhere.
It is important to realize that subnetting does not change anything with respect to
the outside world. Internet routers still deliver the packets as before. It is the customer
or site router that applies the subnet mask and delivers packets to the subnets. Instead
of the usual two parts of the IP address, network, and host, we now have network, subnet, and host. However, even at the beginning of the classful era, Class A blocks were
subnetted into /16s and /24s internally as appropriate.
Look at a simple LAN (192.168.15.0) before and after subnetting, as shown in
Figure 4.4. The subnet creates two equal-sized subnets, but the Internet routers deliver
packets as before. The subnet adds one “extra” bit to the default Class C mask. If this bit
is 0, the first subnet is intended, and if the bit is 1, then the second subnet is intended.
The hosts must be numbered according to the subnet, naturally, and all have the same
subnet mask so they can determine which addresses are still on their subnet (same
NetID) and which are not (different NetID).
Many implementations will not allow the assignment of the first subnet address (the
network) or the last (broadcast). A LAN with 254 hosts subnetted into two subnets
only yields 126 host addresses per subnet, not 127.
130
PART II Core Protocols
Hosts
192.168.15.1 192.168.15.2
192.168.15.0
network
192.168.15.255
broadcast
255.255.255.0
mask
192.168.15.129 192.168.15.253 192.168.15.254
Router
Internet
Before Subnetting
Hosts
192.168.15.1 192.168.15.126
192.168.15.0
network
192.168.15.127
broadcast
255.255.255.128
mask
192.168.15.129
Router
192.168.15.253 192.168.15.254
192.168.15.128
network
192.168.15.255
broadcast
Internet
After Subnetting
FIGURE 4.4
Subnetting a LAN, showing how the value of the initial bits determines the subnet. Host addresses,
if assigned manually, must follow the subnet mask convention.
A sometimes tricky subnet issue is determining exactly what the subnet address (all
0 bits after the mask) and broadcast address (all 1 bits after the mask) are for a given IP
address and subnet mask. This can be difficult because subnet masks do not always fall
on byte boundaries as do classful addresses. An IP address like 172.31.0.128 might not
look like the address of the network itself, but it might be. A network address, in some
implementations of TCP/IP, cannot be assigned to a host. (172.31.0.128 with a subnet
mask of 255.255.255.128 is a network address.)
Consider the address 172.18.0.126 with a subnet mask of 255.255.255.192. What
is the subnet and broadcast address for this subnet? What range of host addresses can
be assigned to this subnet? These questions come up all the time, and there are utilities
available on the Internet that do this quickly. But here’s one way to do it by hand.
The first thing to do is to mask out the network portion of the IP address with the
subnet mask by writing down the mask bits. Then the subnet portion of the address
can be easily marked off by “turning on” the masked bits. Next, it is easy to form the subnet and broadcast address for the subnet by setting the rest of the bits in the address
(the host bits) first to all 0 bits (network) and then to all 1 bits (broadcast). The resulting address range forms the limits of the subnet.
CHAPTER 4 IPv4 and IPv6 Addressing
IP Address
Subnet Mask
172.18.
0.126 –> 10101010 00010010 00000000 01111110
255.255.255.192 –> 11111111 11111111 11111111 11000000
Mark out
subnet...
172.18.
0.126 –> 10101010 00010010 00000000 01111110
255.255.255.192 –> 11111111 11111111 11111111 11000000
Natural Class B Mask
Then get the:
Host
Prefix (network portion)
Subnet Address 172.18.
(Host ⫽ all 0s)
Broadcast
Address
(Host ⫽ all 1s)
Subnet
131
172.18.
0.64
–> 10101010 00010010 00000000 01000000
0.127 –> 10101010 00010010 00000000 01111111
The valid host address range on subset
172.18.0.64 is 172.18.0.65 through
172.18.0.126 (62 hosts).
Many TCP/IP implementations allow assignment of
172.18.0.64 and 172.18.0.127, but not all!
FIGURE 4.5
Finding subnet host address range, showing those available for host assignment. Many routers
allow the use of subnet and broadcast addresses as if they were host addresses.
Let’s look at an example. Figure 4.5 shows how to derive the network and broadcast
address answers for IP address 172.18.0.126 with the subnet mask 255.255.255.192.
These answers are important when subnetting the IP address space because care is
needed to assign host addresses to the proper subnets (and router interfaces). Having
a “discontiguous” classful major network that has been subnetted so that part of the
space is reached through one interface of the router (“10.24.0.0 over here...”), and
the other part of the subnetted major network is reached through another interface
(“10.25.0.0 over there . . .”) can be a problem unless care is taken with the subnets and
the masks that establish them.
CIDR and VLSM
Today, the standard methods for moving the network/host address boundary are
variable-length subnet masking (VLSM) for host addressing and routing inside a routing domain, and classless interdomain routing (CIDR) for routing between routing
domains. (We’ll talk more about routing domains later in this book. For now, think of
a routing domain as an ISP’s collection of routers.) And although treated separately
here for introductory reasons, it is important to realize that VLSM is the fundamental
mechanism of CIDR.
132
PART II Core Protocols
CIDR (defined in RFC 1519) and VLSM (defined in RFC 1860) address more general
issues than simple subnetting. We’ve been looking at addresses from the host perspective in this chapter so far. Let’s discuss CIDR from the router perspective.
CIDR was an immediate answer to two problems: first, the impending exhaustion of
the Class A and Class B address space, and second, the rapid increase in Internet core
routing table sizes to handle the many Class C addresses required to handle new users.
In CIDR, a block of contiguous IP addresses from the former classful address space
are assigned in a group, such as groups of Class C addresses. This allows a service
provider or large customer to configure IP networks from a few hosts up to 16,384
hosts. The number of contiguous addresses needed is determined by a simple count
of the number of host addresses required. The original CIDR plan, applied to Class C
addresses, is shown in Table 4.4. Contiguous address numbers flow seamlessly between
former class boundaries, allowing assignment of address “chunks” for larger networks.
The CIDR RFC does not “subtract” two host addresses for the network itself (final
bits all 0s) and a broadcast address (final bits all 1s). CIDR applies mainly to router
operation, and routers do not assume any structure of the IP addresses in the packets
they route. The limitation on assigning the high and low IP addresses to a host interface
is a function of the host TCP/IP implementation (and some, like routers, do not enforce
any limitations at all).
CIDR changed the terminology that applied to IP addresses. Routes to IP networks
are now represented by prefixes. A prefix consists of an IP network address, followed
by a slash (/), and followed with an indication of how many of the leftmost contiguous bits in the address are part of the network mask applied for routing purposes. For
example, before CIDR, the Class C address 192.168.64.0 would ordinarily have a mask
of 255.255.255.0. Subnetting could add bits to this major network mask, but only in the
fixed patterns and values outlined in the previous section. CIDR enabled a “CIDR-ized”
network address to be represented as 192.168.64.0/18, and that was all the information needed. Sometimes this is abbreviated even further to just 192.168.64/18, but the
Table 4.4 Address Grouping under CIDR*
Number of Hosts Needing Addresses
Class C Addresses Given by Registry
Fewer than 256
1 Class C network
Fewer than 512 but more than 256
2 contiguous Class C networks
Fewer than 1024 but more than 512
4 contiguous Class C networks
Fewer than 2048 but more than 1024
8 contiguous Class C networks
Fewer than 4096 but more than 2048
16 contiguous Class C networks
Fewer than 8192 but more than 4096
32 contiguous Class C networks
Fewer than 16,384 but more than 8192
64 contiguous Class C networks
*Contiguous address numbers flow seamlessly between former class boundaries, allowing assignment of
address “chunks” for larger networks.
CHAPTER 4 IPv4 and IPv6 Addressing
133
two forms are equivalent. The notation just means that a “subnet mask 18 bits long
should be applied to 192.168.64.0.” This is the same as writing “192.168.64.0 with
mask 255.255.192.0” but in more compact form.
Table 4.5 shows all possible prefix lengths, their netmasks in dotted decimal, and
the number of classful networks the prefix represents. It also shows the number of
usable IPv4 addresses that can be assigned to hosts once the network address itself and
the directed broadcast address are subtracted. We’ll talk about the special 0/0 address
and prefix length in Chapter 8. All possible mask lengths are shown for /1 to /32. The
/0 mask matches the whole Internet and is discussed in the routing chapters.
Even when CIDR was used, all bits after the IP network address had to be zero, an
aspect of IP addressing that did not change. For example, 192.168.64.0/18 was a valid
IP network address, but 192.168.64.0/17 was not (due to the presence of the “1” bit
for the “64” in the 17th bit position). This aspect of CIDR is shown in Figure 4.6. The IP
network 192.168.64.0/18 is a CIDR “supernet” because the mask contained fewer bits
than the natural mask in classful IP addressing.
Table 4.5 CIDR Prefixes and Addressing*
Prefix Length
Dotted Decimal
Netmask
Number of Classful
Networks
Number of Usable IPv4
Addresses
/1
128.0.0.0
128 Class A’s
2,147,483,646
/2
192.0.0.0
64 Class A’s
1,073,741,822
/3
224.0.0.0
32 Class A’s
536,870,910
/4
240.0.0.0
16 Class A’s
268,435,454
/5
248.0.0.0
8 Class A’s
134,217,726
/6
252.0.0.0
4 Class A’s
67,108,862
/7
254.0.0.0
2 Class A’s
33,554,430
/8
255.0.0.0
1 Class A or 256 Class B’s
16,777,214
/9
255.128.0.0
128 Class B’s
8,388,606
/10
255.192.0.0
64 Class B’s
4,194,302
/11
255.224.0.0
32 Class B’s
2,097,150
/12
255.240.0.0
16 Class B’s
1,048,574
/13
255.248.0.0
8 Class B’s
524,286
/14
255.252.0.0
4 Class B’s
262,142
/15
255.254.0.0
2 Class B’s
131,070
/16
255.255.0.0
1 Class B or 256 Class C’s
65,534
/17
255.255.128.0
128 Class C’s
32,766
(Continued)
134
PART II Core Protocols
Table 4.5 CIDR Prefixes and Addressing* (Continued)
Prefix Length
Dotted Decimal
Netmask
Number of Classful
Networks
Number of Usable IPv4
Addresses
/18
255.255.192.0
64 Class C’s
16,382
/19
255.255.224.0
32 Class C’s
8,190
/20
255.255.240
16 Class C’s
4,094
/21
255.255.248.0
8 Class C’s
2,046
/22
255.255.252.0
4 Class C’s
1,022
/23
255.255.254.0
2 Class C’s
510
/24
255.255.255.0
1 Class C
254
/25
255.255.255.128
1/2 Class C
126
/26
255.255.255.192
1/4 Class C
62
/27
255.255.255.224
1/8 Class C
30
/28
255.255.255.240
1/16 Class C
14
/29
255.255.255.248
1/32 Class C
6
/30
255.255.255.252
1/64 Class C
2
/31
255.255.255.254
1/128 Class C
0
/32
255.255.255.255
1/256 Class C (1 host)
– (1 host route)
*All possible mask lengths are shown, for /1 to /32. The /0 mask matches the whole Internet and will be
discussed in the routing chapters.
The /31 Prefix
In many cases, a /31 prefix that allows only two IPv4 addresses on a subnet is useless. Hosts are not normally assigned addresses that indicate the network itself (the
lowest address on the subnet) or the directed broadcast (the highest address on
the subnet). Because a /31 prefix only allows the final bit to be 0 or 1, this prefix is
not useful for a subnet with hosts. Most subnets normally use a /30 prefix at most,
which yields two useful host addresses in addition to the low and high addresses.
However, many router networks employ the /31 prefix to address the endpoints of a point-to-point link such as SONET/SDH. There are no hosts to worry
about, and only the router network need worry about the use of internal address
spaces. With /31 prefixes, a single Class C address space can be used to provide
addresses for 128 (256 divided by 2) point-to-point inter-router links, not just 64
(256 divided by 4).
CHAPTER 4 IPv4 and IPv6 Addressing
IP Address
192.168.64.0/18
11000000
10101000
01000000
00000000
Natural Mask
255.255.255.0
11111111
11111111
11111111
00000000
CIDR Mask Bits 255.255.192.0(/18) 11111111
11111111
11111111
00000000
135
Supernet Portion
This method allows
64 Class C networks
to be gathered into
one routing table entry:
192.168.64/18.
Natural Class C Mask
Natural mask:
192.168.64.0 5 192.168.64/24
CIDR mask:
192.168.64.0 5 192.168.64/18
FIGURE 4.6
CIDR in operation. Basically, supernetting moves the natural mask to the left while subnetting
moves it to the right.
CIDR allowed the creation of a network such as 192.168.64.0/18 with 16,384 hosts
(14 bits remain for the host portion of the 192.168.64.0 network) instead of requiring
64 separate IP network addresses to be assigned and configured. CIDR did more than
allow the grouping of contiguous Class C addresses into bigger networks than possible
before. Once the principle was established, CIDR allowed the aggregation of all possible IP addresses under the specified prefix into this one compact notation. This kept
routing table sizes under control in the late 1990s.
Where does VLSM fit in? As mentioned, VLSM applied more to hosts and a single
routing domain. Basically, in the days of classful IP addressing, all subnets of the same
address had to have the same mask length. So you could, for example, subnet 10.0.0.0/8
into 10.0.0.0/16 subnets, but every device on every subnet had to have the same /16
mask. This could be okay if all the subnetted LANs had roughly the same number of
hosts, but what about point-to-point links between routers on the subnet? They could
get by with a /31 or /30 mask because there were only two endpoints, but they had to
have room for the same thousands of hosts as the rest of the /16.
Note that the Illustrated Network is an offender: The links between our routers use
/24 masks for point-to-point links. We would not do this in the real world, but it will help
our understanding of simple examples when we turn to routing later in this book.
IPV6 ADDRESSING DETAILS
Let’s take a quick look at some of the differences between IPv4 and IPv6 addressing.
The use of the IPv6 address space is determined by the value of the first few bits of an
IPv6 address. Routing in IPv6 is similar to IPv4 with CIDR and VLSM, but there are a few
points to be made to clarify this.
136
PART II Core Protocols
IPv6 addresses can be provider based, provider independent, or for local use. All
provider-based IPv6 addresses for “aggregatable” global unicast packets begin with
either 0010 (2) or 0011 (3) in the first four bit positions of the 128-bit IPv6 address.
Typical IPv6 address prefixes would look like:
2001:0400::/23
2001:05FF::/29
2001:0408::/35
and so on.
The 64 bits that make up the low-order bits of the IPv6 address must be in a
format known as the EUI-64 (64-bit Extended Unique Identifier). Normally, the 48-bit
MAC address consists of 3 bytes (24 bits) assigned to the manufacturer and 3 bytes
(24 bits) for the serial number of the NIC itself. A typical MAC address would look like
0000:900F:C27E. The next to the last bit in the first byte of this address is the global/
local bit, and is usually set to a 0 bit (global). This means that the MAC address is globally assigned and is using the native address assigned by the manufacturer. In EUI-64 format, this bit is flipped and usually ends up being set to a 1 bit (the meaning is flipped
too, so in IPv6, 1 here means global). This would make the first byte 02 instead of 00.
For example, 0000:900F:C27E becomes 0200:900F:C27E (not always, but this is just a
simple example).
To convert a MAC address to a 64-bit address that can be used on an interface for
the host portion of an IPv6 address, we insert the string FFFE between the manufacturer and the serial number fields of the MAC address (between the first and the last
3 bytes). The MAC address becomes 0200:90FF:FE0F:C27E. This is more easily shown
as follows:
MAC address: 0200:900F:C27E
Split in half: 0200:90 0F:C27E
■ Insert FFFE: FF FE
■ Form EUI-64: 0200:90FF:FE0F:C27E
■
■
Link-local IPv6 addresses begin with 1111 1110 1000 (FE80 in hexadecimal, making the first two bytes FE80 if all of the trailing 6 bits in the second byte are 0 bits).
ULA local addresses are in the form FC00::/7. In IPv6, interfaces are expected to have
multiple addresses, a shift from IPv4. It’s common to find three IPv6 addresses on an
interface: global, link local, and site local. It is also common to use multiple link-local
addresses, one based on the MAC and the other based on random numbers.
Both forms usually end with the 48-bit IEEE MAC address, but again with the added
FFFE bits to form the EUI-64 identifier. The FC00 ULA address forms are used as the
private addresses in IPv6 (just as 10.0.0.0 and the others in IPv4), and that’s how they
are used in this book.
IPv6 addresses appear in sources and outputs about equally with capitals (FE80) or
lower case (fe80), and we’ll see both. (In the RFCs, however, these are universally capitalized.) The major formats of the IPv6 address are shown in Figure 4.7.
CHAPTER 4 IPv4 and IPv6 Addressing
137
128 bits
Provider
Site
Host
48 bits
16 bits
64 bits
Global Routing Prefix Subnet ID
001
7 bits
Interface ID
Global Unicast Address Format
38 bits
16 bits
64 bits
0
Subnet ID
Interface ID
1111110110000
FC00::/7
10 bits
Private ULA Unicast Address Format
54 bits
Interface ID
0
11111110100000
FE80::/10
64 bits
Link-Local Unicast Address Format
FIGURE 4.7
Major IPv6 address formats, showing how the value of the initial bits determine format. The
FC00 address format is often used as private IPv6 address.
Two routers connected by a small LAN can use the link-local IPv6 address of
on their interfaces. This type of address is
never advertised by an IPv6 router attached to the Internet, and it cannot be used
across subnets. On point-to-point links, a distinguishing identifier of the interface card
other than the MAC address can be used at the end of the link-local address.
ULA-L addresses can include a 16-bit subnet field, so these forms of private IPv6
addresses can be used across subnets (through routers), but these addresses are not
usually advertised onto the Internet. Using link-local and ULA-local IPv6 addresses, an
organization can build an entire global network, but usually only if none of the traffic
tries to travel across the Internet. If it does, IPv6 provider–based addresses are needed.
This is similar to building a complete corporate network in IPv4 using the 10.0.0.0
private address space, but using Network Address Translation (NAT) for traffic that must
travel across the Internet. However, in IPv6, hosts are assigned multiple addresses, some
global and some local. In this case, the lower order bits (80 bits) of the site-local address
(subnet and interface) are just pasted onto the higher fields (48 bits) of the providerbased forms of the IPv6 address.
What about private masks and routing in IPv6? As shown above, prefix masks in IPv6
have the same general form as prefix masks in IPv4. Here is a sample IPv6 link-local
FE80::<EUI-64 formatted MAC address>
138
PART II Core Protocols
host address (this time in lower case hex notation) and one possible network prefix
for it:
fe80::90:69ff:fea0:8000/128
fe80:: /64
As in keeping with all of the addresses used in this book, this IPv6 address is a private address. The /64 mask tells the router that the first 64 bits of the address are to be
used for routing purposes.
IP Address Assignment
Most people get IP addresses from their ISP. But where do ISPs get their IP addresses?
Large organizations can still apply for their own IP addresses independent from any ISP.
To whom do they apply?
IP addresses (and the Internet domain names associated with them) were initially
handed out by the Internet Assigned Number Authority (IANA). Today the Internet Corporation for Assigned Names and Numbers (ICANN), an international nonprofit organization, oversees the process of assigning IP addresses.
Actual IP addresses are handed out by the following Regional Internet Registries
(RIRs):
■
■
■
■
■
ARIN (American Registry for Internet Numbers) at www.arin.net—ARIN has handed
out IP addresses for North and South America, the Caribbean, and Africa below the
Sahara since 1997.
RIPE NCC (Reseaux IP European Network Coordination Center) at www.ripe.net—
RIPE assigns IP addresses in Europe and surrounding areas.
APNIC (Asian Pacific Network Information Center) at www.apnic.net—APNIC
assigns IP addresses in 62 countries and regions in Central Asia, Southeast Asia,
Indochina, and Oceania.
LACNIC (Latin American and Caribbean Network Information Center) at www.lacnic.
net—LACNIC assigns IP addresses from ARIN in 38 countries, including Mexico.
AfriNIC (African Network Information Center) at www.afrinic.net—AfriNIC took
over assignment of African IP addresses from ARIN.
All of these Internet Registries databases (who has what IP address space?) combined
are known as the Internet Routing Registry (IRR). Internet domain names comprise a
related activity, but (like IP addresses) names must be globally unique and (unlike IP
addresses) can be almost anything.
For the latest information on IP address assignment, which is always subject to
change, see www.icann.org.
When it comes to IPv6, in particular, IANA still hands out addresses to the registries,
which pass them along to IPv6 ISPs, who allocate IPv6 addresses to their customers.
CHAPTER 4 IPv4 and IPv6 Addressing
139
The current policy is given at www.arin.net/policy. An older policy is used in this
chapter (see www.arin.net/policy/ipv6_policy.html) and uses these prefixes at each
step of the process:
■
■
■
■
■
is reserved for IANA.
IANA hands out a /23 prefix to each registry.
Registry hands out a /32 or shorter prefix to an IPv6 ISP.
ISP allocates a /48 prefix for each customer site.
Local administrators add 16 bits for each LAN on their network, for a /64
prefix.
2001::/16
This scheme is shown in Figure 4.8. When the LAN is included, most IPv6 addresses
have /64 network masks. This is the prefix length used on the Illustrated Network. IPv6
routers can perform the following tasks:
■
■
■
Route traffic to a particular ISP based on the first 32 bits of the IPv6
destination address.
Route traffic to a particular site based on the first 48 bits of the IPv6
destination address.
Route traffic to a particular LAN based on the first 64 bits of the IPv6
destination address.
In practice, IPv6 core routers can look at (and build forwarding tables based on)
or shorter prefixes, routers inside a particular AS (routing domain) can look at /48
prefixes, and site routers on the customer edge can look at /64 prefixes to get traffic
right to the destination LAN.
/32
128 bits
2001
Interface ID
Registry
/23
ISP Prefix
/32
Site Prefix
/48
LAN Prefix
/64
One IPv6 Address Allocation Policy
FIGURE 4.8
IPv6 address allocation, showing how various bits should be assigned by different entities. In
some places, mobile phone providers are heavy users of IPv6 addresses.
140
PART II Core Protocols
Now we can better understand the IPv6 address assigned to CE0 that we saw at the
beginning of the chapter:
FC00:ffb3:d5:b:205:85ff:fe88:ccdb
or
FC00:FFB3:00D5:000B:0205:75FF:FE88:CCDB
Let’s break it down one element at a time and see where it all comes from:
■
■
■
■
■
Registry—We use FC00 instead of 2001 to indicate a private ULA-local IPv6
address.
ISP—We add Best ISP’s AS number of 65459 (0xFFB3) for LAN 1 or Ace ISP’s AS
number 65127 (0xFE67) for LAN2.
Site—We add telephony area code 213 (0x00D5) for the Los Angeles or 212
(0x00D4) for New York sites. (We could always use more of the phone number,
but this is enough.)
LAN—We add 11 (0x000B) for LAN1 or 12 (0x000C) for LAN 2. These are
borrowed from the IPv4 addresses.
EUI-64—We add 0x0205 85FF FE88 CCDB for the hardware MAC address.
The mask is /64, naturally. Keep in mind that in the real world, none of this complex
coding would be done.
141
QUESTIONS FOR READERS
Figure 4.9 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
First byte
Second byte
Third byte
Fourth byte
IPv4
Class A
HostID
8 bits for NetID, 24 bits for HostID
NetID
Class B
NetID
16 bits for NetID
Class C
NetID
24 bits for NetID, 8 bits for HostID
IPv6
Global Routing Prefix
001
38 bits
64 bits
Subnet ID
Interface ID
16 bits
64 bits
Subnet ID
Interface ID
Private ULA Unicast Address Fromat
FC00::/7
FE80::/10
16 bits
Global Unicast Address Format
0
10 bits
HostID
128 bits
48 bits
10 bits
HostID
16 bits for NetID
54 bits
64 bits
0
Interface ID
Link-Local Unicast Address Fromat
FIGURE 4.9
Some major IPv4 and IPv6 address formats, showing classes in IPv4 and FE80 FC00 IPv6
addresses.
1. How many bits make up IPv4 and IPv6 addresses?
2. Which special address formats make up the IPv4 network itself and directed
broadcast (all hosts on the subnet) addresses?
3. How many hosts can be configured with an IPv4 network mask of
255.255.255.240?
4. What are the differences in format and use between IPv6 link-local and private
ULA-local addresses?
5. How many “double colons” (::) can appear in an IPv6 address?
CHAPTER
Address Resolution
Protocol
5
What You Will Learn
In this chapter, you will learn about the hardware addressing used in the data link
layer frame and how it is found by the sender. We’ll talk a lot about the hardware
addresses used on LANs, the MAC addresses.
You will learn about the ARP protocol, which is how IP stacks on LANs identify
the hardware address that the destination field of the frame should use.
The Internet, or any internetwork, is made up of a combination of physical networks
such as LANs and internetworking devices such as routers. A packet sent by a host
might pass through several different physical networks before finally reaching its
destination.
The hosts and routers at the network layer are identified by their network addresses
(also called logical addresses). In TCP/IP, the network or logical address is the IP address,
as we saw in the last chapter. These addresses are usually implemented in software,
and must be globally unique on the Internet. At the data link layer, the interface that
sends and receives frames is identified by the physical or hardware address. An example of a hardware address is the 48-bit MAC address we have been seeing at the frame
level. (See Figure 5.1.)
The hardware address and the network address are two different identifiers with
different sizes, but we need both of them. Layered protocol stacks can use different
types of packets (such as IPv4 and IPv6) on the same Ethernet. Also, IPv4 packets can
be sent over an Ethernet link and then over a point-to-point link with a very different
frame structure.
However, we need some way to map back and forth between addresses at the network and hardware levels. In TCP/IP, this mapping is provided by the address resolution
protocols (the technical term is bindings). ARP results are stored in an ARP cache on
a host so that the entire process does not have to be constantly repeated.
144
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
ink
LL
DS
ge0/0
50. /3
1
Wireless
in Home
PE5
lo0: 192.168.5.1
0
/0/
-0
so 9.2
5
0
/0/
-0 .1
o
s
59
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/3
49.1
P4
lo0: 192.168.4.1
so-0/0/1
79.2
so0/0
29. /2
2
/0
0/0
1
47.
so-
so-0/0/1
24.2
AS 65459
FIGURE 5.1
ARP on the Illustrated Network, showing that devices on the LANs employ ARP to determine
hardware (MAC) addresses.
CHAPTER 5 Address Resolution Protocol
145
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/1
24.1
so-0/0/3
27.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
so
-0
/
17 0/2
.2
ge-
/0/0
so-0 2
47.
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
12
-0
so
Global Public
Internet
AS 65127
146
PART II Core Protocols
What Layer Is ARP?
Although often shown at the same layer as IP because the messages ride inside
frames, as in this book, the ARPs are really in a class all by themselves. Some authors
describe them as a “high” data link layer function, but they are more of a boundary
function between the logical network and its physical hardware. Also, ARPs are
not really protocols, but rather mapping methods (bindings).
The main address resolution protocol is the Address Resolution Protocol (ARP) itself,
but there are also Reverse ARP (RARP), proxy ARP, Inverse ARP (InARP), and ARP for
ATM networks (ATMARP). Other ARPs have been proposed as well (such as a generic
“WARP” for ARPs on a wide area network). In many ways, the various ARP flavors are
not really separate protocols. For that reason, only the main ARP will be described in
detail in this chapter. The purposes of the other members of the ARP family will be
mentioned, but they are not used very often, and not at all on the Illustrated Network.
Most implementations allow the static entry of ARP IP-address-to-physical-address
information as permanent entries into the ARP cache. However, this poses an administrative nightmare (many organizations have a hard enough time keeping track of IP
addresses alone) and is seldom done today. Most ARP tables today are built and maintained dynamically.
ARP AND LANs
Let’s see how the Illustrated Network uses ARP to map IPv4 addresses to physical
addresses. We can look at some ARPs sent by FreeBSD, Linux, and Windows XP, and see
what they look like. Then we can examine the ARP caches and see what information is
kept and how it is stored.
Figure 5.1 shows the devices on the Illustrated Network that we’ll be working with
in this chapter. This time we’ll be using the hosts on each LAN and a pair of routers.
We’ll use these hosts and routers to look at four different cases where ARP is used,
as shown in Figure 5.2.
Host to host—The ARP sender is a host and wants to send a packet to another host
on the same LAN. In this case, the IP address of the destination is known and
the MAC address of the destination must be found.
Host to router—The ARP sender is a host and wants to send a packet to another
host on a different LAN. A forwarding (routing) table is used to find the IP
address of the router. In this case, the IP address of the router is known and the
MAC address of the router must be found.
CHAPTER 5 Address Resolution Protocol
147
Case 2: Find the address
of a router on the same
subnet as the source.
Case 1: Find the address
of a host on the same
subnet as the source.
Sending Host
Sending Host
ARP
bsdclient
ARP
Wincli1
LAN
LAN
Inxserver
CEO
Receiving Host
Receiving Router
Case 3: Find the address
of a router on the same
subnet as the source router.
Case 4: Find the address
of a host on the same
subnet as the source router.
Sending Router
Sending Router
ARP
CE0
ARP
CE6
LAN
LAN
PE5
Receiving Router
bsdserver
Receiving Host
FIGURE 5.2
Four ARP scenarios. Note that routers employ ARP just as hosts do, and that an ARP stays on the
same subnet as the sender.
Router to router—The ARP sender is a router and wants to forward a packet to
another router on the same LAN. A forwarding (routing) table is used to find
the IP address of the router. In this case, the IP address of the router is known
and the MAC address of the destination router must be found.
Router to host—The ARP sender is a router and wants to forward a packet to a
host on the same LAN. In this case, the IP address of the host is known (from
the IP destination address on the packet) and the MAC address of the host
must be found.
Let’s look at Case 1 in detail because the others are more or less variations on this
basic theme. In Case 1, ARP is used when a host wants to send to another host on the
same IP subnet and the MAC address of the destination is not already known. We’ll
start the LAN2 host lnxclient sending a short message to winsrv2 (it doesn’t really
matter what the message is). Because this is the first time that these devices have
communicated in a long time, an ARP request is broadcast on LAN2 and the sender
waits for a reply.
148
PART II Core Protocols
Now let’s capture the ARP request and response pair on the lnxclient host at IPv4
address 10.10.12.166. We’ll set a filter to only capture and display ARP packets.
[email protected] admin]# /usr/sbin/tethereal -V arp
Capturing on eth0
Frame 1 (42 bytes on wire, 42 bytes captured)
Arrival Time: May 5, 2008 22:13:40.148457000
Time delta from previous packet: 0.000000000 seconds
Time relative to first packet: 0.000000000 seconds
Frame Number: 1
Packet Length: 42 bytes
Capture Length: 42 bytes
Ethernet II, Src: 00:b0:d0:45:34:64, Dst: ff:ff:ff:ff:ff:ff
Destination: ff:ff:ff:ff:ff:ff (Broadcast)
Source: 00:b0:d0:45:34:64 (Dell_45:34:64)
Type: ARP (0x0806)
Address Resolution Protocol (request)
Hardware type: Ethernet (0x0001)
Protocol type: IP (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: request (0x0001)
Sender MAC address: 00:b0:d0:45:34:64 (Dell_45:34:64)
Sender IP address: 10.10.12.166 (10.10.12.166)
Target MAC address: 00:00:00:00:00:00 (00:00:00_00:00:00)
Target IP address: 10.10.12.52 (10.10.12.52)
Frame 2 (106 bytes on wire, 106 bytes captured)
Arrival Time: May 5, 2008 22:13:40.148642000
Time delta from previous packet: 0.000185000 seconds
Time relative to first packet: 0.000185000 seconds
Frame Number: 2
Packet Length: 106 bytes
Capture Length: 106 bytes
Ethernet II, Src: 00:0e:0c:3b:88:56, Dst: 00:b0:d0:45:34:64
Destination: 00:b0:d0:45:34:64 (Dell_45:34:64)
Source: 00:0e:0c:3b:88:56 (00:0e:0c:3b:88:56)
Type: ARP (0x0806)
Trailer: 00000000000000000000000000000000...
Address Resolution Protocol (reply)
Hardware type: Ethernet (0x0001)
Protocol type: IP (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: reply (0x0002)
Sender MAC address: 00:0e:0c:3b:88:56 (00:0e:0c:3b:88:56)
Sender IP address: 10.10.12.52 (10.10.12.52)
Target MAC address: 00:b0:d0:45:34:64 (Dell_45:34:64)
Target IP address: 10.10.12.166 (10.10.12.166)
CHAPTER 5 Address Resolution Protocol
149
We’ll look at the fields of an ARP in detail later. For now, note that the ARP request,
indicated by a 0x0806 in the Ethertype field goes out as a broadcast frame with an
all-zero MAC address field. It’s looking for the MAC address that goes with IP address
10.10.12.52 (winsrv2), the target IP address. The ARP reply frame returns the reply
with the correct MAC address plugged into the all-zero field (and with the MAC address
as the source address in the frame).
The results of an ARP pair between the bsdclient host (10.10.11.177) and the
lnxserver host (10.10.11.66) is almost the same, but not quite. The frame sent in reply
to the ARP is smaller than before.
bsdclient# tethereal -V arp
Capturing on em0
Frame 1 (42 bytes on wire, 42 bytes captured)
Arrival Time: May 5, 2008 22:24:04.518213000
Time delta from previous packet: 0.000000000 seconds
Time since reference or first frame: 0.000000000 seconds
Frame Number: 1
Packet Length: 42 bytes
Capture Length: 42 bytes
Ethernet II, Src: 00:0e:0c:3b:8f:94, Dst: ff:ff:ff:ff:ff:ff
Destination: ff:ff:ff:ff:ff:ff (Broadcast)
Source: 00:0e:0c:3b:8f:94 (10.10.11.177)
Type: ARP (0x0806)
Address Resolution Protocol (request)
Hardware type: Ethernet (0x0001)
Protocol type: IP (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: request (0x0001)
Sender MAC address: 00:0e:0c:3b:8f:94 (10.10.11.177)
Sender IP address: 10.10.11.177 (10.10.11.177)
Target MAC address: 00:00:00:00:00:00 (00:00:00_00:00:00)
Target IP address: 10.10.11.66 (10.10.11.66)
Frame 2 (60 bytes on wire, 60 bytes captured)
Arrival Time: May 5, 2008 22:24:04.518421000
Time delta from previous packet: 0.000208000 seconds
Time since reference or first frame: 0.000208000 seconds
Frame Number: 2
Packet Length: 60 bytes
Capture Length: 60 bytes
Ethernet II, Src: 00:d0:b7:1f:fe:e6, Dst: 00:0e:0c:3b:8f:94
Destination: 00:0e:0c:3b:8f:94 (10.10.11.177)
Source: 00:d0:b7:1f:fe:e6 (10.10.11.66)
Type: ARP (0x0806)
Trailer: 000000000000000000000000000000000000
Address Resolution Protocol (reply)
Hardware type: Ethernet (0x0001)
150
PART II Core Protocols
Protocol type: IP (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: reply (0x0002)
Sender MAC address: 00:d0:b7:1f:fe:e6 (10.10.11.66)
Sender IP address: 10.10.11.66 (10.10.11.66)
Target MAC address: 00:0e:0c:3b:8f:94 (10.10.11.177)
Target IP address: 10.10.11.177 (10.10.11.177)
The reply from the Linux system is only 60 bytes, 46 bytes less than the response
from the Windows XP server in the first example. That’s interesting; let’s take a closer
look at what Windows XP is doing. Figure 5.3 shows a graphical capture of the reply
from winsrv2 (10.10.12.52) to an ARP request from wincli2 (10.10.12.222).
The reply is indeed 106 bytes long, but the extra bits are all zeros. The only difference in the replies is the number of trailing zeroes in the frame. And we can also see
that the ARP software can deal with these easily.
We’ve already mentioned that ARP results are cached. The devices that send the
ARP requests cache the results, and the device that receives the ARP usually also caches
the MAC address in the arriving ARP request. The idea is that if one device in a pair
FIGURE 5.3
Windows XP ARP reply capture. The ARP message, in this case an ARP reply, is encapsulated
directly inside the Ethernet frame.
CHAPTER 5 Address Resolution Protocol
151
sends in one direction, the other device in the pair will probably send in the opposite
direction as well.
Let’s look at the ARP cache on the bsdserver host (10.10.12.77) using the –a (all)
option.
bsdserver# arp -a
? (10.10.12.1) at 00:05:85:8b:bc:db on em0 [ethernet]
? (10.10.12.52) at 00:0e:0c:3b:88:56 on em0 [ethernet]
? (10.10.12.166) at 00:b0:d0:45:34:64 on em0 [ethernet]
? (10.10.12.222) at 00:02:b3:27:fa:8c on em0 [ethernet]
All four other devices on LAN2 are represented. The question marks are there
because we have no DNS running at the moment. Let’s see if we can add to the cache
by sending a ping to the Windows XP server (winsrv1) on LAN1.
bsdserver# ping 10.10.11.111
PING 10.10.11.111 (10.10.11.111): 56 data bytes
64 bytes from 10.10.11.111: icmp_seq=0 ttl=126 time=0.403 ms
64 bytes from 10.10.11.111: icmp_seq=1 ttl=126 time=0.413 ms
64 bytes from 10.10.11.111: icmp_seq=2 ttl=126 time=0.376 ms
^C
--- 10.10.11.111 ping statistics --3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.376/0.397/0.413/0.016 ms
bsdserver# arp -a
? (10.10.12.1) at 00:05:85:8b:bc:db on em0 [ethernet]
? (10.10.12.52) at 00:0e:0c:3b:88:56 on em0 [ethernet]
? (10.10.12.166) at 00:b0:d0:45:34:64 on em0 [ethernet]
? (10.10.12.222) at 00:02:b3:27:fa:8c on em0 [ethernet]
Nothing was added to the ARP cache on the FreeBSD server. Why should it be? The
other host is only reachable through a router, and the router’s ARP entry is already there
(10.10.12.1). These types of ARPs, the most common, are only used when the destination is on the same LAN subnet as the source.
Usually, entries in the ARP cache are deleted when no communication occurs with
another device, usually after 300 seconds (5 minutes) of silence between the devices.
We can force the ARP cache to empty by using the –d (delete) option.
bsdserver# arp -d -a
10.10.12.1 (10.10.12.1) deleted
10.10.12.52 (10.10.12.52) deleted
10.10.12.166 (10.10.12.166) deleted
10.10.12.222 (10.10.12.222) deleted
In Linux, the command to display the ARP cache is the same (arp), but the –e
option displays the result in the “default” Linux format (using no option gives the same
result). The “C” means that the entry is “complete.”
152
PART II Core Protocols
[[email protected] admin]# /sbin/arp
Address
HWtype
HWaddress
10.10.11.1
ether
00:05:85:88:CC:DB
10.10.11.111
ether
00:0E:0C:3B:88:3C
10.10.11.177
ether
00:0E:0C:3B:8F:94
10.10.11.51
ether
00:0E:0C:3B:87:36
[[email protected] admin]# /sbin/arp -e
Address
HWtype
HWaddress
10.10.11.1
ether
00:05:85:88:CC:DB
10.10.11.111
ether
00:0E:0C:3B:88:3C
10.10.11.177
ether
00:0E:0C:3B:8F:94
10.10.11.51
ether
00:0E:0C:3B:87:36
Flags Mask
C
C
C
C
Iface
eth0
eth0
eth0
eth0
Flags Mask
C
C
C
C
Iface
eth0
eth0
eth0
eth0
In Linux, use of the –a option displays the results in “BSD” style. The output is still
slightly different, however.
[[email protected] admin]# /sbin/arp -a
? (10.10.11.1) at 00:05:85:88:CC:DB [ether] on eth0
? (10.10.11.111) at 00:0E:0C:3B:88:3C [ether] on eth0
? (10.10.11.177) at 00:0E:0C:3B:8F:94 [ether] on eth0
? (10.10.11.51) at 00:0E:0C:3B:87:36 [ether] on eth0
Windows XP displays the ARP cache with arp
on LAN2.
C:\Documents and Settings\Owner>arp -a
Interface: 10.10.12.52 --- 0x1003
Internet Address
Physical Address
10.10.12.1
00-05-85-8b-bc-db
10.10.12.77
00-0e-0c-3b-87-32
10.10.12.166
00-b0-d0-45-34-64
10.10.12.222
00-02-b3-27-fa-8c
–a as well. This output is from winsrv2
Type
dynamic
dynamic
dynamic
dynamic
The term dynamic distinguishes these entries from statically defined entries.
There is no separate ARP for IPv6. MAC addresses can be embedded in the IPv6
addresses, but this does not solve the problem of a source host knowing the physical
address of a destination host or router. When a host uses IPv4-derived IPv6 addresses,
such as ::10.10.11.111, IPv4 ARP information can be used to supply the MAC addresses
for IPv6.
The address resolution process in IPv6 uses ICMPv6 messages and is part of the
Neighbor Discovery (ND) process. Generally, a multicast Neighbor Solicitation message
is sent and a unicast Neighbor Advertisement message is received in reply. We’ll talk
more about this process in the chapter on ICMPv6. For now, let’s just verify that IPv6
address resolution uses ICMPv6 messages.
Ethereal can capture and display IPv6 traffic as well as IPv6. Let’s send a test message
using the link-local IPv6 addresses from winsrv1 to wincli1, and capture the address
resolution in action. We’ll capture everything but only display ICMPv6 messages. The
result is shown in Figure 5.4.
CHAPTER 5 Address Resolution Protocol
153
FIGURE 5.4
IPv6 address resolution with ICMPv6, showing that the Neighbor Solicitation frame is sent to the
special IPv6 Neighbor Discovery address.
Figure 5.4 shows the details of the Neighbor Solicitation message. The frame destination address is highlighted in the figure, showing that a special multicast frame
address is used instead of the ARP broadcast frame address. The major differences
between this procedure and the ARP process in IPv4 are that ICMPv6 is used in IPv6,
and the solicitation message is sent to the IPv6 multicast group address associated with
the target address.
ARP PACKETS
ARP uses packets, but these are not IP packets. ARP messages ride inside Ethernet
frames, or any LAN frame, in exactly the same way as IP packets. There is no need to
use an IP address here anyway: ARP frames are valid only for a particular LAN segment
and never leave the local LAN (i.e., ARP messages cannot be routed). The structure of
an ARP message is shown in Figure 5.5.
154
PART II Core Protocols
Type of Hardware
Hardware
Size
Type of Protocol
Protocol
Size
Operation
Sender’s Ethernet
Address
Sender’s IP Address
Sender’s IP Address (cont)
Target’s
Ethernet Address
Target’s IP Address
(Trailing 0s)
4 bytes
FIGURE 5.5
The ARP message’s fields. The message is placed directly inside a frame, such as an Ethernet
frame.
This figure is because the 28-byte ARP message includes fields 1, 2, 4, and 6 bytes in
length, and does not readily lend itself to “normal” 32-bit representation. The first five
fields form a type of message header. The next four fields are the sender’s and target’s
IP and MAC addresses. Usually, it’s the target’s MAC address that needs to be found with
the ARP process. And as we have already seen, the ARP message can end with a variable
number of trailing zeros.
On an Ethernet LAN, ARP messages have their own Ethertype value (0x0806). However, some ARP implementations used the “regular” Ethertype for IP packets (0x0800)
because the IP implementation itself can easily decide if the information inside the
frame is IPv4 (packet starts with 0x04) or an ARP message (packet starts with 0x0001
for Ethernet).
The main fields are present in both ARP request and ARP reply messages:
Type of Hardware—This 2-byte field is used to identify the style of hardware
address. (The Ethernet-style MAC address, with value = 1, is the most common,
of course.)
Type of Protocol—This 2-byte field identifies the type of Layer 3, or network layer,
protocol that is being queried. (ARP messages, because they are not IP packets,
can be used for more than IP addresses.) This uses the same set of values as the
Ethertype field, so IP is 0x0800.
CHAPTER 5 Address Resolution Protocol
155
Hardware Size—This byte identifies the size, in bytes, of the hardware address.
The Ethernet MAC address is 6 bytes long.
Protocol Size—This byte identifies the size, in bytes, of the Layer 3 protocols. IPv4
addresses are 4 bytes long.
Operation—This 2-byte field identifies the ARP message’s intent. For example, an
ARP request (“Who has this IPv4 address?”) has the operation value of 1 and
a reply value of 2.
The rest of the fields do not have a fixed size. Their size is determined by the value
in the Hardware Size and Protocol Size fields. On our Ethernet LANs, the hardware
address size is 6 bytes (MAC) and the protocol address size is 4 bytes (IPv4). In that
case, the sizes and functions of these fields are as follows.
Sender’s Ethernet Address—This 6-byte field holds the sender’s Ethernet address.
It should be the same as the source address in the Ethernet frame.
Sender’s IP Address—This 4-byte field holds the sender’s Ethernet address. (This
is how targets fill in their own ARP caches without requiring more ARPs.)
Target’s Ethernet Address—This 6-byte field holds the target’s Ethernet address.
This field in set to all 0 bits in a request. The reply will have this field filled in
and the operation changed to “reply.”
Target’s IP Address—This 4-byte field holds the target’s IPv4 address.
EXAMPLE ARP OPERATION
What the ARP process adds to TCP/IP is a mechanism for a source device to ask,“Who
has IP address 10.10.12.52 (this was our first example from the Illustrated Network)
and what is the physical (hardware) address associated with it?”
ARP messages are broadcast frames sent to all stations. The proper destination IP
layer realizes that the destination IP address in the packet matches its own and replies
directly to the sender. The target device replies by simply reversing the source and
destination IP address in the ARP packet. The target also uses its own hardware address
as the source address in the frame and message.
The ARP process is shown in Figure 5.6. The steps are numbered and taken from
the example earlier in this chapter, where lnxclient ARPs to find the MAC address of
winsvr2.
1. The system lnxclient (10.10.12.166) assembles an ARP request and sends it as a
broadcast frame on the LAN. Because it is unknown, the requested MAC address field
in the ARP message uses all zeros (0s), which are placeholders.
156
PART II Core Protocols
3
1
What’s the MAC address of 10.10.12.52?
Tell 10.10.10.166, okay?
Here’s my MAC address...
Ethernet LAN
ARP
Request
Broadcast
4
Inxclient
10.10.12.166
bsdserver
10.10.12.77
wincli2
10.10.12.222
CE6
10.10.12.1
ARP Request
Sent and Reply
Processed
Not me!
(request
ignored)
Not me!
(request
ignored)
Not me!
(request
ignored)
(These two devices can cache the
sender’s MAC and IP addresses.)
winsvr2
10.10.12.52
2
Hey!
That’s me!
(reply sent
unicast)
FIGURE 5.6
The ARP request and reply process. The message asks for the MAC address associated with the
destination, and the sender’s address that should receive the reply. Other devices that hear the
reply can cache the information.
2. All devices attached to the LAN receive and process the broadcast, even the router
CE6. But only the device with the target’s IP address in the ARP message (winsvr2
at 10.10.12.52) replies to the ARP. The target also caches the MAC address associated with 10.10.12.166 (the source address in the broadcast frame).
3. The target system winsvr2 sends a unicast ARP reply message back to lnxclient.
The reply has the MAC address requested both in the frame (as a source address)
and in the ARP message field sent as 0s.
The originating source system and the target system will cache the hardware
address of the destination and proceed to send “live” IP packets with the information,
at the same time supplying the proper frame address as a parameter to the network
access layer software.
Figure 5.7 shows how the ARP request and reply message shown at the beginning
of this chapter look like “on the wire.” The field values can be compared to the ARP
message format shown in Figure 5.5. Again, the lnxclient to winsrv2 ARP pair are
used as the example. Trailing zeros are not shown.
ARP operation is completely transparent to the user. ARP operation is usually
triggered when a user runs some TCP/IP application, such as FTP, and the frame’s destination MAC address is not in the ARP cache.
CHAPTER 5 Address Resolution Protocol
10.10.12.166
00:b0:dO:45:34:64
Inxclient
10.10.12.52
00:0e:0c:3b:88:56
157
winsvr2
LAN2
0⫻0001
0⫻0800
0⫻00001
0⫻06 0⫻04
0⫻00B0D0453464
0⫻0A0A0CA6
0ⴛ000000000000
0⫻0A0A0C34
CRC
Data (28 bytes)
ARP Reply
Destination
0⫻00B0D0453464
(10.10.12.166)
ARP Request
(10.10.12.52)
0⫻0806
Source
0⫻00B0D0453464
Destination
0ⴛFFFFFFFFFFFF
0⫻0001
0⫻0800
0⫻06 0⫻04 0⫻00002
0ⴛ00E0C3B8856
(10.10.12.52)
0⫻0A0A0C34
0⫻00B0D0453464
(10.10.12.166)
0⫻0A0A0CA6
Source
0ⴛ00E0C3B8856
0⫻0806
Data (28 bytes)
CRC
FIGURE 5.7
ARP exchange example, showing how the requested information is provided by the destination’s
reply.
ARP VARIATIONS
ARP is a fairly straightforward procedure to determine the LAN hardware address that
goes with a given IP address. However, there are more network types than LANs and
there are more “addresses” that need to be associated with IP addresses than “hardware” addresses. Consequently, there are a few other types of ARPs that have evolved to
deal with other IP network situations.
Proxy ARP
Proxy ARP is an older technique (it was called the “ARP Hack”) that was used in early
routers, and is still supported in some routers today. LANs connected by bridges had
hosts that did not (and could not) use different IP network addresses. The same IP
158
PART II Core Protocols
network address is used on both sides of a bridge, so there is one broadcast domain, and
ARPs are shuttled back and forth. This practice wasted bandwidth on the LANs (and on
any WAN link between the bridges). Proxy ARP allowed the router that replaced the
bridge to respond to ARP requests directly with its own MAC address, without having
to propagate the ARP packets onto the other LAN segment. Hosts then sent frames
to the router, but acted as if they were sending the frames directly to the destination
host. Proxy ARP makes sure that the router received the frame, just as with indirect
delivery.
Routers normally require that the same IP subnet address not be configured on
more than one router port. Proxy ARP was a method of assigning a single Class A, B, or
C address to both sides of router without using subnet masking, allowing the router to
function as a bridge. Proxy ARP was useful as networking transitioned from bridges to
routers.
Proxy ARP is still often used in Mobile IP networks, which often bridge between
devices.
Reverse ARP
Reverse ARP (RARP) is used in cases where a device on a TCP/IP network knows its
physical (hardware) address but must determine the IP address associated with it.
A RARP request (“I have MAC address X . . . What’s my IP address?”) is sent to a device
running the RARP server process. The RARP server replies with the IP address of the
device. The RARP server should be located on the local LAN segment, but it does not
have to be.
RARP messages use the same packet format as ARP, but the Ethertype is 0x0835, and
the operation field is 3 for a RARP request and 5 for a RARP reply. Of course, the information to be supplied is the IP address. As with ARP, the request is broadcast and the
reply is unicast. RARP is defined in RFC 903.
RARP was frequently used for diskless network devices on TCP/IP networks such
as workstations, X-terminals, routers, and hubs. These devices needed to obtain variable configuration information such as the IP address for an external source whenever
they were rebooted or powered on. In addition, the amount of configuration information you could obtain through RARP was very limited. Today, with almost every device
having flash memory to store configuration information during reboot when power is
off, the need for RARP is greatly diminished.
Even in cases where configuration information or IP addresses need to be assigned
dynamically, there are better ways to achieve the same result than with RARP, such as
BOOTP and DHCP. Both will be discussed in Chapter 18 of this book.
ARPs on WANs
On most WANs, ARP is still used, but as a limited multicast rather than a broadcast. ARP
has a couple of variations used to address WAN environments such as frame relay
and ATM networks. These public network technologies use virtual circuits (a type
CHAPTER 5 Address Resolution Protocol
Router 1
InARP message 1:
“Which IP address is at
the end of DLCI 18?”
InARP message 2:
“Which IP address is at
the end of DLCI 19?”
Frame
Relay
Network
DLCI 518
159
Reply to InARP message 1:
“My IP address is in the ARP
reply ... use this in the
routing table.”
Router 2
DLCI 519
Router 3
Reply to InARP message 2:
“My IP address is in the ARP
reply ... use this in the
routing table.”
FIGURE 5.8
Inverse ARP (InARP) exchange over a frame relay network. In this case, the hardware address
(DLCI) is known and the sender needs to determine the IP address.
of logical connection) at the frame (frame relay) or cell (ATM) level instead of MAC
addresses. The issue in frame relay and ATM (both called non-broadcast multiaccess
[NBMA] link networks) is to find the virtual circuit number, such as the Data Link Connection Identifier (DLCI) in frame relay, associated with a particular IP address.
InARP (Inverse ARP) was developed for use on frame relay networks. Instead of using
ARP to determine MAC-layer LAN addresses,TCP/IP networks linked by frame relay networks use InARP to determine the IP address at the other end of a frame relay DLCI
number to use when sending IP packets. InARP is used as soon as frame relay DLCI are
created. The replies are used to build the routing table in the frame relay access device
(router). The InARP process is shown in Figure 5.8. InARP is essentially an adaptation of
the reverse ARP (RARP) process used on LANs.
ATMARP is a similar method used to find the ATM virtual path identifier (VPI) and/
or virtual channel identifier (VCI) over an ATM network.
ARP AND IPv6
IPv6 really has no need for a separate ARP function. Instead, the Neighbor Discovery
protocol (ND, sometimes NDP) described in RFC 2461 performs the functions of the
IPv4 ARP in IPv6.
ND is really a superset of most of the functions of IPv4’s ARP, ICMP Redirect, and
ICMP Router Discovery features. This section will discuss some of the features of NDP,
but most of this will be covered in the chapter on ICMP.
160
PART II Core Protocols
Neighbor Discovery Protocol
The Neighbor Discovery protocol is the way that IPv6 hosts and routers find things
out about their immediate neighborhood, typically the LAN segment. A lot of effort
was expended in IPv4 to find out configuration necessities such as default routers,
any alternate routers, MAC addresses of adjacent hosts, and so on. In some cases, these
addresses could not be found automatically with IPv4 and had to be entered manually
(the default router). IPv6 was designed to be almost automatic in this regard.
When an IPv6 host comes up for the first time, the host advertises its MAC layer
address and asks for neighbor and router information. Because these messages are in
the form of ICMPv6 messages, only the basics will be presented here.
Why Neighbor and Router Discovery?
Why does IPv6 have separate neighbor and router discovery messages? After all,
IPv4 did fine using a single broadcast frame structure for host–host and router–
host address discovery.
IPv6 is more sophisticated than IPv4 when it comes to devices and networks.
In IPv6, devices can be located on a local multiple access link (LAN), which are
considered on link, or off link. Generally, there are a lot more hosts on a network
than routers. IPv6 directs messages that discover host addresses only to the local
hosts, while messages to discover one or more default routers are processed only
by the routers.
Instead of a single mass broadcast, neighbor discover in IPv6 is done with
multicast groups. We’ll talk about multicast in more detail in a later chapter.
Many routers today forward packets in hardware, but broadcasts have to be
processed by software. IPv6 routers can ignore the numerous messages sent from
host to host on a LAN. This makes the use of the network resources with IPv6
more efficient.
The ARP function in IPv6 is performed by four messages in ND. The Router
Solicitation/Router Advertisement mechanism is noteworthy in that it provides the key
for host IPv6 address configuration, default route selection, and potentially even bootstrap configuration information.
Neighbor Solicitation—This message is sent by a host to find out the MAC layer
address of another host. It is also used for Duplicate Address detection (Does
another host have the same IPv6 address?) and for Neighbor Unreachability
Detection (Is the other host still there?). The receiving host must reply with a
Neighbor Advertisement.
CHAPTER 5 Address Resolution Protocol
161
Neighbor Advertisement—This message contains the MAC layer address of the
host and is sent in reply to a Neighbor Solicitation message. Hosts also send
unsolicited Neighbor Advertisement when they first start up or if any of the
advertised information changes.
Router Solicitation—This message is sent by a host to find routers. The receiving
router must reply with a Router Advertisement.
Router Advertisement—This message contains the MAC layer address of the
router and is sent in reply to a Router Solicitation message. Routers also send
an unsolicited Router Advertisement when they first start up if any of the
advertised information changes.
ND Address Resolution
ND functions are performed only for local IPv6 addresses (the hop limit is set
to 1 for these messages). ND messages, unlike ARP, are not broadcast (“Everyone
pay attention to this”) but rather multicast (“Only those interested pay attention
to this”).
When an IPv6 host or router starts up, it joins several multicast groups. The IPv6
mode must join the all-nodes group. It must also join a solicited-node group for each
interface running IPv6 or IPv6 address that the node has. Joining these groups allows
the device to receive packets without having all the details of its address established.
This is a much more sophisticated arrangement than the ARP method used in IPv4. The
IPv6 device must keep these multicast groups active until all of its addressing details
have been resolved.
When an IPv6 device needs to resolve the MAC layer address of another host on the
LAN, a Neighbor Solicitation message is sent to the solicited-node multicast address.
The IPv6 solicited-node multicast address is formed by taking the low-order 24 bits of
the IPv6 address and adding the 104-bit prefix FF02::1 to it. Thus, for the link-local IPv6
address fe80::20e:cff:fe3b:883c, the IPv6 multicast group address used is fe02::1:
fe3b:883c.
But what multicast address should the message use in the Ethernet frame? That
multicast address is formed by prepending 33:33 to the lower 24 bits of the IPv6
address. Each device with an IP address registers this form with the local NIC and
expects to receive ND messages this way initially. For the IPv6 multicast group address
fe02::1:fe3b:883c, the multicast address used in the Ethernet destination field is
33:33:fe:3b:88:3c.
An example of the address resolution pair capture earlier in this chapter is shown
in Figure 5.9. Note the use of multicast IPv6 and frame addresses in the Neighbor
Solicitation request and the way the information is supplied in the unicast Neighbor
Announcement reply.
162
PART II Core Protocols
wincli1
10.10.11.51
00:0e:0c:3b:88:3c
fe80::20e:cff:fe3b:883c
10.10.11.111
00:0e:0c:3b:88:56
fe80::20e:cff:fe3b:8736
winsvr1
LAN1
IPv6 source address:
fe80::20e:cff:fe3b:883c
IPv6 destination address:
ff02::1:fe3b:883c
ND target address is:
ff80::20e:cff:fe3b:8736
(find physical address)
Neighbor Solicitation
Neighbor
Solicitiation
(multicast request)
Source
0⫻000E0C3B883C
Neighbor
Annoucement
(unicast reply)
Destination
0⫻000E0C3B883C
Source
0ⴛ000E0C3B88736
Destination
0ⴛ33FE3B8736
IPv6 source address:
fe80::20e:cff:fe3b:8736
IPv6 destination address:
fe80::20e:cff:fe3b:883c
For target address:
ff80::20e:cff:fe3b:8736
MAC is: 00:0e:0c:3b:87:36
Neighbor Announcement
FIGURE 5.9
IPv6 neighbor discovery and address resolution, showing how the request uses multicast frame
and packet addresses.
If no response is received, the sender can generate the Neighbor Solicitation
message several times. When a Neighbor Advertisement message is received by the
sender, the content is used to update the IPv6 Neighbor cache (the equivalent of the
IPv4 ARP cache).
More details on ND message formats and operation are discussed in the ICMP
chapter.
163
QUESTIONS FOR READERS
Figure 5.10 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
IP Layer
(32-bit address)
IP Layer
(32-bit address)
MAC Layer
(48-bit address)
MAC Layer
(48-bit address)
Router
Bridge
Ethernet LAN
Ethernet LAN
To Another
Broadcast
Domain
One Broadcast Domain
(Nontarget destinations parse, but ignore, broadcast ARP messages.)
FIGURE 5.10
ARP messages are used to coordinate IP addresses with lower layer addressing.
1. Why can’t the same address structure and value be used for network layer and
hardware addresses?
2. Why do ARPs have to pass through bridges, but should not pass through
routers?
3. Why does a receiver place the sender’s MAC address in its own ARP cache?
4. What is Proxy ARP used for?
5. What is the advantage of using multicast groups instead of broadcasts for address
resolution?
CHAPTER
IPv4 and IPv6 Headers
6
What You Will Learn
In this chapter, you will learn about the IP layer. We’ll start with the fields in the
IPv4 and IPv6 packet headers. We’ll discuss most of the fields in detail and show
how many of them relate to each other.
You will learn about fragmentation, and how large content is broken up, spread
across a sequence of many packets, and reassembled at the destination. We’ll also
talk about some of the perceived hazards of this fragmentation process.
Thus far, we’ve created a network of hosts and routers, linked them with a variety of
architectures and link types (LANs and WANs), and discussed the frame formats and
methods used to distribute packets among the nodes. We’ve considered the IPv4 and
IPv6 address formats, and the ways that they map to lower, link layer addresses. Now
it’s time to concentrate on the IP layer itself.
Even casual users of the TCP/IP protocol suite are familiar with the basic IP packet,
or, as it was initially called (and still often is) the datagram. An IP datagram or packet
is the connectionless IP network-layer protocol data unit (PDU). When TCP/IP came
along, packets were often associated with connection-oriented data networks such
as X.25, the international packet data network standard. To emphasize the connectionless nature of IP, then a radical approach to network layer operation, the TCP/IP
developers decided to invent a new term for the IP packet. Through analogy with the
telegram (a terse message sent hop by hop through a network of point-to-point links),
they came up with the term “datagram.”
The IP layer of the whole TCP/IP protocol stack is the very heart of TCP/IP. The
frames that are sent and delivered across the network from host to router and router
to host contain IP packets. However, like almost all statements about nearly any network protocol, there are exceptions to the general “frames contain IP packets” rule. As
shown in the last chapter, an important class of IP layer protocols known as the Address
Resolution Protocols (ARPs) does not technically use IP packets, but ARP messages
are very close in structure to IP packets. Also, the Internet Control Message Protocol
(ICMP) uses IP packets and is included in the IP layer. We’ll look at ICMP in the next
chapter.
166
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
ink
LL
DS
ge0/0
50. /3
1
Wireless
in Home
PE5
lo0: 192.168.5.1
0
/0/
-0
so 9.2
5
0
/0/
-0 .1
o
s
59
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/3
49.1
P4
lo0: 192.168.4.1
so-0/0/1
79.2
so0/0
29. /2
2
/0
0/0
1
47.
so-
so-0/0/1
24.2
AS 65459
FIGURE 6.1
The LANs on the Illustrated Network use both IPv4 and IPv6 packets. We’ll be looking at the
headers generated by the hosts on the LANs.
CHAPTER 6 IPv4 and IPv6 Headers
167
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
so
-0
/
17 0/2
.2
ge-
/0/0
so-0 2
47.
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
12
-0
so
Global Public
Internet
AS 65127
168
PART II Core Protocols
Both IPv4 and IPv6 packet structures will be detailed in this chapter. However, for
the sake of simplicity, whenever the term “IP” is used without qualification, “IPv4” is
implied.
PACKET HEADERS AND ADDRESSES
Let’s take a close look at the packets used on the Illustrated Network. We’ll look at the
IPv4 header and addresses first. We worked with the Windows clients and servers a
lot in the last few chapters, and we’ll work with them again in this chapter. But we’ll
also work with the Unix devices and tethereal captures in this chapter, especially for
fragmentation and IPv6. And, as we’ll soon see, one of the biggest differences between
IPv4 and IPv6 is how fragmentation is handled.
Fragmentation
People talk loosely about the pros and cons of “IP packet fragmentation,” but this
terminology is not correct. It is not the IP packet itself that is fragmented, but
the packet content. If the payload is too large to fit inside a single IP packet (as
determined by the IP layer implementation), the content is spread across several
packets, each with its own IP header.
In some cases, as we will see in this chapter, the content of an IP packet must
be further broken up to traverse the next link on the network. However, it’s not
really the IP packet that is fragmented. The original packet is discarded, and a
string of IP packets is created that preserves the packet content and overall header
fields, but changes specifics. When we say that “the packet is the data unit that
flows end-to-end through the network,” it is not the packet that is unchanged, but
the content.
Naturally, if packet content is kept small enough, no fragmentation is necessary.
Figure 6.1 shows the parts of the Illustrated Network that we’ll be using for our
investigation of IP headers and fragmentation. The LAN clients and servers are highlighted, as are the local customer-edge routers.
Let’s start with IPv4. We can just start a flow of IPv4 packets between a client and
server and capture them. Then we can parse the packets until we find something of
interest.
Let’s take a good look at all the fields in an IPv4 packet header. We’ve already captured
plenty of them. This example is from the FTP transfer from host (wincli2, with address
10.10.12.222) to router (CE6, with address 10.10.12.1) that we first saw in Chapter 2.
Figure 6.2 shows a frame from the actual data transfer itself, frame 35, in fact.
The Ethernet frame is of type 0x0800 to show it carries an IPv4 packet. All of the lines
from “Internet Protocol” to the line before “Transmission Control Protocol” interpret
CHAPTER 6 IPv4 and IPv6 Headers
169
FIGURE 6.2
Capture of IPv4 header fields. The frame is broken out to show the content and meaning of every
field in the IPv4 header. Note that the DF (Don’t Fragment) bit is set on the packet.
fields in the IPv4 header. The source and destination addresses are listed first. Although
we’ll see that they are not the first fields in the header, they are definitely the fields that
most frequently are of interest.
Ethereal interprets a field in the IPv4 header called the Type of Service (TOS) field
according to something called Differentiated Services (DiffServ). DiffServ is only one
way to interpret these fields. The figure shows that there are three things indicated by
the 8 bits in the TOS field:
Differentiate Services Code Point (DSCP)—The default is zero, which means this
packet does not require special handling by any router or host other than IP’s
normal best-effort service.
Explicit-Congestion-Notification Capable Transport (ECT)—This bit is set by
devices when the transport is able to provide an indication of network congestion to network-attached devices. The value of zero shows that Ethernet is not
an ECT, so packets cannot tell devices when the LAN is congested.
ECN Congestion Explicit (ECT-CE)—On transport that can report congestion, this bit is set when some predefined criteria for network congestion is
met. This is often a percentage of output buffer fullness. On Ethernet this bit
is always zero.
170
PART II Core Protocols
We’ll say a little more about DSCP and quality of service (QOS) in a later chapter.
However, the incomplete support for and variations in QOS implementations rule out
QOS or DSCP as a topic for an entire chapter.
There are also four flag bits shown in the figure. The two most important are the
bits that indicate this packet content is not to be fragmented (the DF bit is set to 1)
and that there are no more frames carrying pieces of this packet’s payload (the More
Fragments bit is set to 0).
In the following, we talk about fragmentation in IPv4 in more detail, and then
explore all of the fields in the IPv4 header in more detail.
THE IPv4 PACKET HEADER
The general structure of the IPv4 packet is shown in Figure 6.3. The minimum header
(using no options, the most common situation) has a length of 20 bytes (always shown
in a 4-bytes-per-line format), and a maximum length (very rarely seen) of 60 bytes. Some
of the fields are fairly self-explanatory, such as the fields for the 4-byte (32-bit) IPv4
source and destination address, but others have specialized purposes.
1 byte
Version
1 byte
Header
Length
Type of Service
Identification
H
e
a
d
e
r
Time to Live
1 byte
1 byte
Total Packet Length
Flags
Protocol
Fragment Offset
Header Checksum
32-bit IPv4 Source Address
32-bit IPv4 Destination Address
(Options, if present, padded if needed)
DATA
32 bits
FIGURE 6.3
IPv4 Packet and Header
CHAPTER 6 IPv4 and IPv6 Headers
Version—Currently set to
0x04
171
for IPv4.
Header Length—Technically, this is the Internet header length (IHL). It is the
length of the IP header in 4-byte (32-bit) units known as “words,” and includes
any option fields present and padding needed to align the header on a 32-bit
boundary. In Figure 6.2, this is 20 bytes, which is most common.
Type of Service (TOS)—Contains parameters that affect how the packet is handled
by routers and other equipment. Never widely used, it was redefined as Differentiated Services (DiffServ or DS) code points and is still hampered because
of a lack of widespread implementation, especially from one routing domain
to another. The meaning of these bits, which are all set to 0 in Figure 6.2, was
detailed earlier in this chapter.
The next four fields, shown in italics in Figure 6.3, figure directly in the fragmentation process. Fragmentation, introduced in Chapter 4, occurs when a packet is forwarded onto a data link and the packet content will not fit inside a single frame. In
these cases, the packet content must be fragmented and spread across several frames,
then reassembled at the destination host. Fragmentation will be discussed in detail in
the next section of this chapter.
Total Packet Length—This is the length of the whole packet in bytes. The maximum value for this two-byte field is 65,535 bytes. This length is approached
by no common TCP/IP implementation or network MTU size. The packet in
Figure 6.2 is 1500 bytes long, the most common length due to the prevalence
of Ethernet LANs.
Identification—A 16-bit number set for each packet to help the destination host
reassemble like-numbered fragments. Even intact, single packets could be fragmented by routers (sometimes repeatedly) on their way to a destination, so
this field must be filled in. This field is set to 0x78be (30910) in Figure 6.2.
Flags—Only the first 3 bits of this field are defined. Bit 1 is reserved and must
be set to 0. Bit 2 (DF) is set to 0 if fragmentation is allowed or 1 if fragmentation is not allowed. Bit 3 (MF) is set to 0 if the packet is the last fragment,
or 1 if there are more fragments to come. Note that the MF field does not
imply any sequencing of the arriving fragments, nor does it guarantee that
the set is complete. Other fields are examined to determine sequencing and
completeness. The packet in Figure 6.2 will generate an error when it encounters a device that wants to fragment the packet content.
Fragment Offset—When a packet is fragmented, the fragments must fall on an 8-byte
boundary. That is, an 800-byte packet can be fragmented into two packets of 400 bytes
each, but not as eight packets of 100 bytes each, since 100 is not evenly divisible by
8. This field contains the number of 8-byte units, or blocks, in the packet fragment. The
offset is 0 in Figure 6.2.
172
PART II Core Protocols
The rest of the IP header fields do not deal with fragmentation.
Time to Live (TTL)—This 8-bit field value is supposed to be the number of seconds,
up to 255 maximum, that a packet can take to reach the destination. Each
router is supposed to decrement this field by a preconfigured amount which
must be greater than 0. If a packet arriving at a router has this field set to 0, it
is discarded and never routed. Unfortunately, there is no standard way to track
time across a group of routers, so most TCP/IP networks interpret this field as
a simple hop count between routers and simply decrement this field by 1. The
TTL in Figure 6.2 is 128, a fairly typical value.
Protocol—This 8-bit field contains the number of the transport-layer protocol that
is to receive and process the data content of the packet. The protocol number
for TCP is 6 and UDP is 17, but almost 200 have been defined. The packet in
Figure 6.2 carries TCP.
Header Checksum—An error-detection field for the IP header only, not the packet
data fields. If the computed checksum does not match at the receiver, the
header is damaged and not routed. Figure 6.2 not only shows the header
checksum of 0x4f6b, but Ethereal tells us that it is correct.
Source and Destination Addresses—The 32-bit IPv4 addresses of the source
and destination hosts. The packet in Figure 6.2 is sent from 10.10.12.222 to
10.10.12.1.
Options—The IPv4 options are seldom used today for data transfer and will not
be described further, nor do they appear in Figure 6.2.
Padding—When options are used, the padding field makes sure the header ends
on a 32-bit boundary. That is, the header must be an integer number of 4-byte
“words.” The header in Figure 6.2 is not padded, and few are since options use
is unusual.
FRAGMENTATION AND IPv4
Let’s look at IPv4 fragmentation on the Illustrated Network. We can determine how the
MTU size and fragmentation affect IPv4 data transfer rates.
It’s not all that important (and not all that interesting) to show the fragmentation
process with a capture. Moreover, it is difficult to convey a sense of what’s going on
with a series of snapshots, even when Ethereal parses the fragmentation fields. Appreciating the effects of a small MTU size on data transfers is more important.
Let’s use the bsdclient on LAN1 and bsdserver on LAN2 to show what fragmentation does to data throughput. We’ll use FTP to transfer a small file (about 30,000 bytes)
called test.stuff from the server to the client. Why so small a file? Just to show that
if fragmentation plays a role in small transfers, the effects will be magnified with larger
files. First, we’ll use the default MTU sizes.
CHAPTER 6 IPv4 and IPv6 Headers
173
bsdclient# ftp 10.10.12.77
Connected to 10.10.12.77.
220 bsdserver FTP server (Version 6.00LS) ready.
Name (10.10.12.77:admin): admin
331 Password required for admin.
Password:
230 User admin logged in.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> get test.stuff
local: test.stuff remote: test.stuff
150 Opening BINARY mode data connection for 'test.stuff' (29752 bytes).
100%
|***************************************************************************
***********************| 29752 00:00 ETA
226 Transfer complete.
29752 bytes received in 0.01 seconds (4.55 MB/s)
This is about 4.5 MBps (or about 36 Mbps) and transfer time of about 1/100th of
a second. Not too bad. (Keep in mind that 1/100th of a second is about the smallest interval that can be reported without special hardware.) This is good throughput,
but remember there are only two routers involved, connected by a SONET link at
155 Mbps and the LAN runs at 100 Mbps. There is also no other traffic on the network,
so the transfer rate is totally dependent on the ability of the host to fill the pipe from
server to client.
Now let’s change to Maximum Transmission Unit size at the server connected to
LAN2 (the server LAN) from the default of 1500 to 256 bytes. How much of a difference will this make?
ftp> get test.stuff
local: test.stuff remote: test.stuff
150 Opening BINARY mode data connection for 'test.stuff' (29752 bytes).
100%
|***************************************************************************
***********************| 29752 00:00 ETA
226 Transfer complete.
29752 bytes received in 1.30 seconds (22.29 KB/s)
ftp>
The transfer time is up to 1.3 seconds, about 130 times longer than before! And the
transfer rate fell from about 36 Mbps to about 184 KILOBITS per second, three orders
of magnitude less than before. This is the “performance penalty” of fragmentation. (It
should be pointed out that these numbers are not precise, and there are many other
reasons that file transfers speed up or slow down. However, the point is entirely
valid.)
We can view a lot of packet statistics, including fragment statistics, using the
netstat utility. With netstat, we can monitor an interface in real time, display the
174
PART II Core Protocols
host routing table, observe running network processes, and so on. We’ll do more with
netstat later. For now, we’ll just see how many fragments our 30,000-byte file transfer
has generated.
To do this, we’ll look at the IP statistics on the client before and after the file transfer
has been run with the small MTU size. We’ll set the counters to zero first.
bsdclient# netstat -sp ip
ip:
0 total packets received
0 bad header checksums
0 with size smaller than minimum
0 with data size < data length
0 with ip length > max ip packet size
0 with header length < data size
0 with data length < header length
0 with bad options
0 with incorrect version number
0 fragments received
0 fragments dropped (dup or out of space)
0 fragments dropped after timeout
0 packets reassembled ok
[many more lines deleted for clarity...]
Now we’ll reset the counters, run the transfer again, and check the IP statistics.
bsdclient# netstat -sp ip
ip:
57 total packets received
0 bad header checksums
0 with size smaller than minimum
0 with data size < data length
0 with ip length > max ip packet size
0 with header length < data size
0 with data length < header length
0 with bad options
0 with incorrect version number
171 fragments received
0 fragments dropped (dup or out of space)
0 fragments dropped after timeout
57 packets reassembled ok
[many more lines deleted for clarity...]
The file was transferred as 171 fragments that were reassembled into 57 packets. Let’s
take a closer look at fragmentation of the MTU size in IPv4.
CHAPTER 6 IPv4 and IPv6 Headers
175
Fragmentation and MTU
If an IP packet is too large to fit into the frame for the outgoing link, the packet content
must be fragmented to fit into multiple “transmission units.” The Maximum Transmission Unit (MTU) size is a key concept in all TCP/IP networks, often complicated by the
fact that different types of links (LAN or WAN) have very different MTU sizes. Many of
these are shown in Table 6.1. The link protocols shown in italics have “tunable” (configurable) MTU sizes instead of defined defaults, but almost all interfaces allow you to
lower the MTU size. The figures shown are the usual maximums. The 9000-byte packet
size is not standard in Gigabit Ethernet, but common.
Hosts reassemble any arriving fragmented packets to avoid routers pasting together
and then tearing apart packets repeatedly as they are forwarded from link to link. Fragments themselves can even be fragmented further as a packet makes its way from, for
example, Gigabit Ethernet to frame relay to Ethernet.
Fragmentation is something that all network administrators used to try to avoid. As
a famous paper circulated in 1987 asserted bluntly, “Fragmentation [is] considered
harmful.” As recently as 2004, an Internet draft (http://ietfreport.isoc.org/all-ids/draftmathis-frag-harmful-00.txt) took this one step further with the title, “Fragmentation
Considered Very Harmful.”The paper asserts that most of the harm occurs when a fragment of packet content, especially the first, is lost on the network. And a number of
older network attacks involved sending long sequences of fragments to targets, never
finishing the sequence, until the host or router ran out of buffer space and crashed. Also,
Table 6.1 Typical MTU Sizes*
Link Protocol
Typical MTU Limit
Maximum IP Packet
Ethernet
1518
1500
IEEE 802.3
1518
1492
Gigabit Ethernet
9018
9000
IEEE 802.4
8191
8166
IEEE 802.5 (Token Ring)
4508
4464
FDDI
4500
4352
SMDS/ATM
9196
9180
Frame relay
4096
4091
SDLC
2048
2046
*Frame overhead accounts for the differences between the theoretic limit and
maximum IP packet size.
176
PART II Core Protocols
because of the widespread use of tunnels (see Chapter 26), there are link layers that
really need an MTU larger than 1500 to support encapsulation, and you can’t fragment
MTUs inside a tunnel.
There are several reasons for the quest to determine the smallest of the MTU sizes
on the links between source and destination. This “minimum” MTU size can be used
between a source and destination in order to avoid fragmentation. The main reasons
today follow:
■
■
■
■
■
Fragmentation is processor intensive. Early routers were hard pressed to both route
and fragment. Even today, high link speeds force routers to concentrate on routing
and minimize “housekeeping” tasks.
Many hosts struggle to reassemble fragments. Fragmentation puts the reassembly
burden on the receiving host, which can be a cell phone, watch, or something
else. This requires processing power and delays the processing of the packet.
Fragmentation fields are favorite targets for hacking. TCP/IP implementation behaviors are not spelled out in detail for many situations where the fragmentation fields
are set to inconsistent or contradictory values. Many a host and router have been
hung by exploiting this variable behavior.
Fragments can be lost, out-of-sequence, or errored. The more pieces there are, the
more things that can go wrong. The worse occurs when the first fragment is lost on
the network.
Early IP implementations avoided fragmentation by setting the default IP packet
size very low, to only 576 bytes. All link protocols then in common use could
handle this small packet size, and many IP implementations to this day still use
this default packet size. Naturally, the smaller the MTU size, the greater the number of packets sent for a given message, and the greater the chances something
can go wrong.
Fragmentation behavior changes in IPv6. In IPv6, routers do not perform fragmentation.
Fragmentation and Reassembly
The point has already been made that fragmentation is a processor-intensive
operation. Naturally, if all hosts sending packets were aware of the minimum MTU size
on a path from source to destination before sending an IP packet, the problem would
be solved. There are ways to determine the path MTU size.
Path MTU Determination
The commonly used method to determine this path MTU is slow, but it works. The
method involves “testing” the path to the destination before sending “live” packets to
a destination system where the path MTU is not known. The source system sends out
an echo packet. (The echo service just bounces back the content of the packet to the
sender.) The echo packet is usually the MTU size of the source system’s own TCP/IP
network, which could be 1500 bytes for Ethernet, 4500 for Token Ring, and so on. This
CHAPTER 6 IPv4 and IPv6 Headers
177
packet has the DF bit set in the Flags field in the IPv4 header. If the echo packet comes
back successfully, then the MTU size is fine and can be used for “live” data.
However, if the current path through the routers includes a smaller MTU size on a
link or network that the packet must traverse as the packet makes its way to the destination, the router attached to this smaller MTU size network must discard the packet,
since the DF bit is set. The router sends an ICMP error message back to the source
indicating the error condition, which is that the packet was discarded because the DF
bit was set. The source can then adjust the packet size downward and try again. This
process can be repeated several times, trying to find the optimal path MTU.
This path MTU determination method works, but it is awkward and slow. The live
data basically wait until the path MTU size is determined for a destination. And because
each packet is independently routed, if there are multiple paths through the router
network (and there usually are, this being the whole point of using routers), the MTU
size may change with every possible path that an IP packet can take from the source to
the destination. However, this method is better than nothing.
A FRAGMENTATION EXAMPLE
Figure 6.4 shows a router on a TCP/IP network. The arriving IP packet is coming from a
WAN link with a configured MTU size of 4500 bytes. The destination system is attached
to the router by means of an Ethernet LAN, which has an MTU size of 1500 bytes.
WAN link:
4500-byte MTU size
Packet from WAN:
Total Packet Length:
Identification:
Flags:
Fragment Offset:
(blocks from start)
Router
Ethernet:
1500-byte MTU size
4488
03E4
LAST
0
Packet from LAN:
Total Packet Length:
Identification:
Flags:
Fragment Offset:
(blocks from start)
Frag #1:
4488
03E4
MORE
0
Host
(destination)
Frag #2:
4488
03E4
MORE
187
Frag #3:
4488
03E4
LAST
374
(187 8-byte blocks 51496 bytes)
FIGURE 6.4
An IPv4 fragmentation example, showing the various header field values for each of the three
fragments loaded into the frames.
178
PART II Core Protocols
Obviously, the 4500-byte packet must be fragmented across three Ethernet frames to
reach the destination host.
Figure 6.4 shows the portions of the IP packet data and the values of the fragmentation fields for each fragment. The figure also shows how the destination system
interprets the fragmentation fields to reassemble the entire packet at the destination.
We’ve already looked at the problems with fragmentations from the router and
network perspective. From the perspective of the receiving host, there are two main
reasons that fragmentation should be avoided. One is the need to wait for undelivered
fragments, and the other is the lack of knowledge on the part of a destination of the
reassembled datagram size. Let’s look at the destination host reassembly process to
explore the “performance penalty” that fragmentation involves.
A fragmented packet is always reassembled at the destination host and never by
routers. (Why put together packets that might require fragmentation all over again?)
However, because all packets are independently routed, the pieces of a packet can
arrive out of sequence. When the first fragment arrives, local buffer memory is allocated for the reassembly process. The Fragment Offset of the arriving packet indicates
exactly where in the sequence the newly arrived fragment should be placed.
At a busy destination, such as a Web server, many different packets from several
sources can arrive in fragments. All of these pieces can be subjected to the reassembly
process at the same time. The destination host IP layer software will associate packets
having matching Identification, Source, Destination, and Protocol fields as belonging to
the same packet.
However, the Total Length field in a packet fragment’s header only indicates the
length of that particular fragment, not the entire packet before fragmentation. It is only
when the destination system receives the last fragment that the total length of the
original packet can be determined.
If a packet is partially reassembled and the final piece to complete the set has not
arrived, IP includes a tunable reassembly time-out parameter. If the reassembly timer
expires, the remaining packet fragments are discarded. If the final piece of the packet
arrives after the time-out, this packet fragment must be discarded as well.
This description of the reassembly process shows the twin problems of memory allocation woes from packet size uncertainties and delays due to the reassembly time-out.
Arriving IP packets have no way to inform the destination system that “I am the first
of 10 fragments.” If so, it would be easy for the destination system to allocate memory
for reassembly that was the best-fit for remaining contiguous buffer space. But all packet
fragments can indicate is “I am the first of many,” “I am the second of many,” and so
on, until one finally says, “I am the last of many.” This uncertainty of reassembled size
makes many TCP/IP implementations allocate as large a block of memory as available
for reassembly. Obviously, a fragmented packet may have been quite large to begin with,
because it was fragmented in the first place. But the net result is that local buffers
become quite fragmented. And if smaller blocks of memory are allocated, the resulting
non-contiguous pieces must be moved to an adequate sized memory buffer before the
transport layer can process the reassembled datagram.
CHAPTER 6 IPv4 and IPv6 Headers
179
The reassembly time-out value must have a value low enough to make the recovery
process delay of the transport layer reasonable. The transport layer contains session
(connection) information that will detect a missing packet in a sequence of segments
(the contents of the packets), and TCP always requests missing information to be
resent. Too long a value for the reassembly timer makes this retransmission process
very inefficient. Too short a value leads to needlessly discarded packets. In most TCP/
IP implementations, the reassembly timer is set by the software vendor and cannot be
changed. This is yet another reason to avoid fragmentation.
Reassembly “deadlock” used to be a problem as well. When memory was a scarce
commodity in hosts, all available local buffer memory could end up holding partially
assembled fragments. An arriving fragment could not be accepted even if it completed
a set and the system eventually hung. However, in these days of cheap and plentiful
memory, this rarely happens.
Limitations of IPv4
The limitations of IPv4 are often cast solely in terms of address space. As important as
that is, it is only part of the story. Address space is not the only IPv4 limitation. Some
others follow:
■
■
■
■
■
■
The fragmentation fields are present in every IPv4 packet.
Fragmentation is always done with a performance penalty and is best avoided. Yet
the fields involved—all 6 bytes worth and more than 25% of the basic 20-byte IPv4
header—must be present in each and every packet.
IPv4 Options were seldom used and limited in scope.
The IPv4 Type of Service field was never used as intended.
The IPv4 Time To Live field was also never used as intended.
The 8-bit IPv4 Type field limited IPv4 packet content to 256 possibilities.
All of these factors contributed to the structure of the IPv6 packet header.
The IPv6 Header Structure
Let’s go back to our Windows devices and capture some IPv6 packets. Then we can
examine those headers and compare them to IPv4 headers.
bsdserver# ping6 fc00:fe67:d4:b:205:85ff:fe8b:bcdb
PING6(56=40+8+8 bytes) fc00:fe67:d4:b:20e:cff:fe3b:8732 -->
fc00:fe67:d4:b:205:85ff:fe8b:bcdb
16 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=0 hlim=64
time=16.027 ms
16 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=1 hlim=64
time=0.538 ms
16 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=2 hlim=64
time=0.655 ms
180
PART II Core Protocols
16 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=3 hlim=64
time=0.622 ms
^C
--- fc00:fe67:d4:b:205:85ff:fe8b:bcdb ping6 statistics --4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/std-dev = 0.538/4.461/16.027/6.678 ms
Here is the first packet we captured:
bsdserver# tethereal -V
Capturing on em0
Frame 1 (70 bytes on wire, 70 bytes captured)
Arrival Time: May 23, 2008 18:39:58.914560000
Time delta from previous packet: 0.000000000 seconds
Time since reference or first frame: 0.000000000 seconds
Frame Number: 1
Packet Length: 70 bytes
Capture Length: 70 bytes
Ethernet II, Src: 00:0e:0c:3b:87:32, Dst: 00:05:85:8b:bc:db
Destination: 00:05:85:8b:bc:db (JuniperN_8b:bc:db)
Source: 00:0e:0c:3b:87:32 (Intel_3b:87:32)
Type: IPv6 (0x86dd)
Internet Protocol Version 6
Version: 6
Traffic class: 0x00
Flowlabel: 0x00000
Payload length: 16
Next header: ICMPv6 (0x3a)
Hop limit: 64
Source address: fc00:fe67:d4:b:20e:cff:fe3b:8732 (fc00:fe67:d4:b:20e:
cff:fe3b:8732)
Destination address: fc00:fe67:d4:b:205:85ff:fe8b:bcdb (fc00:fe67:d4:
b:205:85ff:fe8b:bcdb)
Internet Control Message Protocol v6
Type: 128 (Echo request)
Code: 0
Checksum: 0x7366 (correct)
ID: 0x0565
Sequence: 0x0000
Data (8 bytes)
0000 6e b9 73 44 43 f4 0d 00
n.sDC...
In contrast to the IPv4 header, there are only eight lines (and eight fields) in the IPv6
header. Since the packet is simple enough, let’s look at the header fields in detail as we
examine the meaning and values in this IPv6 packet.
The IPv6 header is shown in Figure 6.5. Besides the new expanded, 16-byte IP source
and destination addresses, there are only six other fields in the entire IPv6 header. This
simpler header structure makes for faster packet processing in most cases.
CHAPTER 6 IPv4 and IPv6 Headers
1 byte
Version
1 byte
Traffic Class
Payload Length
1 byte
181
1 byte
Flow Label
Next Header
Hop Limit
128-bit IPv6 Source Address
128-bit IPv6 Destination Address
FIGURE 6.5
The IPv6 header fields. Note the reduction in field number of how the address fields occupy
most of the header.
IPv6 packets have their own frame Ethertype value, 0x86dd, making it easy for
receivers that must handle both IPv4 and IPv6 on the same interface to distinguish the
frame content.
Version—A 4-bit field for the IP version number (0x06).
Traffic Class—A 12-bit field that identifies the major class of the packet content
(e.g., voice or video packets). Our capture shows this field as the default at 0,
meaning that it is ordinary bulk data (as FTP should carry) and requires no
special handling at devices.
Flow Label—A 16-bit field used to label packets belonging to the same flow
(those with the same values in several TCP/IP header parameters). The flow
label here is 0, but this is common.
Payload Length—A 16-bit field giving the length of the packet in bytes, excluding the
IPv6 header. The payload of this packet, an ICMP message, is 16 bytes long.
182
PART II Core Protocols
Next Header—An 8-bit field giving the type of header immediately following the
IPv6 header (this served the same function as the Protocol field in IPv4). This
packet carries an ICMPv3 message, so the value is 0x3a.
Hop Limit—An 8-bit field set by the source host and decremented by 1 at each
router. Packets are discarded if the hop limit is decremented to zero (this
replaces the IPv4 Time To Live field). The hop limit here is 64, half of the FTP
value in our IPv4 example. Generally, implementers choose the default to use,
but values such as 64 or 128 are common.
IPv4 AND IPv6 HEADERS COMPARED
Figure 6.6 shows the fields in the IPv4 packet header compared to the fields in the
IPv6 header.
1 byte
1 byte
Ver- Hdr
sion Len
Type of
Service
Identification
Time to
Live
Protocol
1 byte
1 byte
Total Packet Length
Flags Fragment Offset
1 byte
Version
1 byte
Traffic Class
Playload Length
1 byte
1 byte
Flow Label
Next
Header
Hop Limit
Header Checksum
Source Address (32-bit IPv4)
Source Address (128-bit IPv6)
Destination Address (32-bit IPv4)
(Options, if present, padded in needed)
Destination Address (128-bit IPv6)
Field names kept from IPv4 to IPv6
Fields not kept in IPv6
Field name and position changed in IPv6
New field in IPv6
FIGURE 6.6
IPv4 and IPv6 headers compared, showing how the old fields and new fields relate to each
other.
CHAPTER 6 IPv4 and IPv6 Headers
183
IPv6 Header Changes
In summary, the following are some of the most important changes to the IP header in
IPv6.
■
■
■
■
■
■
■
■
■
Longer addresses (32 bits to 128 bits). No fragmentation fields.
No header checksum field. No header length field (there is a fixed length header).
Payload length given in bytes, not “blocks” (32-bit units). Time to Live (TTL) field
becomes Hop Limit.
Protocol field becomes Next Header (determines content format). 64-bit alignment
of the packet, not 32-bit alignment. A Flow Label field has been added.
No Type of Service bits (which were seldom respected anyway). Many of the IPv4
fields vanish completely, especially the fields used for packet fragmentation. IPv6
addresses fragmentation performance penalties and problems by forbidding it altogether in routers. Source hosts can still fragment, however, if the source host wants
to send packets larger than the Path MTU size to a destination. In IPv6, as in IPv4,
fragmentation issues can be avoided altogether by making all packets 1280 bytes
long—the minimum established by RFC 2460—but this results in many “extra”
packets.
The IPv4 header Checksum field is absent because destination host error checking
is the preferred method of error detection in today’s more reliable networks, and
almost all transmission frames provide better error detection than the IP layer. There
is no header length field because all IPv6 headers are the same length. The Payload
Length field excludes the IPv6 header fields and is measured in bytes, rather than the
awkward 4-byte units of IPv4.
The TTL field, never interpreted as time anyway, is gone as well. In its place is the
Hop Limit field, a simple indication of the number of routers that a packet can pass
through before it should reach the destination host. The Protocol field of IPv4 has
become the Next Header field in IPv6. The term “next header” is more accurate
because the information inside the IPv6 packet is not necessarily a higher layer protocol (e.g.,TCP segment) in IPv6. There are many other possibilities.
The entire packet must be an integer number of 64-bit (8-byte) units. The 32-bit
unit for IPv4 was established when many high-performance computers were 32-bit
machines, meaning memory access and internal bus operations moved 32-bit units
(called a “word”) around. Today high-performance computers often support 64-bit
words. It only made sense to align the new IPv6 header for ease and speed of processing on the newer architecture computers.
Finally, in place of the ToS field in IPv4, the IPv6 header defines a Flow Label field. Flows
are used by routers to pick out IPv6 packets containing delay-sensitive data such as
voice, video, and multimedia. The Type of Service field was usually ignored by routers in IPv4, and other uses were not standardized.
184
■
■
PART II Core Protocols
The IPv6 specification includes a concept known as Extension Headers. Extension
Headers essentially take the place of the Options in the IPv4 packet header. IPv6
Extension Headers are only present when necessary and are designed to be extensible (new functions may be defined in the future), but the term “extensible Extension Headers” is awkward.
The current Extension Headers include a Hop-by-Hop Option Header, examined by every router handling the IPv6 packet and an Authentication Header
for enhanced security on TCP/IP networks (these are used in IPv4 as part of
IPSec). There is also a Fragmentation header for the use of the source host when
there is no way to prevent the source from sending packets larger than the path
MTU size (IPv6 routers cannot fragment, but hosts can). Also, there used to be
a Routing Header specifying the IP addresses of the routers on the path from
source to destination (similar to “source routing” in token ring LANs), but this is
deprecated by RFC 5095. There are several others, but these show the kinds of
capabilities included in the IPv6 Extension Headers.
IPv6 AND FRAGMENTATION
What would happen if we put IPv6 into a situation where it has to fragment packet
content to make it fit into a frame? Let’s use the Illustrated Network to find out. Two
useful ping parameters are the size of the packet to bounce off a remote device and
the count of packets sent. We’ll capture the packets sent when bsdserver sends a 2000byte packet (too large for an Ethernet frame) to the router.
bsdserver# ping6 -s 2000 -c 1 fc00:fe67:d4:b:205:85ff:fe8b:bcdb
PING6(2048=40+8+2000 bytes) fc00:fe67:d4:b:20e:cff:fe3b:8732 -->
fc00:fe67:d4:b:205:85ff:fe8b:bcdb
2008 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=0 hlim=64
time=2.035 ms
--- fc00:fe67:d4:b:205:85ff:fe8b:bcdb ping6 statistics --1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/std-dev = 2.035/2.035/2.035/0.000 ms
bsdserver#
This makes 2008 bytes with the IPv6 header. Here’s what we have (the data fields,
which contain test strings, have been omitted):
bsdserver# tethereal -V
Capturing on em0
Frame 1 (1510 bytes on wire, 1510 bytes captured)
Arrival Time: May 25, 2008 08:39:21.231993000
Time delta from previous packet: 0.000000000 seconds
Time since reference or first frame: 0.000000000 seconds
Frame Number: 1
CHAPTER 6 IPv4 and IPv6 Headers
185
Packet Length: 1510 bytes
Capture Length: 1510 bytes
Ethernet II, Src: 00:0e:0c:3b:87:32, Dst: 00:05:85:8b:bc:db
Destination: 00:05:85:8b:bc:db (JuniperN_8b:bc:db)
Source: 00:0e:0c:3b:87:32 (Intel_3b:87:32)
Type: IPv6 (0x86dd)
Internet Protocol Version 6
Version: 6
Traffic class: 0x00
Flowlabel: 0x00000
Payload length: 1456
Next header: IPv6 fragment (0x2c)
Hop limit: 64
Source address: fc00:fe67:d4:b:20e:cff:fe3b:8732 (fc00:fe67:d4:b:20e:
cff:fe3b:8732)
Destination address: fc00:fe67:d4:b:205:85ff:fe8b:bcdb (fc00:fe67:d4:
b:205:85ff:fe8b:bcdb)
Fragmentation Header
Next header: ICMPv6 (0x3a)
Offset: 0
More fragments: Yes
Identification: 0x000000e5
Internet Control Message Protocol v6
Type: 128 (Echo request)
Code: 0
Checksum: 0x74df
ID: 0x0e60
Sequence: 0x0000
Data (1440 bytes) (OMITTED)
Frame 2 (622 bytes on wire, 622 bytes captured)
Arrival Time: May 25, 2008 08:39:21.232007000
Time delta from previous packet: 0.000014000 seconds
Time since reference or first frame: 0.000014000 seconds
Frame Number: 2
Packet Length: 622 bytes
Capture Length: 622 bytes
Ethernet II, Src: 00:0e:0c:3b:87:32, Dst: 00:05:85:8b:bc:db
Destination: 00:05:85:8b:bc:db (JuniperN_8b:bc:db)
Source: 00:0e:0c:3b:87:32 (Intel_3b:87:32)
Type: IPv6 (0x86dd)
Internet Protocol Version 6
Version: 6
Traffic class: 0x00
Flowlabel: 0x00000
Payload length: 568
Next header: IPv6 fragment (0x2c)
Hop limit: 64
Source address: fc00:fe67:d4:b:20e:cff:fe3b:8732 (fc00:fe67:d4:
b:20e:cff:fe3b:8732)
186
PART II Core Protocols
Destination address: fc00:fe67:d4:b:205:85ff:fe8b:bcdb (fc00:fe67:
d4:b:205:85ff:fe8b:bcdb)
Fragmentation Header
Next header: ICMPv6 (0x3a)
Offset: 1448
More fragments: No
Identification: 0x000000e5
Data (560 bytes) (OMITTED)
(Frames 3 and 4, the echoed frames sent back in response, are mirror
images of Frames 1 and 2 and have been omitted for brevity.)
bsdserver#
Because the host cannot pack 2000 bytes into an Ethernet frame, the IPv6 host does
the fragmenting before it sends the bits onto the LAN. There are no fragmentation fields
in the IPv6 header, however, so IPv6 includes a second header (next header) that carries
the information needed for the destination to reassemble the fragments (in this case,
two of them). The important fields are highlighted in bold in the preceding code.
The first frame (the capture says “packet”) is 1510 bytes long, including 1456 bytes
of payload (given in the Payload Length field). The Next Header value of 0x2c indicates
that the next header is an IPv6 fragment header. The Fragmentation Header fields are
listed in full:
■
■
■
■
Next Header (0x3a)—The header following the Fragmentation Header is an
ICMPv6 message header.
Offset (0)—This is the first fragment of a series.
More Fragments (Yes)—There are more fragments to come.
Identification (0x000000e5)—Only reassemble fragments that share this
identifier value.
The data field in the ICMPv6 message is 1440 bytes long. The rest of the 1510 bytes are
from the various headers pasted onto these bytes.
Frame 2 holds the rest of the 2000 bytes in the ping. This frame is 622 bytes long
and carries 568 bytes of payload. The Next Header is still an IPv6 fragment (0x2c). The
Fragmentation Header fields follow:
■
■
■
■
Next header (0x3a)—The header following the Fragmentation Header is an
ICMPv6 message header.
Offset (1448)—These bytes start 1448 bytes after the content of the first
frame. (The “extra” 8 bytes are for the ICMPv6 header.)
More Fragments (No)—The contents of this packet complete the series.
Identification (0x000000e5)—This fragment goes with the previous one with
this identifier value.
The data field in the ICMPv6 message is 560 bytes long. Along with the 1440 bytes
in the first fragment, these add up to the 2000 bytes sent.
187
QUESTIONS FOR READERS
Figure 6.7 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
1 byte
1 byte
Ver- Hdr
sion Len
Type of
Service
Identification
Time to
Live
Protocol
1 byte
1 byte
Total Packet Length
Flags Fragment Offset
1 byte
Version
1 byte
Traffic Class
Playload Length
1 byte
1 byte
Flow Label
Next
Header
Hop Limit
Header Checksum
Source Address (32-bit IPv4)
Destination Address (32-bit IPv4)
Source Address (128-bit IPv6)
(Options, if present, padded if needed)
Destination Address (128-bit IPv6)
FIGURE 6.7
The IPv4 and IPv6 packet header fields. IPv6 can employ most IPv4 options as “next header”
fields following the basic header.
1. Why are diagnostics like ping messages routinely given high hop-count values
such as 64 or 128?
2. Without any IPv4 options in use, what value should be seen in the Header Length
field most of the time?
3. How does an IP receiver detect missing fragments?
4. Is there any way for an IP receiver to determine how many fragments are
supposed to arrive?
5. Since almost all the IPv4 header fields are options in IPv6, is it correct to say that
the IPv6 header is “simplified”?
CHAPTER
Internet Control Message
Protocol
7
What You Will Learn
In this chapter, you will learn about ICMP messages, their types, and (in many
cases) the codes used in each type. We’ll look at which ICMP messages are routinely
blocked at firewalls and which are essential for proper device operation.
You will learn about the common ping utility for determining device accessibility
(“reachability”) on an IP network. We’ll discuss the mechanics of both ping and
traceroute, and use several ping examples to illustrate ICMP on the network.
The only function of the IP layer is to provide addressing for and route the IP packet.
That’s all. Once an IP packet has been dealt with, the IP layer just looks for the next
packet. But IP is a connectionless, “best effort,” or “unreliable” method of packet
delivery. The terms “best effort” and “unreliable” often make it sound like IP is casual
about the delivery of packets, which is why they are in quotes so that no one takes
them too literally. IP’s best effort is usually just fine, given the low error rates on modern
transports, and it is mostly unreliable with regard to a lack of guarantees, as has been
pointed out. Besides, there is nothing wrong with letting other layers, such as the TCP
segments or the Ethernet frames, have the major responsibility for error detection and
correction.
This is not to say that IP should be oblivious to errors. The network layer, in its ubiquitous and key position at the heart of the protocol stack, should know about packet
errors and is in a good position to let layers above know what’s going on (although IP
lets the upper layers decide what to do about the condition).
And there’s plenty that can still go wrong, and not just with regard to bit errors.
A packet might wander the router cloud until the TTL field hits zero. A destination
server might be down. A destination server might no longer exist. The “do not fragment” bit might forbid fragmentation when it is needed to send a packet, stopping the
routing process cold. In all of these situations, the sender should be informed of the
condition.
190
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
ink
LL
DS
ge0/0
50. /3
1
Wireless
in Home
PE5
lo0: 192.168.5.1
0
/0/
-0
so 9.2
5
0
/0/
-0 .1
o
s
59
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
/0
0/0
1
47.
so-0/0/3
49.1
so-
P4
lo0: 192.168.4.1
so-0/0/1
24.2
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
FIGURE 7.1
ICMP is used on all devices on the Illustrated Network, routers, and hosts. In this chapter,
we’ll work with the hosts on the LANs.
AS 65459
CHAPTER 7 Internet Control Message Protocol
191
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/1
24.1
so-0/0/3
27.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
so
-0
/
17 0/2
.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
12
-0
so
Global Public
Internet
AS 65127
192
PART II Core Protocols
Without error condition feedback from the network, the natural response to an
unexpected result (in this case, a reply) is to simply repeat the original message. Sometimes this might work, especially if the condition is transient, but semipermanent or
permanent error conditions must be reported to the source. Otherwise, repetitive
sending might result in an endless error loop, and certainly adds unnecessary traffic
loads to the network.
This chapter explores aspects of IP’s built-in error reporting protocol, the Internet
Control Message Protocol (ICMP). Note that ICMP does not deal with “error messages,”
but “control messages,” a better term to cover all of the roles that have evolved for
ICMP. We’ll start by looking at one indispensable utility used on all TCP/IP network:
ping. We’ll be using the same LAN-based hosts as in the previous chapter, as shown in
Figure 7.1.
ICMP AND PING
The easiest way to look at ICMP on the Illustrated Network is with ping and traceroute.
Both utilities have been used before in this book, but because traceroute will be used
again in the chapters on routing, this chapter will use ICMP and ping.
The ping utility is just a way to “bounce” packets off a target device and see if it is
there—that is, it has the IP address that was provided, is powered on, and alive. The
device might still not function in the correct way (i.e., the router might not be routing properly), but at least the device is present and accounted for. It is routine to ping
a newly installed device, host, router, or anything else, just to see if it responds. If it
doesn’t, network administrators have a place to start troubleshooting.
Let’s use ping from the lnxclient to the bsdserver, both on LAN2 to start exploring
ICMP. Windows XP only sends four pings by default, but Unix systems will just keep
going until stopped with ^C (which is what was done here).
[[email protected] admin]# ping 10.10.12.77
PING 10.10.12.77 (10.10.12.77) 56(84)
64 bytes from 10.10.12.77: icmp_seq=1
64 bytes from 10.10.12.77: icmp_seq=2
64 bytes from 10.10.12.77: icmp_seq=3
64 bytes from 10.10.12.77: icmp_seq=4
64 bytes from 10.10.12.77: icmp_seq=5
bytes of data.
ttl=64 time=0.549
ttl=64 time=0.169
ttl=64 time=0.171
ttl=64 time=0.187
ttl=64 time=0.216
ms
ms
ms
ms
ms
^C
--- 10.10.12.77 ping statistics --5 packets transmitted, 5 received, 0% packet loss, time 3996ms
rtt min/avg/max/mdev = 0.169/0.258/0.549/0.146 ms
[[email protected] admin]#
The output shows the ICMP sequence numbers and round-trip time (rtt) for the
group in terms of minimum, average, maximum, and even the maximum deviation from
the mean. We do not have DNS on the network, so we have to use IP addresses. Most
CHAPTER 7 Internet Control Message Protocol
193
ping implementations will accept host names, and some (such as Cisco routers) will
even do a reverse DNS lookup when given an IP address and report the host name
in the result. This can be very helpful when an IP address is entered incorrectly or
assigned to a different device than anticipated.
We can look at the ICMP packets used with ping in more detail. Let’s use both LANs
this time, and ping from wincl1 (10.10.11.51) on LAN1 to wincli2 (10.10.12.222) on
LAN2. With XP, we won’t have to worry about stopping the sequence.
C:\Documents and Settings\Owner> ping 10.10.12.222
Pinging 10.10.12.222 with 32 bytes of data:
Reply
Reply
Reply
Reply
from
from
from
from
10.10.12.222:
10.10.12.222:
10.10.12.222:
10.10.12.222:
bytes=32
bytes=32
bytes=32
bytes=32
time<1ms
time<1ms
time<1ms
time<1ms
TTL=126
TTL=126
TTL=126
TTL=126
Ping statistics for 10.10.12.222:
Packets: Sent = 4, Received = 4, Lost = 0 (0% less),
Approximate round-trip times in milliseconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
Due to the way the Windows operating systems handle timing, it’s not unusual to have
RTTs of 0.
What does this group of packets look like at the target? Figure 7.2 shows us.
We can see that the four pings are accomplished with eight packets sent over the
network. Look at the last column in the upper part of the figure. Ping employs messages
in request–reply pairs using the ICMP protocol. An Echo request is sent out which
basically tells the receiver to “send an ICMP Echo message back to me, okay?” Once
the reply is received, the next request is sent, statistics compiled as the procedure goes
along, and so on.
The details of Frame 1 show that the ICMP message is carried directly inside an IP
packet (and then Ethernet II frame). But ICMP is not often shown as a transport layer
protocol. That would make ICMP function at the same level as things like TCP and
UDP, and this is simply not true. ICMP, as we will find, is concerned with network layer
problems, so portraying ICMP as a type of special protocol associated with IP is not
really a mistake.
So technically, because IPv4 packets carry ICMP messages as protocol number 1,
ICMP is as valid a layer above IP as TCP or UDP or any other of the 200 or so defined
IP protocols that can be carried inside IP packets. But because every IP implementation must include ICMP (and IPv6 has ICMPv6), it makes sense to bundle ICMP and IP
together. This also implies that ICMP messages do not report their own errors.
What if no reply is received by the source of a ping? The source then times out
and another ICMP Echo request message is sent. Naturally, no statistics can be generated, and we get a “host unreachable” message in most cases. We can force a timeout
simply by trying to ping a nonexistent address (this could also be the result of a
simple typo).
194
PART II Core Protocols
FIGURE 7.2
Ping ICMP requests and replies showing details of the ping echo request in the middle pane. Note
that the content of the packet is the ICMP message, not TCP or UDP.
[[email protected] admin]# ping 10.10.12.55
PING 10.10.12.55 (10.10.12.55) 56(84) bytes of data.
From 10.10.12.166 icmp_seq=1 Destination Host Unreachable
From 10.10.12.166 icmp_seq=2 Destination Host Unreachable
From 10.10.12.166 icmp_seq=3 Destination Host Unreachable
--- 10.10.12.55 ping statistics --3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5022ms,
pipe 3
[[email protected] admin]#
Many ping implementations report either “unreachable” or “unknown” errors. The
unreachable report implies that the target was once known to the source and reachable, but isn’t “reachable” at the moment.The unknown report implies that the source
has never heard of the target address or port. However, unreachable reports are often
returned by a host source pinging a new device, which obviously should be unknown.
CHAPTER 7 Internet Control Message Protocol
195
Most network people treat both error condition reports the same way: Something is
just plain wrong.
Ping remains the first choice for checking connectivity on the Internet, between
hosts, and between host and router. On LANs, the first troubleshooting step is “can
you ping it?” If you cannot, there’s no sense of going further. If you can, and things like
applications still do not function as expected, at least the troubleshooting process can
continue productively.
Firewalls sometimes screen out ICMP messages in the name of security. In these
cases, even a failed ping does not prove that a device is not working properly. However, diagnostics become more complex, although not impossible. Of course, screening
out all ICMP messages from a site usually also eliminates correct error reporting and
proper operation of the device. After we list the ICMP message types, we’ll discuss
which ICMP messages are essential.
Ping works with IPv6, too. On most Unix hosts, it’s called ping6. When used with the
special IPv6 multicast address ff02::1, the %em0 addition probes for the IPv6 address of
every interface on the LAN, a form of forced neighbor discovery in IPv6. Here’s what it
looks like on LAN2 when run from the bsdserver.
bsdserver# ping6 ff02::1%em0
PING6(56=40+8+8 bytes) fe80::20e:cff:fe3b:8732%em0 —> ff02::1%em0
16 bytes from fe80::20e:cff:fe3b:8732%em0, icmp_seq=0 hlim=64 time=0.154 ms
16 bytes from fe80::202:b3ff:fe27:fa8c%em0, icmp_seq=0 hlim=128 time=0.575
ms(DUP!)
16 bytes from fe80::5:85ff:fe8b:bcdb%em0, icmp_seq=0 hlim=64 time=1.192
ms(DUP!)
16 bytes from fe80::20e:cff:fe3b:8856%em0, icmp_seq=0 hlim=64 time=0.097
ms(DUP!)
^C
—- ff02::1%em0 ping6 statistics —1 packets transmitted, 1 packets received, +3 duplicates, 0% packet loss
round-trip min/avg/max/std-dev = 0.071/2.520/39.406/8.950 ms
bsdserver#
All four systems on LAN2 are listed, except for lnxclient, which does not have an
IPv6 address. But hosts winsrv2 (fe80::20e:cff:fe3b:8856), wincli2 (fe80::202:b3ff:
fe27:fa8c), router TP6 (fe80::5:85ff:fe8b:bcdb), and even bsdserver (fe80::20e:
cff:fe3b:8732) itself have all replied. Oddly, the Windows XP client replies with a hop
limit of 128.
IPv6 traffic (and ICMPv6) is also visible to Ethereal, so we can explore the format of these packets a little further. Figure 7.3 shows how the exchange of the ping6
ff02::1%em0 packets looks like from wincli2 when run from bsdserver. Note that this
only captures the exchange of packets that wincli2 processes.
196
PART II Core Protocols
FIGURE 7.3
ICMPv6 capture showing the ICMPv6 echo reply message from wincli2. The header details are
shown in the middle pane.
IPv6 uses its own version of ICMP, called (not surprisingly) ICMPv6. The ICMPv6
Echo reply message, sent in response to the ping to multicast group ff02::1, is highlighted in the figure. From the source address, we can tell this is from wincli2. We
looked at the details of the IPv6 header in the last chapter. Note that the hop limit is
128 in the reply, and that the protocol number for ICMP is 0x3a (58 decimal).
THE ICMP MESSAGE FORMAT
ICMP is usually considered to be part of the IP layer itself, and that is how ICMP is presented here. Hosts are supposed to set the IPv4 packet header TOS field to 0 if the packet
carries an ICMP message, and routers are supposed to set the precedence field to 6 or 7.
Figure 7.4 shows the format of two ICMP messages. All ICMP messages start with
the same three fields: an 8-bit Type and Code, followed by a 16-bit Checksum. Then,
depending on the value of the Type, the details of what follows varies. So to be more
informative, a second ICMP message is shown. The second message displays the format
used for a very common network condition, Destination Unreachable, which we saw
earlier.
CHAPTER 7 Internet Control Message Protocol
1 byte
1 byte
Type
Code
1 byte
197
1 byte
Checksum
ICMP Data
(content and format depends on Type)
(a)
1 byte
Type
3
1 byte
1 byte
Code
1 byte
Checksum
Unused (all 0 bits)
IP Header (20 bytes)
and
First 8 bytes of Original Packet Data (usually TCP/UDP header)
(b)
FIGURE 7.4
ICMP message format, showing how a specific message such as Destination Unreachable uses
the fields following the initial three. (a) General format of ICMP message. (b) Format of Destination
Unreachable ICMP message.
Destinations on a TCP/IP network can be unreachable for a number of reasons. The
host could be down, or have a new IP address that is not yet known to all systems. The
destination’s Internet name could have been typed incorrectly (but still maps to an
existing IP address), the only link to the site could have failed, and so on.
ICMP Message Fields
The fields that appear in all ICMP messages follow:
Type—This 8-bit field defines the major purpose of the ICMP message. Most
indicate error conditions, but two of the most common type values, 8 and 0,
mean Echo Request and Echo Reply, respectively. A Type value of 3 means Destination Unreachable. All Types determine the format of the rest of the ICMP
message beyond the first three fields.
Code—This 8-bit field gives additional information about the condition in the
Type field. This is often not necessary, and many Types have only a Code = 0
defined. Other Types have many Code values defined to allow the source to
198
PART II Core Protocols
focus on the real problem. For example, Destination Unreachable (Type = 3)
has 16 codes (0–15) defined.
Checksum—This is the same type of checksum as used for the IP packet header.
This points out that ICMP, although considered part of IP itself, is really just as
much a separate layer as anything else in TCP/IP and so must provide for its
own error checking.
ICMP Types and Codes
There are about 40 defined ICMP message types, and message types 41 through 255 are
reserved for future use. Only a handful of the types have more than a Code value of 0
defined, but these are the more important ICMP message types.
There are two major categories of ICMP messages: error messages (reports that do
not expect a response) and queries (messages sent with the expectation of a matching response). Some others do not fall neatly into either category. The structure of
the fields following the checksum depends on the type of ICMP message. These two
formats are shown in Figure 7.5.
Note that the Destination Unreachable format shown in Figure 7.4 is an ICMP error
message and does not generate a reply. The fields that appear following the initial three
in the ICMP Destination Unreachable message are very common.
Unused—This 32-bit field must be set to all 0 bits for Destination Unreachable,
but in other ICMP messages it is often used as a sequence number to allow
requests and responses to be coordinated by senders and receivers.
IP Header and More—The last 28 bytes of the ICMP Destination Unreachable
message consist of the original IP header (usually 20 bytes, but can be up to
60 bytes) and the first 8 bytes of the segment inside the packet. Usually, this
includes the ports used by the TCP or UDP segment. This practice allows senders to realize exactly what field value is objectionable. It’s one thing to say
“Port unreachable,” but better to say “Hey! The port in the UDP segment you
sent, which is port 6735, can’t be reached here right now...”
Usually, the error messages have the all-zero unused byte followed by the 28-byte
header and packet data, but not always. Identifiers track Query message request/
response pairs, and the sequence numbers help sort out queries sent by the same
process (the process identifier, the PID, is often the ICMP Query identifier in Unix
systems).
The suite of the 40 ICMP message types can be implemented by hosts or routers. Some of the types are mandatory, some are optional, some are for experimental
use, and some are obsolete. In some cases, specifications explicitly state that hosts
or routers be able to transmit and receive (process) ICMP messages, but not in all
cases.
CHAPTER 7 Internet Control Message Protocol
1 byte
1 byte
Type
Code
1 byte
199
1 byte
Checksum
Content Depends on Type/Code*
IP Header (20 bytes)
and
First 8 bytes of Original Packet Data (usually TCP/UDP header)
(a)
*Usually all 0 (unused) except for:
Type 3/Code 4: Destination unreachable, fragmentation needed
(fields are 2 bytes unused and 2-byte link MTU size)
Type 3/Code 5: Destination unreachable, redirect (field is router IP address)
Type 12/Code 0: Parameter problem (field is 4-bit pointer to parameter, rest all 0)
1 byte
Type
3
1 byte
1 byte
Code
1 byte
Checksum
Identifier for Request/Response pairs
(usually PID in Unix)
Sequence Number
(set to 0 initially and incremented)
Content depends on Query Type
(b)
FIGURE 7.5
ICMP error and query messages. Note that error messages include the IP header that generated
the error. (a) ICMP error message. (b) ICMP query message.
Let’s take a look at what the specifications say about ICMP messages. First, we’ll look
at error messages, and then query messages, and then all the rest.
ICMP Error Messages
ICMP Error messages report semipermanent network conditions. The five ICMP error
messages are displayed in Table 7.1, which shows how routers and hosts should handle
each type.
Time-exceeded errors result from TTL expiration (Code = 0) or when fragments
cannot be completed quickly enough at a receiver (Code = 1). Parameter problems
are usually sent in regard to IP options.The codes are for a bad IP header (0), missing a
required option field (1), or a bad length (2).
Which of these message types are essential to device operation and should not be
blocked? Generally, the Destination Unreachable is essential (it is used by traceroute),
and used in MTU path calculations. Of the others, the Redirect message is most often
200
PART II Core Protocols
Table 7.1 ICMP Error Messages
Router
Sends
Router
Receives
Host
Sends
Host
Receives
IP hdr +
8 bytes
M
M
M
M
0
IP hdr +
8 bytes
Obs
Obs
Obs
Obs
Redirect
0–3
IP hdr +
8 bytes
M
M
Opt
Opt
11
Time
Exceeded
0–1
IP hdr +
8 bytes
M
M
Opt
Opt
12
Parameter
Problem
0–2
IP hdr +
8 bytes
M
M
M
M
Type
Meaning
Codes
Data
3
Destination
Unreachable
0–15
4
Source
Quench
5
Obs, obsolete; Opt, optional; M, mandatory.
Table 7.2 ICMP Destination Unreachable Codes
Code
Meaning
0
Network is unreachable (the router’s links to it might have failed).
1
Host is unreachable (the router can’t reach the host; it might be turned off).
2
Requested protocol is unreachable (the process might not be running on the host).
3
Port is unreachable (the remote application might not be running on the host).
4
Fragmentation needed at router but DF flag is set (used for path MTU determination).
5
Source route has failed (source route path might go through down link or router).
6
Destination network is unknown (different than Code = 0; router can’t find it).
7
Destination host is unknown (different than Code = 1; router can’t find host).
8
Source host is isolated (source host is not allowed to send onto the network).
9
Communication with this network is administratively forbidden (due to firewall).
10
Communication with this host is administratively forbidden (due to firewall).
11
Network is unreachable with specified Type of Service (router can’t forward).
12
Host is unreachable with specified Type of Service (router can’t forward).
13
Communication administratively prohibited (by route filtering).
14
Host precedence violation (the first-hop router does not support this precedence).
15
Precedence cut-off in effect (requested precedence too low for router network).
CHAPTER 7 Internet Control Message Protocol
201
blocked, because it does just as it says, that is, it tells another device to send packets
somewhere else.
Many ICMP errors are Destination Unreachable errors. The 16 codes for this error
type and their meanings are shown in Table 7.2, which includes a likely cause for the
condition.
The precedence bits are in the TOS field of the IPv4 packet header, and are distinct
from the TOS bits themselves (and are almost universally ignored anyway).
ICMP Query Messages
ICMP Query messages are used to question conditions on the network. These messages
are used in pairs, and each request anticipates a response. The 10 ICMP Query messages
are listed in Table 7.3, which shows how routers and hosts should handle each type.
These ICMP messages in Table 7.3 allow routers and hosts to query for timestamp,
address mask, and domain name information. Echo requests and replies have special
uses described in the section of this chapter on ping.
Table 7.3 ICMP Query Messages
Codes
Data
Router
Sends
Router
Receives
Host
Sends
Host
Receives
Echo reply
0
Varies
M
M
M
M
8
Echo
request
0
Varies
M
M
M
M
13
Timestamp
request
0
12 bytes
Opt
Opt
Opt
Opt
14
Timestamp
reply
0
12 bytes
Opt
Opt
Opt
Opt
15
Information
request
0
0 bytes
Obs
Obs
Obs
Obs
16
Information
reply
0
0 bytes
Obs
Obs
Obs
Obs
17
Mask
request
0
4 bytes
M
M
Opt
Opt
18
Mask reply
0
4 bytes
M
M
Opt
Opt
37
Domain
name
request
0
0 bytes
M
M
M
M
38
Domain
name reply
0
0 bytes
M
M
M
M
Type
Meaning
0
Obs, obsolete; Opt, optional; M, mandatory.
202
PART II Core Protocols
Which of these should be allowed to pass through firewalls? Sites most often allow
Echo messages (used by ping), although some allow only incoming Echo replies but
not Echo requests (which allows my devices to ping yours, but not the other way
around). The timestamp reply is also used by traceroute, and if these messages are
blocked, asterisks (*) appear instead of times in the traceroute report (we’ll look at
traceroute operation in detail in Chapter 9).
Table 7.4 Other ICMP Query Messages
Data
Router
Sends
Router
Receives
Host
Sends
Host
Receives
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
(4 bytes) (Prohibited)
(Prohibited)
Opt
Opt
0
Varies
M
Opt
Prohibited
Opt
Router
solicitation
0
0 bytes
M
M
Opt
Opt
19
Reserved–
security
NA
NA
NA
NA
NA
NA
20–29
Reserved–
robustness
NA
NA
NA
NA
NA
NA
30
Traceroute
0–1
Varies
Opt
Opt
M
M
31
Datagram conversion error
0–11
Varies
?
?
?
?
32
Mobile host
redirect
0
Varies
Opt
Opt
Opt
Opt
33
IPv6
where-are-you
0
?
Opt
Opt
Opt
Opt
34
IPv6 I-am-here
0
?
Opt
Opt
Opt
Opt
35
Mobile registration request
0, 16
Varies
Opt
Opt
Opt
Opt
36
Mobile registration reply
0, 16
Varies
Opt
Opt
Opt
Opt
39
SKIP
0
Varies
Opt
Opt
Opt
Opt
40
Photurius
0–3
Varies
Exp
Exp
Exp
Exp
Type
Meaning
Codes
1
Unassigned
2
Unassigned
6
Alternate host
address
0
9
Router
advertisement
10
Exp, expired; Obs, obsolete; Opt, optional; M, mandatory; NA, not applicable.
CHAPTER 7 Internet Control Message Protocol
203
Other ICMP Messages
Some ICMP messages do not fall neatly into either the error or query category.
These messages are typically used in specialized circumstances. The other 25 ICMP
messages are listed in Table 7.4, again showing how routers and hosts should handle
each type.
The messages displayed in Table 7.4 are less intuitive than others. Many of the other
messages are relatively new, apply to special circumstances, and not much has been
published about their use.
Very little has been written on the use of the alternate host address message and
the table is filled in with more suggestions than anything else. Router advertisement
and solicitation messages are defined in RFC 1256 as part of “neighbor discovery”
for IPv4 and a way around network administrators needing to know local router
addresses.
The traceroute message was introduced in RFC 1393 and was supposed to be
a more formal way to perform a traceroute, but never really caught on. RFC 1393
describes an alternate traceroute method that uses a single packet with an IP header
Traceroute option field and uses the answering ICMP Type = 30 messages from
routers to gather the same information while using far fewer messages. However, support for this method is not mandatory on routers, making this form of traceroute
problematic.
Datagram conversion errors are part of the “Next Generation Internet” protocol
using 64-bit addresses described in RFC 1475 and occurring when packets cannot be
converted to the new format. The mobile-related messages (32, 36, and 37) are part of
Mobile IP (or “IP Mobility”). SKIP is the Simply Key Management for Internet Protocols
and is used for Internet security. So is Photurius, an experimental aspect of IPSec that
has four codes: one reserved (0), one for an unknown IPSec Security Parameter Index
(SPI, 1), one for failed authentication (2), and one for failed decryption (3).
SENDING ICMP MESSAGES
Few TCP/IP protocols have been the subject of as much tinkering and add-on
functionality as ICMP. The original specification of ICMP was in RFC 792 and refined
in RFC 1122 (Host Network Requirements) and RFC 1812 (Router Requirements).
RFC 1191 added path MTU discovery functions to ICMP, RFC 1256 added router discovery, and RFC 1393 extended traceroute functions with a special message type not
often used.
But at heart, ICMP is a collection of predefined messages to indicate very specific
conditions. If the sender of a packet receives an ICMP message that involves ICMP itself
(the query messages), then ICMP deals with it directly. Otherwise, other protocols are
notified. (Unreachable ports are reported to UDP, which lacks the segment tracking
that TCP has, and so forth.) The precise response of an application to an ICMP message
can vary, but usually the error is reported to the user so that corrective action (even if
it’s just “Stop doing that!”) can be taken.
204
PART II Core Protocols
When ICMP Must Be Sent
Systems that detect a packet error and discard the packet may or may not send an
ICMP message back to the originating host. Usually it depends on whether the error is
transient or semipermanent.
Things like invalid checksums are ignored in TCP/IP, because these are considered to
be transient failures that should not persist. The philosophy is that if the data are important, the sender will simply resend. Transient errors are unlikely to repeatedly manifest
themselves in a chain of packets, and thus do not indicate a network-wide problem.
However, semipermanent errors such as invalid IP addresses need to be reported
to the originator. These are fundamental problems with the network or in the way that
the application is trying to use the network. The sender must either stop or change the
content of the packets.
It is important to realize that the presence of many ICMP messages on a network
does not mean that things are not working well, nor does the lack of ICMP messages
mean that the network is working fine.
Most users see only a handful of ICMP message types, especially those used for ping
and traceroute, such as the Time Exceeded, Timestamp Reply, Destination Unreachable, and Echo messages.
When ICMP Must Not Be Sent
ICMP also establishes situations when ICMP messages must not be sent. Transients like
checksum errors or intermittent link-level failures are clear examples, but ICMP goes
further than this. Generally, error messages should not be sent if they will generate
more network traffic and add little new information to what is obvious to the sender.
For example, RFC 1122 says that ICMP error message should never be sent if a
receiver gets the following:
■
■
■
■
■
■
ICMP error message (e.g., errors in ICMP checksums should not be reported as
errors)
Internet Group Management Protocol (IGMP) message (IGMP is for multicast, and
multicast traffic tends to multiply exponentially on the network, and one error could
trigger many error messages)
Packet with a broadcast or multicast destination address (another traffic-oriented
rule)
Link-layer frame with broadcast or multicast address
Packet with a special source address (all zeros, loopback, and so on)
Any fragment other than the first fragment of a fragmented packet
PING
Most people who know little about how TCP/IP works usually know of the ICMP-based
application known as ping. The original metaphor was the “ping” of a naval sonar unit.
Ping is a simple Echo query-and-response ICMP message that is used to see if another
CHAPTER 7 Internet Control Message Protocol
205
device is up and reachable over the network. A successful ping means that network
administrators looking at problems can relax a great deal: The network routers on the
path and at least two hosts are running just fine.
Ping implementations and the parameters supported vary greatly among operating
systems and routers (most routers support ping). Some only send four packets and quit,
unless told to send more. Others send constantly until told to stop. The parameters can
usually set many of the IPv4 packet header fields such as TTL,TOS, and so on to specific
values.
Usually, Unix versions use the PID as the Identifier field in the ping message, but
Linux increments this based on application calls. Unix ping messages are usually
56 bytes long, but Windows implementations use only 32 bytes. The payload of the
ping message echoed back to the sender typically consists of an 8-byte timestamp and
a fill pattern. The timestamp can be used to roughly calculate round-trip delays through
the network (in milliseconds).
Ping has some quirks that users should be aware of. First, small pings (maybe 56 or
64 bytes in the packet) often work fine, while larger pings with more realistic payload
sizes do not go through reliably. That’s what users care about—the network is struggling with real data packets. Seeing a small ping getting through reliably is not always
helpful.
Also, the round-trip times are not often vital information. You expect round-trip
times to go up as packet sizes increase, and that’s typically what is observed. The same
is true if the network is heavily loaded. But this is a relative, not absolute, observation.
Only when round-trip times are longer than expected, or if they vary by huge amounts,
is there an indication that something is wrong.
Part of the reason that round-trip times are not reliable is that routers (in particular)
and even hosts might process ICMP Echo requests at a lower priority than other traffic.
In fact, in many router architectures, ICMP message processing requires a trip to the control-plane processor, while transit traffic is forwarded in the forwarding-plane hardware.
We’ll be using ping extensively in many chapters in this book.
TRACEROUTE
Traceroute is not an ICMP-based network utility in the same sense that ping is. However, because traceroute uses ICMP messages to perform its functions, and for many
people the next step after ping is traceroute, this is the place to discuss this utility. We’ll
use traceroute heavily in Chapter 9 and throughout the rest of the book.
After ping has been used to verify that an IP address is reachable over the network,
the next logical step is to determine how the packets make their way to the destination
and back. In other words, we would like to trace the route from source to destination
(the reverse path is normally the same). Yes, IP networks route around failures and
routing tables can change, but paths are usually stable on the order of hours if not days
when things are not going completely haywire. Of course, paths might also simply be
asymmetric, yet stable, so it is not only path changes that are challenging for traceroute
interpretation.
206
PART II Core Protocols
Traceroute implementations vary even more than those for ping. Some have graphical displays and use other Internet utilities to display location and administrative
information about the routers and networks uncovered. This in turn has made many
network administrators so nervous that they routinely block traceroute ICMP messages
with firewalls or route filters to hide topology details. In fairness, the Internet is no
longer a teaching tool or good place to explore the limits of knowledge, and there are
so many disruptive or even malicious people on the Internet, that a certain amount of
anxiety is completely understandable (which is why a network such as the one used
for this book makes so much sense).
On Unix-based systems, traceroute often sends a sequence of three UDP packets (a
typical default is three) to an invalid port on another host (this number starts at 33434).
The utility can also use ICMP Echo requests, which is what the Windows version does.
Some versions even use TCP (a utility called tcptraceroute).
Whatever the type of packet, the TTL field is initially set to 1 in the three packet set,
so the first router along the path should generate an ICMP Time Exceeded message to
the sender. The round-trip delay in the timestamp field and IP address of the router is
recorded by the sender and another set of packets is sent, this time with the TTL set
to 2. These packets are discarded by the second router, and another ICMP message is
sent back. The process is repeated until the destination host is reached and the host
returns a Destination Port Unreachable message, or until a firewall is encountered that
blocks the ICMP messages or unsolicited UDP traffic. (These messages mimic port
scans and are sometimes blocked, as mentioned earlier in this chapter.)
The end result should be a list of the routers on the path from source to destination
(or the firewall) that also records round-trip delays. In some cases (sometimes many
cases), some routers will not respond to the TTL “timeout” with an ICMP message, but
simply silently discard the offending packet. If the packet does not return within the
timeout window (Cisco routers use a default timeout of 2 seconds), most traceroute
implementations indicate this with an asterisk (*) or some other placeholder and just
keep going, trying to reach the next router. (The appearance of the asterisk does not
necessarily mean that the packet was lost.)
One nagging traceroute issue is the number of messages exchanged over the
network needed to reveal fairly basic information. RFC 1393 describes an alternate
traceroute method that uses a single packet with an IP header Traceroute option field
and uses the answering ICMP Type = 30 messages from routers to gather the same
information while using far fewer messages. However, support for this method is not
mandatory on routers.
We’ll use traceroute a lot in many of the chapters of this book too.
PATH MTU
ICMP messages also play a role in path MTU discovery. We’ve already mentioned the
MTU as a critical link parameter determined by the maximum frame size. Packets,
including all headers, that fit inside the smallest frame size on the path from source to
CHAPTER 7 Internet Control Message Protocol
207
destination do not have to be fragmented and do not incur any of the penalties that
fragmentation involves.
But tuning the path MTU size to packet size has another network benefit: This practice maximizes throughput and minimizes the overhead required to move large messages from system to system. Overhead bytes are those that do no useful work in terms
of data transfer, but are necessary for the data transfer to take place at all.
Consider a data transfer using 68-byte MTUs, once the smallest size possible. If usual
IP and TCP headers are used, which are 20 bytes each, they will take up 40 bytes of
the packet, leaving only 28 bytes for data. So a whopping 59% (40/68) of the packet
is made of overhead. And a minimum of 35,715 packets need to be sent, routed, and
processed to transfer every megabyte of data. Bumping this MTU size up to 576 bytes
(a typical default value and the functional minimum for IPv4) cuts the overhead down
to about 7% (40/576) and requires only 1866 packets per megabyte of data, about 5%
of the previous number of packets.
Using the typical Internet frame size of 1500, the overhead shrinks to about 2.5%
and the number of packets required for a megabyte of data becomes a respectable
685. Larger MTUs have proportional benefits. (It is sometimes pointed out that bigger
packets are not always more efficient; they can add delay for smaller units of traffic,
a phenomenon often called “serial delay,” and on high bit error links, larger packets
almost guarantee that a bit error requiring a resend will occur during frame transmission. On older, more error-prone networks, throughput shrank to zero as packet size
grew.)
The 576-btye MTU size was selected as a compromise between latency (“delay”)
and throughput for modems and low-speed serial SLIP implementation. This is directly
related to the serialization delay discussed below. And use of an MTU size smaller than
512 precludes the use of the Dynamic Host Configuration Protocol (DHCP).
Now, TCP can adjust this message size, no matter what the default, but UDP traffic,
which is growing, cannot. Of course, every link from host to router to router to host
can have a different MTU size. That is what path MTU discovery is all about. It works
via the following:
■
■
■
■
Setting the DF flag in the IP header to 1 (don’t fragment)
Sending a large packet to the destination to which the path MTU is being
determined
Seeing if any router responds with an ICMP Destination Unreachable message
with Code 5 4 (fragmentation required but don’t fragment bit is set)
Repeating the first three steps with a smaller packet size
The process stops when a message is received from the destination host, showing
that a path MTU of this size works. Again, paths are fluid on TCP/IP router networks,
but they are remarkably stable considering all that can go wrong. By the way, it is
assumed that the path MTU for outbound packets is the same as the path MTU size for
inbound packets, but this is not true just often enough to make the process unnecessarily haphazard.
208
PART II Core Protocols
Table 7.5 Path MTU Plateaus for Various Network Link Types
Plateau Size in Bytes
Description
65535
Maximum MTU and packet size
32000
A value established “just in case”
17914
16-Mbps IBM token ring LANs
8166
IEEE 802.4 token bus LANs
4352
FDDI (100 Mbps fiber rings)
2314
Wireless IEEE 802.11b native frame (often “adjusted” to 1492)
2002
4-Mbps IEEE 802.5 token ring (recommended value)
1492
IEEE 802.3 LANs (also used in 802.2)
1006
SLIP
508
Arcnet (proprietary LAN from Datapoint)
296
Some point-to-point links use this value
68
Minimum MTU size
The path MTU “seed” or probe size and adjustment steps are not randomly chosen.
A series of “plateaus” representing common link MTU limits has been established. Some
of these are shown in Table 7.5.
In practice, as important as the path MTU size is, little is often done about the MTU
size except to change the default to 1500 bytes if the default value is less (it usually
is). This is because most networks that hold the source and destination networks are
Ethernet LANs that do not support 9000-byte jumbo frames. Between routers, WAN
links typically support larger MTU sizes (around 4500 bytes or larger), but that does
no good if the end system can only handle 1500-byte frames. However, WAN links with
MTUs greater than 1500 bytes allow the use of tunnel encapsulation of 1500-byte MTU
packets without the need for fragmentation, so the larger MTU is not actually wasted.
ICMPV6
A funny thing happened to ICMP on its way to IPv6. It didn’t work. ICMP, now officially
called ICMPv4, is built around the IPv4 packet header and things that could go wrong
with it. And not only is the IPv6 packet header different, as well as many fields and
address sizes, but many functions added to IPv4 that affected ICMPv4 were scattered in
separate RFCs and implementation varied. These functions are systematized in ICMPv6.
ICMPv6 makes some major changes to ICMPv4:
■
■
New ICMPv6 messages and procedures replace ARPs.
There are ICMPv6 messages to help with automatic address configuration.
CHAPTER 7 Internet Control Message Protocol
■
■
■
■
■
209
Path MTU discovery is automatic, and a new Packet Too Big message is sent
to the source for over-large packets because IPv6 routers do not fragment.
There is no Source Quench in ICMPv6 (it is obsolete in ICMPv4, but still
exists).
IGMP for multicast is included in ICMPv6.
ICMPv6 helps detect nonfunctioning routers and inactive partner hosts.
ICMPv6 is so different that it now has its own IP protocol number. IPv6 uses
the next header value of 58 for ICMPv6 messages.
Basic ICMPv6 Messages
The general ICMPv6 message format is similar to ICMPv4, but somewhat simpler.
The structure of a generic ICMPv6 message and the common Destination Unreachable
message are shown in Figure 7.6. ICMPv6 error messages are in the range 0 to 127.
Some of the most common are shown in the figure as well.
1 byte
1 byte
Type
Code
1 byte
1 byte
Checksum
Message Body
(a)
Basic ICMPv6 Type field values:
1 Destination Unreachable
2 Packet Too Big
3 Time Exceeded
4 Parameter Problem
5 Redirect
128 Echo Request
129 Echo Reply
1 byte
Type
1
1 byte
1 byte
Code
1 byte
Checksum
Unused
As Much as Original IPv6 Packet as Will Fit in 576 bytes or Less
(b)
FIGURE 7.6
ICMPv6 message formats, which can be compared to the IPv4 versions in Figure 7.4. (a) Generic
ICMPv6 message format. (b) ICMPv6 Destination Unreachable message.
210
PART II Core Protocols
Table 7.6 Destination Unreachable Codes for ICMPv6
Code
Meaning
0
No route to destination
1
Communication with destination administratively prohibited
2
Next destination in the IPv6 Routing header is not a neighbor, and this is a strict
route (routing headers are not currently supported)
3
Address unreachable
4
Port unreachable
Destination Unreachable
In ICMPv6, the Destination Unreachable message type is Type = 1. The codes that can
be compared to Table 7.2 IPv4 codes number only five and are listed in Table 7.6.
Packet Too Big
A router sends an ICMPv6 Packet Too Big message to the source when the packet is bigger than the MTU for the next-hop link. The next-hop link’s MTU size is reported in the
message. In ICMPv4, this type of information was supplied in the Destination Unreachable message. The format of the Packet Too Big message is shown in Figure 7.7.
Time Exceeded
An ICMPv6 Time Exceeded message is sent by a router when the Hop Limit field of the
IPv6 header reaches 0 (ICMPv6 Code = 0) or when the receiver’s fragment reassembly
timeout (senders can still fragment under IPv6) has expired (ICMPv6 Code = 1). The
1 byte
1 byte
Type
Code
1 byte
1 byte
Checksum
Next Link MTU
As Much as Original IPv6 Packet as Will Fit in 576 bytes or Less
FIGURE 7.7
ICMPv6 Packet Too Big format, showing details of the fields used.
CHAPTER 7 Internet Control Message Protocol
211
Table 7.7 Parameter Problem Codes and Meanings
Code
Meaning
0
Erroneous header field encountered
1
Unrecognized next header type encountered
2
Unrecognized IPv6 option encountered
format is the same as for the ICMPv6 Destination Unreachable message, except that
the Type is 3.
Parameter Problem
As in ICMPv4, an ICMPv6 Parameter Problem message is sent by a host or router that
cannot process a packet due to a header field problem. The codes are listed in Table 7.7.
Echo Request and Reply
Under IPv6, ping becomes “pingv6” (the name is not important) and uses ICMPv6 Echo
Request and Reply messages, but with Type = 128 used for requests and Type = 129
used for replies.
Neighbor Discovery and Autoconfiguration
ICMPv6 provides a number of neighbor discovery functions that help with:
■
■
■
■
■
Location of routers
IPv6 parameter configuration
Location of local hosts
Neighbor unreachability detection
Automatic address configuration and duplicate detection
These ICMPv6 functions use the following message types:
Router Solicitation Type 5 133 messages are sent by a host to ask neighbor routers
to make their presence known and provide link and Internet parameters, similar to
the ICMPv4 Router Solicitations. The message is sent to the all-router link-local IPv6
multicast address.
Router Advertisement Type 5 134 messages are sent periodically by every router
and in response to a host’s Router Solicitation, similar to the ICMPv4 Router
Advertisements. The message is sent either to the all-nodes IPv6 multicast
address (unsolicited) or to the querying host (solicited).
212
PART II Core Protocols
Neighbor Solicitation Type 5 135 messages are used, as ARP in IPv4, to find the
link-layer address of a neighbor, verify the neighbor is still reachable with the
cached entry, or check that no other node has this IPv6 address. These messages
also detect unresponsive neighbors.
Neighbor Advertisement Type 5 136 messages are sent in response to Neighbor
Solicitation messages and resemble the ARP response. Nodes can also announce
changes in link-layer addresses by sending unsolicited.
Neighbor Advertisements. Redirect Type 5 137 messages perform the same role
as the ICMPv4 redirect.
Routers and Neighbor Discovery
IPv6 routers provide their hosts with basic configuration and parameter information using Router Advertisement messages sent to the all-hosts link-local IPv6 multicast address. Hosts do not have to wait for these periodic router messages and can
send a Router Solicitation message at startup. This reply is sent to the host’s link-local
address.
Each router will supply data that includes the following:
■
■
■
■
■
■
■
Link-layer router address
MTU for any links that have variable MTUs
List of all prefixes and lengths used on the LAN (the specification says “link”)
Prefixes that a host can use to create its addresses
Default Hop Limit value to use on packets
Values for miscellaneous timers
Location of a DHCP server where the host should fetch more information
Note that the Router Advertisement (RA) will indicate the availability of a DHCP
server for stateless configuration (RA option O), or the requirement to perform stateful configuration (RA option M). The location of the DHCPv6 server is not specified,
merely that it’s available and what the requirements are for use.
Interface Addresses
Each IPv6 interfaces has a list of addresses and prefixes associated with it, including a
unique link-local address. In theory, this should allow LANs to easily migrate from one
ISP to another simply by changing prefixes and allowing the older prefix to age-out of
the host. In practice, migration between IPv6 service providers is not as simple. DNS
entries do not just “flop over,” and host and router configuration (and firewalls!) have
static configuration parameters. The point is that router advertisements assign a lifetime, which must be refreshed, to advertised prefixes. This also makes it easier to move
hosts from LAN to LAN.
CHAPTER 7 Internet Control Message Protocol
213
Each host can use some of the prefixes and lengths advertised by the routers (if
they are flagged for this use) to construct host addresses. A private (ULA local) or
global address can be constructed by appending a unique interface identifier to the
advertised prefix and added to the list of the host’s IPv6 addresses.
Router advertisements can also direct a host to a DHCP server that can assign
addresses chosen by a network administrator.
Neighbor Solicitation and Advertisement
One of the problems with ARP in IPv4 was that it was essentially a frame-level protocol that did not fit in well with the IP layer at all. In IPv6,“ARPs” are ICMPv6 messages.
ICMPv6 packets can be handled easily at the IPv6 layer, and can be authenticated and
even encrypted with IPSec techniques.
In addition to finding neighbor link-layer addresses, the Neighbor Solicitation and
Advertisement messages are used to find “dead” routers and partner hosts, and detect
duplicate IPv6 addresses.
Neighbor Solicitation messages are sent to the solicited-node IPv6 multicast address,
which is formed by appending the last 3 bytes of an IPv6 link-local address to a multicast prefix. The use of the multicast address cuts down on the number of hosts that has
to pay attention to the “ARP” message (in fact, only the target system should process the
request). The sender also includes its own link-layer address with the message.
Duplicate IP addresses are always a problem. Before a system can claim an IPv6
address or any other address not constructed by adding a link-local address to a prefix, the system sends a Neighbor Solicitation message asking whether any neighbor
already has that IPv6 address. This message uses the special IPv6 Unspecified Source
address as the source address, because you can’t ask about a source address by using
the source address! If the address is in use, the response is multicast to inform all
devices. Addresses that are manually assigned are tested in the same fashion.
Dead routers and hosts are detected by a sending unicast Router and Neighbor
Solicitation message to the device in question.
This page intentionally left blank
215
QUESTIONS FOR READERS
Figure 7.8 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
1 byte
1 byte
Type
Code
1 byte
1 byte
Checksum
Content Depends on Type/Code*
IP Header (20 bytes) and
First 8 bytes of Original Packet Data (usually TCP/UDP header)
(a)
*Usually all 0 (unused) except for:
1 byte
1 byte
Type 5 3
Code
1 byte
1 byte
Checksum
Sequence Number
(set to 0 initially and incremented)
Identifier for Request/Response Pairs
(usually PID in Unix)
Content Depends on Query Type
(b)
FIGURE 7.8
ICMP error and query messages in general. (a) Error message. (b) Query message.
1. How many types of error-reporting messages are there in ICMP? How many pairs
of query messages are there in ICMP?
2. Which pair of ICMP messages can be used to obtain the subnet mask?
3. Which kind of ICMP message notifies a host that there is a problem in the packet
header?
4. Which fields are used for the ICMP checksum calculation?
5. A ping sent to IP address 10.10.12.77 (the address assigned to bsdserver) on
LAN2 is successful. Later, it turns out that the bsdserver was powered off for
maintenance at the time. What could have happened?
CHAPTER
8
Routing
What You Will Learn
In this chapter, you will learn how routing works. We’ll look at both direct delivery
of packets to a destination without a router and indirect delivery through a router,
both of which happen all the time. Routers provide indirect delivery between
LANs while bridges essentially provide direct delivery only. Packet switching, on
the other hand, is a related form of indirect delivery that will be explored in a later
chapter.
You will learn about the role of routing tables and forwarding tables in the
routing process. Technically, routers use the information in the routing table to
create a forwarding table to forward packets to the next hop based on a metric,
but many people use the terms routing and forwarding loosely, often using one
term for both. We’ll try to use the terms as defined here consistently in this chapter, but there is no real formal definition of either term.
The Internet is the largest router-based network in the world. Router-based networks,
as we’ll see in this chapter, are characterized by certain features and methods of
operation. The most obvious feature of a router-based network is that the most essential network nodes are routers and not bridges or switches or more exotic devices. This
does not mean that there are no bridges, switches, and other types of network devices.
It just means that routing is the most important function in moving packets from source
to destination. This chapter is an introduction to routing as a process.
Figure 8.1 shows the areas of the Illustrated Network we will be investigating in this
chapter. The LANs and customer-edge routers are highlighted, but the other routers
play a large but unseen part in this chapter. We’ll look at the role of the service-provider
routers in the chapters on routing protocols. For now, we’ll focus on how sending
devices decide whether the destination is on their own network or whether the packets must be sent to a router for forwarding through a routing network.
We’ll talk about forwarding tables in later chapters that investigate routing and routers more deeply. For now, let’s take a look at the simple routing tables that are used on
the Illustrated Network’s hosts and routers.
218
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
so-0/0/3
49.1
P4
lo0: 192.168.4.1
/0
0/0
1
7
4.
so-0/0/1
24.2
so-
AS 65459
FIGURE 8.1
The Illustrated Network LAN internetworking, showing how the routers are connected and the links
available to forward (route) packets through the network.
CHAPTER 8 Routing
219
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
2
16.
ge-
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/
29. 0/2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
220
PART II Core Protocols
Routing Table and Forwarding Table
There are really two different types of network tables used in routers and hosts,
and we’ll distinguish them in this chapter. The routing table holds all of the information that a device knows about network addresses and interfaces, and is usually
held in a fairly user-friendly format such as a standard set of tables or even a database, often with metrics (costs) associated with each route.
A forwarding table, on the other hand, is usually a machine-coded internal one that
contains the routes actually used by the device to reach destinations. In most cases,
the routing one holds more information than is distilled in the forwarding table.
ROUTERS AND ROUTING TABLES
The router that attaches LAN1 to the world is CE0, a Juniper Networks router. Let’s look
at the information in the routing table on CE0.
[email protected]> show route
inet.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
1 5 Active Route, - 5 Last Active, * 5 Both
0.0.0.0/0
10.0.50.0/24
10.0.50.1/32
10.10.11.0/24
10.10.11.1/32
*[Static/5] 3d 02:59:20
> via ge-0/0/3.0
*[Direct/0] 2d 14:25:52
> via ge-0/0/3.0
*[Local/0] 2d 14:25:52
Local via ge-0/0/3.0
*[Direct/0] 2d 14:25:52
> via fe-1/3/0.0
*[Local/0] 2d 14:25:52
Local via fe-1/3/0.0
inet6.0: 5 destinations, 6 routes (6 active, 0 holddown, 0 hidden)
1 5 Active Route, - 5 Last Active, * 5 Both
*[Static/5] 2d 13:50:23
> via ge-0/0/3.0
fe80::/64
*[Direct/0] 2d 14:25:53
> via fe-1/3/0.0
fe80::205:85ff:fe88:ccdb/128
*[Local/0] 2d 14:25:53
Local via fe-1/3/0.0
fc00:fe67::/32
*[Static/5] 2d 13:50:23
> via ge-0/0/3.0
fc00:ffb3:d4:b::/64*[Direct/0] 2d 10:45:08
> via fe-1/3/0.0
fc00:ffb3:d4:b:205:85ff:fe88:ccdb/128
*[Local/0] 2d 10:45:08
Local via fe-1/3/0.0
::/0
CHAPTER 8 Routing
221
Because both IPv4 and IPv6 addresses are configured, we have both IPv4 and IPv6
routing tables. There’s a lot of information here that we’ll detail in later chapters on
routing protocols, so let’s just look at the basics of CE0’s routing tables. Only physical
addresses are used for now, on the LAN1 interface fe-1/3/0 and the Gigabit Ethernet
link to the provider routers, ge-0/0/3. Later, we’ll also assign an address to the router’s
loopback interface, but not in this example.
In both tables, there are local, direct, and static entries. Local entries are the
full 32- or 128-bit addresses configured on the interfaces. Direct entries are for the
network portions of the interface address, so they have prefixes shorter than 32
or 128 bits. For example, the entry for the fe-1/3/0 interface has a local entry of
10.10.11.1/32 and a direct entry of 10.10.11.0/24. Both were derived from the configuration of the address string 10.10.11.1/24 to the interface (technically, a string
like 10.10.11.1/24 is neither 32-bit host address nor 24-bit network address, but a
concatenation of address and network mask).
Static entries are entries that are placed in the routing table by the network administrator, and they stay there no matter what else the router learns about the network. In
this case, the static entry is also the default route, a type of “router of last resort” that
is used if no other entry in the routing table seems to represent the correct place to
forward the packet. The default route matches the entire IPv4 address space, so nothing
escapes the default. Note that the highlighted default route for IPv4 is 0.0.0.0/0 (or 0/0)
and sends packets out via interface ge-0/0/3 onto the service provider router network.
The local and direct entries for the ge-0/0/3 interface make up the last two entries
in this simple five-entry routing table. The default entry basically says to the router, “If
you don’t know where else to forward the packet, send it out here.” This seems trivial,
but only because router CE0 has only two interfaces. Backbone routers can have very
complicated routing tables.
Each route in the table has a preference associated with the route. A lower value means
the route is somehow “better” than another route to the same place having a higher
value. The value of 0 associated with local/direct entries means that no other route can be
a better way of reaching the locally attached interface, which only makes sense.
Routing table entries often have a metric associated with them. Why do routes
need both preferences and metrics? Preference indicates how the router knows about
a route; the metric assigns a cost of using the route, no matter how it was learned. Both
preference and metric are considered in determining the active route to a destination.
Generally, only active routes are loaded into the forwarding table. We’ll look at this
process more closely in the later chapters of routing. An asterisk (*) marks routes that
are both currently active and have been active the last time the router recomputed its
routes to use in the forwarding table.
There are no metrics in the CE0 routing tables. Why? Because metrics are usually
assigned by routing protocols and we don’t have any routing protocols running yet on
CE0. Static routes can be configured with metrics, but they still work fine without them.
The six entries in the IPv6 routing table mimic the five entries in the IPv4 table,
and the default ::0 static route is highlighted. The only unassigned or “extra” entry is
the fe80::/64 direct route (which is generated automatically) for the link-local prefix
for LAN1.
222
PART II Core Protocols
HOSTS AND ROUTING TABLES
Routers are not the only network devices that have routing tables. Hosts have them
as well. It’s how they know whether to send a packet inside a frame directly to the
destination or to send the packet and frame to a router so it can be forwarded to its
destination.
The following code block shows what the routing table on bsdserver looks like. We
can display it with the netstat –r command (the r option displays network statistics
about the routing table). We’ll use netstat –nr in this chapter because the n option forces
the output to use IP addresses instead of DNS names. This is a good practice because
when trouble strikes the network, chances are that DNS will be down (or provides the
wrong information), so it’s best to get used to seeing IP addresses in these reports.
bsdserver# netstat -nr
Routing tables
Internet:
Destination
default
10.10.12/24
localhost
Gateway
10.10.12.1
link#1
localhost
Internet6:
Destination
localhost.booklab.
fe80::%em0
fe80::20e:cff:fe3b
fe80::%lo0
fe80::1%lo0
fc00::
fc00::20e:cff:fe3b
fc00:fe67:d4:b::
fc00:fe67:d4:b:205
fc00:fe67:d4:b:20e
ff01::
ff02::%em0
ff02::%lo0
Flags
UGSc
UC
UH
Gateway
localhost.booklab.
link#1
00:0e:0c:3b:87:32
fe80::1%lo0
link#4
link#1
00:0e:0c:3b:87:32
link#1
00:05:85:8b:bc:db
00:0e:0c:3b:87:32
localhost.booklab.
link#1
localhost.booklab.
Refs
0
0
0
Flags
UH
UC
UHL
Uc
UHL
UC
UHL
UC
UHLW
UHL
U
UC
UC
Use
0
0
144
Netif Expire
em0
em0
lo0
Netif Expire
lo0
em0
lo0
lo0
lo0
em0
lo0
em0
em0
lo0
lo0
em0
lo0
The IPv4 routing table is even simpler than the CE0 router’s, which we might have
expected, because the host only has one interface (em0). The third entry (localhost)
is for the loopback interface (lo0), so there are really only two. The 10.10.12/24 entry
points to link#1, which is the em0 interface that attaches bsdserver to LAN1. It says
Gateway above the column, but it really means “what is the next hop for this packet?”
Why does it say “gateway” and not “router”? Because technically it is a gateway, not a
router. A gateway, as mentioned before, connects one or more LANs to the Internet (and
can route from LAN to LAN, not just onto or off of the Internet). A router, on the other
hand, can have nothing but other routers connected to it. People speak very loosely, of
course, and usually the terms “gateway” or “router” can be used without confusion.
CHAPTER 8 Routing
223
So the default entry does point to a router, in this case CE6, which is the gateway
to the world on LAN2. The Refs and Use columns are usage indicators, and there is no
Expire value because this information, as on router CE0, was not learned via a routing
protocol and therefore will not get “stale” and need to be refreshed.
The flags commonly seen in FreeBSD follow:
■
■
■
■
■
■
■
U (Up)—The route is the active route.
H (Host)—The route destination is a single host.
G (Gateway)—Send packets for this destination here, and it will figure out
where to forward it.
S (Static)—A manually configured route that was not generated by protocol
or other means.
C (Clone)—Generates a new route based on this one for devices that we
connect to. Normally used for the local network(s).
W (Was cloned)—A route that was autoconfigured based on a LAN clone
route.
L (Link)—The route references hardware.
Although listed as default, the actual entry value for the default route is 0.0.0.0/0 or
0/0. We can force numeric displays in netstat by using the n option, but we won’t use
that here (generally, the fewer options you have to remember to use, the better).
Where’s the Metric?
Note the netstat –nr on the host did not display any metric values, and show
on the router didn’t either. In the case of CE0, that was explained by the fact
that we have no routing protocol running to provide metrics for routes (destination networks). But even if a routing protocol were running, netstat never shows
any metrics associated with routes. Does that mean hosts have no metrics or do
not bother to compute them? Not necessarily, as we’ll soon see in the case of
Windows XP.
route
Why is the Internet6 routing table so much larger than either the Internet (IPv4)
table on bsdserver or the tables on router CE0? It is larger because of the IPv6 neighbor
discovery feature that populates the table with all of the local IPv6 hosts on LAN2. An
easy way to spot them is by their MAC addresses in the Gateway column. There are also
number link-local (fe80) and private (fc00) entries absent in IPv4, as well as multicast
addresses beginning with ff.
Let’s look at the routing table on lnxclient for comparison. We don’t have IPv6
running, so the table includes the IPv4 address only. Most of the information is the same
as in FreeBSD, just arranged differently.
224
PART II Core Protocols
[[email protected] admin]# netstat -nr
Kernel IP routing table
Destination
Gateway
Genmask
10.10.12.0
*
255.255.255.0
127.0.0.0
*
255.0.0.0
default
10.10.12.1
0.0.0.0
[[email protected] admin]#
Flags
U
U
UG
MSS Window
0 0
0 0
0 0
irtt Iface
0 eth0
0 lo
0 eth0
The Gateway column has asterisks because we don’t have DNS running and
the address is the same as the Destination. Only the default gateway entry
(10.10.12.1) is different than the entry (0.0.0.0/0). Instead of prefixes, lnxclient
uses netmask (Genmask) notation for the table entries, but either way, the network is
10.10.12.0/24.
The flags used in Linux follow (note the slightly different meanings compared to
FreeBSD):
■
■
■
■
■
■
G (Gateway)—The route uses a gateway.
U (Up)—The interface to be used is up.
H (Host)—Only a single host can be reached by the route.
D (Dynamic)—The route is not a static route, but a dynamic route learned by a
routing protocol.
M (Modified)—This flag is set if the entry was changed by an ICMP redirect
message.
! (Exclamation)—The route will reject (drop) all packets sent to it.
Linux hosts have the maximum segment size (MSS), Window size, and initial roundtrip time (irtt) lists associated with the route, but these are not IP parameters.
They’re most useful for TCP, and we’ll talk about them in the TCP chapter. And confusingly, a value of 0 in these columns does not mean that their values are zero (which
would make for an interesting network), but that the defaults are used. The Iface
column shows the interface used to reach the destination address space, with lo being
loopback.
Finally, Windows hosts have routing tables as well.You can display the routing table
contents with the route print command or with the same netstat –nr command using
in Unix-based systems. This output is from wincli1 and lists only the IPv4 routes.
C:\Documents and Settings\Owner>route print
Route Table
============================================================================
Interface List
0x1 . . . . . . . . . . . . . . MS TCP Loopback interface
0x2 . . .00 0e 0c 3b 88 3c. . . Intel(R) PR0/1000 MT Desktop Adapter –
Packet Scheduler Miniport
============================================================================
CHAPTER 8 Routing
225
============================================================================
Active Routes:
Network Destination
Netmask
Gateway
Interface
Metric
0.0.0.0
0.0.0.0
10.10.11.1
10.10.11.51 10
10.10.11.51
255.255.255.255 127.0.0.1
127.0.0.1
10
10.255.255.255
255.255.255.255 10.10.11.51 10.10.11.51 1
127.0.0.0
255.0.0.0
127.0.0.1
127.0.0.1
1
224.0.0.0
240.0.0.0
10.10.11.51 10.10.11.51 10
255.255.255.255
255.255.255.255 127.0.0.1
127.0.0.1
1
Default Gateway:
10.10.11.1
============================================================================
Persistent Routes:
Network Address
Netmask
Gateway Address
Metric
10.10.12.0
255.255.255.0
10.10.11.1
1
The table looks different, yet is still very familiar. There is an entry for the default
gateway (10.10.11.1), which is also listed separately for emphasis. One oddity is the
classful broadcast address entry (10.255.255.255), but this can be changed. There
are explicit loopback (127.0.0.0/8) and multicast (224.0.0.0/4) entries, and a
255.255.255.255/32 entry, as well as for the host itself (10.10.11.51/32), which point
to the loopback interface.
Instead of relying on a flag,Windows just shows you Active Routes. But there is also
a Persistent Route that is always in the table, no matter what. This was entered in the
table manually, like a static route, and makes sure that any packets sent to LAN2 go to
the router at 10.10.11.1. It would still work with only a default route, but this shows
how a static route shows up in Windows.
Note that even though no routing protocol is running in the host, wincli1 assigns
metrics to all the routes. These can be changed, but they are always there. But what
about when netstat –nr is used on the Windows host? We didn’t see any metrics on
the Unix-based systems. Take a look at what we get with netstat –nr.
This output is from wincli1 and lists only the IPv4 routes.
C:\Documents and Settings\Owner>netstat -nr
Route Table
============================================================================
Interface List
0x1 . . . . . . . . . . . . . . MS TCP Loopback interface
0x2 . . .00 0e 0c 3b 88 3c. . . Intel(R) PR0/1000 MT Desktop Adapter –
Packet Scheduler Miniport
============================================================================
============================================================================
Active Routes:
Network
Destination
Netmask
Gateway
Interface
Metric
0.0.0.0
0.0.0.0
10.10.11.1
10.10.11.51 10
10.10.11.51
255.255.255.255
127.0.0.1
127.0.0.1
10
10.255.255.255 255.255.255.255
10.10.11.51 10.10.11.51 1
127.0.0.0
255.0.0.0
127.0.0.1
127.0.0.1
1
226
PART II Core Protocols
224.0.0.0
240.0.0.0
10.10.11.51 10.10.11.51 10
255.255.255.255 255.255.255.255
127.0.0.1
127.0.0.1
1
Default Gateway:
10.10.11.1
============================================================================
Persistent Routes:
Network Address
Netmask
Gateway Address
Metric
10.10.12.0
255.255.255.0
10.10.11.1
1
That’s right—the output is identical, and does show the metrics. However, Windows
appears to be the only implementation that shows the metrics associated with routes
when netstat is used.
Let’s take a more detailed look at how routing tables are used to determine whether
packets should be sent to the destination directly or to a router for forwarding. We’ll
see how IP and MAC addresses are used in the packets and frames as well.
DIRECT AND INDIRECT DELIVERY
When routers are used to connect or segment Ethernet LANs, the Ethernet frame that
leaves a source may or may not be the same frame that arrives at the destination. If the
source and destination host are on the same LAN, then a method sometimes known
as direct delivery is used and the frame is delivered locally. This means that the source
and destination MAC addresses are the same in the frame that is sent from the source
and in the frame that arrives at the destination.
Let’s see if we can verify that frames are delivered locally, without a router, when
the IP address prefix is the same on the destination and on the source. In this case, the
MAC addresses on the frame that leave the source and the ones in the frame that arrive
at the destination should be the same.
We can also check and make sure that the frames use different MAC addresses
when the source and destination hosts are on different IP networks and the frames
pass through a router. We can even check and make sure that the frames came from
the router.
First, let’s use the Windows client and server (which are located in pairs on the two
LANs) to generate some packets to capture with Ethereal. We’ll use a little utility called
“ping” (discussed more fully in Chapter 7) to bounce some packets off the Windows
IPv4 addresses.
Ethereal is running on wincli2. When we send some pings to the client (10.10.
12.222) from the Windows server (10.10.12.52), what we see is shown in Figure 8.2.
The MAC address 00:02:b3:27:fa:8c is associated with IPv4 address 10.10.12.222,
and the MAC layer address 00:0e:0c:3b:88:56 is associated with IPv4 address
10.10.12.52. If we looked at the same stream of pings on the server, the MAC address
and IP address associations would be the same. The frame sent is the same as the one
that arrives.
What about a packet sent to other IP networks? We’ll use a little “echo” client and
server utility on the Linux hosts to generate the frames for this exercise. We’ll say more
CHAPTER 8 Routing
227
FIGURE 8.2
MAC addresses and direct delivery. Note that the MAC layer addresses in the frame that is sent are
the same as in the frame that will arrive at the destination.
about where this little utility came from in the chapter on sockets (Chapter 12). For now,
just note that this is not the usual Linux echo utility bundled with most distributions. With
this utility, we can invoke the server on the lnxserver host and use the client to send a
simple string to be echoed back by the server process. We’ll use tethereal (the text version of Ethereal) this time, just to show that the same information is available in either
the graphical or text-based version.
First, we’ll run the Echo server process, which normally runs on port 7, on port
55555:
[[email protected] admin]# ./Echo 55555
We have to run tethereal on each end too, if we want to compare frames. The command is the same on the client and server. We’ll use the verbose (2V) switch to see the
MAC layer information as packets arrive.
[[email protected] admin]# /usr/sbin/tethereal-V
Capturing on eth0
Now we can invoke the Echo client to bounce the string TESTING123 off the server
process.
[[email protected] admin]# ./Echo 10.10.11.66 TESTING123 55555
Received: TESTING123
[[email protected] admin]#
228
PART II Core Protocols
What did we get? Let’s look at the frames leaving the client. We only need to examine
the Layer 2 and IP address information.
[[email protected] admin]# /usr/sbin/tethereal-V
Capturing on eth0
Frame 1 (74 bytes on wire, 74 bytes captured)
Arrival Time: May 5, 2008 13:39:34.102363000
Time delta from previous packet: 0.000000000 seconds
Time relative to first packet: 0.000000000 seconds
Frame Number: 1
Packet Length: 74 bytes
Capture Length: 74 bytes
Ethernet II, Src: 00:b0:d0:45:34:64, Dst: 00:05:85:8b:bc:db
Destination: 00:05:85:8b:bc:db (Juniper__8b:bc:db)
Source: 00:b0:d0:45:34:64 (Dell_45:34:64)
Type: IP (0x0800)
Internet Protocol, Src Addr: 10.10.12.166 (10.10.12.166), Dst Addr: 10.10.11.66
(10.10.11.66)
Version: 4
Header length: 20 bytes... [much more information not shown]
We can see that the Ethernet frame leaving the Linux client has source MAC address
and destination MAC address 00:05:85:8b:bc:db. The packet
inside the frame has the source IPv4 address 10.10.12.166 and destination address
10.10.11.66, as expected.
How do we know that the destination MAC address 00:05:85:8b:bc:db is not associated with the destination address 10.10.11.66? We can simply look at the frame that
arrives at the Linux server.
00:b0:d0:45:34:64
[[email protected] admin]# /usr/sbin/tethereal -V
Capturing on eth0
Frame 1 (74 bytes on wire, 74 bytes captured)
Arrival Time: May 5, 2008 13:39:34.104401000
Time delta from previous packet: 0.000000000 seconds
Time relative to first packet: 0.000000000 seconds
Frame Number: 1
Packet Length: 74 bytes
Capture Length: 74 bytes
Ethernet II, Src: 00:05:85:88:cc:db, Dst: 00:d0:b7:1f:fe:e6
Destination: 00:d0:b7:1f:fe:e6 (Intel_1f:fe:e6)
Source: 00:05:85:88:cc:db (Juniper__88:cc:db)
Type: IP (0x0800)
Internet Protocol, Src Addr: 10.10.12.166 (10.10.12.166), Dst Addr: 10.10.11.66
(10.10.11.66)
Version: 4
Header length: 20 bytes...(much more information not shown)
Note that the frame arriving at 10.10.11.66 has the MAC address 00:d0:b7:1f:fe:e6,
which is not the one used as the destination MAC address in the frame leaving the
10.10.12.166 client (that address is 00:b0:d0:45:34:64).
CHAPTER 8 Routing
229
Table 8.1 Frame IP and MAC Addresses
MAC Source
Address
IP Source
Address
MAC Destination
Address
IP Destination
Address
Frame
leaving
client
00:b0:d0:45:34:64
10.10.12.166
00:05:85:8b:bc:db
10.10.11.66
(Linux client)
(Linux client)
(Juniper router)
(Linux server)
Frame
arriving at
server
00:05:85:88:cc:db
10.10.12.166
00:d0:b7:1f:fe:e6
10.10.11.66
(Juniper router)
(Linux client)
(Linux server)
(Linux server)
Now, if the MAC address associated with the frame leaving the 10.10.12.166 client
is 00:bo:do:45:34:64, then the MAC address associated with the same IP address on the
server LAN cannot magically change to 00:05:85:88:cc:db. As expected, the IP packet
is identical (except for the decremented TTL field), but the frame is different. This is
sometimes called indirect delivery of packets because the packet is sent through one
or more network nodes and not directly to the destination.
These relationships are displayed in Table 8.1, which shows how the MAC addresses
relate to the IP subnet addresses.
Tethereal not only gives the MAC addresses, but also parses the 24-bit OUI and helpfully lists Intel as the owner of 00:d0:b7 and Juniper as the owner of 00:05:85. We can
verify this on the Linux client or server. Let’s look at the client’s ARP cache.
[[email protected] admin]# /sbin/arp -a
? (10.10.12.1) at 00:05:85:8b:bc:db [ether] on eth0
[[email protected] admin]#
The question mark (?) just means that our routers do not have names in DNS.
The Illustrated Network uses two small LAN switches for LAN1 and LAN2, but the
nodes used for internetworking are routers. Let’s take a closer look at just what a router
does and how it delivers packets from LAN to LAN over an internetwork.
Routing
Routing is done entirely with IP addresses, of course. Many books make extensive use
of the concepts of direct routing and indirect routing of packets. This can be confusing, since direct “routing” of packets does not require a router. In this chapter, the terms
direct delivery and indirect delivery are used instead. A host can use direct delivery to
send packets directly to another host, perhaps using a VLAN, or use indirect delivery if
the destination host is reachable only through a router.
How does the source host know whether the destination host is reachable through
direct (local) delivery or indirect (remote) delivery through a router? The answer has
a lot to do with the way bridges and routers differ in their fundamental operation, and
how routers use the IP address to determine how to handle packets. Here’s an example
using the Illustrated Network’s actual MAC and IP addresses.
230
PART II Core Protocols
Direct Delivery without Routing
Let’s look at a packet sent from wincli on LAN1 to winsvr1. Both of these hosts are
on LAN1, so no routing is needed. The IPv4 addresses are 10.10.11.51 for wincli1 and
10.10.11.111 for winsvr1, and both use the same 255.255.255.0 mask. Therefore, both
addresses have the same network portion of the IPv4 address, 10.10.11.0/24.
The host software knows that no router is needed to handle a packet sent from the
source host to the destination host because the IP addresses of the source and destination hosts have the same IP network portion (prefix) in both source and destination
IP addresses. This is a simple and effective way to let hosts know whether they are on
the same LAN. The packet can be placed in a frame and sent directly to the destination
using the local link. This is shown in Figure 8.3.
In Figure 8.3, a packet is followed from client to server when both are on the
same LAN segment and there is no router between client and server. All direct delivery
means is that the packet and frame do not have to pass through a router on the way
from source to destination.
The TCP/IP protocol stack on the client builds the TCP header and IP header. In
Figure 8.3, the IP packet is placed inside an Ethernet MAC frame. The MAC source and
destination addresses are shown as well. The client knows its own MAC address, and if
(Router ignores
Sender (wincli1):
this frame:
1. Server on same subnet? YES!
It is addressed to
2. ARP for IP address of server
00:0e:0c:3b:87:36)
3. Use ARP response to determine
MAC address for frame
Router
4. Build packet and frame and
MAC Address
send!
00:05:85:88:cc:db
MAC Address:
MAC Address:
00:0e:0c:3b:88:3b
00:0e:0c:3b:87:36
wincli1
winsvr1
Frame: To: 00:0e:0c:3b:88:3b
From: 00:0e:0c:3b:87:36
Packet: To: 10.10.11:111
Network 10.10.11 Host 111
From: 10.10.11.51
Network 10.10.11 Host 51
FIGURE 8.3
Direct delivery of packets on a LAN. Note that the MAC address does not change from source to
destination, and that the router ignores the frame.
CHAPTER 8 Routing
231
the server’s MAC address is not cached, an ARP broadcast message that asks,“Who has
IP address 10.10.11.111?,” is used to determine the MAC address of the server.
The source host knew to ask for the MAC address of the destination host because
the destination host is on the same LAN as the source. Hosts with the same IP network
addresses must be on the same LAN segment. Destination hosts on the same LAN are
simply “asked” to provide their MAC addresses. The destination MAC address in the
frame is the MAC address that corresponds to the destination IP address in the IP
packet inside the MAC frame.
What would be different when the client and server are on different LANs and must
communicate through a router?
Indirect Delivery and the Router
It is one thing to say that the router is the network node of the Internet, but exactly
what does this mean? What is the role of the router on the Internet? Routers route IP
packets to perform indirect delivery (through the forwarding) of packets from source
to destination.
Unlike direct delivery, where the packets are sent between devices on the same LAN,
indirect delivery employs one or more routers to connect source and destination. The
source and destination could be near in terms of distance, perhaps on separate floors
of the same building. All that really matters is whether there is a router between source
and destination or not.
Figure 8.4 shows a simple network consisting of two LANs connected by routers. The
routers are connected by a serial link using PPP, but SONET would do just as well. Of
course, the Internet consists of thousands of LANs and routers, but all of the essentials
of routing can be illustrated with this simple network.
The routing network has been simplified to emphasize the architectural features
without worrying about the details. The routers are just Router 1 and Router 2, not CE0
and CE6. But the LANs are still LAN1 and LAN2, and we’ll trace a packet from wincli1
on LAN1 to winsvr2 on LAN2.
Both LAN segments in Figure 8.4 are implemented with Ethernet hubs and
unshielded twisted pair (UTP) wiring, but are shown as shared media cables, just to
make the adjacencies clearer. Each host in the figure has a network interface card (NIC)
installed. It is important to realize that it is the interface that has the IP address, not the
entire host, but in this example each host has only one interface. However, the routers
in the figure have more than one network interface and therefore more than one IP
network address. A router is a network device that belongs to two or more networks
at the same time, which is how they connect LANs. A typical router can have 2, 8, 16,
or more interfaces. Each interface usually gets an IP address and typically represents a
separate “network” as the term applies to IP, but there are exceptions.
Each NIC in a host or router has a MAC address, and these are given in Figure 8.4. The
routers are only shown with network layers and IP layers, because that’s all they need
for packet forwarding (most routers do have application layers, as we have seen).
Because the routers in this example are in different locations, they are connected by a
232
PART II Core Protocols
wincli1
winsvr1
10.10.11.51
10.10.11.111
00:0e:0c:3b:88:3c
LAN1:
IP Network
10.10.11/24
00:0e:0c:3b:87:36
00:05:85:88:cc:db
10.10.11.1
S1
10.0.99.2
Router 2
Router 1
PPP
Serial
Link
10.10.12.1
10.0.99.1
S1
00:05:85:8b:bc:db
00:02:b3:27:fa:8c
10.10.12.222
wincli2
LAN2:
IP Network
10.10.12/24
00:0e:0c:3b:88:56
10.10.12.52
winsvr2
FIGURE 8.4
Indirect delivery using a router. Note the different MAC and link-level addresses in place between
source and destination.
serial link. The serial link is running PPP and packets are placed inside PPP frames on
this link between the routers. There is no need for global uniqueness on serial ports,
since they are point-to-point links in the example, so each is called “S1” (Serial1) at the
network layer. They don’t even require IP addresses, but these are usually provided to
make the link visible to network management and make routing and forwarding tables
a lot simpler.
All of the pieces are now in place to follow a packet between client and server on
the “internetwork” in Figure 8.4 using indirect delivery of packets with routers. Let’s
see what happens when a client process running on wincli1 wants to send a packet to
a server process running on winsvr2. The application is unimportant. What is important is that the source host knows that the destination host (server) is not on the same
LAN. Once the IP address of the server is obtained, it is obvious to the source that the
destination IP network address (10.10.12.52) is different than the source IP network
address (10.10.11.51).
The source client software now knows that the packet going to 10.10.12.52 must
be sent through at least one router, and probably several routers, using indirect delivery. It is called indirect delivery (or indirect routing) because the packet destination
CHAPTER 8 Routing
Destination
MAC Address:
00:05:85:
88:cc:db
Source
MAC Address:
00:0e:0c:
3b:88:3c
Destination
IP Address:
10.10.12.52
Source
IP Address:
10.10.11.51
233
DATA
(Segment)
Ethernet Frame (trailer not shown)
Packet
FIGURE 8.5
Frame and packet sent to Router1, showing source and destination IP and MAC addresses.
address is the destination IP address of winsvr2, but the initial frame destination
address is the MAC address of the Router1. The packet is sent indirectly to the destination host inside a frame sent to the router. The address fields of the frame and packet
constructed and sent on the LAN by wincli1 are shown in Figure 8.5.
Note that the frame is sent to Router1’s MAC address (00:05:85:88:cc:db), but the
packet is sent to 10.10.12.52 (winsvr2). This is how routing works. (Bridges, or direct
delivery even in routing, always has frames in which the destination MAC address is the
same as the IP address it represents.)
How did the source host, wincli1, know the MAC address of the correct router?
There could be several routers on a LAN, if for no other reason than redundancy. All that
wincli1 did was use the routing table to look up the IP address of the destination. But
there’s no specific entry for a network associated with 10.10.12.52. However, TCP/IP
configuration on a host often includes configuration of at least one default gateway
to be used when packets must leave the local LAN. The default gateway (a router in
this case) can be set statically, or dynamically using the Dynamic Host Configuration
Protocol (DHCP), or even other ways. In this example network, the default gateway IP
address has been entered statically when the host was configured for TCP/IP.
Since the default gateway is by definition on the same LAN as the source host (they
share the same IP address prefix), the source host can just send an ARP to get the MAC
address of the interface on the router attached to that LAN. Note that the IP address of
the router is used only to get the MAC address of the router, not so that the source host
wincli1 can send packets to the router (the packets are being forwarded to winsvr2).
When this packet is sent, the router pays attention to the frame when it arrives,
but winsrv1 ignores it (the frame is not for 00:0e:0c:3b:87:36). Router1 looks at the
packet inside the frame and knows that the destination host is not directly connected
to Router1. The next hop to the destination is another router. How does Router1
know? In much the same way as wincli1: Router1 compares the destination IP address
to the IP addresses assigned to its local interfaces. These are 10.10.11.0/24 and
10.0.99.0/24. The packet’s destination IP address of 10.10.12.0/24 does not belong
to either of the two networks local to Router1.
However, a router can have many interfaces, not just the two in this example. Which
output port should the router use to forward the packet? The network portion of the IP
234
PART II Core Protocols
Destination
MAC Address:
00:0e:Oc:
3b:88:58
Source
MAC Address:
00:05:85:
8b:bc:db
Destination
IP Address:
10.10.12.52
Source
IP Address:
10.10.11.51
DATA
(Segment)
Ethernet Frame (trailer not shown)
Packet
FIGURE 8.6
Frame sent by Router2 to winsvr2, showing source and destination IP and MAC addresses.
address is looked up in the forwarding table according to certain rules to find out the IP
address of the next-hop router and the output interface leading to this router. (In practice, Router1 might simply have a default route pointed at the serial WAN interface.)
The rules used for these lookups will be discussed in more detail in a later chapter.
For now, assume that Router1 finds out that the next hop for the packet to winsvr2 is
Router2, and that Router2 is reached on serial port S1.
Router1 now encapsulates the packet from wincli1 to winsvr2 inside a PPP frame
for transport on the serial link. Another key feature distinguishing routers from bridges,
as we have seen, is an IPv4 router’s ability to fragment a packet for transport on an output link. Fragmentation depends on every router knowing the maximum transmission
unit (MTU) frame size for the link types on all of the router’s interfaces. Ethernet LANs,
for example, all have an MTU size of 1500 bytes (1518 bytes, including the LAN frame
header). Serial links usually have MTU sizes larger than that, so this example assumes that
Router1 does not have to fragment the content of the packet it received from the LAN.
When the packet sent by wincli1 to winsvr2 arrives at Router2 on the serial link from
Router1, Router2 knows that the next hop for this packet is not another router. Router2
can deliver the packet directly to winsvr2 using direct delivery. How does it know?
Because the network portion of the IP address in the packet destination, 10.10.12.52/24,
is on the same network as the router on one of its interfaces, 10.10.12.1/24. In brief, it
has a route that covers the destination network on one of its interfaces.
The frame containing the packet is sent onto the LAN with the structure shown in
Figure 8.6. Note that in this case the MAC address of the source is Router2, and the MAC
address of the destination is the MAC address of winsrv2. Again, Router2 can always use
ARP to get the MAC address associated with IP address 10.10.12.52 if the MAC address
of the destination host is not in the local ARP cache on the router. The source and destination IP addresses on the packet do not change in this example, of course. Winsvr2
must be able to reply to the sender, wincli1 in this case. (We’ll talk about cases using
NAT, when the source and destination packet addresses do and must change, in the
chapter on NAT.)
It is assumed that there is no problem with MTU sizes in this example. However,
MTU sizes are often important, especially when the operational differences between
IPv4 and IPv6 routers, when it comes to fragmentation, are considered.
235
QUESTIONS FOR READERS
Figure 8.7 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
[email protected] show route
Router
CEO
inet .0 : 5 destinations, 5 routes (5 active, 0 holddown, 0
hidden)
1 5 Active Route, 2 5 Last Active, * 5 Both
0.0.0.0/0
10.0.50.0/24
10.0.50.1/32
10.10.11.0/24
10.10.11.1/32
bsdserver
bsdserver# netstat -nr
Routing tables
Internet:
Destination
default
10.10.12/24
localhost
Internet 6:
Destination
localhost.booklab.
fe80::%emo
fe80::20e:cff:fe3b
fe80::%1o0
fe80::1%1o0
fec0::
fec0::20e:cff:fe3b
fec0::fe67:d4:b::
fec0::fe67:d4:b:205
fec0::fe67:d4:b:20e
ff01::
ff02::%em0
ff02::%1o0
* [Static/5] 3d 02:59:20
. via ge-0/0/3.0
*Direct/0] 2d 14:25:52
. via ge-0/0/3.0
*[Local/0] 2d 14:25:52
Local via ge-0/0/3.0
*[Direct/0] 2d 14:25:52
. via fe-1/3/0.0
*[Local/0] 2d 14:25:52
Local via fe-1/3/0.0
Gateway
10.10.12.1
link#1
localhost
Flags Refs
UGSC
0
UC
0
UH
0
Gateway
localhost.booklab
link#1
00:0e::0c:3b:87:32
fe80::1&1o0
link#4
link#1
00:0e::0c:3b:87:32
link#1
00:05:85:8b:bc:db
00:0e:0c:3b:87c:32
localhost.booklab.
link#1
localhost.booklab.
Flags
UH
UC
UHL
UC
UHL
UC
UHL
UC
UHLW
UHL
U
UC
UC
Use
0
0
144
Netif Expire
em0
em0
1o0
Netif Expire
1o0
em0
1o0
1o0
1o0
em0
1o0
em0
em0
1o0
1o0
em0
1o0
FIGURE 8.7
The routing table output from router CE0 (IPv4 only) and host bsdserver.
1. What is the difference between a routing table and a forwarding table?
2. In the IPv6 routing table for router CE0, what is the IPv6 address associated with
interface ge20/0/3?
3. In the IPv6 routing table for router CE0, what is the precise IP address value of the
default route for IPv4 and IPv6?
4. Why are there so many entries in the IPv6 host routing table on bsdserver?
5. What is a “persistent” route? What is a “static” route?
CHAPTER
Forwarding IP Packets
9
What You Will Learn
In this chapter, you will learn how routers forward IP packets. We’ll start with
the logical steps a router follows to forward (“route”) a packet out the next-hop
interface. Then we’ll look at router architectures to see how specialized devices
(there are “software-only” routers) accomplish routing and forwarding.
Finally, you will learn about how IPv4 routers transition to handling IPv6 routing
and various methods to tunnel IPv6 packets through links connected by IPv4-only
routers. Tunnels were introduced in Chapters 3 and 4 and occur when the normal
encapsulation sequence of packet–inside frame is violated in some fashion.
This chapter is really a continued investigation into many of the concepts introduced
in the previous chapter. Figure 9.1 highlights the network components we’ll be working with in this chapter.
The routers on our network are Juniper Networks routers. These routers have a
different “look and feel” compared to other routers, most of which use a more “Ciscolike” interface and display. For example, the routing tables seem very long and detailed
compared to Cisco routers’ default displays.
[email protected]> show route 10.10/16
inet.0: 34 destinations, 35 routes (34 active, 0 holddown, 0 hidden)
1 5 Active Route, - 5 Last Active, * 5 Both
10.10.11.0/24
10.10.12.0/24
10.10.12.1/32
*[OSPF/10] 1w5d 18:25:05, metric 6
> via ge-0/0/3.0
*[Direct/0] 2w2d 00:15:44
> via fe-1/3/0.0
*[Local/0] 2w2d 00:15:44
Local via fe-1/3/0.0
We’ll talk about the routing table entry marked Open Shortest Path First (OSPF) in
Chapter 14. This route was learned by a routing protocol running between the routers
on our network, and we’ll see how OSPF is configured in a later chapter. Note that
238
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
-0
45 /0/2
.1
so-0/0/3
49.1
P4
lo0: 192.168.4.1
/0
0/0
1
7
4.
so-0/0/1
24.2
so-
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
FIGURE 9.1
Forwarding packets across the network. Note that we’ll be using the customer-edge routers
CE0 and CE6 in this chapter.
AS 65459
CHAPTER 9 Forwarding IP Packets
239
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
2
16.
ge-
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0/0
so-0 2
47.
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
-0
so 2.2
1
Global Public
Internet
AS 65127
240
PART II Core Protocols
the entry has a preference of 10 (which makes it more “costly” to use than direct/local
interface routes [0] or static routes [5]). Traffic to destinations on LAN1 is sent to PE1
over the ge-0/0/3 interface. A preference is distinct from the metric or cost of a route
itself; preference applies to routes learned in different ways.
We can make the routing table display more Cisco-like by using the terse option:
[email protected]> show route 10.10/16 terse
inet.0: 34 destinations, 35 routes (34 active, 0 holddown, 0 hidden)
1 5 Active Route, - 5 Last Active, * 5 Both
A
*
*
*
Destination
10.10.11.0/24
10.10.12.0/24
10.10.12.1/32
P Prf
O 10
D
0
L
0
Metric 1
6
Metric 2
Next hop
>ge-0/0/3.0
>fe-1/3/0.0
Local
AS path
The asterisk (*) means the route is active (used for forwarding), and the P field is for
protocol. One metric is used (two are allowed), the next-hops are the same (thankfully!),
and we’ll talk about what an AS path is in the chapter on the BGP routing protocol.
Let’s use traceroute to see which routers CE6 uses to reach LAN1, attached to
router CE0 at interface 10.10.11.1.
[email protected]> traceroute 10.10.11.1
traceroute to
1 10.0.16.1
2 10.0.12.2
3 10.0.24.2
4 10.0.45.2
5 10.10.11.1
10.10.11.1 (10.10.11.1), 30 hops max, 40 byte packets
(10.0.16.1)
0.743 ms 0.681 ms 0.573 ms
(10.0.12.2)
0.646 ms 0.647 ms 0.620 ms
(10.0.24.2)
0.656 ms 0.664 ms 0.632 ms
(10.0.45.2)
0.690 ms 0.677 ms 0.695 ms
(10.10.11.1)
0.846 ms 0.819 ms 0.775 ms
Each router handles the three-packet set generated by the source (CE6) in one of
three ways:
1. If the packet is not for this router (the device does not have 10.10.11.1 configured
locally), and the TTL is 1 or 0, then the router creates an ICMP Time-Exceeded
message, sets the source address to the router’s receiving interface address, sets
the destination address to the source’s, and sends the ICMP packet out the interface listed as the route back to the source in the forwarding table. This does not
have to be the same as the receiving interface, but it usually is.
2. If the packet is not for this router and the TTL is not 1 or 0, then the router decrements the TTL field and forwards the packet out the interface leading to the
next hop on the way to the destination address.
3. If the packet is for this router or device, then it sends back an ICMP Port
Unreachable message.
Why a TTL of 1 or 0? Some routers decrement the TTL immediately and others only
as part of the forwarding process, right before output queuing. This way both types of
router handle the packet consistently.
CHAPTER 9 Forwarding IP Packets
241
When the source receives a Time-Exceeded message, it records the results of the
round-trip time for the three packets, checks to see if it has a DNS entry for the IP
address, and prints a line of output with a “hop” number and the rest of the statistics.
When it receives a Port Unreachable message, the traceroute utility prints the final
results and exits.
Because we don’t yet have DNS running, all the IPv4 addresses are repeated twice.
From the network diagram, we can see that the packets flowed from CE6 to PE1 (not
surprisingly) at 10.0.16.1 and then through P2 (10.0.12.2), P4 (10.0.24.2), PE5
(10.0.45.2) and on to CE0 (10.10.11.1, the local interface target, is used instead of
10.0.50.2). (We’ll see what happens when one of the P routers or links between them
fails in a later chapter.)
We have IPv6 running on the LANs and routers CE0 and CE6. Let’s see what happens
on CE6 when we ping the LAN1 interface address four times using the LAN2 interface
IPv6 source address. Recall that the private ULA IPv6 addresses on LAN1 start with
fc00:ffb3:d5:a.
[email protected]> ping count 4 inet6 source fc00:fe67:d4:b:205:85ff:fe8b:bcdb
fc00:ffb3:d5:a:205:85ff:fe88:ccdb
PING6(56=40+8+8 bytes) fc00:fe67:d4:b:205:85ff:fe8b:bcdb —> fc00:ffb3:d5:
a:205:85ff:fe88:ccdb
—- fc00:ffb3:d5:a:205:85ff:fe88:ccdb ping6 statistics —4 packets transmitted, 0 packets received, 100% packet loss
What happened? Well, for one thing, we have no routes to any IPv6 addresses on
LAN1 in the IPv6 routing table. And if they’re not in the routing table, they won’t be in
the forwarding table.
[email protected]> show route table inet6 fc00:ffb3:d5:a::/64
[email protected]>
What can we do about this? Well, we could add some static routes to the IPv6 tables
on each router, or we could run an IPv6 routing protocol between the routers to share
the routing information (we’ll do this in a later chapter). Or, we can configure an IPv6
over IPv4 tunnel between routers CE6 and CE0 (and back). We know we have connectivity with IPv4 between the edge routers, as shown with traceroute.
Here’s how to configure an IPv6-over-IPv4 tunnel on routers CE0 and CE6. It basically tells the router to take any traffic for LAN1 or LAN2 IPv6 addresses, put them
inside IPv4 packets with the LAN IPv4 interface addresses, and send them out as if they
were IPv4 packets. We’ll apply the tunnels on a logical interface known as the Generic
Routing Encapsulation (GRE) interfaces, abbreviated gr- on Juniper Networks routers.
Only the final configuration statements are shown.
[edit interfaces gr-1/0/0]
[email protected]# set interfaces gr-1/0/0
[email protected]# set unit 0 tunnel source 10.10.12.1;
/*source address on LAN2 interface*/
242
PART II Core Protocols
[email protected]# set unit 0 tunnel destination 10.10.11.1;
/*destination address on LAN1 interface*/
[email protected]# set unit 0 family inet6 address fc00:ffb3::/32
/*LAN1 addresses*/
[edit interfaces gr-1/0/0]
[email protected]# set interfaces gr-1/0/0
[email protected]# set unit 0 tunnel
/*source address on LAN1
[email protected]# set unit 0 tunnel
/*destination address on
[email protected]# set unit 0 family
/*LAN2 addresses*/
source 10.10.11.1;
interface*/
destination 10.10.12.1;
LAN2 interface*/
inet6 address fc00:ffb3::/32
Now we should be able to ping and traceroute an IPv6 address on LAN1 (in this
case, fc00:ffb3:d5:a:20e:cff:fe3b:8f95 for bsdclient) from the customer-edge
router on LAN2. And we can. Note that, because of the tunnel, the destination seems to
be only two hops away.
[email protected]> ping inet6 count 4 source fc00:fe67:d4:b:205:85ff:fe8b:bcdb
fc00:ffb3:d5:a:20e:cff:fe3b:8f95
PING6(56=40+8+8 bytes) fc00:fe67:d4:b:205:85ff:fe8b:bcdb —>
fc00:ffb3:d5:a:20e:cff:fe3b:8f95
16 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=0
time=0.900 ms
16 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=1
time=0.728 ms
16 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=2
time=0.856 ms
16 bytes from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, icmp_seq=3
time=0.838 ms
hlim=64
hlim=64
hlim=64
hlim=64
[email protected]> traceroute inet6 source fc00:fe67:d4:b:205:85ff:fe8b:bcdb
fc00:ffb3:d5:a:20e:cff:fe3b:8f95
traceroute6 to fc00:ffb3:d5:a:20e:cff:fe3b:8f95 (fc00:ffb3:d5:a:205:85ff:
fe88:ccdb) from fc00:fe67:d4:b:205:85ff:fe8b:bcdb, 30 hops max, 12 byte
packets
1 fc00:ffb3:d4:b:205:85ff:fe88:ccdb (fc00:ffb3:d4:b:205:85ff:fe88:ccdb)
1.059 ms 0.979 ms 0.819 ms
2 fc00:ffb3:d5:a:20e:cff:fe3b:8f95 (fc00:ffb3:d5:a:20e:cff:fe3b:8f95)
0.832 ms 0.887 ms 0.823 ms
Let’s take a look at the some basic types of router architectures that can be used to
implement these packet-forwarding strategies.
ROUTER ARCHITECTURES
There are three main steps that a router must follow to process and forward a packet
to the next hop. Processing a packet means to check an incoming packet for errors
and other parameters, looking up the destination address in a forwarding table to
CHAPTER 9 Forwarding IP Packets
243
determine the proper output port for the packet, and then sending the packet out on
that port.
But how are the input ports connected to the output ports? In smaller routers,
which can even be implemented on PC or laptop computers with two or more interfaces, software simply examines the packet headers and forwards the packets where
they need to go. Windows PCs can do this, and often do on home networks. In Linux,
there is a command to allow the “host” to forward packets without processing the content of the packet more fully.
[[email protected] admin]# echo "1" > /proc/sys/net/ipv4/ip_forward
Linux IP Forwarding
If you enter the ip_forward command from the shell command prompt, the setting
is not “remembered” after a reboot. If the host is to function as a gateway as well
as host, place the command in an initialization script.
Small routers, such as those for DSL or small-edge LANs, can allow the incoming
packet to sit in a memory buffer somewhere and adjust header fields, perform tunnel
encapsulation, and so on, and then queue the packet for output. Larger routers, such
as those used by ISPs or on the Internet backbones, must route as fast as they can, usually at wire speeds (this means that the device processes data without reducing overall
transmission speed, so even if the packets arrive as fast as the input line allows, under
maximum load, there is minimal delay through the router).
Instead of software-based forwarding architectures, these larger routers use
hardware-based forwarding fabric architectures. The differences are important, so
we’ll take a look at them in more detail.
Basic Router Architectures
When it comes to architecture, routers look very much like a PC. This was one of the
reasons for the initial success of routers: Routers could be fabricated out of simple,
off-the-shelf parts and did not require extensive or customized chipsets or hardware. So
these routers have a CPU, memory, interfaces, peripheral ports—in short, usually everything but a hard drive. Small routers do not even have floppy drives or other forms of
external storage. This makes sense: Routers don’t need to store much of anything. A
forwarding table needs to be in memory at all times, because it’s much too slow to try
and fetch a piece of the table off a hard drive when needed. A lot of routers boot themselves from special servers, and have nonvolatile random access memory (NVRAM)
that keeps whatever information they need to remember whenever their power is cut
or turned off. Volatile memory like normal RAM is always erased when power is lost,
but NVRAM is like a disk.
244
PART II Core Protocols
The chief distinction is that at the heart of such routers is a general-purpose
computer. The architecture for large modern routers does not have a “center.”
Routers do not have to worry about adding cards for video, graphics, or other tasks
either. The slots in the chassis just handle various types of networking interfaces such
as Ethernet, ATM, SONET/SDH (Synchronous Optical Network/Synchronous Digital
Hierarchy), or other types of point-to-point WAN links. Most interface modules have
multiple ports, depending on the type of interface that they support. In a lot of highend router models, the interface cards are complex devices all by themselves and
often called blades. Interfaces usually can be added as needed for the networking
environment—one or more LAN cards for the routers that handle customers and one
or more WAN cards for connection to other routers. Backbone routers often have only
WAN cards and no customers at all.
Another difference between a software-based router and a common PC is that PCs
almost always have only a single CPU. Because of the central role of these chips in
running all of the hardware and software on the computer, single-CPU architectures
require very powerful CPU chips.
Some routers use a variety of CPU chips, and because the tasks are shared among
the processors, these CPU chips do not have to be tremendously powerful either. Each
CPU set is chosen to fit the mission of the router. They have enough horsepower for
the home and small office, and these chips are stable, plentiful, and inexpensive.
Some routers use different types of memory. Figure 9.2 shows the general layout of
the motherboard of a generic software-based router. Many router motherboards have
four types of memory intended for specific purposes. Each type of memory and its location on the motherboard is shown in the figure. This architecture is also very similar
to the network processor engine (NPE) for larger Cisco router architectures. A lot of
architectures forgo packet memory because of the bandwidth available in their shared
Shared DRAM
DRAM
CPU
NVRAM
ROM
Flash
Memory
FIGURE 9.2
Software-based architecture for small routers, showing the various types of memory used.
CHAPTER 9 Forwarding IP Packets
245
memory architecture or because the CPU itself contains a dedicated packet handling
architecture.
Every router ships with at least the factory default minimum of DRAM (dynamic
random access memory) and flash memory, but more can be added in the factory or
in the field. Generally, the DRAM can be doubled or increased fourfold, depending on
model, and flash memory can be doubled.
RAM/DRAM is sometimes called working storage because in the days before hard
drives and other types of external storage, memory was all that computers had for storing information outside of the immediate CPU. In a router, the RAM/DRAM performs
the same functions for the router’s CPU as the memory in a PC does for its CPU. So
when the router is up and running, the RAM/DRAM contains an image of the operating
system software, the running configuration (called running-config in routers using the
Cisco configuration conventions) file, the routing table and associated tables built after
startup, and the packet buffer. If this seems like a lot of work for one type of memory,
this just shows the flexibility of function in a general-purpose architecture router.
The RAM acronym often used by router vendors is somewhat misleading. Almost
all RAM in a router today is DRAM, since static memory—regular RAM—became obsolete some time ago. But people are used to the old RAM acronym, and it’s included in a
lot of literature just for familiarity.
In addition to the DRAM near the CPU, these types of routers include shared
DRAM or shared memory. Also known as packet memory, the shared DRAM handles
the packet buffers in the router. Splitting the packet buffers from the other DRAM
improves I/O performance, because the shared DRAM is physically closer to the interfaces that handle the packets.
Nonvolatile RAM (NVRAM) is memory that retains information even when power
is cut off to the router. Routers use NVRAM to store a copy of the router configuration file. Without NVRAM, the router would never be able to remember its proper
configuration when it was restarted. NVRAM is where the startup configuration (called
startup-config on routers using the Cisco configuration conventions) is stored.
Flash memory is another form of nonvolatile memory. But although flash memory is
different from NVRAM, flash memory can also be erased and reprogrammed as needed. In
many routers, flash memory is used to hold one or more copies of the router’s operating
system: In the case of Cisco, this is called the Internetwork Operating System, or IOS.
ROM is read-only memory and is therefore nonvolatile, but, as might be expected,
ROM cannot be changed. Routers use ROM to hold what is called the bootstrap program.
Normally, flash memory and NVRAM hold all of the information that the router needs
to come up again properly with the current configuration after a shutdown or other
power loss. But if there is a catastrophe, the bootstrap program in ROM can be used to
boot the router into a minimum configuration. ROM used for this purpose is also called
ROMMON (ROM monitor) and usually has a distinctive rommon>> prompt taken from
early Unix systems. ROMMON at least gets the router to the point where simple commands can be typed in through a system console terminal (monitor). In smaller routers,
ROM holds only a minimal subset of the router’s operating system software. In larger
routers, the ROM often holds a full copy of the router’s operating system software.
246
PART II Core Protocols
Another Router Architecture
In contrast to the basic router architecture just explored, no one would accuse a large
Internet backbone router of looking or acting like a PC. Routers based on a central
CPU just about run out of gas once link speeds move into the multigigabit ranges with
OC-48 (2.4 Gbps) and OC-192 (10 Gbps). And with 10 Gigabit Ethernet and OC-768
(40 Gbps), a change to the basic architecture of the router for the Internet backbone is
necessary. Many Internet backbone routers share the same basic architecture, whether
they come from Cisco or Juniper Networks or someone else. However, the terminology used for the components varies considerably from vendor to vendor. Because the
Illustrated Network uses Juniper Networks routers as its network nodes, we’ll use the
Juniper Networks architecture and terminology in this section, but only as an example,
not necessarily as an endorsement.
Larger network routers, oddly enough, do have hard drives. In fact, many Internet
backbone routers have a complete PC built right in (some even have two PCs). But wait
a minute. Isn’t the PC architecture much too slow for heavy duty, “wire-speed” routing?
And isn’t a hard drive useless when it comes to routing because the forwarding table
has to be in memory? Right on both counts. The PC in the backbone router, called the
routing engine (RE) in Juniper Networks routers, does not forward packets at all. Packets are routed and forwarded by the packet-forwarding engine (PFE), which is where
all the specialized ASICs are located. The RE controls the router, handles the routing
protocols, and performs all of the other tasks that can be handled more leisurely than
wire-speed packet transit traffic. Packets are forwarded from input to output port using
the forwarding table (FT) in the hardware fabric.
The fundamental principle in large router design is the idea that the functions of a
router can be split into two distinct parts: one portion for handling routing and control
operations and another for forwarding packets. By separating these two operations, the
router hardware can be designed and optimized to perform each function well.
This division of labor makes perfect sense. It has already been pointed out several
times that no one really sends traffic to a router. The vast majority of packets just pass
through the router. So transit packets never leave the hardware-based fabric linking input
and output ports and control packets, such as those for the routing protocols, which only
come along every few seconds or so, and can be handled as required by the RE.
Just like other routers, large backbone routers can handle various types of networking interfaces. But these routers are normally intended for mainly customer traffic
aggregation or for an ISP backbone, although many corporations are attracted to edgeoriented routers with this architecture as well. And anywhere in an enterprise where
there is a requirement for sustained 2-Gbps operation, routing is probably not being
done in software.
The overall concept of the division between routing engine (routing protocol
control and management) and packet-forwarding engine (line-rate routing transit traffic) with a hardware-based “switching” fabric is shown in Figure 9.3.
The section of the router that is designed to handle the general routing operations (and control-plane management tasks) is the RE. The RE is designed to handle all
the routing protocols, user interaction, system management, and OAM&P (operations,
CHAPTER 9 Forwarding IP Packets
Console
247
Routing
Engine
AUX
100
fxp0 Ethernet
fxp1
FPC 0
Transit Traffic
FPC n
0
Input
1
2
3
IP II
0
Output
1
PacketForwarding
Engine
2
3
Transit Traffic
FIGURE 9.3
A hardware-based router with a switching fabric architecture. Note that the figure uses the
architecture and terminology of Juniper Networks routers, which are used on the Illustrated
Network.
administration, maintenance, and provisioning), and so on. The second section in Juniper Networks routers is the PFE, and is specifically designed to handle the forwarding
of packets across the router from input to output interface. Transit packets never enter
the routing engine at all.
The communications channel between the routing engine and the PFE is a standard 100-Mbps Fast Ethernet. This might seem somewhat surprising at first, because
the interfaces on a Juniper Networks router can be many gigabits per second. But
only control information needs to enter the routing engine. The vast majority of packets only transits the PFE at wire speeds. There are many advantages to using a standard
interface, even internally. A standard interface is easier to implement than creating a
new proprietary interface, and standard chipsets are readily available, inexpensive,
and so on.
The routing engine of a Juniper Networks router contains the router’s operating system, the JUNOS Internet software, the command line interface (CLI) for configuration
and control, and the routing table (RT) itself. The routing table in a Juniper Networks
router contains all of the routing information gathered from all routing protocols running on the router, as well as miscellaneous information such as interface addresses,
static routes, and so forth.
It might not seem that the RE would have to be very powerful, or have a large hard
drive, but it usually does. This is because of the increasing expense of converging a
growing routing table.
The PFE is where the forwarding table resides. The forwarding table contains all
the active route information that is actually used to determine the packet’s next hop
without needing to send the packet to the routing engine.
248
PART II Core Protocols
ROUTER ACCESS
Users don’t generally communicate directly with routers, but rather through routers.
The situation is different for network administrators and managers, however, who
must communicate directly with the individual routers in order to install, configure,
and manage the routers.
Routers are key devices on the Internet and almost any type of network. Many
backbone routers handle packets for hundreds or thousands of users, and some handle
packets for even more. So when a router goes down, or even slows down due to congestion or a problem, the users go wild and the network managers react immediately.
For this reason, network managers need multiple and foolproof ways to access the routers they are responsible for in order to manage them.
Larger routers, and many smaller ones, do not normally come with a keyboard,
mouse, and monitor. Nevertheless, there are usually three ways that a network administrator can communicate with a router.
The Console Port
This port is for a serial terminal that is at the same location as the router and attached
by a short cable from the serial port on the terminal to the console port on the
router. The terminal is usually a PC or Unix workstation running a terminal emulation
program. There are several physical connector types used for this port on Cisco routers. Network administrators sometimes have to carry around several different connector types so they can be sure to have the proper connector for the router they need to
manage. (Usually, after initial installation, the console ports are connected to a terminal
server on a management network so that access does not have to be right where the
router is.)
The Auxiliary Port
This port is for a serial terminal that is at a remote location. Connection is made
through a pair of modems, one connected to the router and the other connected to
the terminal. There is little difference, if any, between the auxiliary (AUX) and console ports in terms of characteristics. They are separate because routers might require
simultaneous local and remote access that would be impossible if there were only one
serial port on the router.
The Network
The router can always be managed over the same network on which it is routing
packets. This is often called “in-band management” in contrast to the console and
AUX ports, which are “out-of-band.” This just means that the network access method
shares the link to the router “in the same bandwidth” as user packets transiting the
router. There are often three ways to access a router over the network: through Telnet
CHAPTER 9 Forwarding IP Packets
249
Router
Console
Port
Local
Cable
AUX
Port
Network
Interface
Modem
Dial-up
Network
Modem
Management
Terminal
Telnet, HTTP, SNMP
Management
Terminal
Management
Terminal
FIGURE 9.4
The three router access methods. Note that the console port requires access to the router, while
the others allow remote access.
(called VTY lines on a Cisco router), with a more secure remote access program called
secure shell (SSH), using a Web browser (HTTP is the protocol), or with SNMP (Simple Network Management Protocol), a protocol invented expressly for remote router
management.
These arrangements are shown in Figure 9.4. Small routers usually only have a console port. With the proper cables, these console ports can be hooked up to a modem
for remote access, but obviously cannot be used simultaneously for local access. On
some routers, the console ports are labeled “Admin” or “Management.” It is tempting to
try and access a console or AUX ports using the normal graphical interface provided by
Windows, a Mac, or Unix X-Windows. But the console and AUX ports only understand a
simple, character-based serial protocol. On Windows PCs, for example, only HyperTerminal (or another serial terminal emulation program) can communicate with a router
through the console or AUX ports.
FORWARDING TABLE LOOKUPS
In the connectionless, best-effort world of IP, every packet is forwarded independently,
hop by hop, toward the destination. Each router determines the next hop for the
destination address in the packet header based on information gathered into the routing table and distilled into the forwarding table. The essential operation of a router
is the looking up of the packet’s destination IP address in this table to determine the
next hop.
250
PART II Core Protocols
It’s unusual that a packet address is an exact match for a table entry. Otherwise,
routing and forwarding tables would need an entry for every host in the world—all
32 bits for IPv4 and 128 bits for IPv6! So in the current classless (prefix) world of IP
addressing, the host-hop destination is chosen by the longest match rule. Figure 9.5
shows how the next-hop address and interface information are used with the ARP process (cache or query) to forward the packet in a frame toward the destination.
Consider a packet sent to 10.10.11.77 (bsdclient) from LAN2. Remember, the network is 10.10.11.0/24. Suppose the Best ISP edge router, PE1, has the entries shown
in Table 9.1 about 10.10/16 networks in its tables; the longest match determines the
correct interface that should forward the packet.
Which interface is the “best” next hop toward the destination? It would be easy if
we had an entry like 10.10.11/24 to work with, but routers closer to the backbone
use aggregate addresses in their tables. In most cases, Internet backbone routers will
accept prefixes of /24 or shorter. (It would be nice to accept only /19 or shorter, but
not many could get away with that.)
So where should the router send a packet for network 10.10.11.0/24? Which next
hop should it use? All three table entries are “close” to the destination address, but
which one is “best”?
According to the longest-match rule, the router will send the packet for 10.10.11.77
to 10.10.17.2 on interface so-0/0/2. But how exactly does it work?
Forwarding Module
Packet
Extract
Destination
Address
Lookup
Table
Network
Address
Next-hop Address
and Interface
Information
Prefix
Next-hop
Address
To ARP
Interface
FIGURE 9.5
How the longest match rule applies to a forwarding table lookup. More specific (longer) routes
are preferred to less specific (shorter) routes.
CHAPTER 9 Forwarding IP Packets
251
Table 9.1 Tables for Router PE1
Network (Network Bits in Bold)
Prefix
Next-Hop Address
Interface
10.10.0 (00001010 00001010 0000xxxx xxxx)
/20
10.0.12.2
so-0/0/0
10.10.8 (00001010 00001010 00001xxx xxxx)
/21
10.0.19.2
so-0/0/1
10.10.8 (00001010 00001010 000010xx xxxx)
/22
10.0.17.2
so-0/0/2
Routers today can “mix and match” prefixes of differing lengths in a routing or forwarding table and still send packets to the correct next hop. In the table, 10.10.8/21
and 10.10.8/22 are different routes, as would be 10.10.8/23 and 10.10.8/24.
Now, the 32-bit destination address, 10.10.11.77, in bits is 00001010 00001010
00001011 01001101. There is, of course, no subnet mask associated with a host address.
Looking at the table, the first 20 bits are exactly the same in all three entries, as well as
the destination address. But which is the longest match? The router will keep comparing the addresses in the table to the destination address bit by bit until the table runs
out of entries. The last match is the longest match, no matter if it’s all 32 bits, or none
(the default 0/0 entry matches everything).
The 21st bit is a 1 bit in the table entry for 10.10.8/21, and so is the 21st bit in the
destination address. The 22nd bit is a 0 bit in the table entry for 10.10.8/22, and so is
the 22nd bit in the destination address. There is no longer entry. This makes the /22
entry the longest match for the destination address, and the packet is forwarded to
10.10.17.2. The rest of the bits are used for local delivery of the packet on LAN2.
The longest match is also often called the best match or the more specific route for a
given destination IP address. But whatever it is called, the point is the same: The longestmatch next hop is always used in favor of a potential, but shorter match, next hop.
What if there were other entries such as 10.10.8/23 or 10.10.8/24? It doesn’t
matter. The 1 bit in the 23rd position will not match these entries, which all have 0s at
the end of the entry. The same longest match rules apply at each router.
DUAL STACKS, TUNNELING, AND IPV6
So far, we’ve seen how routers forward packets, what the routers look like internally,
and how the longest match determines the output port. But most of this chapter dealt
with IPv4. But what about IPv6 packets? It’s one thing to say that some routers can
handle both IPv4 and IPv6, but what about older or smaller routers and hosts that don’t
integrate IPv6 support and handle IPv4 only? This chapter ends with a consideration of
the role of the router in a world that is slowly making its way toward IPv6.
The transition to IPv6 will be a long one for most networks. There might be networks where it will be necessary to mix hosts and routers that run IPv4 only, IPv6 only,
and a combination of the two. Why would a host need to run both IPv4 and IPv6? Well,
a Web site that only ran IPv6 would be forever unreachable by IPv4 browsers. Routers,
of course, can be used to build separate IPv4 and IPv6 router networks. For example,
252
PART II Core Protocols
LAN1 and LAN2 could have two routers each—one for IPv4 and one for IPv6 traffic.
But a lot of newer routers should be able to handle both IPv4 and IPv6 packets, and
many do.
There are two main strategies that have emerged for dealing with mixed IPv4 and
IPv6 environments. These are dual protocol stacks and tunneling.
Dual Protocol Stacks
All of the hosts on the Illustrated Network, as we have seen, are capable of assigning
both an IPv6 and IPv4 address to their network interfaces. This is possible because they
all implement a sort of “split” IP network layer. For example, if the Ethernet Type field is
set to 0x0800 the packet is handed off to the IPv4 process, and if the Type field is set to
0x86DD, then the packet is handed off to the IPv6 process. This is shown conceptually
in Figure 9.6.
The dual protocol stack must provide error messages that are IPv6 “aware,” and routing protocols have to adapt to IPv6 addresses as well (as we’ll see). And in spite of the
figure, which is a very common representation, the TCP/UDP layer is also dual.
Dual protocols stacks are not new with IPv6. This method was frequently used
whenever two or more protocol stacks had to share a single host interface. In fact, very
complex arrangements were not unknown, with IBM’s (and Microsoft’s) NetBios sharing the network with Novell’s NetWare and IP itself (for Internet access).
Tunneling
Tunneling is a much misunderstood topic in general. This section talks about IPv6 tunnels, but networks also feature IPSec tunnels,VPN tunnels, and possibly even more. But
they all employ tunnels. Tunneling occurs whenever the normal sequence of encapsulation headers is violated. That’s all.
Application Services
TCP/UDP
IPv4
IPv6
Network Access (Ethernet, etc.)
Physical Network
FIGURE 9.6
Dual protocol stacks for IPv4 and IPv6 sharing a single network connection. Technically, TCP and
UDP have to be adjusted for an IPv6 environment.
CHAPTER 9 Forwarding IP Packets
253
Normally, a message is broken up into segments, which are put inside packets placed
inside frames that are sent as a sequence of bits to an adjacent system. The receiver
usually expects that the frame contains a packet, and so on, but what if it doesn’t? Then
the device is using tunneling.
We’ve already seen a form of tunneling in action. When we put PPP frames inside
Ethernet frames, we put a frame inside a frame and violated the normal OSI-RM
sequence of headers. That’s okay, as long as the receiver knows the sequence of headers the sender is generating.
Not all devices need to know the exact sequence of encapsulations used by the
sender and receiver. Only the endpoints (usually hosts, but not always) need to know
how to encapsulate the data at one end and process the headers correctly at the destination. In between, inside the tunnel, all other devices can treat the data units as
usual.
Tunneling in a mixed IPv4 and IPv6 network is used to transport IPv6 packets over
a series of IPv4 routers or to an IPv4 host. There is a lot of variation in tunnels to support IPv4/IPv6 operation. For example, a native IPv6 backbone might tunnel IPv4 to
reduce address consumption in the network core. For the sake of simplicity, let’s consider four types of tunnels and two major scenarios for their use:
1. Host to router—Hosts with dual-stack capabilities can tunnel IPv6 packets to a
dual-stack router that is only reachable over a series IPv4-only device.
2. Router to router—Routers with dual-stack capabilities can tunnel IPv6 packets
over an IPv4 infrastructure to other routers.
3. Router to host—Routers with dual-stack capabilities can tunnel IPv6 packets
over an IPv4 infrastructure to a dual-stack destination host.
4. Host to host—Hosts with dual-stack capabilities can tunnel IPv6 packets over an
IPv4 infrastructure to other dual-stack IP hosts without an intervening router.
The four types of tunnels are shown in Figure 9.7. When the IPv6 packet is sent to
a router (the first two tunneling methods), the endpoint of the tunnel is not the same
as the destination, so the destination address of the IPv6 packet does not indicate the
same device as the IPv4 tunnel endpoint address that carries the IPv6 packet. The
source host or router must have the tunnel endpoint’s IPv4 address configured. This is
called configured tunneling.
In contrast, the last two methods send the encapsulated IPv6 packet directly to the
destination host, so the IPv4 and IPv6 addresses used correspond to the same host. This
lets the IPv6 destinations use IPv4-compatible addresses that are derived automatically
by the devices. This is called automatic tunneling because it does not require explicit
configuration.
Automatic tunneling uses a special form of the IPv6 address. The 32-bit IPv4 address
is simply prepended with 96 zero bits in the form 0:0:0:0:0:0:<IPv4 address>. This
format is abbreviated as ::<IPv4 address>.
All dual-stack IP hosts recognize this format and encapsulate the IPv6 packet inside
an IPv4 packet using the embedded IPv4 address, creating an end-to-end tunnel. The
254
PART II Core Protocols
IPv4/IPv6
Host
IPv4 Network
(IPv4 routers)
IPv6-only
Router
Router to Router
(intermediate
hops)
IPv4/IPv6
Router
IPv4 Network
(IPv4 routers)
IPv4/IPv6
Router
Router to Host
(last hop)
IPv4-only
Router
IPv4 Network
(IPv4 routers)
IPv4/IPv6
Host
Host to Host
IPv4/IPv6
Host
IPv4 Network
(IPv4 routers)
IPv4/IPv6
Host
Host to Router
FIGURE 9.7
The various types of IPv6 tunnels, showing host and router situations that can be used to connect.
IPv4 Header
IPv6 Header
TCP/UDP Header
Data
IPv4
Dest.
Addr.:
192.168.38.156
IPv6 Header
TCP/UDP Header
Data
IPv6 Destination Address:
0:0:0:0:0:0:192.168.38.156
(::192.168.38.156)
FIGURE 9.8
The special IPv6 tunnel-addressing format for dual-stack routers.
receiver simply strips off the IPv4 header and processes the IPv6 header and packet
inside.
Hosts that only run IPv6 can use dual-stack routers to communicate using this special form of IPv6 address also. Dual-stack routers recognize the IPv6 traffic and use the
last 32 bits to create the IPv4 address for the IPv4 “wrapper.” Figure 9.8 shows how this
special addressing format works. Naturally, this requires IPv6-only hosts to have valid
and routable IPv4 addresses, which clearly marks the format as a transitional method.
If the IPv6 address is not in this special address form, then a configured tunnel must
be used, or, if every device on the path from source to destination uses dual protocol
stacks, or IPv6 only, well-formed IPv6 addresses can be used.
CHAPTER 9 Forwarding IP Packets
255
TUNNELING MECHANISMS
The theory of tunneling IPv6 packets through a collection of IPv4 routers is one thing.
Exactly how to do it is another. There are several tunnel mechanisms that embody the
concepts discussed previously.
Manually configured tunnels—These are defined in RFC 2893, and both endpoints of the tunnel must have both IPv4 and IPv6 addresses. These tunnels are
usually used between dual-stack edge routers.
Generic Routing Encapsulation (GRE) tunnels—GRE tunnels were designed to
transport non-IP protocols over an IP network. But GRE is also a good way to
carry IPv6 across the IPv4 routers. We used a GRE tunnel earlier in this chapter.
IPv4-compatible (6over4) tunnels—Also defined in RFC 2893, these are the
automatic tunnels based on IPv4-compatible IPv6 addresses using the ::<IPv4
address> form of IPv6 address.
6to4 tunnels—Another form of automatic tunnel defined in RFC 3065. They use an
IPv4 address embedded in the IPv6 address to identify the tunnel endpoint.
Intra-site Automatic Tunnel Addressing Protocol (ISATAP) tunnels—ISATAP tunnels are a mechanism much like 6to4 tunneling, but for local site (campus)
networks. An ISATAP address uses a special prefix and the IPv4 address to
identify the endpoint.
The differences between the 6to4 tunnel and the ISATAP tunnel address are shown
in Figure 9.9.
128 bits
16 bits
32 bits
16 bits
64 bits
IPv4 Address
Subnet ID
Interface ID
001000000000000010
2002: ...
(a)
64 bits
32 bits
32 bits
Subnet Prefix
0005EFE
IPv4 Address
(b)
FIGURE 9.9
The differences between 6to4 and ISATAP tunnel addressing, showing how the 128 bits of the
IPv6 address are structured in each case. (a) 6to4 tunneling address format (b) ISATAP tunneling
address format
256
PART II Core Protocols
TRANSITION CONSIDERATIONS
Routers occupy a key position during the transition period between IPv4 and
IPv6. There are still a lot of routers, mostly older ones, that do not handle IPv6 or
understand only the ::<IPv4 address> form of IPv6 address. How will IPv4 and IPv6
routers and hosts interoperate?
A transition plan has been put in place and contains some distinct terminology that
is new. The IPv4 to IPv6 transition plan defines the following terms for nodes:
■
■
■
■
■
IPv4-only Node—A host or router that implements only IPv4.
IPv6/IPv4 (dual) Node—A host or router that implements both
IPv4 and IPv6.
IPv6-only Node—A host or router that implements only IPv6.
IPv6 Node—A host or router that implements IPv6. Both IPv4/IPv6 dual
nodes and IPv6-only nodes are included in this category.
IPv4 Node—A host or router that implements IPv4. Both IPv4/IPv6 dual
nodes and IPv4-only nodes are included in this category.
In addition, the plan defines three types of addresses:
1. IPv4-compatible IPv6 address—An address assigned to an IPv6 node that can
be used in both IPv6 and IPv4 packets. The ::<IPv4 address> format is used for
this type of IP address. For example, an address such as ::10.10.11.66 is used
when there is no IPv6 router available.
2. IPv4-mapped IPv6 address—An address assigned to an IPv4-only node represented as an IPv6 address. These addresses always identify IPv4-only nodes,
never IPv4/IPv6 or IPv6-only nodes. These are provided when an IPv6 application requests the host name for a node with an IPv4 address only. For example,
::FFFF:10.10.12.166 is an IPv4-mapped IPv6 address.
3. IPv6-only address—An address globally assigned to any IPv4/IPv6 or IPv6-only
node. These addresses never identify IPv4-only nodes.
These terms can be somewhat confusing, but all they mean is that hosts and routers
can be classified either as IPv4 devices, IPv6 devices, or both IPv4 and IPv6 devices.
The IPv4/IPv6 devices are capable of understanding and using both IPv4 and IPv6.
However, the IPv6-only address (an address that has no relationship to an IPv4 address)
can be used in an IPv6/IPv4 device.
257
QUESTIONS FOR READERS
Figure 9.10 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
[email protected]> show route
inet.0: 2 destinations, 2
10.10.0.0/16
>via
10.10.64.0/18
>via
10.10.128.0/18
>via
routes (2
interface
interface
interface
active...
#1
#2
#3
Interface 2
Router with
NVRAM
and DRAM
Host
Supporting
6to4
Tunnels
Interface 1
Router
with RE
and PFE
Interface 3
Host
Supporting
6to4 and
ISATAP
Tunnels
FIGURE 9.10
A simple network of routers and hosts, showing architecture, a routing table, and tunnel support.
1. Which router, based on the architecture in the figure, is probably a small site
router? Which is probably a large Internet backbone router?
2. Which output interface, based on the routing table shown in the figure, will
packets arriving from the directly attached host for IPv4 address 10.10.11.1 use
for forwarding? Assume longest match is used.
3. Which output interface will packets for 10.10.192.10 use? Assume the longest
match is used.
4. Which IPv6 tunneling protocol can be used between the two hosts? How many
bits will be used for the subnet identifier?
5. Do the routers require IPv6 support to deliver packets between the two hosts?
CHAPTER
User Datagram Protocol
10
What You Will Learn
In this chapter, you will learn about UDP, one of the major transport layer protocols
in the TCP/IP stack. We’ll talk about datagrams and the structure of the UDP
header.
You will learn about ports and sockets and how they are used at the transport
layer.
The User Datagram Protocol (UDP) is one the major transport layer protocols that rides
on top of IPv4 or IPv6. Most explorations of the TCP/IP transport layer treat the other
major protocol, the connection-oriented Transmission Control Protocol (TCP) first and
present connectionless UDP later. But the complexities of TCP, and the reasons for these
often sophisticated procedures, are better understood after appreciating the basic connectionless service provided by UDP. In addition, certain concepts that are shared by
both UDP and TCP, such as ports, can be introduced in UDP and so reduce the number
of new ideas that must be covered during TCP discussions to a more manageable level.
The UDP acronym shows the effects of early Internet efforts to distinguish connectionless packet delivery (“It’s a datagram, not a packet!”) from more conventional
connection-oriented schemes in use at the time. The data unit of UDP is not a packet
anyway, but a datagram, the content of a connectionless packet (many authors call IP
packets datagrams as well, but we do not in this book). UDP datagrams have their own
headers, naturally, and the UDP header is about as simple as a header can get.That’s only
to be expected, because UDP operation is also very simple, making UDP ideal for a first
look at end-to-end functions on a network.
In recent years, UDP’s popularity as a transport layer protocol for applications has
been growing. The simple and fast operation of UDP makes it ideal for delay-sensitive
traffic like voice samples (the digital representation of analog speech), multicast digital
video, and other types of “real-time” traffic that cannot be resent if lost on the network.
This use of UDP is not as originally intended, and there are other things that need
to be done before UDP is ready for voice and video, but in the true spirit of Internet
innovation, UDP was adapted for these new circumstances.
260
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
45
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
so-0/0/3
49.1
-0
/0/
2
P4
lo0: 192.168.4.1
/0
0/0
1
47.
so-0/0/1
24.2
so-
AS 65459
FIGURE 10.1
UDP ports and sockets on the Illustrated Network. Note that this chapter mainly uses the Unix-based
hosts on the network to explore UDP.
CHAPTER 10 User Datagram Protocol
261
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
7
4.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
262
PART II Core Protocols
UDP is used by many common network applications, including DNS, IPTV streaming
media applications, voice over IP (VoIP), the Trivial File Transfer Protocol (TFTP), and
online games. UDP is required for multicast applications.
UDP PORTS AND SOCKETS
Figure 10.1 shows the hosts on the Illustrated Network that we’ll be using in this
chapter to explore UDP ports and sockets. We’ll primarily use the Unix-based hosts,
both FreeBSD and Linux.
Let’s look at a simple application of UDP between the lnxclient and lnxserver hosts.
The standard Unix “echo” utility (not the same “echo” program as the application used in
a previous chapter) sends a simple text string from a client to a server using UDP port
7. The server just bounces a UDP datagram back with the same content. But even with
this simple interaction, all of the major points about UDP discussed in this chapter can
be illustrated.
The capture is from lnxserver (10.10.11.66). The server is responding to the
lnxclient (10.10.12.166) request to echo the string “TEST.” The important sections
of the request and response packets relevant to UDP are highlighted.
[[email protected] admin]# /usr/sbin/tethereal -V port 7
Capturing on eth0
Frame 1 (60 bytes on wire, 60 bytes captured)
Arrival Time: May 6, 2008 16:31:30.947137000
Time delta from previous packet: 0.000000000 seconds
Time relative to first packet: 0.000000000 seconds
Frame Number: 1
Packet Length: 60 bytes
Capture Length: 60 bytes
Ethernet II, Src: 00:05:85:88:cc:db, Dst: 00:d0:b7:1f:fe:e6
Destination: 00:d0:b7:1f:fe:e6 (Intel_1f:fe:e6)
Source: 00:05:85:88:cc:db (Juniper__88:cc:db)
Type: IP (0x0800)
Trailer: 0000000000000000000000000000
Internet Protocol, Src Addr: 10.10.12.166 (10.10.12.166), Dst Addr:
10.10.11.66 (10.10.11.66)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00)
0000 00.. = Differentiated Services Codepoint: Default (0x00)
.... ..0. = ECN-Capable Transport (ECT): 0
.... ...0 = ECN-CE: 0
Total Length: 32
Identification: 0x0000
Flags: 0x04
.1.. = Don’t fragment: Set
..0. = More fragments: Not set
Fragment offset: 0
CHAPTER 10 User Datagram Protocol
Time to live: 62
Protocol: UDP (0x11)
Header checksum: 0x10d2 (correct)
Source: 10.10.12.166 (10.10.12.166)
Destination: 10.10.11.66 (10.10.11.66)
User Datagram Protocol, Src Port: 32787 (32787), Dst Port: echo (7)
Source port: 32787 (32787)
Destination port: echo (7)
Length: 12
Checksum: 0xac26 (correct)
Data (4 bytes)
0000 54 45 53 54
TEST
Frame 2 (46 bytes on wire, 46 bytes captured)
Arrival Time: May 6, 2008 16:31:30.948312000
Time delta from previous packet: 0.001175000 seconds
Time relative to first packet: 0.001175000 seconds
Frame Number: 2
Packet Length: 46 bytes
Capture Length: 46 bytes
Ethernet II, Src: 00:d0:b7:1f:fe:e6, Dst: 00:05:85:88:cc:db
Destination: 00:05:85:88:cc:db (Juniper__88:cc:db)
Source: 00:d0:b7:1f:fe:e6 (Intel_1f:fe:e6)
Type: IP (0x0800)
Internet Protocol, Src Addr: 10.10.11.66 (10.10.11.66), Dst Addr:
10.10.12.166 (10.10.12.166)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00)
0000 00.. = Differentiated Services Codepoint: Default (0x00)
.... ..0. = ECN-Capable Transport (ECT): 0
.... ...0 = ECN-CE: 0
Total Length: 32
Identification: 0x0000
Flags: 0x04
.1.. = Don’t fragment: Set
..0. = More fragments: Not set
Fragment offset: 0
Time to live: 64
Protocol: UDP (0x11)
Header checksum: 0x0ed2 (correct)
Source: 10.10.11.66 (10.10.11.66)
Destination: 10.10.12.166 (10.10.12.166)
User Datagram Protocol, Src Port: echo (7), Dst Port: 32787 (32787)
Source port: echo (7)
Destination port: 32787 (32787)
Length: 12
Checksum: 0xac26 (correct)
Data (4 bytes)
0000 54 45 53 54
TEST
263
264
PART II Core Protocols
The DF bit in the packet is set, and the UDP checksum field is used. Technically,
the UDP checksum is optional, and the client decides whether to use it. The server
responds with a checksum because the client used a checksum in the request. In fact,
Windows XP and FreeBSD do the same.
The UDP checksum was made optional to cut processing on reliable networks like
small LAN segments to a bare minimum. Today, client and server on the same LAN
segment are not very common, and processing the checksum is not a burden for modern computing devices. Also, UDP checksum calculation can be offloaded to modern
Ethernet chipsets, so it’s less “expensive” than it used to be. Currently, use of the UDP
checksum is common, and most traditional texts say it “should” be used with IPv4. Use
of the UDP checksum is mandatory with IPv6.
Note that the program uses client UDP port 32787. This is in the range of ports
known as registered ports. We’ll talk about those, and the dynamic port range of
49152 to 65535, later in this chapter.The dynamic port range that a Unix system uses
is a kernel-tunable parameter and can be changed using tweaks to the /etc/sysctl.
conf file, but information on exactly how to do it is scarce and beyond the scope of
this book.
We can see the sockets in use on a Linux host by using the netstat –lp command
to display listening sockets. (Although the options imply these are listening ports, it
is the socket information that is displayed.)
[email protected] admin]# netstat -lp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address
PID/Program name
tcp
0
0 *:32768
1664/
tcp
0
0 localhost.localdo:32769
1783/xinetd
tcp
0
0 localhost.localdoma:783
1853/spamd -d -c -a
tcp
0
0 *:sunrpc
1645/
tcp
0
0 *:x11
2103/X
tcp
0
0 *:ssh
1769/sshd
tcp
0
0 localhost.localdoma:ipp
6813/cupsd
tcp
0
0 localhost.localdom:smtp
1826/
udp
0
0 *:32768
1664/
udp
0
0 *:echo
1923/Echo
udp
0
0 *:sunrpc
1645/
Foreign Address
State
*:*
LISTEN
*:*
LISTEN
*:*
LISTEN
*:*
LISTEN
*:*
LISTEN
*:*
LISTEN
*:*
LISTEN
*:*
LISTEN
*:*
*:*
*:*
CHAPTER 10 User Datagram Protocol
0
0 *:631
*:*
6813/cupsd
udp
0
0 localhost.localdoma:ntp *:*
1800/
udp
0
0 *:ntp
*:*
1800/
Active UNIX domain sockets (only servers)
Proto RefCnt Flags
Type
State
Path
unix 2
[ ACC ]
STREAM
LISTENING
/tmp/jd_sockV4
unix 2
[ ACC ]
STREAM
LISTENING
/tmp/.gdm_socket
unix 2
[ ACC ]
STREAM
LISTENING
/tmp/.font-unix/fs7100
unix 2
[ ACC ]
STREAM
LISTENING
/tmp/.iroha_unix/IROHA
unix 2
[ ACC ]
STREAM
LISTENING
/tmp/.X11-unix/X0
unix 2
[ ACC ]
STREAM
LISTENING
/dev/gpmctl
265
udp
I-Node PID/Program name
2663
1939/
2839
2053/
2714
2016/
2542
1872/
2849
2103/X
2535
1862/gpm
The output is difficult to parse, but we can see our little echo utility (highlighted,
and the second line of the UDP section) patiently waiting for clients on port 7 (the
output identifies it as the standard “echo” port). UDP, being a stateless protocol, is not
technically in a “listening” state, but that’s what the server socket essentially does. The
asterisks (*:*) mean that communications will be accepted from another IP address
and port.
The command to reveal the same type of information on bsdserver is sockstat.
bsdserver# sockstat
USER
COMMAND
root
sendmail
root
sendmail
root
sshd
root
inetd
root
inetd
root
syslogd
USER
root
root
root
USER
admin
root
smmsp
root
root
PID
88
88
83
79
79
72
FD
4
6
4
4
5
5
PROTO
tcp4
tcp4
tcp4
tcp4
tcp4
udp4
LOCAL ADDRESS
*:25
*:587
*:22
*:21
*:23
*:514
FOREIGN ADDRESS
*:*
*:*
*:*
*:*
*:*
*:*
COMMAND
PID
sendmail
88
sshd
83
syslogd
72
COMMAND
PID
sshd
48218
sshd
48216
sendmail
91
sendmail
88
syslogd
72
FD
5
3
4
FD
3
4
3
3
3
PROTO
tcp46
tcp46
udp6
PROTO
stream
stream
dgram
dgram
dgram
LOCAL ADDRESS
*:25
*:22
*:514
ADDRESS
sshd[48216]:4
sshd[48218]:3
syslogd[72]:3
syslogd[72]:3
/var/run/log
FOREIGN ADDRESS
*:*
*:*
*:*
266
PART II Core Protocols
The little “echo” port is not listed because it is not running on this host. Note that
the syslogd process in FreeBSD listens on both the UDP and TCP ports (in this case,
port 514) for clients.
What about Windows XP? The command here is netstat –a (all), but be prepared
to be surprised. Windows hosts listen to a larger number of sockets than Unix systems.
It depends on exactly what the system is doing, but even on our “quiet” test network,
winsrv2 has 25 TCP and 19 UDP processes waiting to spring into action. They range
from Netbios (an old IBM and Microsoft LAN protocol) to Microsoft-specific functions.
Heavily loaded systems have even higher numbers.
What about looking at UDP with IPv6? It’s not really necessary. We are now high
enough in the TCP/IP protocol stack not to worry about differences between IPv4 and
IPv6. (In practical terms, we still have to worry about DNS a bit, but we’ll talk about
that in Chapter 19.) With the exception of the checksum use and something called the
pseudo-header, UDP is the same in both.
WHAT UDP IS FOR
UDP was defined in RFC 768 and refined in RFC 1122. All implementations must
follow both RFCs to make interoperability reliable, and all do. UDP uses IP protocol
ID 17. Any IPv4 or IPv6 packet received with 17 in the protocol ID field is given to
the local UDP service.
UDP is defined as stateless (no session information is kept by hosts) and unreliable
(no guarantees of any QoS parameters, not even delivery). This does not mean that
UDP traffic is somehow lower priority on the network or through routers. It’s not as
if UDP traffic is routinely tossed by stressed-out routers. It just means that if the application using UDP needs to keep track of a session history (“How many datagrams did
you get before that link failed?”) or guaranteed delivery (“I’m not sending any more
until I know if you got the datagrams I sent.”), then the application itself must do it,
because UDP can’t and won’t.
Nevertheless, there is a whole class of applications that use UDP, some almost
exclusively. These are applications that are invoked to exchange quick, request–
response pairs of messages, such as DNS (“Quick! What IP address goes with www.
example.com?”). These applications could suffer while waiting for all the overhead
that TCP requires to set up a connection between hosts before sending a message.
Multicast allows one source to send a single packet stream to multiple destinations (TCP is strictly a one-source-to-one-destination protocol), so UDP must be used
for multicast data transfer as well. Multicast is not only used with video or audio, but
also in applications such as the Dynamic Host Configuration Protocol (DHCP).
In other words, UDP is a low-overhead transport for applications that do not need, or
cannot have, the “point-to-point” connections or guaranteed delivery that TCP provides.
Packets carrying UDP traffic in IPv4 sometimes have the DF (Don’t Fragment) bit
set in the IPv4 header. However, no one should be surprised or upset to find a UDP
datagram riding inside an IPv4 packet without the DF bit set.
CHAPTER 10 User Datagram Protocol
267
THE UDP HEADER
Figure 10.2 shows the UDP header. There are only four fields, and the data inside the
datagram (the message) are optional.
The header is only 8 bytes (64 bits) long. First are the 2-byte Source Port field and
the 2-byte Destination Port field. These fields are the datagram counterparts of the
source and destination IP addresses at the packet level. But unlike IP addresses, there
is no structure to the port fields: All values between 0 and 65,353 are represented as
pure numerics. This does not mean that all port numbers, source and destination, are
the same, however. Port values can be divided into well-known, registered, and dynamic
port numbers.
The Length field gives the length in bytes of the UDP datagram, and includes the
header fields along with any data.The minimum length is 8 (the header alone), and the
maximum value is 65,353. However, the achievable maximum UDP datagram lengths
are determined by the size of the send and receive buffers on the host end systems,
which are usually set to around 8000 bytes (although they can be changed).
As already mentioned, hosts are required to handle 576-byte IP packets at a minimum,
but many protocols (the most common being DNS and DHCP) limit the maximum size
of the UDP datagram that they use to 512 bytes or less.
The Checksum field is the most interesting field in the UDP header.This is because
the checksum is not a simple value calculated on the UDP header fields and data,
if present. The UDP checksum is computed on what is called the pseudo-header. The
pseudo-header fields for IPv4 are shown in Figure 10.3.
The all-zero byte is used to provide alignment of the pseudo-header, and the data
field must be padded to align it with a 16-bit boundary. The 12 bytes of the UDP
pseudo-header are prepended to the UDP datagram, and the checksum is computed on
the whole object. For this computation, the Checksum field itself is set to zero, and the
16-bit result placed in the field before transmission. If the checksum computes to zero,
an all-1s value is sent, and all-1s is not a computable checksum. The pseudo-header
fields are not sent with the datagram.
1 byte
1 byte
1 byte
1 byte
Source Port
Destination Port
Length (including header)
Checksum
Datagram Data (optional)
FIGURE 10.2
The four UDP header fields. Technically, use of the checksum is optional, but it is often used
today.
268
PART II Core Protocols
1 byte
1 byte
1 byte
1 byte
Source IPv4 Address
Destination IPv4 Address
All 0 byte
Protocol (517)
UDP Length
FIGURE 10.3
The UDP IPv4 pseudo-header. These fields are used for checksum computation and include
fields in the IP header.
At the receiver, the value of the Checksum is copied and the field again set to zero.
The checksum is again computed on the pseudo-header and compared to the received
value. If they match, the datagram is processed by the receiving application indicated
by the destination port number. If they do not match, the datagram is silently discarded
(i.e., no error message is sent to the source).
Naturally, using 32-bit IPv4 addresses to compute transport layer checksums
will not work in IPv6, although the procedure is the same. RFC 2460 establishes a
different set of pseudo-header fields for IPv6. The IPv6 pseudo-header is shown in
Figure 10.4.
The Next Header value is not always 17 for UDP, because other extension headers could be in use. Length is the length of the upper layer header and the data it
carries.
IPv4 AND IPv6 NOTES
The presence of the IP source and destination address in an upper layer checksum
computation strikes many as a violation of the concept of protocol layer independence.
(The same concern applies to NAT, discussed in Chapter 27.) In fact, a lot of TCP/IP
books mention that including packet level fields in the end-to-end checksum helps
assure (when the checksum is correct at the receiver) that the message has not only
made its way to right port, but to the correct system.
The presence of a pseudo-header also shows how late in the development process
that TCP and UDP were separated from IP. Not only that, but the transport layer and
network layer (or, to give them more intuitive names, the end-to-end layer and routing
layer) have always been tightly coupled in any working network.
The use of the UDP checksum is not required for IPv4, but highly recommended.
It is required in IPv6, of course. In IPv4, servers that receive client datagrams with the
checksum field set are supposed to reply using the checksum, but this is not always
enforced. If the IPv4 checksum field is not used, it is set to all 0 bits (recall that all 0
checksums are sent as all-1s).
CHAPTER 10 User Datagram Protocol
1 byte
1 byte
1 byte
269
1 byte
Source IPv6 Address
Destination IPv6 Address
UDP (Upper Layer Protocol) Length
All 0 bytes
Next Header
FIGURE 10.4
The UDP IPv6 pseudo-header. Use of the UDP checksum is not optional in IPv6.
PORT NUMBERS
Each application running above UDP (and TCP) and IP is indexed by its port number,
allowing for the multiplexing of the IP layer. Just as frames with different types of packets inside (on Ethernet, IPv4 is 0x0800 and IPv6 is 0x86DD) are multiplexed onto a single
LAN interface, the individual IPv4 or IPv6 packets are multiplexed and distributed by
the protocol number (UDP is IP protocol number 17, and TCP is 6).
The port numbers in turn multiplex and distribute datagrams from applications,
allowing them to share a single UDP or TCP process, which is usually integrated closely
with the operating system. This function of frame Ethertype, packet protocol, and datagram port is shown in Figure 10.5. The figure shows how IPv4 data for DNS makes its
way from frame through IPv4 through UDP to the DNS application listening on UDP
port 53.
Well-Known Ports
Port numbers can run from 0 to 65353. Port numbers from 0 to 1023 are reserved for
common TCP/IP applications and are called well-known ports. The use of well-known
ports allows client applications to easily locate the corresponding server application
processes on other hosts. For example, a client process wanting to contact a DNS
270
PART II Core Protocols
TCP
Applications
UDP
Applications
Echo Service
Domain
Name
Server
7
53
UDP Process
TCP Process
IPV6 Process
Port 5 53 for DNS,
7 for Echo
Protocol 5 6 for TCP,
17 for UDP
Data
Segment
Packet Header
Ethertype 5 0800 for IPv4,
86DD for IPv6
Packet
Frame Header
FIGURE 10.5
UDP port multiplexing and distribution, showing how a single IP layer (IPv6 in this case) can be
used by multiple transport protocols and applications.
process running on a server must send the datagram to some destination port. The
well-known port number for DNS is 53, and that’s where the server process should
be listening for client requests. These ports are sometimes called “privileged” ports,
although a number of applications that formerly ran in “privileged” mode, such as HTTP
servers, do not run this way anymore except when binding to the port. It should be
noted that it is getting harder and harder to register new applications in the space
below 1023 (these often use registered ports in the range 1024 to 49151).
Ports used on servers are persistent in the sense that they last for a long time, or at
least as long as the application is running. Ports used on clients are ephemeral (“lasting
a short time,” although the term technically means “lasting a day”) in the sense that they
“come and go” as the user runs client applications.
Technically, UDP port numbers are independent from TCP port numbers. In
practice, most of the applications indexed by port numbers are the same in UDP or
TCP (although a few applications can use either protocol), excepting a handful that
are maintained for historical reasons. This does not imply that applications can use
TCP or UDP as they choose. It just means that it’s easier to maintain one list rather
than two. But no matter what port numbers are used, UDP port 1000 is a different
CHAPTER 10 User Datagram Protocol
271
application than TCP port 1000, even though both applications might perform the
same function.
Some of the more common well-known port numbers are shown in Table 10.1. In
the table, the UDP and TCP port numbers are identical.
Port numbers above 1023 can be either registered or dynamic (also called private
or non-reserved). Registered ports are in the range 1024 to 49151. Dynamic ports are in
the range 49152 to 65535. As mentioned, most new port assignments are in the range
from 1024 to 49151.
Registered port numbers are non–well-known ports that are used by vendors for
their own server applications. After all, not every possible application capability will
be reflected in a well-known port, and software vendors should be free to innovate. Of
course, if another vendor chooses the same port number for a server process, and they
are run on the same system, there would be no way to distinguish between these two
seemingly identical applications.
■
■
■
Well-known ports—Ports in the range 0 to 1023 are assigned and controlled.
Registered ports—Ports in the range 1024 to 49151 are not assigned or controlled,
but can be registered to prevent duplication.
Dynamic ports—Ports in the range 49152 to 65535 are not assigned, controlled,
or registered. They are used for temporary or private ports. They are also known as
private or non-reserved ports. Clients should choose ephemeral port numbers from
this range, but many systems do not.
Table 10.1 Some Well-Known Ports Used by UDP and TCP Services and Functions
Port Number
Service
Meaning
7
Echo
Used to echo data back to the sender
9
Discard
Used to discard data at receiver
13
Daytime
Reports time information in user-friendly format
17
Quote
Returns a “quote of the day” (rarely used today)
19
Chargen
Character generator
53
DNS
Domain Name Service
67
DHCP server
Server port used to send configuration information
68
DHCP client
Client port used to receive configuration information
69
TFTP
Trivial file transfer
161
SNMP
Used to receive network management queries
162
SNMP traps
Used to receive network problem reports
1011–1023
Reserved
Reserved for future use
272
PART II Core Protocols
Vendors can register their application’s ports with ICANN. Other software vendors
are supposed to respect these registered values and register their own server application port numbers from the pool of unused values. Some registered UDP and TCP
protocol numbers are shown in Table 10.2.
The private, or dynamic, port numbers are used by clients and not servers. Datagrams sent from a client to a server are typically only sent to well-known or registered
ports (although there are exceptions). Server applications are usually long lived, while
client processes come and go as users run them. Client applications therefore are free
to choose almost any port number not used for some other purpose (hence the term
“dynamic”), and many use different source port numbers every time they are run. The
server has no trouble replying to the proper client because the server can just reverse
the source and destination port numbers to send a reply to the correct client (assuming
the IP address of the client is correct).
All TCP/IP implementations must know the range of well-known, registered, and
private ports when choosing a port number to use. Unix systems hold this information is the /etc/services file.Windows users can find this C:\%SystemRoot%\system32\
drivers\etc\SERVICES file, where %SystemRoot% will be automatically referred to a
folder such as WinNT or WINDOWS. Most ports are the same for UDP or TCP, but some are
unique to one or the other. For example, FTP control uses TCP port 21.
Table 10.2 Selected Registered UDP and TCP Ports with Service
and Brief Description of Meaning
Port Number
Service
Brief Description of Use
1024
Reserved
Reserved for future use
1025
Blackjack
Network version of blackjack
1026
CAP
Calendar access protocol
1027
Exosee
ExoSee
1029
Solidmux
Solid Mux Server
1102
Adobe 1
Adobe Server 1
1103
Adobe 2
Adobe Server 2
44553
Rbr-debug
REALBasic Remote Debug
46999
Mediabox
MediaBox Server
47557
Dbbrowse
Databeam Corporation
48620–49150
Unassigned
These ports have not been
registered
49151
Reserved
Reserved for future use
CHAPTER 10 User Datagram Protocol
273
Here is the beginning of the file from winsvr2:
#
#
#
#
#
#
#
#
Copyright (c) 1993-1999 Microsoft Corp.
This file contains port numbers for well-known services defined by IANA
Format:
<service name>
<port number>/<protocol>
echo
7/tcp
echo
7/udp
discard
9/tcp
sink null
discard
9/udp
sink null
systat
11/tcp
users
systat
11/tcp
users
daytime
13/tcp
daytime
13/udp
qotd
17/tcp
quote
qotd
17/udp
quote
chargen
19/tcp
ttytst source
chargen
19/udp
ttytst source
ftp-data
20/tcp
ftp
21/tcp
telnet
23/tcp
[many more lines not shown...]
[aliases...]
[#<comment>]
#Active users
#Active users
#Quote of the day
#Quote of the day
#Character generator
#Character generator
#FTP, data
#FTP. control
For the latest global list of well-known, registered, and private port numbers, see
www.iana.org/assignments/port-numbers. The port numbers are the same for IPv4
and IPv6.
The Socket
The combination of IPv4 or IPv6 address and port numbers forms an abstract concept
called a socket. We’ve mentioned the socket concept briefly before, and will do so
again and again in later chapters. The socket concept is important for many reasons,
and a later chapter will explore some of them more completely. For now, all that is
important to mention is that, for each client–server interaction, there is a socket on
each host at the endpoints of the network. The sockets at each end uniquely identify
that particular client–server interaction, although the same sockets can be used for
subsequent interactions.
Sockets are usually written in IPv4 and IPv6 by adding a colon (:) to the IP address,
although sometimes a dot (.) is used instead. In IPv6, it is also necessary to add brackets to avoid confusion with the :: notation, such as in [FC00:490:f100:1000::1]:80.
A UDP socket on lnxclient, for example, would be 10.10.12.166:17, while one on
bsdserver would be 10.10.12.77:17.
274
PART II Core Protocols
Action
Condition
Outcome
UDP request
sent to server
Server
available
UDP request
sent to server
Port is closed
on server
Sender gets
UDP reply from
server
Sender gets ICMP
“Port unreachable”
message
UDP request
sent to server
Server host
does not exist
Sender gets ICMP
“Host unreachable”
message
UDP request
sent to server
Port is blocked by
firewall/router
Sender gets ICMP
“Port unreachable —
Administrative
prohibited”message
UDP request
sent to server
Port is blocked
by silent
firewall/router
(timeout)
UDP request
sent to server
Reply is lost on
way back
(timeout)
FIGURE 10.6
UDP protocol actions, showing the request–reply outcomes.
UDP OPERATION
The delivery of UDP datagrams is by no means certain. The lack of an expected
response on the part of a server to a UDP client request is handled by a simple timeout.
Responses are not always expected, as might be the case with streaming audio and
video. The client might resend the datagram, but in many cases this might not be the
best strategy.
In some cases, lack of response is not a reliable indication that anything is wrong
with the network or remote host. Routers routinely filter out unwanted packets, and
many do so silently, while others send the appropriate ICMP “administratively prohibited” message.
In general, there are five major possible results when an application sends a UDP
request, shown in Figure 10.6. Note that any of the replies can be lost on the way back
to the sender, generating a timeout.
UDP OVERFLOWS
We’ve looked at UDP as a sort of quick-and-dirty request–response interaction between
hosts over a network. Delivery is not guaranteed, but neither is an important network
property called flow control. A lot of nonsense has been written about flow control, which is a very simple idea. It just means that no sender should ever be able to
CHAPTER 10 User Datagram Protocol
275
overwhelm a receiver with traffic. In other words, receivers must have a way to tell
senders to slow down. UDP, of course, has no such mechanism.
The confusion over flow control often comes from treating flow control as a synonym for a related concept called congestion control. While flow control is strictly a
local property of individual senders and receivers, congestion control is a global property of the network. No sender overwhelms a receiver: There’s just too much traffic in
the router network for things to work properly.
Congestion control often uses flow control to accomplish its goals (source quench
was a not-too-sophisticated mechanism). There’s not much else a router can use other
than flow control to tell senders to shut up for a while. But that’s no excuse for treating
the two as one and the same.
What has this to do with UDP? Well, it is possible for UDP receivers’ buffers, which
are usually fixed, to overflow with unexpected UDP datagrams and be forced to discard
traffic. Most UDP implementations include a way to display “UDP socket overflows” or
discarded UDP datagrams.
But what if an application needs guaranteed delivery, sequencing, and flow control
to work properly, and we don’t want to add these to the application? Files cannot use
quick request–response messages to transfer themselves over a network.That’s the job
of TCP, which is the topic of the next chapter.
This page intentionally left blank
277
QUESTIONS FOR READERS
Figure 10.7 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
1 byte
1 byte
1 byte
Source Port
1 byte
Destination Port
Checksum
Length (including header)
Datagram Data (optional)
(a)
1 byte
1 byte
1 byte
1 byte
Source IPv4 Address
Destination IPv4 Address
All 0 byte
Protocol (517)
UDP Length
(b)
FIGURE 10.7
The UDP header (a) and pseudo-header (b) fields for IPv4.
1. Which UDP header field does UDP use for demultiplexing?
2. What is UDP’s only attempt at error control?
3. A socket is comprised of which two TCP/IP components?
4. What is the registered port range? Is this assigned or controlled?
5. What is the dynamic or private port range? Are these assigned or controlled?
CHAPTER
Transmission Control
Protocol
11
What You Will Learn
In this chapter, you will learn about the TCP transport layer protocol, which is
the connection-oriented, more reliable companion of UDP. We’ll talk about all
the fields in the TCP header (which are many) and how TCP’s distinctive three-way
handshake works.
You will learn how TCP operates during the data transfer and disconnect phase,
as well as some of the options that have been established to extend TCP’s use for
today’s networking conditions.
The Transmission Control Protocol (TCP) is as complex as UDP is simple. Some of the
same concepts apply to both because both TCP and UDP are end-to-end protocols.
Sockets and ports, well-known, dynamic, and private, apply to both. TCP is IP protocol
6, but the ports are usually the same as UDP and run from 0 to 65,535. The major difference between UDP and TCP is that TCP is connection oriented. And that makes all
the difference.
Internet specifications variously refer to connections as “virtual circuits,” “flows,”
or “packet-switched services,” depending on the context. These subtle variations are
unnecessary for this book, and we simply use the term “connection.” A connection is
a logical relationship between two endpoints (hosts) on a network. Connections can
be permanent (although the proper term is “semipermanent”) or on demand (often
called “switched”). Permanent connections are usually set up by manual configuration
of the network nodes. (On the Internet, this equates to a series of very specific static
routes.) On-demand connections require some type of signaling protocol to establish connections on the fly, node by node through the network from the source (the
“caller”) host to the destination (the “callee”) host.
Permanent connections are like intercoms: You can talk right away or at any time
and know the other end is there. However, you can only talk to that specific endpoint
on that connection. On-demand connections are like telephone calls: You have to wait
until the other end “answers” before you talk or send any information, but you connect
to (call) anyone in the world.
280
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
45
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
so-0/0/3
49.1
-0
/0/
2
P4
lo0: 192.168.4.1
/0
0/0
1
7
4.
so-0/0/1
24.2
so-
AS 65459
FIGURE 11.1
TCP client–server connections, showing that this chapter uses a client and server pair on the
same LAN.
CHAPTER 11 Transmission Control Protocol
281
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
2
16.
ge-
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so-0/0/3
27.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/
29. 0/2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
ge- .1
16
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
282
PART II Core Protocols
TCP AND CONNECTIONS
As much as router discussions become talks about IP packets and headers, host discussions tend to become talks about TCP. However, a lot of the demonstrations involving
TCP revolve around things that can go wrong. What happens if an acknowledgment
(ACK) is lost? What happens when two hosts send almost simultaneous connection
requests (SYN) to open a connection? With the emphasis on corner cases, many pages
written on TCP become exercises in exceptions.Yet there is much to be learned about
TCP just by watching it work in a normal, error-free environment.
Instead of watching to check whether TCP recovers from lost segments (it does),
we’ll just capture the sequence of TCP segments used on various combinations of
the three operating system platforms and see what’s going on. Later, we’ll use an FTP
data transfer between wincli2 and bsdserver (both on LAN2) to look at TCP in action.
In many ways it is an odd protocol, but we’ll only look at the basics and examine FTP
in detail in a later chapter. Figure 11.1 shows these hosts on the network.
As before, we’ll use Ethereal to look at frames and packets. There is also a utility
called tcpdump, which is bundled with almost every TCP/IP implementation. The major
exception, as might be expected, is Windows. The Windows version, windump, is not
much different than our familiar Ethereal, so we’ll just use Ethereal to capture our Windows TCP sessions. Because TCP operation is complicated, let’s look at some details of
TCP operation before looking at how TCP looks on the Illustrated Network.
THE TCP HEADER
The TCP header is the same for IPv4 and IPv6 and is shown in Figure 11.2. We’ve
already talked about the port fields in the previous chapter on UDP. Only the features
unique to TCP are described in detail.
Source and destination port—In some Unix implementations, source port numbers between 1024 and 4999 are called ephemeral ports. If an application
does not specify a source port to use, the operating systems will use a source
port number in this range. This range can be expanded and changed (but not
always), and 49,152 through 65,535 is more in line with current standards. Use
of ephemeral ports impacts firewall use and limits the number of connections
a host can have open at any one time.
Sequence number—Each new connection (re-tries of failed connections do not
count) uses a different initial sequence number (ISN) as the basis for tracking
segments. Windows uses a very simple time-based formula to compute that
ISN, while Unix ISNs are more elaborate (ISNs can be spoofed by hackers).
Acknowledgment number—This number must be greater than or equal to zero
(even a TCP SYN consumes one sequence number) except for the all 1’s ISN.
All segments on an established connection must have this bit set. If there is no
CHAPTER 11 Transmission Control Protocol
1 byte
1 byte
1 byte
Source Port
283
1 byte
Destination Port
Sequence Number
H
e
a
d
e
r
Acknowledgment Number
Header
Length
RESV
E E U A P R S F
C C R C S S Y I
N N G K H T N N
TCP Checksum
Window Size
Urgent Pointer
Options Field (variable length, maximum 40 bytes, 0 padded to 4 -byte multiple)
DATA (application message)
32 bits
FIGURE 11.2
The TCP header fields. Note that some fields are a single bit wide, and others, like the options
field, can be up to 40 bytes (320 bits) long.
actual data in the received segment, the acknowledgment number increments
by 1. (Every byte in TCP is still counted, but that’s not all that contributes to
the sequence number field.)
Header length—The TCP header length in 4-byte units.
Reserved—Four bits are reserved for future use.
ECN flags—The two explicit congestion notification (ECN) bits are used to tell
the host when the network is experiencing congestion and send windows
should be adjusted.
URG, ACK, PSH, RST, SYN, FIN—These six single-bit fields (Urgent, Acknowledgment, Push, Reset, Sync, and Final) give the receiver more information on how
to process the TCP segment. Table 11.1 shows their functions.
Window size—The size of receive window that the destination host has set. This
field is used in TCP flow control and congestion control. It should not be set
to zero in an initial SYN segment.
Checksum—An error-checking field on the entire TCP segment and header as
well as some fields from the IP datagram (the pseudo-header). The fields are
284
PART II Core Protocols
the same as in UDP. If the checksum computed does not match the received
value, the segment is silently discarded.
Urgent pointer—If the URG control bit is set, the start of the TCP segment contains important data that the source has placed before the “normal” contents
of the segment data field. Usually, this is a short piece of data (such as CTRL-C).
This field points to the first nonurgent data byte.
Options and padding—TCP options are padded to a 4-byte boundary and can be
a maximum of 40 bytes long. Generally, a 1-byte Type is followed by a 1-byte
Length field (including these initial 2 bytes), and then the actual options. The
options are listed in Table 11.2.
Table 11.1 TCP Control Bits by Abbreviation and Function
Bit
Function
URG
If set, the Urgent Pointer field value is valid (often resulting from an interrupt-like
CTRL-C). Seldom used, but intended to raise the priority of the segment.
ACK
If set, the Acknowledgment Number field is valid.
PSH
If set, the receiver should not buffer the segment data, but pass them directly to the
application. Interactive applications use this, but few others.
RST
If set, the connection should be aborted. A favorite target of hackers “hijacking” TCP
connections, a series of rules now govern proper reactions to this bit.
SYN
If set, the hosts should synchronize sequence numbers and establish a connection.
FIN
If set, the sender had finished sending data and initiated a close of the connection.
Table 11.2 TCP Option Types, Showing Abbreviation (Meaning), Length, and RFC
in Which Established
Type
Meaning
Total Length and Description
RFC
0
EOL
1 byte, indicates end of option list (only used if end of
options is not end of header)
793
1
NOP
1 byte, no option (used as padding to align header with
Header-Length Field)
793
2
MSS
4 bytes, the last 2 of which indicate the maximum payload
that one host will try to send another. Can only appear in
SYN and does not change.
793
879
CHAPTER 11 Transmission Control Protocol
285
Table 11.2 (continued)
Type
Meaning
Total Length and Description
RFC
3
WSCALE
3 bytes, the last establishing a multiplicative (scaling) factor.
Supports bit-shifted window values above 65,535.
1072
4
SACKOK
2 bytes, indicating that selective ACKs are permitted.
2018
5
SACK
Of variable length, these are the selective ACKs.
1072
6
Echo
6 bytes, the last 4 of which are to be echoed.
1072
7
Echo reply
6 bytes, the last 4 of which echo the above.
1072
8
Timestamp
10 bytes, the last 8 of which are used to compute the retransmission timer through the RTT calculation. Makes sure that an
old sequence number is not accepted by the current connection.
1323
9
POC perm
2 bytes, indicating that the partial order service is permitted.
1693
10
POC profile
3 bytes, the last carrying 2-bit flags.
1693
11
CC
6 bytes, the last 4 providing a segment connection count.
1644
12
CCNEW
6 bytes, the last 4 providing new connection count.
1644
13
CCECHO
6 bytes, the last 4 echoing previous connection count.
1644
TCP MECHANISMS
It might not be obvious why TCP connections should be such a complication. One of
the reasons is that TCP adds more to connectionless IP than connection capability. The
TCP service also provides aspects of what the ISO-RM defines as Session Layer services,
services that include the history (a popular term is “state variables”) of the connection
progress. Connections also provide a convenient structure with which to associate
QoS parameters, although every layer of any protocol stack always has some QoS duties
to perform, even if it is only error checking.
Officially, TCP is a virtual circuit service that adds reliability to the IP layer, reliability that is lacking in UDP. TCP also provides sequencing and flow control to the
host-to-host interaction, which in turn provides a congestion control mechanism to the
routing network as a whole (as long as TCP, normally an end-to-end concern, is aware
of the congested condition). The flow control mechanism in TCP is a sliding window
procedure that prevents senders from overwhelming receivers and applies in both
directions of a TCP connection.
TCP was initially defined in RFC 793, refined in RFCs 879, 1106, 1110, and 1323
(which obsoleted RFC 1072 and RFC 1185). RFCs 1644 and 1693 extended TCP to
support transactions, which can be loosely understood as “connection-oriented
286
PART II Core Protocols
request–response pairs that cannot use UDP.” RFC 3168 added explicit congestion notification (ECN) bits to the TCP header. These bits were “added” by redefining bits 6 and
7 in the TOS field of the packet header.
TCP and Transactions
It is important to note that TCP does not use the term “transaction” to describe
those peculiar interactions that require coordinated actions among multiple hosts
on the network. A familiar “transaction” is an accounting process that is not complete until both one account has been debited and another has been credited.
Database transactions are a completely different notion than what a transaction
means in TCP.
But this is not the purpose of transactions for TCP (T/TCP)! TCP “transactions”
are a way to sneak a quick burst of request–response data into an exchange of connection setup segments, similar to the way that UDP works.
TCP headers can be between 20 bytes (typical) and 60 bytes long when options are
used (not often). A segment, which is the content of a TCP data unit, is essentially a portion of the application’s send buffer. As bytes accumulate in the send buffer, they will
exceed the maximum segment size (MSS) established for the connection. These bytes
receive a TCP header and are sent inside an IP packet. There are also ways to “push” a
partially full send buffer onto the network.
At the receiver, the segment is added to a receive buffer until complete or until the
application has enough data to process. Naturally, the amount of data exchanged varies
greatly.
Let’s look at how TCP works and then examine the header fields that make it all
happen. It might seem strange to talk about major TCP features before the TCP header
has been presented, but the operation of many of the fields in the TCP header depend
on terminology and concepts used during TCP connection and other procedures.
CONNECTIONS AND THE THREE-WAY HANDSHAKE
TCP establishes end-to-end connections over the unreliable, best-effort IP packet service using a special sequence of three TCP segments sent from client to server and
back called a three-way handshake. Why three ways? Because packets containing the
TCP segment that ask a server to accept another connection and the server’s response
might be lost on the IP router network, leaving the hosts unsure of exactly what is
going on.
Once the three segments are exchanged, data transfer can take place from host
to host in either direction. Connections can be dropped by either host with a simple
CHAPTER 11 Transmission Control Protocol
287
exchange of segments (four in total), although the other host can delay the dropping
until final data are sent, a feature rarely used.
TCP uses unique terminology for the connection process. A single bit called the
SYN (synchronization) bit is used to indicate a connection request. This single bit is
still embedded in a complete 20-byte (usually) TCP header, and other information, such
as the initial sequence number (ISN) used to track segments, is sent to the other host.
Connections and data segments are acknowledged with the ACK bit, and a request to
terminate a connection is made with the FIN (final) bit.
The entire TCP connection procedure, from three-way handshake to data transfer
to disconnect, is shown in Figure 11.3. TCP also allows for the case where two hosts
performs an active open at the same time, but this is unlikely.
This example shows a small file transfer to a server (with the server sending 1000
bytes back to the client) using 1000-byte segments, but only to make the sequence
numbers and acknowledgments easier to follow. The whole file is smaller than the
CLIENT
Active OPEN
OPEN
Data Transfer
SEQ and ACK
Client–Server File Transfer Using
1000-byte Segments
SYN
SEQ (ISN) 2000 WIN 5840
MSS (OPT)1460
SYN SEQ (ISN) 4000 WIN 8760
MSS (OPT)1460
SEQ 2001 WIN 5840
ACK
ACK 4001
SEQ 2001
ACK 4001
SEQ 4001
ACK 4001
SEQ 5001
ACK 4001
FIN
(sends 1000
bytes back)
(3000 bytes of
window full)
ACK 6001
SEQ 4001 ACK 10001
SEQ 10001 ACK 4002
FIN
WAIT!
3-way Handshake
Complete
..
ACK
CLOSING
ACK 3001
SEQ 3001
(no data)
Connection
Release
OPEN
ACK 4001
SEQ 4001
(Transfer
continues...)
SERVER
Passive OPEN
CLOSING
SEQ 10001 ACK 4002
ACK SEQ 4002 ACK 10002
WAIT!
FIGURE 11.3
Client–server interaction with TCP, showing the three connection phases of setup, data transfer,
and release (disconnect).
288
PART II Core Protocols
server host’s receive window and nothing goes wrong (but things often go wrong in
the real world).
Note that to send even one exchange of a request–response pair inside segments,
TCP has to generate seven additional packets. This is a lot of packet overhead, and the
whole process is just slow over high latency (delay) links. This is one reason that UDP
is becoming more popular as networks themselves become more reliable.
Connection Establishment
Let’s look at the normal TCP connection establishment’s three-way handshake in some
detail. The three messages establish three important pieces of information that both
sides of the connection need to know.
1. The ISNs to use for outgoing data (in order to deter hackers, these should not
be predictable).
2. The buffer space (window) available locally for data, in bytes.
3. The Maximum Segment Size (MSS) is a TCP Option and sets the largest segment
that the local host will accept. The MSS is usually the link MTU size minus the 40
bytes of the TCP and IP headers, but many implementations use segments of 512
or 536 bytes (it’s a maximum, not a demand).
A server issues a passive open and waits for a client’s active open SYN, which in
this case has an ISN of 2000, a window of 5840 bytes and an MSS of 1460 (common
because most hosts are on Ethernet LANs). The window is almost always a multiple
of the MSS (1460 3 4 5 5840 bytes). The server responds with a SYN and declares the
connection open, setting its own ISN to 4000, and “acknowledging” sequence number
2001 (it really means “the next byte I get from you in a segment should be numbered
2001”). The server also established a window of 8760 bytes and an MSS of 1460 (1460 3
6 5 8760 bytes).
Finally, the client declares the connection open and returns an ACK (a segment with
the ACK bit set in the header) with the sequence number expected (2001) and the
acknowledgment field set to 4001 (which the server expects). TCP sequence numbers
count every byte on the data stream, and the 32-bit sequence field allows more than
4 billion bytes to be outstanding (nevertheless, high-speed transports such as Gigabit
Ethernet roll this field over too quickly for comfort, so special “scaling” mechanisms are
available for these link speeds).
TCP’s three-way handshake has two important functions. It makes sure that both
sides know that they are ready to transfer data and it also allows both sides to agree
on the initial sequence numbers, which are sent and acknowledged (so there is no
mistake about them) during the handshake. Why are the initial sequence numbers so
important? If the sequence numbers are not randomized and set properly, it is possible
for malicious users to hijack the TCP session (which can be reliable connections to a
bank, a store, or some other commercial entity).
CHAPTER 11 Transmission Control Protocol
289
Each device chooses a random initial sequence number to begin counting every
byte in the stream sent. How can the two devices agree on both sequence number values in about only three messages? Each segment contains a separate sequence number
field and acknowledgment field. In Figure 11.3, the client chooses an initial sequence
number (ISN) in the first SYN sent to the server. The server ACKs the ISN by adding one
to the proposed ISN (ACKs always inform the sender of the next byte expected) and
sending it in the SYN sent to the client to propose its own ISN. The client’s ISN could
be rejected, if, for example, the number is the same as used for the previous connection,
but that is not considered here. Usually, the ACK from the client both acknowledges the
ISN from the server (with server’s ISN 1 1 in the acknowledgment field) and the connection is established with both sides agreeing on ISN. Note that no information is sent
in the three-way handshake; it should be held until the connection is established.
This three-way handshake is the universal mechanism for opening a TCP connection. Oddly, the RFC does not insist that connections begin this way, especially with
regard to setting other control bits in the TCP header (there are three others in addition
to SYN and ACK and FIN). Because TCP really expects some control bits to be used during connection establishment and release, and others only during data transfer, hackers
can cause a lot of damage simply by messing around with wild combinations of the six
control bits, especially SYN/ACK/FIN, which asks for, uses, and releases a connection
all at the same time. For example, forging a SYN within the window of an existing SYN
would cause a reset. For this reason, developers have become more rigorous in their
interpretation of RFC 793.
Data Transfer
Sending data in the SYN segment is allowed in transaction TCP, but this is not typical.
Any data included are accepted, but are not processed until after the three-way handshake completes. SYN data are used for round-trip time measurement (an important
part of TCP flow control) and network intrusion detection (NID) evasion and insertion attacks (an important part of the hacker arsenal).
The simplest transfer scenario is one in which nothing goes wrong (which, fortunately, happens a lot of the time). Figure 11.4 shows how the interplay between TCP
sequence numbers (which allow TCP to properly sequence segments that pop out of
the network in the wrong order) and acknowledgments allow both sides to detect
missing segments.
The client does not need to receive an ACK for each segment. As long as the established receive window is not full, the sender can keep sending. A single ACK covers a
whole sequence of segments, as long as the ACK number is correct.
Ideally, an ACK for a full receive window’s worth of data will arrive at the sender
just as the window is filled, allowing the sender to continue to send at a steady rate.
This timing requires some knowledge of the round-trip time (RTT) to the partner host
and some adjustment of the segment-sending rate based on the RTT. Fortunately, both
of these mechanisms are available in TCP implementations.
290
PART II Core Protocols
What happens when a segment is “lost” on the underlying “best-effort” IP router network? There are two possible scenarios, both of which are shown in Figure 11.4.
In the first case, a 1000-byte data segment from the client to the server fails to arrive
at the server. Why? It could be that the network is congested, and packets are being
dropped by overstressed routers. Public data networks such as frame relay and ATM
(Asynchronous Transfer Mode) routinely discard their frames and cells under certain
conditions, leading to lost packets that form the payload of these data units.
If a segment is lost, the sender will not receive an ACK from the receiving host.
After a timeout period, which is adjusted periodically, the sender resends the last unacknowledged segment. The receiver then can send a single ACK for the entire sequence,
covering received segments beyond the missing one.
But what if the network is not congested and the lost packet resulted from a simple intermittent failure of a link between two routers? Today, most network errors are
caused by faulty connectors that exhibit specific intermittent failure patterns that
steadily worsen until they become permanent. Until then, the symptom is sporadic lost
packets on the link at random intervals. (Predictable intervals are the signature of some
outside agent at work.)
CLIENT
(Sending data...)
(Where’s my
ACK for 8001
and 9001?)
Timeout!
(resend)
Client–Server Response to Lost Segments
SEQ 8001
ACK 3001
LOST!
SEQ 9001
ACK 3001
SERVER
(Where is 8001?)
..
SEQ 8001
ACK 3001
(no data)
(Ah! There it is...)
ACK 10001
(Thanks!)
(Sending data...)
SEQ 10001
ACK 3001
SEQ 11001
ACK 3001
(no data)
SEQ 12001
(Where is 10001?
Repeat ACK for
100001)
ACK 10001
ACK 3001
(no data)
SEQ 10001
ACK 10001
ACK 3001
(no data)
SEQ 13001
LOST!
ACK 10001
ACK 3001
(no data)
(Ah! There it is...)
ACK 14001
FIGURE 11.4
How TCP handles lost segments. The key here is that although the client might continue to send
data, the server will not acknowledge all of it until the missing segment shows up.
CHAPTER 11 Transmission Control Protocol
291
Waiting is just a waste of time if the network is not congested and the lost packet
was the result of a brief network “hiccup.” So TCP hosts are allowed to perform a “fast
recovery” with duplicate ACKs, which is also shown in Figure 11.4.
The server cannot ACK the received segments 11,001 and subsequent ones because
the missing segment 10,001 prevents it. (An ACK says that all data bytes up to the ACK
have been received.) So every time a segment arrives beyond the lost segment, the
host only ACKs the missing segment. This basically tells the other host “I’m still waiting
for the missing 8001 segment.” After several of these are received (the usual number
is three), the other host figures out that the missing segment is lost and not merely
delayed and resends the missing segment. The host (the server in this case) will then
ACK all of the received data.
The sender will still slow down the segment sending rate temporarily, but only in
case the missing segment was the result of network congestion.
Closing the Connection
Either side can close the TCP connection, but it’s common for the server to decide just
when to stop. The server usually knows when the file transfer is complete, or when the
user has typed logout and takes it from there. Unless the client still has more data to
send (not a rare occurrence with applications using persistent connections), the hosts
exchange four more segments to release the connection.
In the example, the server sends a segment with the FIN (final) bit set, a sequence
number (whatever the incremented value should be), and acknowledges the last data
received at the server. The client responds with an ACK of the FIN and appropriate
sequence and acknowledgment numbers (no data were sent, so the sequence number
does not increment).
The TCP releases the connection and sends its own FIN to the server with the
same sequence and acknowledgment numbers. The server sends an ACK to the FIN
and increments the acknowledgment field but not the sequence number. The connection is down.
But not really. The “best-effort” nature of the IP network means that delayed duplicated could pop out of a router at any time and show up at either host. Routers don’t
do this just to be nasty, of course. Typically, a router that hangs or has a failed link rights
itself and finds packets in a buffer (which is just memory) and, trying to be helpful,
sends them out. Sometimes routing loops cause the same problem.
In any case, late duplicates must be detected and disposed of (which is one reason
the ISN space is 32 bits—about 4 billion—wide). The time to wait is supposed to be
twice as long as it could take a packet to have its TTL go to zero, but in practice this is
set to 4 minutes (making the packet transit time of the Internet 2 minutes, an incredibly high value today, even for Cisco routers, which are fond of sending packets with
the TTL set to 255).
The wait time can be as high as 30 minutes, depending on TCP/IP implementation,
and resets itself if a delayed FIN pops out of the network. Because a server cannot
accept other connections from this client until the wait timer has expired, this often
led to “server paralysis” at early Web sites.
292
PART II Core Protocols
Today, many TCP implementations use an abrupt close to escape the wait-time
requirement. The server usually sends a FIN to the client, which first ACKs and then
sends a RST (reset) segment to the server to release the connection immediately and
bypass the wait-time state.
FLOW CONTROL
Flow control prevents a sender from overwhelming a receiver with more data than it
can handle. With TCP, which resends all lost data, a receiver that is discarding data that
overflows the receive buffers is just digging itself a deeper and deeper hole.
Flow control can be performed by either the sender or the receiver. It sounds
strange to have senders performing flow control (how could they know when receivers are overwhelmed?), but that was the first form of flow control used in older
networks.
Many early network devices were printers (actually, teletype machines, but the
point is the same). They had a hard enough job running network protocols and printing the received data, and could not be expected to handle flow control as well. So
the senders (usually mainframes or minicomputers with a lot of horsepower for the
day) knew exactly what kind of printer they were sending to and their buffer sizes. If
a printer had a two-page buffer (it really depended on byte counts), the sender would
know enough to fire off two pages and then wait for an acknowledgment from the
printer before sending more. If the printer ran out of paper, the acknowledgment was
delayed for a long time, and the sender had to decide whether it was okay to continue
or not.
Once processors grew in power, flow control could be handled by the receiver, and
this became the accepted method. Senders could send as fast as they could, up to a
maximum window size. Then senders had to wait until they received an acknowledgment from the receiver. How is that flow control? Well, the receiver could delay the
acknowledgments, forcing the sender to slow down, and usually could also force the
sender to shrink its window. (Receivers might be receiving from many senders and
might be overwhelmed by the aggregate.)
Flow control can be implemented at any protocol level or even every protocol layer.
In practice, flow control is most often a function of the transport layer (end to end). Of
course, the application feeding TCP with data should be aware of the situation and also
slow down, but basic TCP could not do this.
TCP is a “byte-sequencing protocol” in which every byte is numbered. Although
each segment must be acknowledged, one acknowledgment can apply to multiple segments, as we have seen. Senders can keep sending until the data in all unacknowledged
segments equals the window size of the receiver. Then the sender must stop until an
acknowledgment is received from the receiving host.
This does not sound like much of a flow control mechanism, but it is. A receiver is
allowed to change the size of the receive window during a connection. If the receiver
CHAPTER 11 Transmission Control Protocol
293
finds that it cannot process the received window’s data fast enough, it can establish
a new (smaller) window size that must be respected by the sender. The receiver can
even “close” the window by shrinking it to zero. Nothing more can be sent until the
receiver has sent a special “window update ACK” (it’s not ACKing new data, so it’s not
a real ACK) with the new available window size.
The window size should be set to the network bandwidth multiplied by the roundtrip time to the remote host, which can be established in several ways. For example, a
100-Mbps Ethernet with a 5-millisecond (ms) round-trip time (RTT) would establish
a 64,000-byte window on each host (100 Mbps 3 5 ms 5 0.5 Mbits 5 512 kbits 5
64 kbytes). When the window size is “tuned” to the RTT this way, the sender should
receive an ACK for a window full of segments just in time to optimize the sending
process.
“Network” bandwidths vary, as do round-trip times. The windows can always shrink
or grow (up to the socket buffer maximum), but what should their initial value be?
The initial values used by various operating systems vary greatly, from a low of 4096
(which is not a good fit for Ethernet’s usual frame size) to a high of 65,535 bytes. FreeBSD defaults to 17,520 bytes, Linux to 32,120, and Windows XP to anywhere between
17,000 and 18,000 depending on details.
In Windows XP, the TCPWindowSize can be changed to any value less that 64,240.
Most Unix-based systems allow changes to be made to the /etc/sysctl.conf file. When
adjusting TCP transmit and receive windows, make sure that the buffer space is sufficient to prevent hanging of the network portion on the OS. In FreeBSD, this means
that the value of nmbclusters and socket buffers must be greater than the maximum
window size. Most Linux-based systems autotune this based on memory settings.
TCP Windows
How do the windows work during a TCP connection? TCP forms its segments in memory sequentially, based on segment size, each needing only a set of headers to be added
for transmission inside a frame. A conceptual “window” (it’s all really done with pointers) overlays this set of data, and two moveable boundaries are established in this series
of segments to form three types of data. There are segments waiting to be transmitted,
segments sent and waiting for an acknowledgment, and segments that have been sent
and acknowledged (but have not been purged from the buffer).
As acknowledgments are received, the window “slides” along, which is why the
process is commonly called a “sliding window.”
Figure 11.5 shows how the sender’s sliding window is used for flow control. (There
is another at the receiver, of course.) Here the segments just have numbers, but each
integer represents a whole 512, 1460, or whatever size segment. In this example, segments 20 through 25 have been sent and acknowledged, 26 through 29 have been sent
but not acknowledged, and segments 30 through 35 are waiting to be sent. The send
buffer is therefore 15 segments wide, and new segments replace the oldest as the buffer wraps.
294
PART II Core Protocols
Sliding Window
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Data sent and
acknowledged
Data sent and waiting
Data to
for acknowledgment
be sent
(Each integer represents a segment of
hundreds or thousands of bytes)
Data to
be sent
FIGURE 11.5
TCP sliding window.
Flow Control and Congestion Control
When flow control is used as a form of congestion control for the whole network, the
network nodes themselves are the “receivers” and try to limit the amount of data that
senders dump into the network.
But now there is a problem. How can routers tell the hosts using TCP (which is an
end-to-end protocol) that there is congestion on the network? Routers are not supposed to play around with the TCP headers in transit packets (routers have enough to
do), but they are allowed to play around with IP headers (and often have to).
Routers know when a network is congested (they are the first to know), so they can
easily flip some bits in the IPv4 and IPv6 headers of the packets they route. These bits
are in the TOS (IPv4) and Flow (IPv6) fields, and the hosts can read these bits and react
to them by adjusting windows when necessary.
RFC 3168 establishes support for these bits in the IP and TCP headers. However,
support for explicit congestion notification in TCP and IP routers is not mandatory,
and rare to nonexistent in routers today. Congestion in routers is usually indicated by
dropped packets.
PERFORMANCE ALGORITHMS
By now, it should be apparent that TCP is not an easy protocol to explore and understand.
This complexity of TCP is easy enough to understand: Underlying network should be
fast and simple, IP transport should be fast and simple as well, but unless every application builds in complex mechanisms to ensure smooth data flow across the network, the
complexity of networking must be added to TCP. This is just as well, as the data transfer
concern is end to end, and TCP is the host-to-host layer, the last bastion of the network
shielding the application from network operations.
CHAPTER 11 Transmission Control Protocol
295
To look at it another way, if physical networks and IP routers had to do all that the
TCP layer of the protocol stack does, the network would be overwhelmed. Routers
would be overwhelmed by the amount of state information that they would need to
carry, so we delegate carrying that state information to the hosts. Of course, applications are many, and each one shouldn’t have to do it all. So TCP does it. By the way,
this consistent evolution away from “dumb terminal on a smart network” like X.25 to
“smart host on a dumb network” like TCP/IP is characteristic of the biggest changes in
networking over the years.
This chapter has covered only the basics, and TCP has been enhanced over the
years with many algorithms to enhance the performance of TCP in particular and the
network in general. ECN is only one of them. Several others exist and will only be mentioned here and not investigated in depth.
Delayed ACK—TCP is allowed to wait before sending an ACK. This cuts down
on the number of “stand-alone” ACKs, and lets a host wait for outgoing data
to “piggyback” an acknowledgment onto. Most implementations use a 200-ms
wait time.
Slow Start—Regardless of the receive window, a host computes a second congestion window that starts off at one segment. After each ACK, this window
doubles in size until it matches the number of segments in the “regular”
window. This prevents senders from swamping receivers with data at the start
of a connection (although it’s not really very slow at all).
Defeating Silly Window Syndrome Early—TCP implementations processed
receive buffer data slowly, but received segments with large chunks of data.
Receivers then shrunk the window as if this “chunk” were normal. So windows
often shrunk to next to nothing and remained here. Receivers can “lie” to prevent this, and senders can implement the Nagle algorithm to prevent the sending of small segments, even if PUSHed. (Applications that naturally generate
small segments, such as a remote login, can turn this off.)
Scaling for Large Delay-Bandwidth Network Links—The TCP window-scale
option can be used to count more than 4 billion or so bytes before the sequence
number field wraps. A timestamp option sent in the SYN message helps also.
Scaling is sometimes needed because the Window field in the TCP header is
16 bits long, so the maximum window size is normally 64 kbytes. Larger
windows are needed for large-delay times, high-bandwidth product links
(such as the “long fat pipes” of satellite links). The scaling uses 3 bytes: 1 for type
(scaling), 1 for length (number of bytes), and 2 for a shift value called S. The
shift value provides a binary scaling factor to be applied to the usual value
in the Window field. Scaling shifts the window field value S bits to the left to
determine the actual window size to use.
Adjusting Resend Timeouts Based on Measured RTT—How long should a sender
wait for an ACK before resending a segment? If the resend timeout is too short,
296
PART II Core Protocols
resends might clutter up a network slow in relaying ACKs because it is teetering on the edge of congestion. If it is too long, it limits throughput and slows
recovery. And a value just right for TCP connection over the local LAN might
be much too short for connections around the globe over the Internet. TCP
adjusts its value for changing network conditions and link speeds in a rational
fashion based on measured RTT, how fast the RTT has change in the past.
TCP AND FTP
First we’ll use a Windows FTP utility on wincli2 (10.10.12.222) to grab the 30,000byte file test.stuff from the server bsdserver (10.10.12.77) and capture the TCP
(and FTP) packets with Ethereal. Both hosts are on the same LAN segment, so the process should be quick and error-free.
The session took a total of 91 packets, but most of those were for the FTP data
transfer itself. The Ethereal statistics of the sessions note that it took about 55 seconds
from first packet to last (much of which was “operator think time”), making the average
about 1.6 packets per second. A total of 36,000 bytes were sent back and forth, which
sounds like a lot of overhead, but it was a small file. The throughput on the 100 Mbps
LAN2 was about 5,200 bits per second, showing why networks with humans at the
controls have to be working very hard to fill up even a modestly fast LAN.
We’ve seen the Ethereal screen enough to just look at the data in the screen shots.
And Ethereal lets us expand all packets and create a PDF out of the capture file. This in
turn makes it easy to cut-and-paste exactly what needs to be shown in a single figure
instead of many.
For example, let’s look at the TCP three-way handshake that begins the session in
Figure 11.6.
FIGURE 11.6
Capture of three-way handshake. Note that Ethereal sets the “relative” sequence number to zero
instead of presenting the actual ISN value.
CHAPTER 11 Transmission Control Protocol
297
The first frame, from 10.10.12.222 to 10.10.12.77, is detailed in the figure. The
window size is 65,535, the MSS is 1460 bytes (as expected for Ethernet), and selective
acknowledgments (SACK) are permitted. The server’s receive window size is 57,344
bytes. Figure 11.7 shows the relevant TCP header values from the capture for the initial
connection setup (which is the FTP control connection).
Ethereal shows “relative” sequence and acknowledgment numbers, and these always
start at 0. But the figure shows the last bits of the actual hexadecimal values, showing
how the acknowledgment increments the value in sequence and acknowledgment
number (the number increments from 0x...E33A to 0x...E33B), even though no data
have been sent.
Note that Windows XP uses 2790 as a dynamic port number, which is really in the
registered port range and technically should not be used for this purpose.
This example is actually a good study in what can happen when “cross-platform”
TCP sessions occur, which is often. Several segments have bad TCP checksums. Since
we are on the same LAN segment, and the frame and packet passed error checks correctly, this is probably a quirk of TCP pseudo-header computation and no bits were
changed on the network. There is no ICMP message because TCP is above the IP layer.
Note that the application just sort of shrugs and keeps right on going (which happens
not once, but several times during the transfer). Things like this “non–error error” happen all the time in the real world of networking.
At the end of the session, there are really two “connections” between wincli2 and
bsdserver. The FTP session rides on top of the TCP connection. Usually, the FTP session
is ended by typing BYE or QUIT on the client. But the graphical package lets the user
just click a disconnect button, and takes the TCP connection down without ending the
FTP session first. The FTP server objects to this breach of protocol and the FTP server
process sends a message with the text, You could at least say goodbye, to the client.
(No one will see it, but presumably the server feels better.)
TCP sessions do not have to be complex. Some are extremely simple. For example,
the common TCP/IP “echo” utility can use UDP or TCP. With UDP, an echo is a simple
wincli2
Active OPEN
(Client port 2790)
FTP Handshake Using 1460-byte Segments
SYN SEQ (ISN) ...72d1 WIN 65535
MSS (OPT) 1460
SYN SEQ (ISN) ... e33a WIN 57344
MSS (OPT) 1460
OPEN
bsdserver
Passive OPEN
ACK SEQ ...72d2 WIN 65535
ACK ... e33b
OPEN
Checksum Bad!
(But 3-way handshake
complete anyway...)
FIGURE 11.7
FTP three-way handshake, showing how the ISNs are incremented and acknowledged.
298
PART II Core Protocols
FIGURE 11.8
Echo using TCP, showing all packets of the ARP, three-way handshake, data transfer, and
connection release phases.
exchange of two segments, the request and reply. In TCP, the exchange is a 10-packet
sequence.
This is shown in Figure 11.8, which captures the echo “TESTstring” from lnxclient
to lnxserver. It includes the initial ARP request and response to find the server.
Why so many packets? Here’s what happens during the sequence.
Handshake (packets 3 to 5)—The utility uses dynamic port 33,146, meaning
Linux is probably up-to-date on port assignments. The connection has a window of 5840 bytes, much smaller than the FreeBSD and Windows XP window
sizes. The MMS is 1460, and the exchange has a rich set of TCP options, including timestamps (TSV) and windows scaling (not used, and not shown in the
figure).
Transfer (packets 6 to 9)—Note that each ECHO message, request and response, is
acknowledged. Ethereal shows relative acknowledgment numbers, so ACK=11
means that 10 bytes are being ACKed (the actual number is 0x0A8DA551, or
177,055,057 in decimal.
Disconnect (packets 10 to 12)—A typical three-way “sign-off” is used.
We’ll see later in the book that most of the common applications implemented on
the Internet use TCP for its sequencing and resending features.
299
QUESTIONS FOR READERS
Figure 11.9 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
1 byte
1 byte
1 byte
Source Port
1 byte
Destination Port
Sequence Number
H
e
a
d
e
r
Acknowledgment Number
Header
Length
RESV
Control Bits
Window Size
TCP Checksum
Urgent Pointer
Options Field (variable length, maximum 40 bytes, 0 padded to 4 - byte multiple)
DATA (application message)
wincli2
Active OPEN
(Client port 2790)
FTP Handshake Using 1460-byte Segments
SYN SEQ(ISN) ...72d1 WIN 65535
MSS (OPT) 1460
SYN SEQ (ISN) ... e33a WIN 57344
MSS (OPT) 1460
OPEN
bsdserver
Passive OPEN
ACK SEQ ...72d2 WIN 65535
ACK ... e33b
OPEN
3-Way Handshake
Complete
FIGURE 11.9
The TCP header fields and three-way handshake example.
1. What are the three phases of connection-oriented communications?
2. Which fields are present in the TCP header but absent in UDP? Why are they not
needed in UDP?
3. What is the TCP flow control mechanism called?
4. What does it mean when the initial sequence and acknowledgment numbers are
“relative”?
5. What is the silly window syndrome? What is the Nagle algorithm?
CHAPTER
Multiplexing and Sockets
12
What You Will Learn
In this chapter, you will learn about how multiplexing (and demultiplexing) and
sockets are used in TCP/IP. We’ll see how multiplexing allows many applications
can share a single TCP/IP stock process.
You will learn how layer and applications interact to make multiplexing and
the socket concept very helpful in networking. We’ll use a small utility program to
investigate sockets and illustrate the concepts in this chapter.
Now that we’ve looked at UDP and TCP in detail, this chapter explores two key
concepts that make understanding how UDP and TCP work much easier: multiplexing and sockets. Technically, the term should be “multiplexing and demultiplexing,” but
because mixing things together makes little sense unless you can get them back again,
most people just say “multiplexing” and let it go at that.
Why is multiplexing necessary? Most TCP/IP hosts have only one TCP/IP stack process running, meaning that every packet passing into or out of the host uses the same
software process. This is due to the fact that the hosts usually have only one network
connection, although there are exceptions. However, a host system typically runs many
(technically, if other systems can access them, the host system is a server). All these
applications share the single network interface through multiplexing.
LAYERS AND APPLICATIONS
Both the source and destination port numbers, each 16 bits long, are included as the
first fields of the TCP or UDP segment header. Well-known ports use numbers between
0 and 1023, which are reserved expressly for this purpose. In many TCP/IP implementations, there is a process (usually inetd or xinetd, the “Internet daemon”) that listens
for all TCP/IP activity on an interface. This process then launches to FTP or other application processes on request, using the well-known ports as appropriate.
302
PART II Core Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
45
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
so-0/0/3
49.1
-0
/0/
2
P4
lo0: 192.168.4.1
/0
0/0
1
47.
so-0/0/1
24.2
so-
AS 65459
FIGURE 12.1
Sockets between Linux client and server, showing the devices used in this chapter to illustrate socket
operation.
CHAPTER 12 Multiplexing and Sockets
303
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
2
16.
ge-
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/
29. 0/2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
7
4.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
304
PART II Core Protocols
However, the well-known server port numbers can be statically mapped to their
respective application on the TCP/IP server, and that’s how we will explore them in
this introduction to sockets. With static mapping, the DNS (port number 53) or FTP
(port number 21) server processes (for example) must be running on the server at all
times in order for the server TCP protocol to accept connections to these application
form clients. Things are more complex when both IPv4 and IPv6 are running, but this
chapter considers the situation for IPv4 for simplicity.
This chapter will be a little different than the others. Instead of jumping right in and
capturing packets and then analyzing them, the socket packet capture is actually the
whole point of the chapter. So we’ll save that until last. In the meantime, we’ll develop
a socket-based application to work between the lnxclient (10.10.12.166 on LAN2)
and lnxserver (10.10.11.66 on LAN1), as shown in Figure 12.1.
THE SOCKET INTERFACE
Saying that applications share a single network connection through multiplexing is
not much of an explanation. How does the TCP/IP process determine the source and
destination application for the contents of an arriving segment? The answer is through
sockets. Sockets are the combination of IP address and TCP/UDP port number. Hosts
use sockets to identify TCP connections and sort out UDP request–response pairs, and
to make the coding of TCP/IP applications easier.
The server TCP/IP application processes that “listen” through passive opens for connection requests use well-known port numbers, as already mentioned. The client TCP/
IP application processes that “talk” through active opens and make connection requests
must choose port numbers that are not reserved for these well-known numbers. Servers listen on a socket for clients talking to that socket. There is nothing new here, but
sockets are more than just a useful concept. The socket interface is the most common
way that application programs interact with the network.
There are several reasons for the socket interface concept and construct. One reason has already been discussed. Suppose there are two FTP sessions in progress to
the same server, and both client processes are running over the same network connection on a host with IP address 192.168.10.70. It is up to the client to make sure
that the two processes use different client port numbers to control the sessions to
the server. This is easy enough to do. If the clients have chosen client port numbers
14972 and 14973, respectively, the FTP server process replies to the two client sockets
as 192.168.10.79:14972 and 192.168.10.70:14973. So the sockets allow simultaneous
file transfer sessions to the same client from the same FTP server. If the client sessions
were distinguished only by IP address or port number, the server would have no way
of uniquely identifying the client FTP process. And the FTP server’s socket address is
accessed by all of the FTP clients at the same time without confusion.
Now consider the server shown in Figure 12.2. Here there is a server that has more
than one TCP/IP interface for network access, and thus more than one IP address. Yet
these servers may still have only one FTP (or any other TCP/IP application) server process
CHAPTER 12 Multiplexing and Sockets
Socket 1:
172.16.24.17:22
172.16.24.17
FTP Process
305
Socket 2:
172.16.43.11:22
172.16.43.11
FIGURE 12.2
The concept of sockets applied to FTP. Note how sockets allow a server with two different IP
addresses to access the FTP server process using the same port.
running. With the socket concept, the FTP server process has no problem separating
client FTP sessions from different network interfaces because their socket identifiers
will differ on the server end. Since a TCP connection is always identified by both the client and server IP address and the client and server port numbers, there is no confusion.
This illustrates the sockets concept in more depth, but not the use of the socket
interface in a TCP/IP network. The socket interface forms the boundary between the
application program written by the programmer and the network processes that are
usually bundled with the operating system and quite uniform compared to the myriad
of applications that have been implemented with programs.
Socket Libraries
Developers of applications for TCP/IP networks will frequently make use of a sockets
library to implement applications. These applications are not the standard “bundled”
TCP/IP applications like FTP, but other applications for remote database queries and
the like that must run over a TCP/IP network. The sockets library is a set of programming tools used to simplify the writing of application programs for TCP or UDP. Since
these “custom” applications are not included in the regular application services layer
of TCP/IP, these applications must interface directly with the TCP/IP stack. Of course,
these applications must also exist in the same client–server, active–passive open environment as all other TCP/IP applications.
The socket is the programmer’s identifier for this TCP/IP layer interface. In Unix
environments, the socket is treated just like a file. That is, the socket is created, opened,
read from, written to, closed, and deleted, which are exactly the same operations that
a programmer would use to manipulate a file on a local disk drive. Through the use of
the socket interface, a developer can write TCP/IP networked client–server applications without thinking about managing TCP/IP connections on the network.
The programmer can use sockets to refer to any remote TCP/IP application layer
entity. Many developers use socket interfaces to provide “front-end” graphical interfaces
306
PART II Core Protocols
Application Programs
Stream
Interface
Datagram
Interface
TCP
UDP
Raw Socket
Interface
IP Layer
Network
FIGURE 12.3
The three socket types. Note that the raw socket interface bypasses TCP and UDP. (The socket
program often builds its own TCP or UDP header.)
to common remote TCP/IP server processes such as FTP. Of course, the developers may
choose to write applications that implement both sides of the client–server model.
The socket can interface with either TCP (called a “stream” socket), UDP (called a
“datagram” socket), or even IP directly (called the “raw” socket). Figure 12.3 shows the
three major types of socket programming interfaces. There are even socket libraries
that allow interfaces with the frames of the network access layer below IP itself. More
details must come from the writers of the sockets libraries themselves, since socket
libraries vary widely in operational specifics.
TCP Stream Service Calls
When used in the stream mode, the socket interface supplies the TCP protocol with
the proper service calls from the application. These service calls are few in number,
but enough to completely activate, maintain, and terminate TCP connections on the
TCP/IP network. Their functions are summarized in the following:
OPEN—Either a passive or active open is defined to establish TCP connections.
SEND—Allows a client or server application process to pass a buffer of information to the TCP layer for transmission as a segment.
RECEIVE—Prepares a receive buffer for the use of the client or server application
to receive a segment from the TCP layer.
STATUS—Allows the application to locate information about the status of a TCP
connection.
CLOSE—Requests that the TCP connection be closed.
CHAPTER 12 Multiplexing and Sockets
307
ABORT—Asks that the TCP connection discard all data in buffers and terminate
the TCP connection immediately.
These commands are invoked on the application’s behalf by the socket interface,
and therefore are not seen by the application programmer. But it is always good to
keep in mind that no matter how complicated a sockets library of routines might seem
to a programmer, at heart the socket interface is a relatively simple procedure.
THE SOCKET INTERFACE: GOOD OR BAD?
However, the very simplicity of socket interfaces can be deceptive. The price of this
simplicity is isolating the network program developers from any of the details of how
the TCP/IP network actually operates. In many cases, the application programmers
interpret this “transparency” of the TCP/IP network (“treat it just like a file”) to mean
that the TCP/IP network really does not matter to the application program at all.
As many TCP/IP network administrators have learned the hard way, nothing could
be further from the truth. Every segment, datagram, frame, and byte that an application puts on a TCP/IP network affects the performance of the network for everyone.
Programmers and developers that treat sockets “just like a file” soon find out that the
TCP/IP network is not as fast as the hard drive on their local systems. And many applications have to be rewritten to be more efficient just because of the seductive transparency of the TCP/IP network using the socket interface.
For those who have been in the computer and network business almost from the
start, the socket interface controversy in this regard closely mirrors the controversy that
erupted when COBOL, the first “high-level” programming language, made it possible for
people who knew absolutely nothing about the inner workings of computers to be
trained to write application programs. Before COBOL, programmers wrote in a low-level
assembly language that was translated (assembled) into machine instructions. (Some
geniuses wrote directly in machine code without assemblers, a process known as “bare
metal programming.”)
Proponents then, as with sockets, pointed out the efficiencies to be enjoyed by
freeing programmers from reinventing the wheel with each program and writing
the same low-level routines over and over. There were gains in production as well—
programmers who wrote a single instruction in COBOL could rely on the compiler
to generate about 10 lines of underlying assembly language and machine code. Since
programmers all wrote about the same number of lines of code a day, a 10-fold gain in
productivity could be claimed.
The same claims regarding isolation are often made for the socket interface. Freed
from concerns about packet headers and segments, network programmers can concentrate instead on the real task of the program and benefit from similar productivity
gains. Today, no one seriously considers the socket interface to be an isolation liability, although similar claims of “isolation” are still heard when programmers today can
generate code by pointing and clicking at a graphical display in Visual Basic or another
even higher level “language.”
308
PART II Core Protocols
The “Threat” of Raw Sockets
A more serious criticism of the socket interface is that it forms an almost perfect tool
for hackers, especially the raw socket interface. Many network security experts do not
look kindly on the kind of abuses that raw sockets made possible in the hands of
hackers.
Why all the uproar over raw sockets? With the stream (TCP) and datagram (UDP)
socket interfaces, the programmer is limited with regard to what fields in the TCP/UDP
or IP header that they can manipulate. After all, the whole goal is to relieve the programmer of addressing and header field concerns. Raw sockets were originally intended as
a protocol research tool only, but they proved so popular among the same circle of
trusted Internet programmers at the time that use became common.
But raw sockets let the programmer pretty much control the entire content of the
packet, from header to flags to options. Want to generate a SYN attack to send a couple
of million TCP segments with the SYN bit sent one after the other to the same Web
site, and from a phony IP address? This is difficult to do through the stream socket, but
much easier with a raw socket. Consequently, this is one reason why you can find and
download over the Internet hundreds of examples using TCP and UDP sockets, but raw
socket examples are few and far between. Not only could users generate TCP and UPD
packets, but even “fake” ICMP and traceroute packets were now within reach.
Microsoft unleashed a storm of controversy in 2001 when it announced support
for the “full Unix-style” raw socket interface in Windows XP. Limited support for raw
sockets in Windows had been available for years, and third-party device drivers could
always be added to Windows to support the full raw socket interface, but malicious
users seldom bestirred themselves to modify systems that were already in use. However, if a “tool” was available to these users, it would be exploited sooner or later.
Many saw the previous limited support for raw sockets in Windows as a blessing
in disguise. The TCP/UDP layers formed a kind of “insulation” to protect the Internet
from malicious application programs, a protective layer that was stripped away with
full raw socket support. They also pointed out that the success of Windows NT servers,
and then Windows 95/98/Me, all of which lacked full raw socket support, meant that
no one needed full raw sockets to do what needed doing on the Internet. But once full
raw sockets came to almost everyone’s desktop, these critics claimed, hackers would
have a high-volume, but poorly secured, operating system in the hands of consumers.
Without full raw sockets, Windows PCs could not spoof IP addresses, generate TCP
segment SYN attacks, or create fraudulent TCP connections. When taken over by emaildelivered scripts in innocent-looking attachments, these machines could become “zombies” and be used by malicious hackers to launch attacks all over the Internet.
Microsoft pointed out that full raw sockets support was possible in previous editions of Windows, and that “everybody else had them.” Eventually, with the release of
Service Pack 2 for Windows XP, Microsoft restricted the traffic that could be sent over
the raw socket interface (receiving was unaffected) in two major ways: TCP data could
not be sent and the IP source address for UDP data must be a valid IP address. These
changes should do a lot to reduce the vulnerability on Windows XP in this regard.
CHAPTER 12 Multiplexing and Sockets
309
Also, in traditional Unix-based operating systems, access to raw sockets is a privileged
activity. So, in a sense the issue is not to hamper raw sockets, but to prevent unauthorized access to privileged modes of operation. According to this position, all raw socket
restrictions do is hamper legitimate applications and form an impediment to effectiveness and portability. Restrictions have never prevented a subverted machine from
spoofing traffic before Windows XP or since.
Socket Libraries
Although there is no standard socket programming interface, there are some socket interfaces that have become very popular for a number of system types. The original socket
interface was developed for the 1982 version of the Berkeley Systems Distribution of
Unix (BSD 4.1c). It was designed at the time to be used with a number of network protocol architectures, not just TCP/IP alone. But since TCP/IP was bundled with BSD Unix
versions, sockets and TCP/IP have been closely related. A number of improvements have
been made to the original BSD socket interface since 1982. Some people still call the
socket interfaces “Berkeley sockets” to honor the source of the concept.
In 1986, AT&T, the original developers of Unix, introduced the Transport Layer
Interface (TLI). The TLI interface was bundled with AT&T UNIX System V and also supported other network architectures besides TCP/IP. However, TLI is also almost always
used with TCP/IP network interface. Today,TLI remains somewhat of a curiosity.
WinSock, as the socket programming interface for Windows is called, is a special
case and deserves a section of its own.
THE WINDOWS SOCKET INTERFACE
One of the most important socket interface implementations today, which is not for
the Unix environment at all, is the Windows Socket interface programming library, or
WinSock. WinSock is a dynamic link library (DLL) function that is linked to a Windows
TCP/IP application program when run. WinSock began with a 16-bit version for
Windows 3.1, and then a 32-bit version was introduced for Windows NT and Windows
95. All Microsoft DLLs have well-defined application program interface (API) calls, and
in WinSock these correspond to the sockets library functions in a Unix environment.
It is somewhat surprising, given the popularity of the TCP/IP protocol architecture
for networks and the popularity of the Microsoft Windows operating system for PCs,
that it took so long for TCP/IP and Windows to be used together. For a while, Microsoft
(and the hardcover version of Bill Gates’s book) championed the virtues of multimedia CD-ROMs over the joys of surfing the Internet, but that quickly changed when the
softcover edition of the book appeared and Microsoft got on the Internet bandwagon
(much to the chagrin of Internet companies like Netscape). In fairness to Microsoft,
there were lots of established companies, such as Novell, that failed to foresee the rise
of the Internet and TCP/IP and their importance in networking. There were several
reasons for the late merging of Windows and TCP/IP.
310
PART II Core Protocols
TCP/IP and Windows
First, TCP/IP was always closely associated with the Unix world of academics and
research institutions. As such, Unix (and the TCP/IP that came with it) was valued as an
open standard that was easily and readily available, and in some cases even free. Windows, on the other hand, was a commercial product by Microsoft intended for corporate or private use of PCs. Windows came to be accepted as a proprietary, de facto
standard, easily and readily available, but never for free. Microsoft encouraged developers to write applications for Windows, but until the release of Windows for Workgroups
(WFW) these applications were almost exclusively “stand-alone” products intended to
run complete on a Windows PC. Even with the release of Windows for Workgroups, the
network interface bundled with WFW was not TCP/IP, but NetBIOS, a network interface for LANs jointly owned by IBM and Microsoft.
Second, in spite of Windows multitasking capabilities (the ability to run more
than one process at a time), Windows used a method of multitasking known as “nonpreemptive multitasking.” In non-preemptive multitasking, a running process had
to “pause” during execution on its own, rather than the operating system taking
control and forcing the application to pause and give other processes a chance to
execute. Unix, in contrast, was a preemptive multitasking environment. With preemptive multitasking, the Unix operating system keeps track of all running processes, allocating computer and memory resources so that they all run in an efficient
manner. This system is characterized by more work for the operating system, but it is
better for all the applications in the long run. Windows was basically a multitasking
GUI built on top of a single-user operating system (DOS).
Sockets for Windows
The pressure that led to the development of the WinSock interface is simple to
relate. Users wanted to hook their Windows-based PCs into the Internet. The Internet
only understands one network protocol, TCP/IP. So WinSock was developed to satisfy
this user need. At first the WinSock interface was used almost exclusively to Internetenable Windows PCs. That is, the applications developed in those pre-Web days to use
the WinSock interface were simple client process interfaces to enable Windows users
to Telnet to Internet sites, run FTP client process programs to attach to Internet FTP
servers, and so on. This might sound limited, but before WinSock, Windows users were
limited to dialing into ports that offered asynchronous terminal text interfaces and
performed TCP/IP conversion for Windows users.
There were performance concerns with those early Windows TCP/IP implementations. The basic problem was the performance of multitasked processes in the Microsoft Windows non-preemptive environment. Most TCP/IP processes, client or server,
do not worry about when to run or when to pause, as the Unix operating system
handles that. With Windows applications written for the WinSock DLL, all of the TCP/IP
processes worried about the decision of whether to run or pause, since the Windows operating system could not “suspend” or pause them on its own. This voluntary
CHAPTER 12 Multiplexing and Sockets
311
giving up of execution time was a characteristic of Windows, but not of most TCP/IP
implementations.
Also, Unix workstations had more horsepower than PC architectures in those early
days,and the Unix operating system has had multitasking capabilities from the start. Originally, Unix required a whole minicomputer’s resources to run effectively. When PCs
came along in the early 1980s, they were just not capable of having enough memory
or being powerful enough to run Unix effectively (a real embarrassment for the makers of AT&T PCs for a while). By the early 1990s, when the Web came along, early Web
sites often relied on RISC processors and more memory than Windows PCs could even
address in those days.
It is worth pointing out that most of these limitations were first addressed with
Windows 95, the process continued with Windows NT, and finally Windows XP and
Vista. Today, no one would hesitate to run an Internet server on a Windows platform,
and many do.
SOCKETS ON LINUX
Any network, large or small, can use sockets. In this section, let’s look at some socket
basics on Linux systems.
We could write socket client and server applications from scratch, but the truth
is that programmers hate to write anything from scratch. Usually, they hunt around
for code that does something pretty close to what they want and modify it for the
occasion (at least for noncommercial purposes). There are plenty of socket examples available on the Internet, so we downloaded some code written by Michael
J. Donahoo and Kenneth L. Calvert. The code, which comes with no copyright and a
“use-at-your-own-risk” warning, is taken from their excellent book, TCP/IP Sockets in
C (Morgan Kaufmann, 2001).
We’ll use TCP because there should be more efficiency derived from a connectionoriented, three-way handshake protocol like TCP than in a simple request–response
protocol like UDP. This application sends a string to the server, where the server
socket program bounces it back. (If no port is provided by the user, the client looks
for well-known port 7, the TCP Echo function port.) First, we’ll list out and compile
my version of the client socket code (TCPsocketClient and DieWithError.c) on
lnxclient. (Ordinarily, we would put all this is its own directory.)
[[email protected] admin]#
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
cat TCPsocketClient.c
/* for printf() and fprintf() */
/* for socket(), connect(), send(), and recv() */
/* for sockaddr_in and inet_addr() */
/* for atoi() and exit() */
/* for memset() */
/* for close() */
#define RCVBUFSIZE 32
/* Size of receive buffer */
312
PART II Core Protocols
void ErrorFunc(char *errorMessage);
/* Error handling function */
int main(int argc, char *argv[])
{
int sock;
struct sockaddr_in echoServAddr;
unsigned short echoServPort;
char *servIP;
char *echoString;
char echoBuffer[RCVBUFSIZE];
unsigned int echoStringLen;
int bytesRcvd, totalBytesRcvd;
/*
/*
/*
/*
/*
/*
/*
/*
if ((argc < 3) || (argc > 4))
Socket descriptor */
Echo server address */
Echo server port */
Server IP address (dotted quad) */
String to send to echo server */
Buffer for echo string */
Length of string to echo */
Bytes read in single recv()
and total bytes read */
/* Test for correct number of
arguments */
{
fprintf(stderr, "Usage: %s <Server IP> <Echo Word> [<Echo Port>]\n",
argv[0]);
exit(1);
}
servIP = argv[1];
echoString = argv[2];
/* First arg: server IP address (dotted quad) */
/* Second arg: string to echo */
if (argc == 4)
echoServPort = atoi(argv[3]); /* Use given port, if any */
else
echoServPort = 7; /* 7 is the well-known port for the echo
service */
/* Create a reliable, stream socket using TCP */
if ((sock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0)
DieWithError("socket() failed");
/* Construct the server address structure */
memset(&echoServAddr, 0, sizeof(echoServAddr));
/* Zero out
structure */
echoServAddr.sin_family
= AF_INET;
/* Internet address
family */
echoServAddr.sin_addr.s_addr = inet_addr(servIP); /* Server
IP address */
echoServAddr.sin_port
= htons(echoServPort);
/* Server port */
/* Establish the connection to the echo server */
if (connect(sock, (struct sockaddr *) &echoServAddr,
sizeof(echoServAddr)) < 0)
DieWithError("connect() failed");
echoStringLen = strlen(echoString);
/* Determine input length */
CHAPTER 12 Multiplexing and Sockets
313
/* Send the string to the server */
if (send(sock, echoString, echoStringLen, 0) != echoStringLen)
DieWithError("send() sent a different number of bytes than expected");
/* Receive the same string back from the server */
totalBytesRcvd = 0;
printf("Received: ");
/* Setup to print the echoed string */
while (totalBytesRcvd < echoStringLen)
{
/* Receive up to the buffer size (minus 1 to leave space for
a null terminator) bytes from the sender */
if ((bytesRcvd = recv(sock, echoBuffer, RCVBUFSIZE - 1, 0)) <= 0)
DieWithError("recv() failed or connection closed prematurely");
totalBytesRcvd += bytesRcvd;
/* Keep tally of total bytes */
echoBuffer[bytesRcvd] = ‘\0’; /* Terminate the string! */
printf(echoBuffer);
/* Print the echo buffer */
}
printf("\n");
/* Print a final linefeed */
close(sock);
exit(0);
}
[[email protected] admin]# cat DieWithError.c
#include <stdio.h> /* for perror() */
#include <stdlib.h> /* for exit() */
void DieWithError(char *errorMessage)
{
perror(errorMessage);
exit(1);
}
[[email protected] admin]#
The steps in the program are fairly straightforward. First, we create a stream socket,
and then establish the connection to the server. We send the string to echo, wait for
the response, print it out, clean things up, and terminate. Now we can just compile the
code and get ready to run it.
[[email protected] admin]# gcc –o TCPsocketClient TCPsocketClient.c DieWithError.c
[[email protected] admin]#
Before we run the program with TCPsocketoClient <ServerIPAddress> <StringtoEcho>
<ServerPort>, we need to compile the server portion of the code on lnxserver. The code
in these two files is more complex.
314
PART II Core Protocols
[[email protected] admin]#
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
cat TCPsocketServer.c
/* for printf() and fprintf() */
/* for socket(), bind(), and connect() */
/* for sockaddr_in and inet_ntoa() */
/* for atoi() and exit() */
/* for memset() */
/* for close() */
#define MAXPENDING 5
/* Maximum outstanding connection requests */
void ErrorFunc(char *errorMessage);
void HandleTCPClient(int clntSocket);
int main(int argc, char *argv[])
{
int servSock;
int clntSock;
struct sockaddr_in echoServAddr;
struct sockaddr_in echoClntAddr;
unsigned short echoServPort;
unsigned int clntLen;
/* Error handling function */
/* TCP client handling function */
/*
/*
/*
/*
/*
/*
Socket descriptor for server */
Socket descriptor for client */
Local address */
Client address */
Server port */
Length of client address data
structure */
if (argc != 2)
/* Test for correct number of arguments */
{
fprintf(stderr, "Usage: %s <Server Port>\n", argv[0]);
exit(1);
}
echoServPort = atoi(argv[1]);
/* First arg:
local port */
/* Create socket for incoming connections */
if ((servSock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0)
DieWithError("socket() failed");
/* Construct local address structure */
memset(&echoServAddr, 0, sizeof(echoServAddr));
/* Zero out
structure */
echoServAddr.sin_family = AF_INET;
/* Internet address
family */
echoServAddr.sin_addr.s_addr = htonl(INADDR_ANY); /* Any incoming
interface */
echoServAddr.sin_port = htons(echoServPort);
/* Local port */
/* Bind to the local address */
if (bind(servSock, (struct sockaddr *) &echoServAddr,
sizeof(echoServAddr)) < 0)
DieWithError("bind() failed");
/* Mark the socket so it will listen for incoming connections */
if (listen(servSock, MAXPENDING) < 0)
DieWithError("listen() failed");
CHAPTER 12 Multiplexing and Sockets
315
for (;;) /* Run forever */
{
/* Set the size of the in-out parameter */
clntLen = sizeof(echoClntAddr);
/* Wait for a client to connect */
if ((clntSock = accept(servSock, (struct sockaddr *) &echoClntAddr,
&clntLen)) < 0)
DieWithError("accept() failed");
/* clntSock is connected to a client! */
printf("Handling client %s\n", inet_ntoa(echoClntAddr.sin_addr));
HandleTCPClient(clntSock);
}
/* NOT REACHED */
}
[[email protected] admin]#
#include <stdio.h>
#include <sys/socket.h>
#include <unistd.h>
cat HandleTCPClient.c
/* for printf() and fprintf() */
/* for recv() and send() */
/* for close() */
#define RCVBUFSIZE 32
/* Size of receive buffer */
void DieWithError(char *errorMessage);
/* Error handling function */
void HandleTCPClient(int clntSocket)
{
char echoBuffer[RCVBUFSIZE];
int recvMsgSize;
/* Buffer for echo string */
/* Size of received message */
/* Receive message from client */
if ((recvMsgSize = recv(clntSocket, echoBuffer, RCVBUFSIZE, 0)) < 0)
DieWithError("recv() failed");
/* Send received string and receive again until end of transmission */
while (recvMsgSize > 0)
/* zero indicates end of transmission */
{
/* Echo message back to client */
if (send(clntSocket, echoBuffer, recvMsgSize, 0) != recvMsgSize)
DieWithError("send() failed");
/* See if there is more data to receive */
if ((recvMsgSize = recv(clntSocket, echoBuffer, RCVBUFSIZE, 0)) < 0)
DieWithError("recv() failed");
}
close(clntSocket);
/* Close client socket */
}
[[email protected] admin]#
The server socket performs a passive open and waits (forever, if need be) for the
client to send a string for it to echo. It’s the HandleTCPClient.c code that does the bulk
316
PART II Core Protocols
of this work. We also need the ErrorFunc.c code, as before, so we have three files to
compile instead of only two, as on the client side.
[[email protected] admin]# gcc -o TCPsocketServer TCPsocketServer.c
HandleTCPClient.c DieWithError.c
[[email protected] admin]#
Now we can start up the server on lnxserver using the syntax TCPsocketServer
(Always check to make sure the port you choose is not in use already!)
<ServerPort>.
[[email protected] admin]# . /TCPsocketServer 2005
The server just waits until the client on lnxclient makes a connection and presents a
string for the server to echo. We’ll use the string TEST.
[[email protected] admin]# . /TCPsocketClient 10.10.11.66 TEST 2005
Received: TEST
[[email protected] admin]#
Not much to that. It’s very fast, and the server tells us that the connection with
was made. We can cancel out of the server program.
lnxclient
Handling client 10.10.12.166
^C
[[email protected] admin]#
We’ve also used Ethereal to capture any TCP packets at the server while the socket
client and server were running. Figure 12.4 shows what we caught.
So that’s the attraction of sockets, especially for TCP. Ten packets (two ARPs are not
shown) made their way back and forth across the network just to echo “TEST” from
one system to another. Only two of the packets actually do this, as the rest are TCP
connection overhead.
But the real power of sockets is in the details, or lack of details. Not a single line
of C code mentioned creating a TCP or IP packet header, field values, or anything
else. The stream socket interface did it all, so the application programmer can concentrate on the task at hand and not be forced to worry about network details.
FIGURE 12.4
The socket client–server TCP stream captured. This is a completely normal TCP connection
accomplished with a minimum of coded effort.
317
QUESTIONS FOR READERS
Figure 12.5 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
Server Socket:
172.16.19.10:22
FTP Server
Internet
FTP Client 1:
IP: 192.168.14.76
Port: 50001
FTP Client 2:
IP: 192.168.243.17
Port: 50001
Application Programs
Stream
Interface
Datagram
Interface
TCP
UDP
Raw Socket
Interface
IP Layer
Network
FIGURE 12.5
A socket in an FTP server and the various types of socket programming interfaces.
1. In the figure, two clients have picked the same ephemeral port for their FTP
connection to the server. What is it about the TCP connection that allows this to
happen all the time without harm?
2. What if the user at the same client PC ran two FTP sessions to the same server
process? What would have to be different to make sure that both TCP control
(and data) connections would not have problems?
3. What is the attraction of sockets as a programming tool?
4. Why can’t the same type of socket interface be used for both TCP and UDP?
5. Are fully supported raw sockets an overstated threat to the Internet and attached
hosts?
PART
Routing and
Routing
Protocols
III
Internet service providers (ISPs) use routers and routing protocols to connect
pieces of the Internet together. This part explores IGPs such as RIP, OSPF, and
IS-IS, and also BGP. It includes a look at multicast routing protocols and MPLS, a
method of IP switching.
■
■
■
■
■
Chapter 13—Routing and Peering
Chapter 14—IGPs: RIP, OSPF, and IS–IS
Chapter 15—BGP
Chapter 16—Multicast
Chapter 17—IP Switching and Convergence
CHAPTER
Routing and Peering
13
What You Will Learn
In this chapter, you will learn about how routing differs from switching, the other
network layer technology. We’ll compare connectionless and connection-oriented
networking characteristics and see how quality of service (QOS) can be supported on both.
You will learn what a routing protocol is and what they do. We’ll investigate
the differences between interior and exterior routing protocols as the terms apply
to an ISP. We’ll also talk about routing policies and the role they play on the modern Internet.
In Chapter 9, we introduced the concept of forwarding packets hop by hop across a
network of interconnected routers and LANs. This process is loosely called “routing,”
and that chapter comprised a first look at routing tables (and the associated forwarding tables). In this chapter, we’ll discuss how ISPs manipulate their routing tables with
routing policies to influence the flow of traffic on the Internet. This chapter will focus
more closely on the routing tables on hosts. In Chapters 14 and 15, we discuss in more
detail the routing tables and routing policies on the network routers.
This chapter will look at the routing tables on the hosts on the LANs, as shown in
Figure 13.1. But we’ll also discuss, for the first time, how the two ISPs on the network
(called Ace ISP and Best ISP) relate to each other and how their routing tables ensure
that traffic flows most efficiently between LAN1 and LAN2. For example, it’s obviously
more effective to send LAN1–LAN2 traffic over the link between P4 and P2 instead of
shuttling onto the Internet from P4 and relying on routers beyond the control of either
Best or Ace ISP to route the packets back to P2. (Of course, traffic could flow from P4
to P7, or even end up at P9 to be forwarded to P7, but this is just an example.) But how
do the routers know how P2 and P4 are connected? More importantly, how do the
routers PE5 and PE1 know how the other routers are connected? What keeps router
PE5 from forwarding Internet-bound traffic to P9 instead of P4? And, because P9 is also
connected to P4, why should it be a big deal anyway?
322
PART III Routing and Routing Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
45
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
so-0/0/3
49.1
-0
/0/
2
P4
lo0: 192.168.4.1
/0
0/0
1
7
4.
so-0/0/1
24.2
so-
AS 65459
FIGURE 13.1
The hosts on the LANs have routing tables as well as the routers. The ISPs on the Illustrated Network
have chosen to implement an ISP peering arrangement.
CHAPTER 13 Routing and Peering
323
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/
29. 0/2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
324
PART III Routing and Routing Protocols
This chapter will begin to answer these questions, and the next two chapters will
complete the investigation. However, it should be mentioned right away that connectionless routers that route (forward) each packet independently through the network
are not the only way ISPs can connect LANs on the Internet. The network nodes can
be connection-oriented switches that forward packets along fixed paths set up through
the network nodes from source to destination.
We’ve already discussed connectionless and connection-oriented services at the
transport layer (UDP and TCP). Let’s see what the differences are between connectionless and connection-oriented services at the network layer.
NETWORK LAYER ROUTING AND SWITCHING
Are the differences between connection-oriented and connectionless networking at
the network layer really that important? Actually, yes. The difference between the way
connectionless router networks handle traffic (and link and node failures) is a major
reason that IP has basically taken over the entire world of networking.
A switch in modern networking is a network node that forwards packets toward a
destination depending on a locally significant connection identifier over a fixed path.
This fixed path is called a virtual circuit and is set up by a signaling protocol (a switched
virtual circuit, or SVC) or by manual configuration (a permanent virtual circuit, or
PVC). A connection is a logical association of two endpoints. Connections only need be
referenced, not identified by “to” and “from” information. A data unit sent on “connection
22” can only flow between the two endpoints where it is established—there is no need
to specify more. (We’ve seen this already at Layer 2 when we looked at the connectionoriented PPP frame.) As long as there is no confusion in the switch, connection identifiers can be reused, and therefore have what is called local significance only.
Packets on SVCs or PVCs are often checked for errors hop by hop and are resent
as necessary from node to node (the originator plays no role in the process). Packet
switching networks offer guaranteed delivery (as least as error-free as possible). The
network is also reliable in the sense that certain performance guarantees in terms of
bandwidth, delay, and so on can be enforced on the connection because packets always
follow the same path through the network. A good example of a switched network is
the public switched telephone network (PSTN). SVCs are normal voice calls and PVCs
are the leased lines used to link data devices, but frame relay and ATM are also switched
network technologies. We’ll talk about public switched network technologies such as
frame relay and ATM in a later chapter.
On the other hand, a router is a network node that independently forwards packets toward a destination based on a globally unique address (in IP, the IP address)
over a dynamic path that can change from packet to packet, but usually is fairly stable
over time. Packets on router networks are seldom checked for errors hop by hop and
are only resent (if necessary) from host to host (the originator plays a key role in
the process). Packet routing networks offer only “best-effort” delivery (but as errorfree as possible). The network is also considered “unreliable” in the sense that certain
CHAPTER 13 Routing and Peering
325
performance guarantees in terms of bandwidth, delay, and so on cannot be enforced
from end to end because packets often follow different paths through the network.
A good example of a router-based network is the global, public Internet.
CONNECTION-ORIENTED AND CONNECTIONLESS NETWORKS
Many layers of a protocol stack, especially the lower layers, offer a choice of connectionoriented or connectionless protocols. These choices are often independent. We’ve seen
that connectionless IP can use connection-oriented PPP at Layer 2. But what is it that
makes a network connectionless? Not surprisingly, it’s the implantation of the network
layer. IP, the Internet protocol suite’s network layer protocol, is connectionless, so TCP/IP
networks are connectionless.
Connection-oriented networks are sometimes called switched networks, and connectionless networks are often called router-based networks. The signaling protocol
messages used on switched networks to set up SVCs are themselves routed between
switches in a connectionless manner using globally unique addresses (such as telephone numbers). These call setup messages must be routed, because obviously there
are no connection paths to follow yet. Every switched network that offers SVCs must
also be a connectionless, router-based network as well.
One of the major reasons to build a connectionless network like the Internet was
that it was inherently simpler than connection-oriented networks that must route signaling setups messages and forward traffic on connections. The Internet essentially
handles everything as if it were a signaling protocol message. The differences between
connection-oriented switched networks and connectionless router networks are
shown in Table 13.1.
Table 13.1 Switched and Connectionless Networks Compared by Major Characteristics
Characteristic
Switched Network
Connectionless Network
Design philosophy
Connection oriented
Connectionless
Addressing unit
Circuit identifiers
Network and host address
Scope of address
Local significance
Globally unique
Network nodes
Switches
Routers
Bandwidth use
As allowed by “circuit”
Varies with number and size of
frames
Traffic processing
Signaling for path setup
Every packet routed independently
Examples
Frame relay, ATM, ISDN, PSTN,
most other WANs
IP, Ethernet, most other LANs
326
PART III Routing and Routing Protocols
Note that every characteristic listed for a connectionless network applies to the
signaling network for a switched network. It would not be wrong to think of the Internet as a signaling network with packets that can carry data instead of connection (call)
setup information. The whole architecture is vastly simplified by using the connectionless network for everything.
The simplified router network, in contrast to the switched network, would automatically route around failed links and nodes. In contrast, connection-oriented networks
lost every connection that was mapped to a particular link or switch. These had to be
re-established through signaling (SVCs) or manual configuration (PVCs), both of which
involved considerable additional traffic loads (SVCs) or delays (PVCs) for all affected
users. One of the original aims of the early “Internet” was explicitly to demonstrate that
packet networks were more robust when faced with failures. Therefore, connectionless
networks could be built more cheaply with relatively “unreliable” components and still be
resistant to failure. Today,“best-effort” and “unreliable” packet delivery over the Internet is
much better than any other connection-oriented public data network not so long ago.
Of course, an Internet router has to maintain a list of every possible reachable destination in the world (and so did signaling nodes in connection-oriented networks),
but processors have kept up with the burden imposed by the growth in the scale of
the routing tables. A switch only has to keep track of local associations of two endpoints (connections) currently established. We’ll talk about multiprotocol label switching (MPLS) in Chapter 17 as an attempt to introduce the efficiencies of switching into
router-based networking. (MPLS does not really relieve the main burdens of interdomain routing, but we will see that MPLS has traffic engineering capabilities that allow
ISPs to shift the paths that carry this burden.)
In only one respect is there even any discussion about the merits of connectionoriented networks versus the connectionless Internet. This is in the area of the ability
of connectionless router networks to deliver quality of service (QoS).
Quality of Service
It might seem odd to talk about QoS in a chapter on connectionless Internet routing
and forwarding. But the point is that in spite of the movement to converge all types
of information (voice and video as well as data) onto the Internet, no functional interdomain QoS mechanism exists. QoS is at heart a queue management mechanism, and
only by applying these strategies across an entire routing domain will QoS result in any
route optimization at all. Even then, no ISP can impose its own QoS methodology on
any other.
One of the biggest challenges in quality of service (QoS) discussions is that there
is no universal, accepted agreement of just what network QoS actually means. Some
sources define QoS quite narrowly, and others define it more broadly. For the purposes
of this discussion, a broader definition is more desirable. We’ll use six parameters in
this book.
CHAPTER 13 Routing and Peering
327
CoS or QoS?
Should the term for network support of performance parameters be “class of
service” (CoS) or “quality of service” (QoS)? Many people use the terms interchangeably, but in this book QoS is used to mean that parameters can take on
almost any value between maximum and minimum. CoS, on the other hand, establishes groups of parameters based on real world values (e.g., bandwidth at 10, 100,
or 1000 Mbps with associated delays), and is offered as a “class” to customers (e.g.,
bronze, silver, or gold service).
Our working definition of QoS in this book is the “ability of an application to
specify required values of certain parameters to the network, values without which the
application will not be able to function properly.” The network either agrees to provide
these parameters for the applications data flow, or not.These parameters include things
like minimum bandwidth, maximum delay, and security. It makes no sense to put delaysensitive voice traffic onto a network that cannot deliver delays less than 2 or 3 seconds
one way (voice suffers at delays far less than full seconds), or to put digital, wide-screen
video onto a network of low-bandwidth, dial-up analog connections.
Table 13.2 shows some typical example values that are used often. In some cases, an
array of values is offered to customers as a CoS.
Bandwidth is usually the first and foremost QoS parameters, for the simple reason that bandwidth was for a long time the only QoS parameter that could be delivered by networks with any degree of consistency. It has also been argued that, given
enough bandwidth (just how much is part of the argument), every other QoS parameter becomes irrelevant.
Jitter is just delay variation, or how much the end-to-end network latency varies
from time to time due to effects such as network queuing and link failures, which cause
alternate routes to be used. Information loss is just the effect of network errors. Some
Table 13.2 The Six QoS Parameters
QoS Parameter
Example Values (Typical)
Bandwidth (minimum)
1.5 Mbps, 155 Mbps, 1 Gbps
Delay (maximum)
50-millisecond (ms) round-trip delay, 150-ms delay
Jitter (delay variation)
10% of maximum delay, 5-ms variation
Information loss (error effects)
1 in 10,000 packets undelivered
Security
All data streams encrypted and authenticated
328
PART III Routing and Routing Protocols
applications can recover from network errors by retransmission and related strategies.
Other applications, most notably voice and video, cannot realistically resend information and must deal with errors in other ways, such as the use of forward error correction codes. Either way, the application must be able to rely on the network to lose only
a limited amount of information, either to minimize resends (data) or to maximize the
quality of the service (voice/video).
Availability and reliability are related. Some interpret reliability as a local network
quality and availability as global quality. In other words, if my local link fails often,
I cannot rely on the network, but global availability to the whole pool of users might
be very good. There is another way that reliability is important in TCP/IP. IP is often
called an unreliable network layer service. This does not imply that the network fails
often, but that, at the IP layer, the network cannot be relied on to deliver any QoS
parameter values at all, not even minimum bandwidth. But keep in mind that a system
built of unreliable components can still be reliable, and QoS is often delivered in just
this fashion.
Security is the last QoS parameter to be added, and some would say that it is the
most important of all.
Many discussions of QoS focus on the first four items on the parameter list. But
reliability and security also belong with the others, for a number of reasons. Security
concerns play a large part in much of IPv6. And reliability can be maximized in IP
routing tables. There are several other areas where security and reliability impact QoS
parameters; the items discussed here are just a few examples.
Service providers seldom allow user application to pick and choose values from
every QoS category. Instead, many service providers will gather the typical values of
the characteristics for voice, video, and several types of data applications (bulk transfer,
Web access, and so on), and bundle these as a class of service (CoS) appropriate for that
traffic flow. (On the other hand, some sources treat QoS and CoS as synonyms.) Usually,
the elements in a CoS suite that a service provider offers have distinctive names, either
by type (voice, video) or characteristic (“gold” level availability), or even in combination (“silver-level video service”).
The promise of widespread and consistent QoS has been constantly derailed by
the continuing drop in the cost (and availability) of network links of higher and higher
bandwidth. Bandwidth is a well-understood network resource (some would say the
only well-understood network resource), and those who control network budgets
would rather spend a dollar on bandwidth (known effects, low risk, etc.) than on other
QoS schemes such as DiffServ (spotty support, difficult to implement, etc.).
HOST ROUTING TABLES
Now that we’ve shown that the Illustrated Network is firmly based on connectionless
networking concepts, let’s look at the routing tables (not switching tables) on some
of the hosts. Host routing tables can be very short. When initially configured, many of
them have only four types of entries.
CHAPTER 13 Routing and Peering
329
Loopback—Usually called lo0 on Unix-based systems (and routers), this is the
prefix 127/8 in IPv4 and ::1 in IPv6. Not only used for testing, the loopback
is a stable interface on a router (or host) that should not change even if the
interface addresses do.
The host itself—There will be one entry for every interface on the host with an IP
address. This is a /32 address in IPv4 and a /128 address in IPv6.
The network—Each host address has a network portion that gets its own routing
table entry.
The default gateway—This tells the host which router to use when the network
portion of the destination IP address does not match the network portion of
the source address.
Gateway or Edge Router?
A lot of texts simply say that the term “router” is the new term for “gateway” on the
Internet, but that this old term still shows up in a number of acronyms (such as
IGP). Other sources use the term “gateway” as a kind of synonym for what we’ve
been calling the customer-edge router, meaning a router with only two types of
routing decisions, that is, local or Internet. A DSL “router” is really just a “gateway”
in this terminology, translating between local LAN protocols and service provider
protocols. On the other hand, a backbone router without customer LANs is definitely a router in any sense of the term.
In this book, we’ll use the terms “gateway” and “router” interchangeably, keeping in mind that the gateway terminology is still used for the entry or egress point
of a particular subnet.
Routing Tables and FreeBSD
FreeBSD systems keep this fundamental information in the /etc/default/rc.conf file.
But this information can be manipulated with the ifconfig command, which we’ve used
already. However, interface information does not automatically jump into the routing
table unless the changes are made to the rc.conf file. (If the network_interfaces variable is kept to the default of auto, the system finds its network interfaces at boot time.)
Let’s use the netstat –nr command to take a closer look at the routing table on
bsdserver.
bsdserver# netstat -nr
Routing tables
Internet:
Destination
default
10.10.12/24
Gateway
10.10.12.1
link#1
Flags
UGSc
UC
Refs
1
2
Use
97
0
Netif Expire
em0
em0
330
PART III Routing and Routing Protocols
10.10.12.1
10.10.12.52
127.0.0.1
00:05:85:8b:bc:db
00:0e:0c:3b:88:56
127.0.0.1
Internet6:
Destination
::1
fe80::%em0/64
fe80::20e:cff:fe3b:8732%em0
fe80::%xl0/64
fe80::2b0:d0ff:fec5:9073%xl0
fe80::%lo0/64
fe80::1%lo0
ff01::/32
ff02::%em0/32
ff02::%xl0/32
ff02::%lo0/32
UHLW
UHLW
UH
2
0
0
Gateway
::1
link#1
00:0e:0c:3b:87:32
link#2
00:b0:d0:c5:90:73
fe80::1%lo0
link#4
::1
link#1
link#2
::1
0
4
6306
Flags
UH
UC
UHL
UC
UHL
Uc
UHL
U
UC
UC
UC
em0
em0
lo0
335
1016
Netif Expire
lo0
em0
lo0
xl0
lo0
lo0
lo0
lo0
em0
xl0
lo0
FreeBSD merges the routing and ARP tables, which is why hardware addresses (and
their timeouts) appear in the output. The C and c flags are host routes, and the S is a
static entry.
To manually configure an Ethernet interface and add the route to the routing table,
we use the ifconfig and route commands.
bsdserver# ifconfig em0 inet 10.10.12.77/24
bsdserver# route add –net 10.10.12.77 10.10.12.1
Routing and Forwarding Tables
Remember, the routing tables we’re looking at here are tables of routing information and mainly for human inspection. Generally, everything the system learns
about the network from a routing protocol is put into the routing table. But not all
of the information is used for packet forwarding.
At the software level, the system creates a forwarding table in a much more
compact and machine-useable format.The forwarding table is used to determine
the output, the next-hop interface (if the system is not the destination). However, we’ll use the friendly routing tables to illustrate the routing process, as is
normally done.
Routing Tables and RedHat Linux
RedHat Linux systems keep most network configuration information in the /etc/
sysconfig and /etc/sysconfig/network-scripts directories. The hostname, default gateway, and other information are kept in the /etc/sysconfig/network file. The Ethernet
CHAPTER 13 Routing and Peering
331
interface-specific information, such as IP address and network mask for eth0, is in the
/etc/sysconfig/network-scripts/ifcfg-eth0 file (loopback is in ifcfg-lo0).
Let’s look at the lnxclient routing table with the netstat –nr command.
[[email protected] admin]# netstat -nr
Kernel IP routing table
Destination Gateway
Genmask
10.10.12.0
0.0.0.0
255.255.255.0
127.0.0.0
0.0.0.0
255.0.0.0
0.0.0.0
10.10.12.1 0.0.0.0
Flags
U
U
UG
MSS
0
0
0
Window
0
0
0
irtt
0
0
0
Iface
eth0
lo
eth0
Oddly, the host address isn’t here. This system does not require a route for the
interface address bound to the interface. The loopback entries are slightly different
as well. Only network entries are in the Linux routing table. If we added a second
Ethernet interface (eth1) with IPv4 address 172.16.44.98 and a different default router
(172.16.44.1), we’d add that information with the ipconfig and route commands.
[[email protected] admin]# ifconfig eth1 172.16.44.98 netmask 255.255.255.0
[[email protected] admin]# route add default gw 172.16.44.0 eth1
We’re not running IPv6 on the Linux systems, so no IPv6 information is displayed.
Routing and Windows XP
Windows XP, of course, handles things a little differently. We’ve already used ipconfig
to assign addresses, and Windows XP uses the route print command to display routing
table information, such as on wincli2.
C:\Documents and Settings\Owner>route print
============================================================================
Interface List
0x1 ........................... MS TCP Loopback interface
0x2 ...00 02 b3 27 fa 8c ...... Intel(R) PRO/100 S Desktop Adapter - Packet
Scheduler Miniport
============================================================================
============================================================================
Active Routes:
Network Destination
Netmask
Gateway
Interface
Metric
0.0.0.0
0.0.0.0
10.10.12.1
10.10.12.222
20
10.10.12.0
255.255.255.0
10.10.12.222
10.10.12.222
20
10.10.12.222 255.255.255.255
127.0.0.1
127.0.0.1
20
10.255.255.255 255.255.255.255
10.10.12.222
10.10.12.222
20
127.0.0.0
255.0.0.0
127.0.0.1
127.0.0.1
1
224.0.0.0
240.0.0.0
10.10.12.222
10.10.12.222
20
255.255.255.255 255.255.255.255
10.10.12.222
10.10.12.222
20
Default Gateway:
10.10.12.1
============================================================================
Persistent Routes:
None
332
PART III Routing and Routing Protocols
The table is an odd mix of loopbacks, multicast, and host and router information.
Persistent routes are static routes that are not purged from the table. We can delete
information, add to it, or change it. If no gateway is provided for a new route, the system
attempts to figure it out on its own.
The IPv6 routing table is not displayed with route print. To see that, we need to
use the IPv6 rt command. The table on wincli2 reveals only a single entry for the linklocal–derived IPv6 address of the default router.
C:\Documents and Settings\Owner>ipv6 rt
::/0 -> 5/fe80:5:85ff:fe8b:bcdb pref 256 life 25m52s <autoconf>
This won’t even let us ping the wincli1 system on LAN1, even though we know to
what router to send the IPv6 packets.
C:\Documents and Settings\Owner>ping6 fe80::20c:cff:fe3b:883c
Pinging fe80::20c:cff:fe3b:883c with 32 bytes of data:
No route to destination.
Specify correct scope-id
No route to destination.
Specify correct scope-id
No route to destination.
Specify correct scope-id
No route to destination.
Specify correct scope-id
or use –s to specify source address.
or use –s to specify source address.
or use –s to specify source address.
or use –s to specify source address.
Ping statistics for fe80::20c:cff:fe3b:883c:
Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)
What’s wrong? Well, we’re using link-local addresses, for one thing. Also, we have
no way to get the routing information known about LAN2 and router CE6 to LAN1
and router CE0. That’s the job of the Interior Gateway Protocols (IGPs), the types of
routing protocols that run between ISP’s routers. Why do we need them? Let’s look at
the Internet first, and then we’ll use an IPG in the next chapter so that the IPv6 ping
works.
THE INTERNET AND THE AUTONOMOUS SYSTEM
Before taking a more detailed look at the routing protocols that TCP/IP uses to ensure
that every router knows how to forward packets closer to their ultimate destination,
it’s a good idea to have a firm grasp of just what routing protocols are trying to accomplish on the modern Internet. The Internet today is composed of interlocking network
pieces, much like a jigsaw puzzle of global proportions. Each piece is called an autonomous system (AS), and it’s convenient to think of each ISP as an AS, although this is not
strictly true.
CHAPTER 13 Routing and Peering
333
Routing Protocols and Routing Policies
A routing protocol is run on a router (and can be run on a host) to allow the router
to dynamically learn about its network neighborhood and pass this knowledge on
until every router has built a consistent view of the network “map” and the least
cost (“best”) place to forward traffic toward any reachable destination. Until the
protocol converges there is always the possibility that some routers do not have
the latest view of the network and might forward packets incorrectly. Actually, it’s
possible that some of the “maps” never converge and that some less-than-optimal
path might be taken. But that need not be a disaster, although the reasons are far
beyond this simple introduction.
A routing policy can be defined as “a rule implemented on the router to determine the handling of routing protocol information.”An example of an ISP’s routing
policy rule is to “accept no routing protocol updates from hosts or routers not
part of this ISP’s network.”This rule, intended to minimize the effects of malicious
users, can be combined with others to create an overall routing policy for the
whole ISP.
The term should not be confused with policy routing. Policy routing is usually
defined as the forwarding of packets based not only on destination address, but
also on some other fields in the TCP/IP header, especially the IPv4 ToS bits. Confusingly, policy routing can be made more effective with routing policies, but this
book will not deal with policy routing or QoS issues.
Routing protocols do not and cannot blend all these ASs together into a seamless
whole all on their own. Routing protocols allow routers or networks to share adjacency
information with their neighbors.They establish the global connectivity between routers, within an AS and without, and ASs in turn establish the global connectivity that
characterizes the Internet. Routing policies change the behavior of the routing protocols so AS connectivity is made into what the ISPs want (usually, ISPs add some term
like “AS connectivity is made more effective and efficient” but many times routing
policy doesn’t do this, as we’ll see).
Routers are the network nodes of the global public Internet, and they pass IP address
information back and forth as needed. The result is that every router knows how to
reach every IP network (really, the IP prefix) anywhere in the world, or at least those
that advertise that they are willing to accept traffic for that prefix. They also know
when a link or router has failed, and thus other networks might then be (temporarily)
unreachable. Routers can dynamically route around failed links and routers, unless the
destination network is connected to the Internet by only one link or happens to be
right there on the local router.
There are no users on the router itself that originate or read email (as an example),
although routers routinely take on a client or a server role (or both) for configuration
and administrative purposes. Routers almost always just pass IP packet traffic through
334
PART III Routing and Routing Protocols
from one interface to another, input port to output port, while trying to ensure that the
packets are making progress through the network and moving one step closer to its
destination. It is said that routers route packets “hop by hop” through the Internet. In a
very real sense, routers don’t care if the packet ever reaches the destination or not: All
the router knows is that if the IP address prefix is X, that packet goes out port Y.
THE INTERNET TODAY
There is really no such thing as the Internet today. The concept of “the Internet” is a
valid one, and people still use the term all the time. But the Internet is no longer a
thing to be charted and understood and controlled and administered. What we have
is an interlocking grid of ISPs, an ISP “grid-net,” so to speak. Actually, the graph of the
Internet is a bit less organized than this, although ISPs closer to the core have a higher
level of interconnection than those at the edge. This is an interconnected mesh of
ISPs and related Internet-connected entities such as government bureaus and learning
institutions. Also, keep in mind that in addition to the “big-I internet,” there are other
internetworks that are not part of this global, public whole.
If we think of the Internet as a unity, and have no appreciation of actual ISP connectivity, then the role of routing protocols and routing policies on the Internet today
cannot be understood. Today, Internet talk is peppered with terms like peers, aggregates, summaries, Internet exchange points (IXPs), backbones, border routers, edge
routers, and points of presence (POPs). These terms don’t make much sense in the
context of the Internet as a unified network.
The Internet as the spaghetti bowl of connected ISPs is shown in Figure 13.2. There
are large national ISPs, smaller regional ISPs, and even tiny local ISPs. There are also
pieces of the Internet that act as exchange points for traffic, such as the Network
Access Points NAPs and IXPs. IXPs can by housed in POPs, formal places dedicated for
this purpose, and in various collocation facilities, where the organizations rent floor
space for a rack of equipment (“broom closet”) or larger floor space for more elaborate
arrangements, such as redundant links and power supplies. The IXPs are often run by
former telephone companies.
Each cloud, except the one at the top of the figure, basically represents an ISP’s AS.
Within these clouds, the routing protocol can be an IGP such as OSPF, because it is
presumed that each and every network device (such as the backbone routers) in the
cloud is controlled by the ISP. However, between the clouds, an EGP such as BGP must
be used, because no ISP can or should be able to directly control a router in another
ISP’s network.
The ISPs are all chained together by a complex series of links with only a few hard
and fast rules (although there are exceptions). As long as local rules are followed, as
determined by contract, the smallest ISP can link to another ISP and thus give their
users the ability to participate in the global public Internet. Increasingly, the nature of
the linking between these ISPs is governed by a series of agreements known as peering arrangements. Peers are equals, and national ISPs may be peers to each other, but
CHAPTER 13 Routing and Peering
335
Heavily interconnected
public peering points
Large ISPs Connect
IXPs, POPs or
Collocation Facilities
Large, National ISPs
Customer
Customer
ISP B
ISP A
Peer of ISP A,
Customer of
ISP B
Customer
Regional ISPs
Customer
Customer
Customer
Customer
Customer
of ISP B
Customer
Customer
Customer
Small, Local ISPs
Customer
Customer
Customer
Customer
High speed
Medium speed
Customer
Customer
Customer
Low speed
FIGURE 13.2
The haphazard way that ISPs are connected on today’s Internet, showing IXPs at the top.
Customers can be individuals, organizations, or other ISPs.
treat smaller ISPs as just another customer, although it’s not all that unusual for small
regional ISPs to peer with each other.
Peering arrangements detail the reciprocal way that traffic is handed off from one
ISP (and that means AS) to another. Peers might agree to deliver each other’s packets
for no charge, but bill non-peer ISPs for this privilege, because it is assumed that the
national ISP’s backbone will be shuttling a large number of the smaller ISPs’ packets.
But the national ISP won’t be using the small ISP much. A few examples of national
ISPs, peer ISPs, and customer ISPs are shown in the figure. This is just an example, and
very large ISPs often have plenty of very small customers and some of those will be
attached to more than one other ISP and employ high capacity links. There will also be
“stub AS” networks with no downstream customers.
Millions of PCs and Unix systems act as clients, servers, or both on the Internet.
These hosts are attached to LANs (typically) and linked by routers to the Internet. The
LANs and “site routers” are just “customers” to the ISPs. Now, a customer of even
moderate size could have a topology similar to that of an ISP with a distinct border,
core, and aggregation or services routers. Although all attached hosts conform to the
336
PART III Routing and Routing Protocols
client–server architecture, many of them are strictly Web clients (browsers) or Web
servers (Web sites), but the Web is only one part of the Internet (although probably the
most important one). It is important to realize that the clients and servers are on LANs,
and that routers are the network nodes of the Internet. The number of client hosts
greatly exceeds the number of servers.
The link from the client user to the ISP is often a simple cable or DSL link. In contrast, the link from a server LAN’s router to the ISP could be a leased, private line, but
there are important exceptions to this (Metro Ethernet at speeds greater than 10 Mbps
is very popular). There are also a variety of Web servers within the ISP’s own network.
For example, the Web server for the ISP’s customers to create and maintain their own
Web pages is located inside the ISP cloud.
The smaller ISPs link to the backbones of the larger, national ISPs. Some small ISPs
link directly to national backbones, but others are forced for technical or financial reasons to link in a “daisy-chain” fashion to other ISPs, which link to other ISPs, and so on
until an ISP with direct access to an IXP is reached. Peering bypasses the need to use
the IXP structure to deliver traffic.
Many other countries obtain Internet connectivity by linking to an IXP in the United
States, although many countries have established their own IXPs. Large ISPs routinely
link to more than one IXP for redundancy, while truly small ones rarely link to more
than one other ISP for cost reasons. Peer ISPs often have multiple, redundant links
between their border routers. (Border routers are routers that have links to more than
one AS.) For a good listing of the world’s major IXPs, see http://en.wikipedia.org under
Internet Exchange Point.
Speeds vary greatly in different parts of the Internet. Client access by way of lowspeed dial-up telephone lines is typically 33.6 to 56 kbps. Servers are connected by
Metro Ethernet or by medium-speed private leased lines, typically 1.5 Mbps. The highspeed backbone links between national ISPs run at yet higher speeds, and between the
IXPs themselves, speeds of 155 Mbps (known as OC-3c), 622 Mbps (OC-12c), 2.4 Gbps
(OC-48c), and 10 Gbps (OC-192c) can be used, although “n 3 10” Gbps Ethernet trunks
are less expensive. Higher speeds are always needed, both to minimize large Web site
content-transfer latency times (like video and audio files) and because the backbones
concentrate and aggregate traffic from millions of clients and servers onto a single
network.
THE ROLE OF ROUTING POLICIES
Today, it is impossible for all routers to know all details of the Internet. The Internet
now consists of an increasing number of routing domains. Each routing domain has
its own internal and external routing policies. The sizes of routing domains vary greatly,
from only one IP address space to thousands, and each domain is an AS. Many ISPs
have only one AS, but national or global ISPs might have several AS numbers. A global
ISP might have one AS for North America, another for Europe, and one for the rest of
the world. Each AS has a uniquely assigned AS number, although there can be various,
CHAPTER 13 Routing and Peering
337
logical “sub-ASs” called confederations or subconfederations (both terms are used)
inside a single AS.
We will not have a lot to say about routing policies, as this is a vast and complex
topic. But some basics are necessary when the operation of routers on the network is
considered in more detail.
An AS forms a group of IP networks sharing a unified routing policy framework.
A routing policy framework is a series of guidelines (or hard rules) used by the ISP to
formulate the actual routing policies that are configured on the routers. Among different ASs, which are often administered by different ISPs, things are more complex. Careful coordination of routing policies is needed to communicate complicated policies
among ASs.
Why? Because some router somewhere must know all the details of all the IPv4 or
IPv6 addresses used in the routing domain. These routes can be aggregated (or summarized) as shorter and shorter prefixes for advertisement to other routers, but some
routers must retain all the details.
Routes, or prefixes, not only need to be advertised to another AS, but need to be
accepted. The decision on which routes to advertise and which routes to accept is determined by routing policy. The situation is summarized in the extremely simple exchange
of routing information between two peer ASs shown in Figure 13.3. (Note that the labels
“AS #1” and “AS #2” are not saying “this is AS1” or “this is AS2”—AS numbers are reserved
and assigned centrally.) The routing information is transferred by the routing protocol
running between the routers, usually the Border Gateway Protocol (BGP).
The exchange of routing information is typically bidirectional, but not always. In
some cases, the routing policy might completely suppress or ignore the flow of routing
information in one direction because of the routing policy of the sender (suppress the
advertising of a route or routes) or the receiver (ignore the routing information from
the sender). If routing information is not sent or accepted between ASs, then clients
or servers in one AS cannot reach other hosts on the networks represented by that
routing information in the other AS.
ISP A
(AS 1)
Announces Net1 and Net2 to
ISP Peer and Accepts Net3
ISP B
(AS 2)
Announces Net3 to ISP Peer and
Accepts Net1, But NOT Net2
FIGURE 13.3
A simple example of a routing policy, showing how routes are announced (sent) and accepted
(received). ISP A and ISP B are peers.
338
PART III Routing and Routing Protocols
Economic considerations often play a role in routing policies as well. In the old
days, there were always subsidies and grants available for continued support for the
research and educational network. Now the ISP grid-net has ISPs with their own customers, and they can also be customers of other ISPs as well. Who pays whom, and
how much?
PEERING
Telephony faced the same problem and solved it with a concept called settlements.
This is where one telephone company bills the call originator and shares a portion of
the billed amount with other telephone companies as an access charge. Access charges
compensate the other telephone companies, long distance and local, that carry the call
for the loss of the use of their own facilities (which could otherwise make money for
the company directly) for the duration of the call. Now, in the IP world the source and
destination share the cost of delivering packets, but the point is that telephony solved
a similar issue and the terminology has been borrowed by the ISPs, which are often
telephone companies as well.
The issue on the Internet becomes one of how one ISP should compensate another
ISP for delivering packets that originate on the other ISP (if at all). The issue is complicated because the “call” is now a stream of packets, and an ISP might just be a transit ISP
for packets that originate in one ISP’s AS and are destined for a third ISP’s AS.
ISP peers have tried three ways to translate this telephony “settlements” model to
the Internet. First, there are very popular bilateral (between two sides) settlements
based on the “call,” usually defined as some aspect of IP packet flows. In this settlement
arrangement, the first ISP, where the packet originates at a client, gets all of the revenue
from the customer. However, the first ISP shares some of this money with the other ISP
(where the server is located). Second, there is the idea of sender keeps all (SKA), where
the flow of packets from client to server one way is supposedly balanced by the flow
of packets from client to server the other way. So each ISP might as well just keep all
of the revenue from their customers. Finally, there are transit fees, which are just settlements between one ISP and another, usually paid by a smaller ISP to a larger (because
this traffic flow is seldom symmetrical).
Unfortunately, none of these methods have worked out well on the Internet. TCP/IP
is not telephony and routers are not telephone switches. There are often many more
than just two or three ISPs involved between client and server. There is no easy way to
track and account for the packets that should constitute a “call,” and even TCP sessions
leave a lot to be desired because a simple Web page load might involve many rapid
TCP connections between client and server. It is often hard to determine the “origin”
because a packet and packets do not always follow stable network paths. Packets are
often dropped, and it seems unfair to bill the originating ISP for resent packets replacing
those that were not delivered by the billing ISP in the first place. Finally, dynamic routing might not be symmetric: So-called “hot potato” routing seeks to pass packets off to
another ISP as soon as possible. So the path from client to server often passes through
CHAPTER 13 Routing and Peering
339
different ISPs rather than keeping requests and replies all on one ISP’s network. This
common practice has real consequences for QoS enforcement.
These drawbacks of the telephony settlements model resulted in a movement to
more simplistic arrangements among ISP peers, which usually means ISPs of roughly
equal size. These are often called peering arrangements or just peering. There is no
strict definition of what a peer is or is not, but it often describes two ISPs that are
directly connected and have instituted some routing policies between them. In addition, there is nearly endless variation in settlement arrangements. These are just some
of the broad categories. The key is that any traffic that a small network can offload onto
a peer costs less than traffic that stays on internal transit links.
Economically, there is often also a sender-keeps-all arrangement in place, and
no money changes hands. An ISP that is not a peer is just another customer of the
ISP, and customers pay for services rendered. An interesting and common situation
arises when three peers share a “transit peer” member. This situation is shown in
Figure 13.4. There are typically no financial arrangements for peer ISPs providing
transit services to the third peer, so peer ISPs will not provide transit to a third peer
ISP (unless, of course, the third peer ISP is willing to pay and become a customer of
one of the other ISPs).
Traffic with Sources
and Destinations
in ISP A and ISP B
Is Okay
ISP A
ISP B
Peer of ISP A and ISP C
Traffic with Sources
and Destinations
in ISP A and ISP C
Is Blocked
Traffic with Sources
and Destinations
in ISP C and ISP B
Is Okay
ISP C
Peer of ISP B,
but not ISP A
Peer of ISP B,
but not ISP C
No Direct Connections
Exist between ISP A
and ISP C
FIGURE 13.4
ISPs do not provide free transit services, and generally are either peers or customers of other
ISPs. Unless “arrangements” are made, ISP B will routinely block transit traffic between ISP A
and ISP C.
340
PART III Routing and Routing Protocols
All three of these ISPs are “peers” in the sense that they are roughly equal in terms
of network resources. They could all be small or regional or national ISPs. ISP A peers
with ISP B and ISP B peers with ISP C, but ISP A has no peering arrangement (or
direct link) with ISP C. So packet deliveries from hosts in ISP A to ISP B (and back)
are allowed, as are packet deliveries from hosts in ISP C to and from ISP B. But ISP B
has routing policies in place to prevent transit traffic from ISP A to and from ISP C
through ISP B. How would that be of any benefit to ISP B? Unless ISP A and ISP C are
willing to peer with each other, or ISP A or ISP C is willing to become a customer of
ISP B, there will be no routing information sent to ISP A or ISP C to allow these ISPs
to reach each other through ISP B. The routing policies enforced on the routers in
ISP B will make sure of this, telling ISP A (for example) “you can’t get to ISP C’s hosts
through me!”
The real world of the Internet, without a clearly defined hierarchy, complicates
peering drastically. Peering is often a political issue. The politics of peering began
in 1997, when a large ISP informed about 15 other ISPs that its current, easy-going
peering arrangements would be terminated. New agreements for transit traffic were
now required, the ISP said, and the former peers were effectively transformed into
customers. As the trend spread among the larger ISPs, direct connections were favored
over public peering points such as the IXPs.
This is one reason that Ace ISP and Best ISP in Figure 13.1 at the beginning of the
chapter maintain multiple links between the four routers in the “quad” between their
border routers. Suppose for a moment that routers P2 and P4 only have a single, direct
link between them to connect the two ISPs. What would happen if that link were
down? Well, at first glance, the situation doesn’t seem very drastic. Both have links
to “the Internet,” which we know now is just a collection of other ISPs just like Ace
and Best.
Can LAN1 reach LAN2 through “the Internet”? Maybe. It all depends on the arrangements between our two ISPs and the ISPs at the end of the “Internet” links. These ISPs
might not deliver transit traffic between Ace and Best, and may even demand payment
for these packets as “customers” of these other ISPs. The best thing for Ace and Best to
do—if they don’t have multiple backup links in their “quad”—is to make more peers
of other ISPs.
PICKING A PEER
All larger ISPs often want to be peers, and peers of the biggest ISPs around. (For many,
buying transit and becoming a customer of some other ISP is a much less expensive
and effective way to get access to the global public Internet if being a transit provider is
not your core business.) When it comes to peering, bigger is better, so a series of mergers and acquisitions (it is often claimed that there are no mergers, only acquisitions)
among the ISPs took place as each ISP sought to become a “bigger peer” than another.
This consolidation decreased the number of huge ISPs and also reduced the number of
potential peers considerably.
CHAPTER 13 Routing and Peering
341
Potential partners for peering arrangements are usually closely examined in several
areas. ISPs being considered for potential peering must have high capacity backbones,
be of roughly the same size, cover key areas, have a good network operations center
(NOC), have about the same quality of service (QoS) in terms of delay and dropped
packets, and (most importantly), exchange traffic roughly symmetrically. Nobody wants
their routers, the workhorse of the ISP, to peer with an ISP that supplies 10,000 packets
for every 1000 packets it accepts. Servers, especially Web sites, tend to generate much
more traffic than they consume, so ISPs with “tight” networks with many server farms
or Web hosting sites often have a hard time peering with anyone. On the other hand,
ISPs with many casual, intermittent client users are courted by many peering suitors.
Even if match is not quite the same in size, if the traffic flows are symmetrical, peering
is always possible. The peering situation is often as shown in Figure 13.5. Keep in mind
that other types of networks (such as cable TV operators and DSL providers) have different peering goals than presented here.
Without peering arrangements in place, ISPs rely on public exchange and peering points like the IXPs for connectivity. The trend is toward more private peering
between pairs of peer ISPs.
Private peering can be accomplished by installing a WAN link between the AS border
routers of the two ISPs. Alternatively, peering can be done at a collocation site where the
two peers’ routers basically sit side by side. Both types of private peering are common.
Who will peer
with ISP A?
ISP A
Traffic with Balance
ISP A to ISP B: 1000
packets per min.
ISP B to ISP A: 1000
packets per min.
Medium Infrastructure
Mix of Clients and Servers
Traffic Flow Unbalanced
ISP A to ISP C: 1000
packets per min.
ISP C to ISP A: 10,000
packets per min.
ISP B
ISP C
Large Infrastructure
with Many Clients
Many Web Servers
on Lots of Server Farms
(a)
(b)
FIGURE 13.5
Good and bad peering candidates. Note that the goal is to balance the traffic flow as much as
possible. Generally, the more servers the ISP maintains, the harder it is to peer. (a) ISP A will
propose peering to ISP B; (b) ISP A will not want to peer with ISP C but will take them on as a
customer.
342
PART III Routing and Routing Protocols
The Internet today has more routes than there were computers attached to the
Internet in early 1989. Routing policies are necessary whether the peering relationship
is public or private (through an IXP or through a WAN link between border routers).
Routing information simply cannot be easily distributed everywhere all at once. Even
the routing protocols play a role. Some routing protocols send much more information
than others, although protocols can be “tuned” by adjusting parameters and with routing policies.
Routing policies help interior gateway protocols (IGPs) such as OSPF and IS–IS
distribute routing information within an AS more efficiently. The flow of routing information between routing domains must be controlled by routing policies to enforce the
public or private peering arrangements in place between ISPs.
In the next chapter, we’ll see how an IGP works within an AS or routing domain.
343
QUESTIONS FOR READERS
Figure 13.6 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
Even Better ISP
(established when EveNet ISP
bought Better ISP)
AS
(former EveNet ISP)
AS
(former Better ISP)
One Unified Routing
Policy and Domain
Higher Speed
Link
Private Peering with Ace
ISP (large amounts of
traffic exchanged)
Lower Speed
Link
Public Peering with Best
ISP at an IXP
FIGURE 13.6
Even Better ISP, showing peering arrangements and routing domains.
1. What is an Internet autonomous system (AS)?
2. Why might a single ISP like Even Better ISP have more than one routing domain?
3. What is the purpose of a routing policy?
4. What does “ISP peering” mean?
5. What is the difference between public and private peering? Are both necessary?
CHAPTER
IGPs: RIP, OSPF,
and IS–IS
14
What You Will Learn
In this chapter, you will learn about the role of IGPs and how these routing protocols are used in a routing domain or autonomous system (AS). We’ll use OSPF and
RIP, but mention IS–IS as well.
You will learn how a routing policy can distribute the information gathered
from one routing protocol into another, where it can be used to build routing and
forwarding tables, or announced (sent) to other routers. We’ll create a routing
policy to announce our IPv6 routes to the other routers.
As is true of many chapters in this book, this chapter’s content is more than
enough for a whole book by itself. Only the basics of IGPs are covered here, but
they are enough to illustrate the function of an internal routing protocol on our
network.
In this chapter, we’ll configure an IGP to run on the Juniper Networks routers that
make up the Illustrated Network. In Chapter 9 we saw output that showed OSPF running on router CE6 as part of Best ISP’s AS. So first we’ll show how OSPF was configured on the routers in AS 65127 and AS 65459. We could configure IS–IS on the other
AS, but that would make an already long chapter even longer. Because we closed the
last chapter with IPv6 ping messages not working, let’s configure RIPng, the version of
RIP that is for IPv6. This is not an endorsement of RIPng, especially given other available choices. It’s just an example.
Why not add OSPFv3 (the version of OSPF used with IPv6) for IPv6 support? We
certainly could, but suppose the smaller site routers only supported RIP or RIPng? (RIP
is usually bundled with basic software, but other IGPs often have to be purchased.)
Then we would have no choice but to run RIPng to distribute the IPv6 addresses. If we
configure RIPng to run on the ASs between on-site routers CE0 and CE6, we can always
extend RIPng support right to the Unix hosts (the IPv6 hosts just need to point to CE0
or CE6 as their default routers).
In this chapter, we’ll use the routers heavily, as shown in Figure 14.1.
346
PART III Routing and Routing Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
45
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
so-0/0/3
49.1
-0
/0/
2
P4
lo0: 192.168.4.1
/0
0/0
1
7
4.
so-0/0/1
24.2
so-
AS 65459
FIGURE 14.1
The routers on the Illustrated Network, showing routers on which OSPF and RIPng will be running.
The IGPs will not be running between the two AS routing domains; instead, an EGP will run.
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
347
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/
29. 0/2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
so- .2
47
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
348
PART III Routing and Routing Protocols
Unfortunately, when it comes to networks, a lot of things are interrelated, although
we’d like to learn them sequentially. For example, we’ve already shown in Chapter 9
that OSPF is configured on the routers, although we didn’t configure it. Also, although
both ASs will run the same IGP (RIPng) in this chapter, the ASs are not running RIPng
or any other IGP in between (e.g., on the links between routers P9 and P7). That’s the
job of the EGP, which we’ll explore in the next chapter. There is a lot going on in this
chapter, so let’s list the topics covered here (and in Chapter 15), so we don’t get lost.
1. We’ll talk about ASs and the role of IGP and EGPs on a network.
2. We’ll configure RIPng as the IGP in both ASs, starting with the IPv6 address on the
interfaces and show that the routing information about LAN1 and LAN2 ends up
everywhere. We will not talk about the role of the EGP in all this until Chapter 15.
3. We’ll compare three major IGPs: RIP, OSPF, and IS–IS. In the OSPF section, we’ll
show how OSPF was configured in the two ASs for Chapter 9.
Internal and External Links
In this chapter, we’ll add RIPng as an IGP on all but the links between AS 65459
and AS 65127. This affects routers P9 and P4 in AS 65459 and routers P7 and P2 in
AS 65127. IGPs run on internal (intra-AS) links, and EGPs run on external (interAS) links.
In Chapter 15, we’ll configure BGP as the EGP on those links. This chapter
assumes that BGP is up and running properly on the external links between P9
and P4 in AS 65459 and P7 and P2 in AS 65127.
We’ll use our Windows XP clients for this exercise, just to show that the “home
version” of XP is completely comfortable with IPv6.
Autonomous System Numbers
Ace and Best ISP on the Illustrated Network use AS numbers (ASNs) in the private
range, just as our IP addresses. IANA parcels them out to the various registries that
assign them as needed to those who apply. Before 2007, AS numbers were 2-byte
(16-bit) values with the following ranges of relevance:
■
■
■
■
■
■
0: Reserved (can be used to identify nonrouted networks)
1–43007: Allocated by ARIN, APNIC, AfriNIC, and RIPE NCC
43008–48127: Held by IANA
48128–64511: Reserved by IANA
64512–65534: Designated by IANA for private use
65535: Reserved
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
349
Since 2007, ASNs are allocated as 4-byte values. Because each field can run
from 0 to 65535, the current way of designating ASNs is as two numbers in the
form nnnnn.nnnnnn. The full range of ASNs now is from 0.0 to 65535.65535
(0 to 4,294,967,295 in decimal).
For example, 0.65525 is how the former 2-byte ASN 65535 would be written
today. In this book, we’ll drop the leading “0,” and just use the “legacy” 2-byte AS
format for Ace and Best ISP: 65459 and 65127.
Now, let’s see what it takes to get RIPng up and running on these routers. So far, the
link-local fe80 addresses have been fine for running ping and for neighbor discovery
from router to host, but these won’t be useful for LAN1 to LAN2 communications with
IPv6. For this, we’ll use routable fc00 private ULA IPv6 addresses. Once we get RIPng
up and running with routable addresses on our hosts and routers, we should be able to
successfully ping from LAN1 to LAN2 using only IPv6 addresses. While we’ll be configuring IGPs on both Ace and Best ISP’s AS routing domains, we won’t be running IGPs
between them. That’s the job of the EGP (Border Gateway Protocol, or BGP), and we’ll
add that in Chapter 15.
We need to create four routable IPv6 addresses and prefixes—two for the hosts
and two for the router’s LAN interfaces (both are fe-1/3/0). We’ve already done this in
Chapter 4. The site IPv6 addresses, and the IPv4 and MAC addresses used on the same
interfaces, are shown in Table 14.1. We don’t need to change the link-local addresses on
the link between the routers because, well, they are link-local.
We know from Chapter 13 that we have these IPv6 addresses configured on wincli1
and wincli2. We have to do three things to enable RIPng on the routers:
■
■
■
Configure routable addresses on interface fe-1/3/0
Configure the RIPng protocol to run on the site (customer-edge) routers (CE0 and
CE6), the provider-edge routers (PE5 and PE1), and the internal links on the providerbackbone routers (P9, P7, P4, and P2).
Create and apply a routing policy on CE0 and CE6 to advertise the fe-1/3/0 IPv6
addresses with RIPng.
Table 14.1 Routable IPv6 Addresses Used on the Network
System
IPv4 Network
Address
MAC Address
IPv6 Address
wincli1
10.10.11/24
02:0e:0c:3b:88:3c
fc00:ffb3:d5:b:20e:cff:fe3b:883c
CE0 (fe-1/3/0)
10.10.11/24
00:05:85:88:cc:db
fc00:ffb3:d5:b:205:85ff:fe88:ccdb
CE6 (fe-1/3/0)
10.10.12/24
00:05:85:8b:bc:db
fc00:fe67:d4:c:205:85ff:fe8b:bcdb
Wincli2
10.10.12/24
00:02:b3:27:fa:8c
fc00:fe67:d4:c:202:b3ff:fe27:fa8c
350
PART III Routing and Routing Protocols
The configurations are completely symmetrical, so one of each type will do for
illustration purposes. Let’s use router CE0 as the customer-edge router. First, the
addresses for IPv4 (family inet) and IPv6 (family net6) must be configured on LAN
interface fe-1/3/0.
set interfaces fe-1/3/0 unit 0 family inet address 10.10.11.1/24
set interfaces fe-1/3/0 unit 0 family inet6 address fe80::205:85ff:fe88:ccdb/64
set interfaces fe-1/3/0 unit 0 family inet6 address fc00:fe67:d4:c:205:85ff:fe88:
ccdb/64
Note that the link-local address is fine as is. We usually have many addresses on
an interface in most IPv6 implementations, including multicast. We just added the
second address to it. Now we can configure RIPng itself on the link between CE0 and
PE5. We have to explicitly tell RIPng to announce (export) the routing information
specified in the send-ipv6 routing policy (which we’ll write shortly) and tell it the
RIPng “neighbor” (routing protocol partner) is found on interface ge-0/0/3 logical
unit 0.
set protocols ripng group ripv6group export send-ipv6
set protocols ripng group ripv6group neighbor ge-0/0/3.0
Because RIPv2 and RIPng use multicast addresses, we specify the router’s neighbor location with the physical address information (ge-0/0/3) instead of unicast
address. And because Juniper Network’s implementation of RIP always listens for routing information but never advertises or announces routes unless told, we’ll have to
write a routing policy to “export” the IPv6 addresses we want into RIPng. There’s only
one interface needed in this case, fe-1/3/0.0 to LAN1. It seems odd to say from when
sending, but in a Juniper Networks routing policy, from really means “out of”—“Out of
all the interfaces, this applies to interface fe-1/3/0.”
set policy-options policy-statement send-ipv6 from interface fe-1/3/0.0
set policy-options policy-statement send-ipv6 from family inet6
set policy-options policy-statement send-ipv6 then accept
All this routing policy says is that “if the routing protocol (which is RIPng) running
on the LAN1 interface (fe-1/3/0) wants to advertise an IPv6 route (from family inet6),
let it (accept).”
We also have to configure RIPng on the other routers. We know that we can’t
run RIPng on the external links on the border routers (P7, P9, P2, and P4), but we
can show the full configurations on PE5 and PE1. These routers have to run RIPng
on three interfaces, not just one, so that RIPng routing information flows from site
router to backbone (and from backbone to site router). Let’s look at PE5 (PE1 is about
the same).
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
351
set interfaces fe-1/3/0 unit 0 family inet address 10.10.50.1/24
set interfaces fe-1/3/0 unit 0 family inet6 address fe80::205:85ff:fe85:aafe/64
set interfaces fe-1/3/0 unit 0 family inet6 address fc00:fe67:d4:c:205:85ff:fe85:
aafe/64
We have IPv6 addresses on the SONET links to P9 and P4, so-0/0/0 and so-0/0/2,
but the details are not important. What is important is that we run RIPng on all three
interfaces.
set
set
set
set
protocols
protocols
protocols
protocols
ripng
ripng
ripng
ripng
group
group
group
group
ripv6group
ripv6group
ripv6group
ripv6group
export send-ipv6
neighbor ge-0/0/3.0
neighbor so-0/0/0.0
neighbor so-0/0/2.0
The routing policy now will export the interface IPv6 addresses we want into
RIPng. This policy has one term for each interface and is more complex than the one
for the site routers.
set
set
set
set
set
set
set
set
set
policy-options
policy-options
policy-options
policy-options
policy-options
policy-options
policy-options
policy-options
policy-options
policy-statement
policy-statement
policy-statement
policy-statement
policy-statement
policy-statement
policy-statement
policy-statement
policy-statement
send-ipv6
send-ipv6
send-ipv6
send-ipv6
send-ipv6
send-ipv6
send-ipv6
send-ipv6
send-ipv6
term A from interface ge-0/0/3.0
term A from family inet6
term A then accept
term B from interface so-0/0/0.0
term B from family inet6
term B then accept
term C from interface so-0/0/2.0
term C from family inet6
term C then accept
The policy simply means this:“Out of all interfaces, look at ge-0/0/3, so-0/0/0, and
If the routing protocol running on those links (which is RIPng) wants to
advertise an IPv6 route (from family inet6), let it (accept).”
The backbone routers run RIPng on their internal interfaces, but the configurations
and policies are very similar to those on the provider-edge routers. We don’t need to
list those.
When all the configurations are committed and made active on the routers, we form
an adjacency and exchange IPv6 routing information with each neighbor according to
the policy. The IPv6 routing table on CE0 now shows the prefix of LAN2 (fc00:fe67:
d4:c::/64) learned from CE6 with RIPng.
so-0/0/2.
[email protected]# show route table inet6 fc00:fe67:d4:c::/64
inet6.0: 38 destinations, 38 routes (38 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
fc00:ffbe:d5:b::/64 *[RIPng/100] 01:15:19, metric 6, tag 0
to fc00:ffbe:d5:b::a00:3b01 via so-0/0/0.0
> to fc00:ffbe:d5:b::a00:2d01 via so-0/0/2.0
352
PART III Routing and Routing Protocols
What does all this mean? We’ve learned this route with RIPng, and its preference is
100 (high compared to local interfaces, which are 0). When routes are learned in different ways from different protocols, the route with the lowest preference will be the
active route. The metric of 6 (hops) essentially shows that LAN2 is 6 routers away from
LAN1. If there are different paths with different metrics through a collection of routers,
the hop to the path with the lowest metric becomes the active route. More advanced
routing protocols can compute metrics on the basis of much more than simply number
of routers (hops).
Note the right angle bracket (>) to the left of the so-0/0/2.0 link to router P9. Remember, there are two ways for PE5 to forward packets to LAN2: through router P4 at the end
of link so-0/0/0.0 and through router P9 at the end of link so-0/0/0.0. The > indicates
that packets are being forwarded to router P9. (Usually, all other things being equal, a
router chooses the link with the lower IP address.) However, the other link is available if
the active link or router fails. (If we want to forward packets out both links, we can turn
on load balancing and the links will be used in a round-robin fashion.)
But even with RIPng up and running among the routers, we still have to give non–
link-local addresses to the hosts. Right now, if we try to use ping6 on LAN2 to ping a
different IPv6 private address on LAN1, we’ll still get an error condition. Let’s try it from
wincli2 on LAN2 to wincl1 on LAN1.
C:\Documents and Settings\Owner>ping6 fe80::20c:cff:fe3b:883c
Pinging fe80::20c:cff:fe3b:883c with 32 bytes of data:
No route to destination.
Specify correct scope-id or use –s to specify source address.
No route to destination.
Specify correct scope-id or use –s to specify source address.
No route to destination.
Specify correct scope-id or use –s to specify source address.
No route to destination.
Specify correct scope-id or use –s to specify source address.
Ping statistics for fe80::20c:cff:fe3b:883c:
Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)
Like the routers, the Windows XP hosts need routable addresses. We assign an interface (by index shown by ipconfig) that is a routable IPv6 address with the ipv6 adu
command. But the address is still shown with ipconfig.
C:\Documents and Settings\Owner>ipconfig
Ethernet adapter Local Area Connection:
Connection-specific DNS Suffix . :
IP Address . . . . . . . . . . . : 10.10.12.222
Subnet Mask . . . . . . . . . . : 255.255.255.0
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
353
IP Address . . . . . . . . . . . : fc00:fe67:d5:c:202:b3ff:fe27:fa8c
IP Address . . . . . . . . . . . : fe80::202:b3ff:fe27:fa8c%5
Default Gateway . . . . . . . . : 10.10.12.1
fe80::5:85ff:fe8b:bcdb%5
fc00:fe67:d5:c:205:85ff:fe8b:bcdb
How did the host know the default gateway to use for IPv6? We probed for neighbors
earlier, but even if we had not, IPv6 router advertisement (which was configured with
RIPng on the routers, and the main reason we did it) takes care of that.
Now we should be able to ping end to end from wincli2 to wincli1 by IPv6 address.
C:\Documents and Settings\Owner>ping6 fc00:ffb3:d4:b:20e:cff:fe3b:883c
Pinging fc00:ffb3:d.4:b:20e:cff:fe3b:883c
from fc00:fe67:d5:c:202:b3ff:fe27:fa8c with 32 bytes of data:
Reply
Reply
Reply
Reply
from
from
from
from
fc00:ffb3:d4:b:20e:cff:fe3b:883c:
fc00:ffb3:d4:b:20e:cff:fe3b:883c:
fc00:ffb3:d4:b:20e:cff:fe3b:883c:
fc00:ffb3:d4:b:20e:cff:fe3b:883c:
bytes=32
bytes=32
bytes=32
bytes=32
time<1ms
time<1ms
time<1ms
time<1ms
Ping statistics for fc00:ffb3:d4:b:20e:cff:fe3b:883c:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 4ms, Maximum = 5ms, Average = 4ms
The reverse also works as well. In the rest of this chapter, let’s take a closer look
at how the IGPs perform their task of distributing routing information within an AS.
Remember, how the IGP routing information gets from AS to AS with an EGP is the
topic of Chapter 15.
INTERIOR ROUTING PROTOCOLS
Routers initially know only about their immediate environments. They know the IP
addresses and prefixes configured on their local interfaces, and at most a little more
statically defined information.Yet all routers must know all the details about everything
in their routing domain to forward packets rationally, hop by hop, toward a given destination. So routers offer to and ask their neighbor routers (adjacent routers one hop
away) about the routing information they know. Little by little, each router then builds
up a detailed routing information database about the TCP/IP network.
How do routers exchange this routing information within a domain and between
routing domains? With routing protocols. Within a routing domain, several different
routing protocols can be used. Between routing domains on the Internet, another routing protocol is used. This chapter focuses on the routing protocols used within a routing domain and the next chapter covers the routing protocol used between routing
domains.
354
PART III Routing and Routing Protocols
Interior routing protocols, or IGPs, run between the routers inside a single routing
domain, or autonomous system (AS). A large organization or ISP can have a single AS,
but many global networks divide their networks into one or more ASs. IGPs run within
these routing domains and do not share information learned across AS boundaries
except physical interface addresses if necessary.
Modern routing protocols require minimal configuration of static routes (routes
configured and maintained by hand). Today, dynamic routing protocols allow adjacent
(directly connected) routers to exchange routing table information periodically to build
up the topology of the router network as a whole by passing information received by
adjacent neighbors on to other routers.
IGPs essentially “bootstrap” themselves into existence, and then send information
about their IP addresses and interfaces to other routers directly attached to the source
router. These neighbor, or adjacent, routers distribute this information to their neighbors until the network has converged and all routers have the identical information
available.
When changes in the network as a result of failed links or routers cause the routing tables to become outdated, the routing tables differ from router to router and are
inconsistent. This is when routing loops and black holes happen. The faster a routing
protocol converges, the better the routing protocol is for large-scale deployment.
THE THREE MAJOR IGPs
There are three main IGPs for IPv4 routing: RIP, OSPF, and IS–IS. The Routing Information Protocol (RIP), often declared obsolete, is still used and remains a popular routing
protocol for small networks. The newer version of RIP, known as RIPv2, should always
be used for IPv4 routing today. Open Shortest Path First (OSPF) and Intermediate
System–Intermediate System (IS–IS) are similar and much more robust than RIP. There
are versions of all three for IPv6: OSPFv3, RIPng (sometimes seen as RIPv6), and IS–IS
works with either IPv4 or IPv6 today.
RIP is a distance-vector routing protocol, and OSPF and IS–IS are link-state routing
protocols. Distance-vector routing protocols are simple and make routing decisions
based on one thing: How many routers (hops) are there between here and the destination? To RIP, link speeds do not matter, nor does congestion near another router. To RIP,
the “best” route always has the fewest number of hops (routers).
Link-state protocols care more about the network than simply the number
of routers along the path to the destination. They are much more complex than
distance-vector routing protocols, and link-state protocols are much more suited
for networks with many different link speeds, which is almost always the case
today. However, link-state protocols require an elaborate database of information
about the network on each router. This database includes not only the local router
addressing and interfaces, but each and every router in the immediate area and
often the entire AS.
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
355
ROUTING INFORMATION PROTOCOL
The RIP is still used on all types of TCP/IP networks. The basics of RIP were spelled out
in RFC 1058 from 1988, but this is misleading. RIP was in use long before 1988, but no
one bothered to document RIP in detail. RIP is bundled with almost all implementations of TCP/IP, so networks often run only RIP. Why pay for something when RIP was
available for free?
RIP version 1 (RIPv1) in RFC 1058 has a number of annoying limitations, but RIP
is so popular that doing away with RIP is not a realistic consideration. RFC 1388 introduced RIP version 2 (RIPv2 or sometimes RIP-2) in 1993. RIPv2 addressed RIPv1 limitations, but could not turn a distance-vector protocol into a link-state routing protocol
such as OSPF and IS–IS.
RIPv2 is backward compatible with RIPv1, and most RIP implementations run RIPv2
by default and allow RIPv1 to be configured. In this chapter, the term “RIP” by itself
means “a version of RIP runs RIPv2 by default but can also be configured as RIPv1 as
required.”
Router vendor Cisco was deeply dissatisfied with RIPv1 limitations and so created
its own vendor-specific (proprietary) version of an IGP routing protocol, which Cisco
called the Interior Gateway Routing Protocol (IGRP). IGRP improved upon RIPv1 in
several ways, but “pure” IGRP could only run between Cisco routers. As good as IGRP
was, IGRP was still basically implemented as a distance-vector protocol. As networks
grew more and more complex in terms of link speeds and router capacities, it was possible to switch to a link-state protocol such as OSPF or IS–IS, but many network administrators at the time felt these new protocols were not stable or mature enough for
production networks. Cisco then invented Enhanced IGRP (EIGRP) as a sort of “hybrid”
routing protocol that combined features of both distance-vector and link-state routing
protocols all in one (proprietary) package.
Due to the proprietary nature of IGRP and EIGRP, only the basics of these routing
protocols are covered in this chapter.
Distance-Vector Routing
RIP and related distance-vector routing protocols are classified as “Bellman–Ford” routing
protocols because they all choose the “best” path to a destination based on the shortest
path computation algorithm. It was first described by R. E. Bellman in 1957 and applied
to a distributed network of independent routers by L. R. Ford, Jr. and D. R. Fulkerson in
1962. Every version of Unix today bundles RIP with TCP/IP, usually as the routed (“route
management daemon”) process, but sometimes as the gated process.
All routing protocols use a metric (measure) representing the relative “cost” of sending a packet from the current router to the destination. The lowest relative cost is the
“best” way to send a packet. Distance-vector routing protocols have only one metric:
distance. The distance is usually expressed in terms of the number of routers between
the router with the packet and the router attached to the destination network. The
356
PART III Routing and Routing Protocols
Table 14.2 Example RIP Routing Table
Network
Next Hop Interface
10.0.14.0
Ethernet 1 (E1)
2
172.16.15.0
Serial 1 (S1)
1
192.168.44.0
Ethernet 2 (E2)
3
192.168.66.0
Serial 2 (S2)
192.168.78.0
Locally attached
Cost
INF (15)
0
distance metric is carried between routers running the same distance-vector routing
protocol as a vector, a field in a routing protocol update packet.
A simple example of how distance-vector, or hop-count, routing works will illustrate
many of the principles that all routing protocols simple and complex must deal with. All
routing protocols must pass along network information received from adjacent routers to all other routers in a routing domain, a concept known as flooding. Flooding is
the easiest way to ensure consistency of routing tables, but convergence time might
be high as routers at one end of a chain of routers wait for information from routers at
the far end of the chain to make its way through the routers in between. Flooding also
tends to maximize the bandwidth consumed by the routing protocol itself, but there
are ways to reduce this.
RIP floods updates every 30 seconds. Note that routing information takes at least 30
seconds to reach the closest neighbor if that is the routing update interval used. Long
chains of routers can take quite a long time to converge (several minutes) when a network address is added or when a link fails.
When this network converges, each routing table will be consistent and each router
will be reachable from every other router over one of the interfaces. The network
topology has been “discovered” by the routing protocol. An example of the information
in one of these tables is shown in Table 14.2.
Routers can have alternatives other than those shown in the table. For example, the
cost to reach network 192.168.44.0 from this router could be the same (3) over E1 as
it is over E2. The E1 interface is most likely in the table because the update from the
neighbor router saying “send 192.168.44.0 packets here” arrived before the update
from another router saying the same thing, or the entry was already in the table. When
costs are equal, routing tables tend to keep what they know.
Broken Links
The distance-vector information has now been exchanged and the routers all have a
way to reach each other. Usually, the routing protocol will update an internal database
in the router just for that routing protocol and one or more entries based on the database are made in the routing table, which might contain information from other routing protocols as well. The routing table information is then used to compute the “best”
routes to be used in the forwarding table (sometimes called the switching table) of the
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
357
router. This chapter blurs the distinctions between routing protocol database, routing
table, and forwarding table for the sake of simplicity and clarity.
What will happen to the network if a link “breaks” and can no longer be used to
forward traffic? In a static routing world, this would be disastrous. But when using a
dynamic routing protocol, even one as simple as a distance-vector routing protocol, the
network should be able to converge around the new topology.
The routers at each end of the link, since they are locally connected to the interface
(direct), will notice the outage first because routers constantly monitor the state of
their interfaces at the physical level. Distance-vector protocols note this absent link by
noting that the link now has an “infinite” cost. All routers formerly reachable through
the link are now an infinite distance away.
Distance-Vector Consequences
In some cases, distance-vector updates are generated so closely in time by different
routers that a link failure can cause a routing loop to occur, and packets can easily
“bounce” back and forth between two adjacent routers until the packet TTL expires,
even though the destination is reachable over another link. The “bouncing effect” will
last until the network converges on the new topology.
However, this convergence can take some time, since routers not located at the end
of a failed link have to gradually increase their costs to infinity one “hop” at a time. This
is called “counting to infinity,” and can drag out convergence time considerably if the
value of “infinity” is set high enough. On the other hand, a low value of “infinity” will
limit the maximum number of routers that can form the longest path through the network from source to destination.
In order to minimize the effects of bouncing and counting to infinity, most implementations of distance-vector routing protocols such as RIP also implement split horizon and triggered updates.
Split Horizon
If Router A is sending packets to Router B to reach Router E, then it makes no sense at
all for Router B to try to reach Router E through Router A. All Router A will do is turn
around and send the packet right back to Router B. So Router A should never advertise
a way to reach Router E to Router B.
A more sophisticated form of split horizon is known as split horizon with poison
reverse. Split horizon with poison reverse eliminates a lot of counting to infinity problems due to single link failures. However, many multiple link failures will still cause
routing loops and counting to infinity problems even when split horizon with poison
reverse is in use.
Triggered Updates
With triggered updates, a router running a distance-vector protocol such as RIP can
remain silent if there are no changes to the information in the routing table. If a link
failure is detected, triggered updates will send the new information. Triggered updates,
358
PART III Routing and Routing Protocols
like split horizon, will not eliminate all cases of routing loops and counting to infinity.
However, triggered updates always help the counting process to reach infinity much
faster.
RIPv1
A RIP packet must be 512 bytes or smaller, including the header. RIP packets have no
implied sequence, and each update packet is processed independently by the router
receiving the update. A router is only required to keep one entry associated with each
route. But in practice, routers might keep up to four or more routes (next hops) to the
same destination so that convergence time is lowered.
RIPv1 required routers running RIP to broadcast the entire contents of their routing tables at fixed intervals. On LANs, this meant that the RIPv1 packets were sent
inside broadcast MAC frames. But broadcast MAC frames tell not only every router on
the LAN, but every host on the LAN,“pay attention to this frame.” Inside the frame, the
host would find a RIPv1 update packet, and probably ignore the contents. But every
30 seconds, every host on the LAN had to interrupt its own application processing and
start throwing away RIPv1 packets.
Each host could keep the information inside the RIPv1 update packet. Some hosts
on LANs with RIPv1 routers have as elaborate a routing table as the routers themselves.
Hackers loved RIPv1: With a few simple coding changes, any host could impersonate
a RIPv1 router and start pumping out fake routing information, as many college and
university network administrators discovered in the late 1980s. (This is one reason you
don’t run RIP on host interfaces.)
Many people see RIP updates vary from 30 seconds and assume that timers are off.
In fact, table updates in RIP are initiated on each router at approximate 30-second
intervals. Strict synchronization is avoided because RIP traffic spikes can easily lead to
discarded RIP packets. The update timer usually adds or subtracts a small amount of
time to the 30-second interval to avoid RIP router synchronization.
Network devices running RIP can be either active or passive (silent) mode. Active
RIP devices will listen for RIP update packets and also generate their own RIP update
packets. Passive RIP devices will only listen for RIP updates and never generate their
own update packets. Many hosts, for example, which must process the broadcast RIP
updates sent on a LAN, are purely passive RIP devices.
RIPv1 Limitations
RIPv1 had a number of limitations that made RIPv1 difficult to use in large networks. The
larger the routing domain, the more severe and annoying the limitations of RIPv1
become.
Wasted Space—All of the RIPv1 packet fields are larger than they need to be,
sometimes many times larger. There are almost three times as many 0 bits as
information bits in a RIP packet.
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
359
Limited Metrics—As a network grows, the distance-vector might require a metric
greater than 15, which is unreachable (infinite).
No Link Speed Allowances—The simple hop count metric will always result in
packets being sent (as an example) over two hops using low-speed, 64-kbps
links rather than three hops using SONET/SDH links.
No Authentication—RIPv1 devices will accept RIPv1 updates from any other
device. Hackers love RIPv1 for this very reason, but even an innocently misconfigured router can disrupt an entire network using RIPv1.
Subnet Masks—RIPv1 requires the use of the same subnet mask because RIPv1
updates do not carry any subnet mask information.
Slow Convergence—Convergence can be very slow with RIPv1, often 5 minutes
or more when links result in long chains of routers instead of neat meshes. And
“circles” of RIPv1 routers maximize the risk of counting to infinity.
RIPv2
RIPv2 first emerged as an update to RIPv1 in RFC 1388 issued in January 1993. This
initial RFC was superseded by RFC 1723 in November 1994. The only real difference
between RFC 1388 and RFC 1723 is that RFC 1723 deleted a 2-byte Domain field
from the RIPv2 packet format, designating this space as unused. No one was really
sure how to use the Domain field anyway. The current RIPv2 RFC is RFC 2453 from
November 1998.
RIPv2 was not intended as a replacement for RIPv1, but to extend the functions of
RIPv1 and make RIP more suitable for VLSM. The RIP message format was changed as
well to allow for authentication and multicasting.
In spite of the changes, RIPv2 is still RIP and suffers from many of the same limitations as RIPv1. Most router vendors support RIPv2 by default, but allow interfaces or
whole routers to be configured for backward compatibility with RIPv1. RIPv2 made
major improvements to RIPv1:
■
■
■
■
Authentication between RIP routers
Subnet masks to be sent along with routes
Next hop IP addresses to be sent along with routes
Multicasting of RIPv2 messages
The RIPv2 packet format is shown in Figure 14.2.
Command Field (1 byte)—This is the same as in RIPv1: A value of 1 is for a
Request and a value of 2 is for a Response.
Version Number (1 byte)—RIPv1 uses a value of 1 in this field, and RIPv2 uses a
value of 2.
360
R
o
u
t
e
PART III Routing and Routing Protocols
1 byte
1 byte
Command
Version
1 byte
Unused (set to all zeros)
Address Family Identifier
Authentication or Route Tag
IP Address
E
n
t
r
y
R
o
u
t
e
1 byte
Subnet Mask
Next Hop
Metric
(Repeats multiple times,up to a maximum of 25)
Address Family Identifier
Authentication or Route Tag
IP Address
Subnet Mask
E
n
t
r
y
Next Hop
Metric
32 bits
FIGURE 14.2
RIPv2 packet format, showing how the subnet mask is included with the routing information
advertised.
Unused (2 bytes)—Set to all zero bits. This was the Domain field in RFC 1388.
Now officially unused in RFC 1723, this field is ignored by routers running
RIPv2 (but this field must be set to all 0 bits for RIPv1 routers).
Address Family Identifier (AFI) (2 bytes)—This field is set to a value of 2 when
IP packet and routing information is exchanged. RIPv2 also defined a value of
1 to ask the receiver to send a copy of its entire routing table. When set to all
1s (0xFFFF), the AFI field is used to indicate that the 16 bits following the AFI
field, ordinarily set to 0 bits, now carry information about the type of authentication being used by RIPv2 routers.
Authentication or Route Tag (2 bytes)—When the AFI field is not 0xFFFF, this
is the Route Tag field. The Route Tag field identifies internal and external
routes in RIPv2. Internal routes are those learned by RIP itself, either locally
or through other RIP routers. External routes are routes learned from another
routing protocol such as OSPF or BGP.
IPv4 Address (4 bytes)—This field and the three that follow can be repeated up
to 25 times in the RIPv2 Response packet. This field is almost the same as in
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
361
RIPv1. This address can be a host route, a network address, or a default route.
A RIPv2 Request packet has the IP address of the originator in this field.
Subnet Mask (4 bytes)—This field, the biggest change in RIPv2, contains the subnet mask that goes with the IP address in the previous field. If the network
address does not use a subnet mask different from the natural classful major
network mask, then this field can be set to all zeroes, just as in RIPv1.
Next Hop (4 bytes)—This field contains the next hop IP address that traffic to this
IP address space should use. This was a vast improvement over the “implied”
next hop used in RIPv1.
Metric (4 bytes)—Unfortunately, the metric field is unchanged. The range is still 1
to 15, and a metric value of 16 is considered unreachable.
RIPv2 is still RIP. But RIPv2’s additions for authentication, subnet masks, next
hops, and the ability to multicast routing information increase the sophistication of RIP
and have extended RIP’s usefulness.
Authentication
Authentication was added in RIPv2. The Response messages contain the routing
update information, and authenticating the responder to a Request message is a good
way to minimize the risk of a routing table becoming corrupted either by accident or
through hacker activities. However, there were really only 16 bits available for authentication, hardly adequate for modern authentication techniques. So the authentication
actually takes the place of one routing table entry and authenticates the entire update
message. This gives 16 bytes (128 bits) for authentication, which is not state of the art,
but is better than nothing.
The really nice feature of RIPv2 authentication is that router vendors can add their
own Authentication Type values and schemes to the basics of RIPv2, and many do. For
example, Cisco and Juniper Networks routers can be configured to use MD5 (Message
Digest 5) authentication encryption to RIPv2 messages. Thus, most routers can have
three forms of authentication on RIP interfaces: none, simple password, or MD5. Naturally, the MD5 authentication keys used must match up on the routers.
Subnet Masks
The biggest improvement from RIPv1 to RIPv2 was the ability to carry the subnet mask
along with the route itself. This allowed RIP to be used in classless IP environments
with VLSM.
Next Hop Identification
Consider a network where there are several site routers with only one or a few small
LANs. The small routers run RIPv2 between themselves and their ISP’s router, but might
run a higher speed link to one router and a lower speed link to another. The higher
speed link might be more hops away than the lower speed link.
362
PART III Routing and Routing Protocols
The next hop field in RIPv2 is used to “override” the ordinary metric method of
deciding active routes in RIP. RIPv2 routers check the next hop field in the routing
update message. If the next hop field is set for a particular route, the RIP router will use
this as the next hop for the route, regardless of distance-vector considerations.
This RIPv2 next hop mechanism is sometimes called source routing in some documents. But true source routing information is always set by a host, not a router. This is
just RIPv2 next hop identification.
Multicasting
Multicasting is a kind of “halfway” distribution method between unicast (one source
to one destination) and broadcast (one source to all possible destinations). Unlike
broadcasts that are received by all nodes on the subnet, only devices that join the
RIPv2 multicast group will receive packets for RIPv2. (We’ll talk more about multicast in Chapter 16.) RIPv2 multicasting also offers a way to filter out RIPv2 messages
from a RIPv1 only router. This can be important, since RIPv2 messages look very much
like RIPv1 messages. But RIPv2 messages are all invalid by RIPv1 standards. RIPv1
devices would either discard RIPv2 messages because the mandatory all-zero fields are
not all zeroes, or accept the routes and ignore the additional RIPv2 information such
as the subnet mask. RIPv2 multicasting makes sure that only RIPv2 devices see the
RIPv2 information. So RIPv1 and RIPv2 routers can easily coexist on the same LAN, for
instance. The multicast group used for RIPv2 routers is 224.0.0.9.
RIPv2 is still limited in several ways. The 15 maximum-hop count is still there, as
well as counting to infinity to resolve routing loops. And RIPv2 does nothing to improve
on the fixed distance-vector values that are a feature of all versions of RIP.
RIPng for IPv6
The version of RIP used with IPv6 is called RIPng, where “ng” stands for “next generation.” (IPv6 itself was often called IPng in the mid-1990s.) RIPng uses exactly the same
hop count metric as RIP as well as the same logic and timers. So RIPng is still a distancevector RIP, with two important differences.
1. The packet formats have been extended to carry the longer IPv6 addresses.
2. IPv6 security mechanisms are used instead of RIPv2 authentication.
The overall format of the RIP packet is the same as the format of the RIPv2 packet
(but RIPng cannot be used by IPv4). There is a 32-bit header followed by a set of 20-byte
route entries. The header fields must be the same as those used in RIPv2: There is a
1-byte Command code field, followed by a 1-byte Version field (now 6), and then 2 unused
bytes of bits that must still be set to all 0 bits. However, the 20-byte router entry fields in
RIPng are totally different that those in RIPv2.
IPv6 addresses are 16 bytes long, leaving only 4 bytes for any other information that
must be associated with the IPv6 route. First, there is a 2-byte Route Tag field with the
same use as in RIPv2: The Route Tag field identifies internal and external routes. Internal routes are those learned by RIP itself, either locally or through other RIP routers.
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
R
o
u
t
e
1 byte
1 byte
1 byte
Command
Version
Unused (set to all zeros)
363
1 byte
IPv6 Address
E
n
t
r
y
R
o
u
t
e
Route Tag
Prefix Length
Metric
(Repeats multiple times, up to a maximum of 25)
IPv6 Address
E
n
t
r
y
Route Tag
Prefix Length
Metric
32 bits
FIGURE 14.3
RIPng for IPv6 packet fields. Note the large address fields and different format than RIPv2 fields.
External routes are routes learned from another routing protocol such as OSPF or
BGP. Then there is a 1-byte Prefix Length field that tells the receiver where the boundary between network and host is in the IPv6 address. Finally, there is a 1-byte Metric
field (this field was a full 32 bits in RIPv1 and RIPv2). Since infinity is still 16 in RIPng,
this is not a problem.
The fields of the RIPng packet are shown in Figure 14.3. The combination of IPv6
address and Prefix Length do away with the need for the Subnet Mask field in RIPv2
packets. The Address Format Identifier (AFI) field from RIPv2 is not needed in RIPng,
since only IPv6 routing information can be carried in RIPng.
But IPv6 still needs a Next Hop field. This RIPv2 field contained the next-hop IP
address that traffic to this IP address space should use, and was a vast improvement
over the “implied” next hop used in RIPv1. Now, IPv6 does not always need this Next
Hop information, but in many cases the next hop should be included in an IPv6 routing
information update. An IPv6 Next Hop needs another 128 bits (16 bytes). The creators
of RIPng decided to essentially reproduce the same route entry structure for the IPv6
Next Hop, but use a special value of the last field (the Metric) to indicate that the first
16 bytes in the route entry was an IPv6 Next Hop, not the route itself. The value chosen
for the metric was 256 (0xFF) because this was far beyond the legal hop count limit
(15) for RIP.
364
PART III Routing and Routing Protocols
1 byte
1 byte
1 byte
1 byte
Next Hop IPv6 Address
Must Be All Zeros
Metric 5 0xFF
32 bits
FIGURE 14.4
The Next Hop in IPv6 with RIPng. Note the use of the special metric value.
When the route entry used is an IPv6 Next Hop, the 3 bytes preceding the 0xFF
Metric must be set to all 0 bits. This is shown in Figure 14.4.
At first it might seem that the amount of the IPv6 routing information sent with
RIPng must instantly double in size, since now each 20-byte IPv6 route requires a
20-byte IPv6 Next Hop field. This certainly would make IPv6 very unattractive to current RIP users. But it was not necessary to include a Next Hop entry for each and every
IPv6 route because the creators of RIPng used a clever mechanism to optimize the use
of the Next Hop entry.
A Next Hop always qualifies any IPv6 routes that follow it in the string of route
entries until another Next Hop entry is reached or the packet stream ends. This keeps
the number of “extra” Next Hop entries needed in RIPng to an absolute minimum. And
due to the fact that the Next Hop field in RIPv2 has only specialized use, a lot of IPv6
routes need no Next Hop entry at all.
The decision to replace RIPv2 authentication with IPv6 security mechanisms was
based on the superior security used in IPv6. When used with RIPng updates, the IPv6
Authentication Header protects both the data inside the packet and the IP addresses of
the packet, but this is not the case with RIPv2 authentication no matter which method
is used. And IPv6 encryption can be used to add further protection.
A NOTE ON IGRP AND EIGRP
Cisco routers often use a proprietary IGP known as the Interior Gateway Routing
Protocol (IGRP) instead of RIP. Later, features were added to IGRP in the form of
Enhanced IGRP (EIGRP). In spite of the name, EIGRP was a complete redesign of
IGRP. This section will only give a brief outline of IGRP and EIGRP, since IGRP/EIGRP
interoperability with Juniper Networks routers is currently impossible.
IGRP and EIGRP might appear to be open standards, but this is only due to the wideranging deployment of Cisco routers. Cisco has never published the details of IGRP
internals (EIGRP is based on these), and is not likely to.
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
365
IGRP improves on RIP in several areas, but IGRP is still essentially a distance-vector
routing protocol. EIGRP, on the other hand, is advertised by Cisco as a “hybrid” routing protocol that includes aspects of link-state routing protocols such as OSPF and
IS–IS among the features of EIGRP. Today not many, even those with all-Cisco networks,
would consider running EIGRP over OSPF or IS–IS.
Open Shortest Path First
OSPF is not a distance-vector protocol like RIP, but a link-state protocol with a set of
metrics that can be used to reflect much more about a network than just the number
of routers encountered between source and destination. In OSPF, a router attempts to
route based on the “state of the links.”
OSPF can be equipped with metrics that can be used to compute the “shortest” path
through a group of routers based on link and router characteristics such as highest
throughput, lowest delay, lowest cost (money), link reliability, or even more. OSPF is still
used very cautiously, with default metrics based entirely on link bandwidth. Even with
this conservative use, OSPF link states are an improvement over simple hop counts.
Distance-vector routing protocols like RIP were fine for networks comprised of
equal speed links, but struggled when networks started to be built out of WAN links
with a wide variety of available speeds. When RIP first appeared, almost all WANs were
composed of low-speed analog links running at 9600 bps. Even digital links running at
56 or 64 kbps were mainly valued for their ability to carry five 9600-bps channels on
the same physical link. Commercial T1s at 1.544 Mbps were not widely available until
1984, and then only in major metropolitan areas. Today, the quickest way to send packets from one router to another is not always through the fewest number of routers.
The “open” in OSPF is based on the fact that the Shortest Path First (SPF) algorithm
was not owned by anyone and could be used by all. The SPF algorithm is often called
the Dijkstra algorithm after the computer and network pioneer that first worked it
out from graph theory. Dijkstra himself called the new method SPF, first described in
1959, because compared to a distance-vector protocol’s counting to infinity to produce
convergence, his algorithm always found the “shortest path first.”
OSPF version 1 (OSPFv1), described in RFC 1131, never matured beyond the experimental stage. The current version of OSPF, OSPFv2, which first appeared as RFC 1247
in 1991, and is now defined by RFC 2328 issued in 1998, became the recommended
replacement for RIP (although a strong argument could be made in favor of IS–IS, discussed later in this chapter).
Link States and Shortest Paths
Link-state protocols are all based on the idea of a distributed map of the network. All
of the routers that run a link-state protocol have the same copy of this network map,
which is built up by the routing protocol itself and not imposed on the network from
an outside source. The network map and all of the information about the routers and
links (and the routes) are kept in a link-state database on each router. The database
366
PART III Routing and Routing Protocols
is not a “map” in the usual sense of the word: Records represent the topology of the
network as a series of links from one router to another. The database must be identical
on all of the routers in an area for OSPF to work.
Initially, each router only knows about a piece of the entire network. The local
router knows only about itself and the local interfaces. So link-state advertisements
(LSAs), the OSPF information sent to all other routers from the local router, always identify the local router as the source of the information.
The OSPF routing protocol “floods” this information to all of the other routers so
that a complete picture of the network is generated and stored in the link-state database. OSPF uses reliable flooding so that OSPF routers have ways to find out if the
information passed to another router was received or not.
The more routers and links that OSPF has to deal with, the larger the link-state database that has to be maintained. In large router networks, the routing information could
slow traffic. OSPFv2 introduced the idea of stub areas into an OSPF routing domain.
A stub area could function with a greatly reduced link-state database, and relied on a
special backbone area to reach the entire network.
What OSPF Can Do
By 1992, OSPF had matured enough to be the recommended IGP for the Internet and
had delivered on its major design goals.
Better Routing Metrics for Links
OSPF employs a configurable link metric with a range of valid values between 1 and
65,535. There is no limit on the total cost of a path between routers from source to
destination, as long as all the routers are in the same AS. Network administrators, for
example, could assign a metric of 10,000 to a low-bandwidth link and 10 to a very
high-bandwidth Metro Ethernet or SONET/SDH link. In theory, these values could be
manually assigned through a central authority. In practice, most implementations of
OSPF divide a reference bandwidth by the actual bandwidth on the link, which is
known through the router’s interface configuration. The default reference bandwidth
is usually 100 Mbps (Fast Ethernet). Since the metric cannot be less than 0, all links at
100 Mbps or faster use a 1 as a link metric and thus revert to a simple hop count when
computing longest cost paths. The reference bandwidth is routinely raised to accommodate higher and higher bandwidths, but this requires a central authority to carry out
consistently.
Equal-Cost Multipaths
There are usually multiple ways to reach the same destination network that the routing protocol will compute as having the same cost. When equal-cost paths exist, OSPF
routers can find and use equal-cost paths. This means that there can be multiple next
hops installed in a forwarding table with OSPF. OSPF does not specify how to use these
multipaths: Routers can use simple round-robin per packet, round-robin per flow, hashing, or other mechanisms.
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
367
Router Hierarchies
OSPF made very large routing domains possible by introducing a two-level hierarchy
of areas. With OSPF, the concepts of an “edge” and “backbone” router became common
and well understood.
Internal and External Routes
It is necessary to distinguish between routing information that originated within the
AS (internal routing information) and routing information that came from another AS
(external routing information). Internal routing information is generally more trusted
than external routing information that might have passed from ISP to ISP across the
Internet.
Classless Addressing
OSPF was first designed in a classful Internet environment with Class A, B, and C
addresses. However, OSPF is comfortable with the arbitrary network/host boundaries
used by CIDR and VLSM.
Security
RIPv1 routers accepted updates from anyone, and even RIPv2 routers only officially
used simple plain-text passwords that could be discovered by anyone with access to
the link. OSPF allows not only for simple password authentication, but strong MD5 key
mechanisms on routing updates.
ToS Routing
The original OSPF was intended to support the bit patterns established for the Type of
Service (ToS) field in the IP packet header. Routers at the time had no way to enforce
ToS routing, but OSPF anticipated the use of the Internet for all types of traffic such
as voice and video and went ahead and built into OSPF ways to distribute multiple
metrics for links. So OSPF routing updates can include ToS routing information for
five IP ToS service classes, defined in RFC 1349. The service categories and OSPF ToS
values are normal service (ToS 5 0), minimize monetary cost (2), maximize reliability
(4), maximize throughput (8), and minimize delay (16). Since all current implementations of OSPF support only a ToS value of 0, no more need be said about the other ToS
metrics.
By the way, here’s all we did on the customer- and provider-edge routers in each AS
to configure OSPF to run on every router interface. Now, in a real network, we wouldn’t
necessarily configure OSPF to run on all of the router’s internal or management interfaces, but it does no harm here.
set protocols ospf area 0.0.0.0 interface all
All OSPF routers do not have to be in the same area, and in most real router networks, they aren’t. But this is a simple network and only configures an OSPF backbone
area, 0.0.0.0. The provider routers in our ISP cores (P9, P7, P4 and P2), which are called
368
PART III Routing and Routing Protocols
AS border routers, or ASBRs, run OSPF on the internal links within the AS, but not on
the external links to the other AS (this is where we’ll run the EGP).
The relationship between the OSPF use of a reference bandwidth and ToS routing
should be clarified. Use of the OSPF link reference bandwidth is different from and
independent of ToS support, which relies on the specific settings in the packet headers. OSPF routers were supposed to keep separate link-state databases for each type
of service, since the least-cost path in terms of bandwidth could be totally different
from the least-cost path computed based on delay or reliability. This was not feasible
in early OSPF implementations, which struggled to maintain the single, normal ToS 5 0
database. And it turned out that the Internet users did not want lots of bandwidth or
low delay or high reliability when they sent packets. Internet users wanted lots of
bandwidth and low delay and high reliability when they sent packets. So the reference
bandwidth method is about all the link-state that OSPF can handle, but that is still better than nothing.
OSPF Router Types and Areas
OSPFv2 introduced areas as a way to cut down on the size of the link-state database, the
amount of information flooded, and the time it takes to run the SPF algorithm, at least
on areas other than the special backbone area.
An OSPF area is a logical grouping of routers sharing the same 32-bit Area ID. The
Area ID can be expressed in dotted decimal notation similar to an IP address, such as
192.168.17.33. The Area ID can also be expressed as a decimal equivalent, so Area 261
is the same as Area 0.0.1.5. When the Area ID is less than 256, usually only a single number is used, but Area 249 is still really Area 0.0.0.249.
There are five OSPF area types. The position of a router with respect to OSPF areas
is important as well. The area types are shown in Figure 14.5.
The OSPF Area 0 (0.0.0.0) is very special. This is the backbone area of an OSPF
routing domain. An OSPF routing domain (AS) can consist of a single area, but in that
case the single area must be Area 0. Only the backbone area can generate the summary
routing topology information that is used by the other areas. This is why all interarea
traffic must pass through the backbone area. (There are backdoor links that can be
configured on some routers to bypass the backbone area, but these violate the OSPF
specification.) In a sense, the backbone area knows everything. Not so long ago, only
powerful high-end routers could be used on an OSPF backbone. On the Illustrated Network, each AS consists of only an Area 0.
If an area is not the backbone area, it can be one of four other types of areas. All of
these areas connect to the backbone area through an Area Border Router (ABR). An
ABR by definition has links in two or more areas. In OSPF, routers always form the
boundaries between areas. A router with links outside the OSPF routing domain is
called an autonomous system boundary router (ASBR). Routing information about destination IP addresses not learned from OSPF are always advertised by an ASBR. Even
when static routes, or RIP routes, are redistributed by OSPF, that router technically
becomes an ASBR. ASBRs are the source of external routes that are outside of the
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
AS
Area 0
(backbone)
369
Inter-AS
Link
ASBR
ABR
ABR
ABR
ABR
Area 10.0.0.3
(NSSA: ASBR
allowed, otherwise
same as stub)
Area 11
(non-backbone
non-stub)
ASBR
Area 1.17
(stub: no ASBR
allowed, default
external routes)
Inter-AS
Link
Area 24
(total stub area:
no ASBR, only
one default
route)
ASBR
Inter-AS
Link, RIP, etc.
FIGURE 14.5
OSPF area types, showing the various ways that areas can be given numbers (decimal, IP address,
or other). Note that ABRs connect areas and ASBRs have links outside the AS or to other routing
protocols.
OSPF routing domain, and external routes are often very numerous in an OSPF routing
domain attached to the global Internet. If a router is not an ABR or ASBR, it is either an
internal router and has all of its interfaces within the same area, or a backbone router
with at least one link to the backbone. However, these terms are not as critical to OSPF
configurations as to ABRs or ASBRs. That is, not all backbone routers are ABRs or ASBRs;
backbone routers can also be internal routers, and so on.
Non-backbone, Non-stub Areas
These areas are really smaller versions of the backbone area. There can be links to other
routing domains (ASBRs) and the only real restriction on a non-backbone, non-stub area
is that it cannot be Area 0. Area 11 in Figure 14.5 is a non-backbone, non-stub area.
Stub Area
Stub areas cannot have links outside the AS. So there can be no ASBRs in a stub area. This
minimizes the amount of external routing information that needs to be distributed into
the link-state databases of the stub area routers. Because an AS might be an ISP on the
370
PART III Routing and Routing Protocols
Internet, the number of external routes required in an OSPF routing domain is usually
many times larger than the internal routes of the AS itself. Stub area routers only obtain
information on routes external to the AS from the ABR. Area 1.17 in Figure 14.5 is a
stub area.
Total Stub Area
This is also called a “totally stubby area.” Recall that stub areas cannot have ASBRs
within them, by definition. But stub areas can only reach other ASBRs, which have the
links leading to and from other ASs, through an ABR. So why include detailed external
route information in the stub area router’s link-state database? All that is really needed
is the proper default route as advertised by the ABR. Total stub areas only know how
to reach their ABR for a route that is not within their area. Area 24 in Figure 14.5 is a
total stub area.
Not-So-Stubby Area
Banning ASBRs from stub areas was very restrictive. Even the advertisement of static
routes into OSPF made a router an ASBR, as did the presence of a single LAN running
RIP, if the routes were advertised by OSPF. And as ISPs merged and grew by acquiring
smaller ISPs, it became difficult to “paste” the new OSPF area with its own ASBRs onto
the backbone area of the other ISP. The easiest thing to do was to make the new former
AS a stub area, but the presence of an ASBR prevented that solution. The answer was to
introduce the concept of a not-so-stubby area (NSSA) in RFC 1587. An NSSA can have
ASBRs, but the external routing information introduced by this ASBR into the NSSA is
either kept within the NSSA or translated by the ABR into a form useful on the backbone Area 0 and to other areas. Area 10.0.0.3 in Figure 14.5 is an NSSA.
OSPF Designated Router and Backup Designated Router
An OSPF router can also be a Designated Router (DR) and Backup Designated Router
(BDR). These have nothing to do with ABRs and ASBRs, and concern only the relationship between OSPF routers on links that deliver packets to more than one destination
at the same time (mainly LANs).
There are two major problems with LANs and public data networks like ATM and
frame relay (called non-broadcast multiple-access, or NBMA, networks). First is the fact
that the link-state database represents links and routers as a directed graph. A simple
LAN with five OSPF routers would need N(N 2 1)/2, or 5(4)/2 5 20 link-state advertisements just to represent the links between the routers, even though all five routers are
mutually adjacent on the LAN and any frame sent by one is received by the other four.
Second, and just as bad, is the need for flooding. Flooding over a LAN with many OSPF
routers is chaotic, as link-state advertisements are flooded and “reflooded” on the LAN.
To address these issues, multiaccess networks such as LANs always elect a designated router for OSPF. The DR solves the two problems by representing the multiaccess network as a single “virtual router” or “pseudo-node” to the rest of the network
and managing the process of flooding link-state advertisements on the multiaccess
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
371
network. So each router on a LAN forms an OSPF adjacency only with the DR (and also
the Backup DR [BDR] as mentioned later). All link-state advertisements go only to the
DR (and BDR), and the DR forwards them on to the rest of the network and internetwork routers.
Each network that elects a DR also elects a BDR that will take over the functions of
the DR if and when the DR fails. The DR and BDR form OSPF adjacencies with all of the
other routers on the multiaccess network and the DR and BDR also form an adjacency
with each other.
OSPF Packets
OSPF routers communicate using IP packets. OSPF messages ride directly inside of IP
packets as IP protocol number 89. Because OSPF does not use UDP or TCP, the OSPF
protocol is fairly elaborate and must reproduce many of the features of a transport protocol to move OSPF messages between routers.
There can be one of five OSPF packet types inside the IP packet, all of which
share a common OSPF header. The structure of the common OSPF header is shown in
Figure 14.6.
The version field is 2, for OSPFv2, and the type has one of the five values. The
packet length is the length of the OSPF packet in bytes. The Router ID is the IP address
selected as OSPF Router ID (usually the loopback interface address), and the Area ID is
the OSPF area of the router that originates the message. The checksum is the same as
the one used on IP packets and is computed on the whole OSPF packet.
1 byte
1 byte
Version
Type
1 byte
1 byte
Packet Length
Router ID
Area ID
Authentication Type
Checksum
Authentication*
Authentication*
*When authentication type 5 2, the authentication field has this structure:
030000
Key ID
Authentication Length
Cryptographic Sequence Number
32 bits
FIGURE 14.6
OSPF packet header fields, showing how the structure can vary with type.
372
PART III Routing and Routing Protocols
The Authentication Type (or AuType) is either none (0), simple password authentication (1), or cryptographic authentication (2). The simple password is an eightcharacter plain-text password, but the use of AuType = 2 authentication gives the
authentication field the structure shown in the figure. In this case, the Key ID identifies
the secret key and authentication algorithm (MD5) used to create the message digest,
the Authentication Data Length specifies the length of the message digest appended
to the packet (which does not count as part of the packet length), and the Cryptographic Sequence Number always increases and prevents hacker “replay” attacks.
OSPFv3 for IPv6
The changes made to OSPF for IPv6 are minimal. It is easy to transition from OSPF
for IPv4 to OSPF for IPv6. There is new version number, OSPF version 3 (OSPFv3),
and some necessary format changes, but less than might be expected. The basics are
described in RFC 2740.
OSPF for IPv6 (often called OSPFv6) will use link local IPv6 addresses and IPv6
multicast addresses. The IPv6 link-state database will be totally independent of the IPv4
link-state database, and both can operate on the same router.
Naturally, OSPFv6 must make some concessions to the larger IPv6 addresses and
next hops. But the common LSA header has few changes as well. The Link State Identifier field is still there, but is now a pure identifier and not an IPv4 address. There is
no longer an Options field, since this field also appears in the packets that need it,
and the LSA Header Type field is enlarged to 16 bits. Naturally, when LSAs carry the
details of IPv6 addresses, those fields are now large enough to handle the 128 bit IPv6
addresses.
INTERMEDIATE SYSTEM–INTERMEDIATE SYSTEM
OSPF is not the only link-state routing protocol that ISPs use within an AS. The other
common link-state routing protocol is IS–IS (Intermediate System–Intermediate
System). When IS–IS is used with IP, the term to use is Integrated IS. IS–IS is not really
an IP routing protocol. IS–IS is an ISO protocol that has been adapted (“integrated”) for
IP in order to carry IP routing information inside non-IP packets.
IS–IS packets are not IP packets, but rather ConnectionLess Network Protocol
(CLNP) packets. CLNP packets have ISO addresses, not IP source and destination
addresses. CLNP packets are not normally used for the transfer of user traffic from
client to server, but for the transfer of link-state routing information between routers.
IS–IS does not have “routers” at all: Routers are called intermediate systems to distinguish them from the end systems (ES) that send and receive traffic.
The independence of IS–IS from IP has advantages and disadvantages. One advantage is that network problems can often be isolated to IP itself if IS–IS is up and running
between two routers. One disadvantage is that there are now sources and destinations
on the network (the ISO addresses) that are not even “ping-able.” So if a link between
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
373
two routers is configured with incorrect IP addresses (such as 10.0.37.1/24 on one
router and 10.0.38.2/24 on the other), IS–IS will still come up and exchange routing
information over the link, but IP will not work correctly, leaving the network administrators wondering why the routing protocol is working but the routes are broken.
Our network does not use IS–IS, so much of this section will be devoted to introducing IS–IS terminology, such as link-state protocol (LSP) data unit instead of OSPF’s
link-state advertisement (LSA), and contrasting IS–IS behavior with OSPF.
The IS–IS Attraction
If IS–IS is used instead of OSPF as an IGP within an AS, there must be strong reasons
for doing so. Why introduce a new type of packet and addressing to the network?
And even the simple task of assigning ISO addresses to routers can be a complex task.
Yet many ISPs see IS–IS as being much more flexible than OSPF when it comes to the
structure of the AS.
IS–IS routers can form both Level 1 (L1) and Level 2 (L2) adjacencies. L1 links connect routers in the same IS–IS area, and L2 links connect routers in different areas. In
contrast to OSPF, IS–IS does not demand that traffic sent between areas use a special
backbone area (Area 0.0.0.0). IS–IS does not care if interarea traffic uses a special area
or not, as long as it gets there. The same is true when a larger ISP acquires a smaller one
and it is necessary to “paste” new areas onto existing areas. With IS–IS, an ISP can just
paste the new area wherever it makes sense and configure IS–IS L1/L2 routers in the
right places. IS–IS takes care of everything.
A backbone area in IS–IS is simply a contiguous collection of routers in different
areas capable of running L2 IS–IS. The fact that the routers must be directly connected
(contiguous) to form the backbone is not too much as a limitation (most core routers
on the backbone usually have multiple connections). Each and every IS–IS backbone
router can be in a different area. If an AS structure similar to centralized OSPF is desired,
this is accomplished in IS–IS by running certain (properly connected) routers as
L2-only routers in one selected area (the backbone), connecting areas adjacent to
the central area with L1/L2 routers, and making the other the routers in the other areas
L1-only routers. The IS–IS attraction is in this type of flexibility compared to OSPF.
IS–IS and OSPF
ISO’s idea of a network layer protocol was CLNP. To distribute the routing information,
ISO invented ES–IS to get routing information from routers to and from clients and
servers, and IS–IS to move this information between routers.
IS–IS came from DEC as part of the company’s effort to complete DECnet Phase
V. Standardized as ISO 10589 in 1992, it was once thought that IS–IS would be the
natural progression from RIP and OSPF to a better routing protocol. (OSPF was struggling at the time.) To ease the transition from IP to OSI-RM protocols, Integrated IS–IS
(or Dual IS–IS) was developed to carry routing information for both IP and ISO-RM
protocols.
374
PART III Routing and Routing Protocols
OSPF rebounded, ironically by often borrowing what had been shown to work
in IS–IS. Today OSPF is the recommended IGP to run on the Internet, but IS–IS still
has adherents for reasons of flexibility. Of course, OSPF has much to recommend it
as well.
Similarities of OSPF and IS–IS
■
■
■
■
■
■
■
Both IS–IS and OSPF are link-state protocols that maintain a link-state database and
run an SPF algorithm based on Dijkstra to compute a shortest path tree of routes.
Both use Hello packets to create and maintain adjacencies between neighboring
routers.
Both use areas that can be arranged into a two-level hierarchy or into interarea and
intraarea routes.
Both can summarize addresses advertised between their areas.
Both are classless protocols and handle VLSM.
Both will elect a designated router on broadcast networks, although IS–IS calls it a
designated intermediate system (DIS).
Both can be configured with authentication mechanisms.
Differences between OSPF and IS–IS
Many of the differences between IS–IS and OSPF are terminology. The use of the terms
IS and ES have been mentioned. IS–IS has a subnetwork point of attachment (SNPA)
instead of an interface, protocol data units (PDUs) instead of packets, and other minor
differences. OSPF LSAs are IS–IS link-state PDUs (LSPs), and LSPs are packets all on their
own and do not use OSPF’s LSA-OSPF header-IP packet encapsulation.
But all IS–IS and OSPF differences are not trivial. Here are the major ones.
Areas—In OSPF, ABRs sit on the borders of areas, with one or more interfaces
in one area and other interfaces in other areas. In IS–IS, a router (IS) is either
totally in one area or another, and it is the links between the routers that connect the areas.
Route Leaking—When L2 information is redistributed into L1 areas, it is called
route leaking. Route leaking is defined in RFC 2966. A bit called the Up/Down
bit is used to distinguish routes that are local to the L1 area (Up/Down 5 0)
from those that have been leaked in the area from an L1/L2 router (Up/
Down 5 1). This is necessary to prevent potential routing loops. Route leaking is a way to make IS–IS areas with LI only routers as “smart” as OSPF routers
in not-so-stubby-areas (NSSAs).
CHAPTER 14 IGPs: RIP, OSPF, and IS–IS
375
Network Addresses—CLNP does not use IP addresses in its packets. IS–IS packets
use a single ISO area address (Area ID) for the entire router because the
router must be within one area or another. Every IS–IS router can have up to
three different area ISO addresses, but this chapter uses one ISO address per
router. The ISO Area ID is combined with an ISO system address (System ID)
to give the ISO Network Entity Title, or NET. Every router must be given an ISO
NET as described in ISO 8348.
Network Types—OSPF has five different link or network types that OSPF can
be configured to run on: point-to-point, broadcast, non-broadcast multi-access
(NBMA), point-to-multipoint, and virtual links. In contrast, IS–IS defines only
two types of links or subnetworks: broadcast (LANs) and point-to-point (called
“general topology”). This only distinguishes links that can support multicasting (broadcast) and use a designating router (DIS) and links that do not support multicasting.
Designated Intermediate System (DIS)—Although IS–IS technically uses a DIS,
many still refer to these devices as a designated router (DR). The DIS or DR
represents the entire multiaccess network link (such as a LAN) as a single
pseudo-node. The pseudo-node (a “virtual node” in some documentation) does
not really exist, but there are LSPs that are issued for the entire multiaccess
network as if the pseudo-node were a real device. Unlike OSPF, all IS–IS routers on a pseudo-node (such as a LAN) are always fully adjacent to the pseudonode. This is due to the lack of a backup DIS, and new DIS elections must take
place quickly.
LSP Handling—IS–IS routers handle LSPs differently than OSPF routers handle
LSAs. While OSPF LSAs age from zero to a maximum (MaxAge) value of 3600 seconds (1 hour), IS–IS LSPs age downward from a MaxAge of 1200 seconds (20 minutes) to 0. The normal refresh interval is 15 minutes. Since IS–IS does not use IP
addresses, multicast addresses cannot be used in IS–IS for LSP distribution. Instead,
a MAC destination address of 0180.c200.0014 (AllL1ISs) is used to carry L1 LSPs to
L1 ISs (routers), and a MAC destination address of 0180.c200.0015 (AllL2ISs) is used
to carry L2 LSPs to L2 ISs (routers).
Metrics—Like OSPF, IS–IS can use one of four different metrics to calculate least-cost
paths (routes) from the link-state database. For IS–IS, these are default (all routers
must understand the default metric system), delay, expense, and error (reliability in
OSPF). Only the default metric system is discussed here, as with OSPF, and that is the
only system that most router vendors support. The original IS–IS specification used
a system of metric values that could only range from 0 to 63 on a link, and paths (the
sum of all link costs along the route) could have a maximum cost of 1023. Today,
IS–IS implementations allow for “wide metrics” to be used with IS–IS. This makes
the IS–IS metrics 32 bits wide.
376
PART III Routing and Routing Protocols
IS–IS for IPv6
One advantage that IS–IS has over OSPF is that IS–IS is not an IP protocol and is not as
intimately tied up with IPv4 as OSPF. So IS–IS has fewer changes for IPv6: IPv4 is already
strange enough.
With IPv6, the basic mechanisms of RFC 1195 are still used, but two new TypeLength-Vector (TLVs, which define representation) types are defined for IPv6.
IPv6 Interface Address (type 232)—This TLV just modifies the interface address
field for the 16-byte IPv6 address space.
IPv6 Reachability (type 236)—This TLV starts with a 32-bit wide metric. Then
there is an Up/Down bit for route leaking, an I/E bit for external (other routing
protocol or AS) information, and a “sub-TLVs present?” bit. The last 5 bits of this
byte are reserved and must be set to 0. There is then 1 byte of Prefix Length
(VLSM) and from 0 to 16 bytes of the prefix itself, depending on the value of
the Prefix Length field. Zero to 248 bytes of sub-TLVs end the TLV.
Both types have defined sub-TLVs fields, but none of these has yet been standardized.
377
QUESTIONS FOR READERS
Figure 14.7 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
RIP
R
RIP
L2
OSPF
Area 0.0.0.0
RIP
AS
BR
RIP DistanceVector Routing
Domain
ABR
R
AS
BR
R
ABR
L2
R
R
R
R
R
R
OSPF Link-State
Routing Domain
with Multiple Areas
L2
R
IS-IS Link-State Routing Domain
with L2 Router “Chain” as Backbone
FIGURE 14.7
Three IGPs and some of their major characteristics.
1. Why does RIP continue to be used in spite of its limitations?
2. What is the difference between distance-vector and link-state routing protocols?
3. It is often said that it is easier to configure a backbone area in IS–IS than in
OSPF. What is the basis for this statement?
4. What are the similarities between OSPF and IS–IS?
5. What are the major differences between OSPF and IS–IS?
CHAPTER
Border Gateway Protocol
15
What You Will Learn
In this chapter, you will learn about the BGP and the essential role it plays on the
Internet. With BGP, routing information is circulated outside the AS and to all routing domains. We’ll see how a simple routing policy change can make a destination
unreachable.
You will learn about the differences between the Internet BGP (IBGP) and the
Exterior Gateway Protocol (EBGP), and why both are needed. We’ll also look at
BGP attributes and message formats.
The EGP used on the Internet is the Border Gateway Protocol (BGP). IGPs run between
the routers inside a routing domain (single AS). BGP runs between different autonomous services (ASs). BGP runs on links between the border routers of these routing
domains and shares information about the routes within the AS or learned by the AS
with the AS on the other side of the “border.”
BGP makes sure that every network and interface in any AS located anywhere on
the Internet is reachable from every other place. BGP does not generate any routing
information on its own, unlike the IGPs, which essentially “bootstrap” themselves into
existence. BGP relies on an underlying IGP (or static routes) as the source of the BGPdistributed information.
BGP runs on the border routers of Ace ISP’s AS 65459 (routers P9 and P4) and Best
ISP’s AS 65127 (routers P7 and P2). These are highlighted in Figure 15.1. An IGP such as
OSPF or IS–IS runs on the direct links between routers P9 and P4 and routers P7 and P2,
but these are interior links. BGP runs on the other links between the backbone routers.
BGP AS A ROUTING PROTOCOL
There are EGPs defined other than BGP. The Inter-Domain Routing Protocol (IDRP)
from ISO is the EGP that was to be used with IS–IS as an IGP. IDRP is also sometimes
promoted as the successor to BGP, or the best way to carry IPv6 routing information
380
PART III Routing and Routing Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
Ace ISP
Wireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0 .2
o
s
59
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
45
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
FIGURE 15.1
BGP on the Illustrated Network.
so-0/0/3
49.1
-0
/0/
2
P4
lo0: 192.168.4.1
/0
0/0
1
47.
so-0/0/1
24.2
so-
AS 65459
CHAPTER 15 Border Gateway Protocol
381
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Best ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
382
PART III Routing and Routing Protocols
between ISP ASs. However, when it comes to the Internet today, the only EGP worth
considering is BGP.
In a very real sense, BGP is not a routing protocol at all. BGP does not really
carry routing information from AS to AS, but information about routes from AS to AS.
Generally, a route that passes through fewer ASs (ISPs) than another is considered more
attractive, although there are many other factors (BGP attributes) to consider. BGP is a
routing protocol without real routes or metrics, and both of those derive from the IGP.
BGP is not a link-state protocol, because the state of links in many AS clouds would be
difficult to convey and maintain across the entire network (and links would tend to
“average out” to a sort of least common denominator anyway). But it’s not a distancevector protocol either, because more attributes than just AS path length determine
active routes. BGP is called a “path-vector” protocol (a vector has a direction as well as
value), but mainly because a new term was needed to describe its operation.
BGP information is not even described as a “route.” BGP carries network layer
reachability information (NLRI). BGP “routes” do not have metrics, like IGP routes, but
attributes. Together, the BGP NLRI and their attributes allow other ASs to make decisions about the best way to reach a route (network) in another AS. Once a packet is
routed to the correct AS through BGP information, the packet is delivered locally using
the IGP information.
The differences between BGP and IGPs should always be remembered. Some new
to BGP struggle with BGP terminology and concepts because they attempt to interpret
BGP features in terms of more familiar IGP features. BGP does not work like an IGP
because BGP is not an IGP and should not work like an IGP. When BGP passes information from one AS border router to another AS border router inside an AS, a form known
as interior BGP (IBGP) is used. When BGP passes information from one AS to another
AS, the form of BGP used is called exterior BGP (EBGP).
This chapter does not deal much with routing policies for BGP based on multiple
attributes, which determine how the routers use BGP to route packets. Complex routing policies are beyond the scope of this book.
Configuring BGP
It’s important to keep in mind exactly what is meant by a routing domain and routing
policy. For example, is CE0 part of AS 65459 or not? This is not as simple a question as
it sounds, because there might be a dozen routers behind CE0 that the Ace ISP knows
nothing about. But the interface to PE5 is firmly under the control of Ace, and generally
all customer site routers are considered part of the ISP’s routing domain in the sense
that a routing policy on PE5 can always control the routing behavior of CE0.
This does not mean something like preventing the users on LAN1 from running
Internet Chat or something. This type of application-level detailing is not what a routing policy is for. Corporate policies of this type (application policing) are best handled by an appliance on site. ISP routing policies determine things like where the
CHAPTER 15 Border Gateway Protocol
383
10.10.11.0/24 route to LAN1 is advertised or held back, and which routes are accepted
from other sources.
Let’s see how easy it is to configure BGP on the border routers. Each of them is
essentially identical in basic configuration, so let’s use P9 as an example.
set
set
set
set
protocols
protocols
protocols
protocols
bgp
bgp
bgp
bgp
group
group
group
group
ebgp-to-as65127
ebgp-to-as65127
ebgp-to-as65127
ebgp-to-as65127
set
set
set
set
protocols
protocols
protocols
protocols
bgp
bgp
bgp
bgp
group
group
group
group
ibgp-mesh
ibgp-mesh
ibgp-mesh
ibgp-mesh
type external;
peer-as 65127;
neighbor 10.0.79.1;
neighbor 10.0.29.1;
type internal;
local-address 192.168.9.1;
neighbor 192.168.4.1;
neighbor 192.168.5.1;
BGP configurations are organized into groups that have user-defined names
(ebgp-to-as65127 and ibgp-mesh) Note that there are two types of BGP running on
the border routers: EBGP and IBGP. EBGP must know the other AS number and IBGP
must know the local address to use as a source address (routers typically have many
IP addresses). Note that EBGP uses link addresses and IBGP uses the router’s “loopback”
address, in this case the address assigned to the routing engine. We’ll see why this is
usually done when we discuss EBGP and IBGP later in this chapter.
We showed at the end of the previous chapter that we could ping IPv6 addresses
from the Windows XP client on LAN1 to the Windows XP client on LAN2. Let’s see
if the same works for the IPv4 addresses on the Unix hosts. All is well between
bsdclient and bsdserver.
bsdclient# ping 10.10.12.77
PING 10.10.12.1 (10.10.12.77): 56 data bytes
64 bytes from 10.10.12.77: icmp_seq=0 ttl=255 time=0.600 ms
64 bytes from 10.10.12.77: icmp_seq=1 ttl=255 time=0.477 ms
64 bytes from 10.10.12.77: icmp_seq=2 ttl=255 time=0.441 ms
64 bytes from 10.10.12.77: icmp_seq=3 ttl=255 time=0.409 ms
^C
--- 10.10.12.77 ping statistics --4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.409/0.482/0.600/0.072 ms
The default behavior for BGP is to advertise all active routes that it learns by its
own operation, so no special advertising policies are needed on the backbone routers. Because there are direct links in place between the two ISPs to connect the Los
Angeles office (LAN1) with the New York office (LAN2), each ISP relies on the routing
protocol metrics to make sure traffic flowing between LAN1 (10.10.11/24) and LAN2
(10.10.12/24) is not forwarded onto the Internet. That is, the cost of forwarding a
LAN1-LAN2 packet between the provider backbone routers will always be less than
using the Internet at large.
384
PART III Routing and Routing Protocols
However, one day the users on LAN1 and LAN2 discover a curious thing: no one can
reach servers on the other LAN. Pings to the local router work fine, but pings to remote
hosts on the other LAN produce no results at all.
bsdserver# ping 10.10.12.1
PING 10.10.12.1 (10.10.12.1): 56 data bytes
64 bytes from 10.10.12.1: icmp_seq=0 ttl=255 time=0.599 ms
64 bytes from 10.10.12.1: icmp_seq=1 ttl=255 time=0.476 ms
64 bytes from 10.10.12.1: icmp_seq=2 ttl=255 time=0.401 ms
64 bytes from 10.10.12.1: icmp_seq=3 ttl=255 time=0.443 ms
^C
--- 10.10.12.1 ping statistics --4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.401/0.480/0.599/0.071 ms
bsdserver# ping 10.10.11.177
PING 10.10.11.177 (10.10.11.177): 56 data bytes
^C
--- 10.10.11.177 ping statistics --5 packets transmitted, 0 packets received, 100% packet loss
The remote router cannot be pinged either (presumably, no security prevents them
from pinging to another site router’s port).
bsdserver# ping 10.10.11.1
PING 10.10.11.1 (10.10.11.1): 56 data bytes
^C
--- 10.10.11.1 ping statistics --7 packets transmitted, 0 packets received, 100% packet loss
The Power of Routing Policy
There are many things that could be wrong in this situation. In this case, the cause of
the problem is ultimately determined to be a feud between the Ace ISP and Best ISPs
running the service provider routers. The issue (greatly exaggerated here) is a server
located on LAN2 in New York. This essential server provides full-motion video, huge
database files, and all types of other information to the clients in Los Angeles on LAN1.
Naturally, a lot more packets flow from Best ISP’s AS to Ace ISP’s AS than the other way
around. So, the Ace ISP (AS 65459) controlling border routers P9 and P4 decided that
Best ISP (AS 65127) should pay for all these “extra” packets they were delivering from
the New York server. Shortly before the LANs stopped communicating, they sent a bill
to Best ISP—turning AS 65127 from a peer into a customer.
Naturally, Best ISP was not happy about this new arrangement and refused to pay.
So, Ace ISP decided to do a simple thing: they applied a routing policy and did not send
any information about the LAN1 network (10.10.11/24) to AS 65127’s border routers
(P7 and P2). If the border routers don’t know how to send packets back to LAN1
from the servers on LAN2, Ace ISP will be getting what they paid Best ISP for—which
is nothing. (In the real world, the customer paying for LAN1 and LAN2 connectivity
would be asked to pay for the asymmetrical traffic load.)
CHAPTER 15 Border Gateway Protocol
385
Without the correct routing information available on the routers on both ASs, no
one on LAN2 can find a route to LAN1. Even if there were still some connectivity
between the sites through Ace and Best ISPs’ links to the Internet, this means that the
symptom would show up as a sharply increased network delay (and related application
timeouts), as packets now wander through many more hops than before. Something
would still clearly be wrong.
This large effect comes from a very simple cause. Let’s look at the routing tables and
policies on P2 and P7 (and P9 and P4) and see what has happened. Best ISP has applied
a very specific routing policy to their external BGP session with Ace ISP’s border routers. Here’s what it looks like on P7.
set policy-statement no-10-10-11 term1 from route-filter 10.10.11.0/24 exact;
set policy-statement no-10-10-11 term1 then reject;
This basically says,“Out of all the routing protocol information, find (filter) the information matching the network 10.10.11.0/24 exactly and nothing else; then discard
(reject) this information and do not use it in the routing or forwarding tables.”
This import policy on P7 and P2 (Best ISP’s routers) is applied on links from neighbor border routers P4 and P9 (Ace ISP’s routers). The effect is to block BGP in AS 65127
from learning anything at all about network 10.10.11/24 from P4 and P9. Normally, Best
ISP’s backbone routers would pass the information about the route to LAN1 through
P7 and P2 to all other routers in the AS, including CE6 (LAN2’s site router). Without this
information, no forwarding table can be built on CE6 to allow packets to reach LAN1.
Problem solved: no packets for LAN1 can flow through Best ISP’s router network.
Note that Best ISP (AS 65127) still advertises its own LAN2 network (10.10.12/24)
to Ace ISP, and Ace ISP’s routers accept and distribute the information. So, on LAN1 the
site router CE0 still knows about both LANs.
[email protected]# show route 10.10/16
inet.0: 38 destinations, 38 routes (38 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both
10.10.11.0/24 *[Direct/0] 00:03:31
> via fe-1/3/0.0
10.10.11.1/32 *[Local/0] 00:03:31
Local via fe-1/3/0.0
10.10.12.0/24 *[BGP/170] 00:00:09
> via ge-0/0/3.0
But this makes no difference: Packets can get to LAN2 through CE6 (and from anywhere else in Best ISP’s AS), but they have no way to get back if they have a source
address of 10.10.12.x. Let’s verify this on CE6.
[email protected]# show route 10.10/16
inet.0: 38 destinations, 38 routes (37 active, 0 holddown, 1 hidden)
+ = Active Route, - = Last Active, * = Both
386
PART III Routing and Routing Protocols
10.10.12.0/24 *[Direct/0] 00:25:42
> via fe-1/3/0.0
10.10.12.1/32 *[Local/0] 00:25:42
Local via fe-1/3/0.0
How are packets to get back to 10.10.11/24? They can’t. ( The former route to
LAN1 is now hidden because the network is no longer reachable.) This simple example shows the incredible power of BGP and routing policies on the Internet.
BGP AND THE INTERNET
BGP is the glue of the Internet. Generally, an ISP cannot link to another ISP unless both
run BGP. Contrary to some claims, customer networks (even large customer networks
with many routers and multiple ASs) do not have to run BGP between their own networks and to their ISP (or ISPs). Smaller customers especially can define a limited number of static routes provided by the ISP, and larger customers might be able run IGP
passively (no adjacency formed) on the border router’s ISP interface. It depends on the
complexity of the customer and ISP network. A customer with only one link to a single
ISP generally does not need BGP at all. But if a routing protocol is needed, it will be BGP.
When a customer network links to two ISPs and runs BGP, routing policies are
immediately needed to prevent the large ISPs from seeing the smaller network as a
transit AS to each other. This actually happened a number of times in the early days of
BGP, when small corporate networks new to BGP suddenly found themselves passing
traffic between two huge national ISPs whose links to each other had failed. Why pass
traffic through two or three other ISPs when “Small Company, Inc.” has a BGP path
a single AS long? BGP routing policies are immediately put in place to not advertise
routes learned for one national ISP to the other. As long as “you can’t get there from
here,” all will be fine at the little network in the middle.
BGP summarizes all that is known about the IP address space inside the local AS
and advertises this information to other ASs. The other ASs pass this information along,
until all ASs running BGP know exactly what is where on the Internet. Without BGP,
a single default route must handle all destinations outside the AS. This is okay when a
single router leads to the Internet, but inadequate for networks with numerous connections to other ASs and ISPs.
BGP was not the original EGP used on the Internet. The first exterior gateway protocol was Exterior Gateway Protocol (EGP). EGP is still around, but only on isolated
portions of the original Internet—such as for the U.S. military. An appreciation of EGP’s
limitations helps to understand why BGP works the way it does.
EGP and the Early Internet
In the early 1980s, the Internet had grown to include almost 1000 computers. Several
noted that distance-vector routing protocols such as the original Gateway-to-Gateway
Protocol (GGP), an IGP, would not scale to a large network environment. If every router
CHAPTER 15 Border Gateway Protocol
387
needed to know everything about every route, convergence times when links failed
would be very high. GGP routing changes had to happen globally and in a coordinated
fashion. But the Internet, even in the 1980s, was a huge network with many different
types of computers and routers run by many different organizations.
The answer divided the emerging Internet into independent but interconnected ASs.
As seen in Chapter 14, the AS is identified by a 4-byte (32-bit) number assigned by the
same authorities that assign IP addresses. We’ll use a shorthand such as 65127 instead
of the full (and proper) 0.65127 to indicate legacy 2-byte AS numbers. The AS range
64512 through 65535 is reserved for private AS numbers. Inside the AS, the network
was assumed to be under the control of a single network administrator. Within the AS,
local network matters (addressing, links, new routers, and so on) could be addressed
locally with GGP. But GGP ran only within the AS. Between ASs, some way had to be
found to communicate what networks were reachable within and through one AS to
the other AS.
EGP was the solution. EGP ran on the border routers (gateways), with links to other
ASs. EGP routers just sent a list of other routers and the classful major networks that
the router could reach. This cut down on the amount of information that needed to
be sent between ASs. Today, aggregation should be used as often as possible with BGP
instead of classful major network routes, but the intent and result are the same. So,
if a BGP router knows about networks 10.10.1.0/24 through 10.10.127.0/24 it can
aggregate the route as 10.10.0.0/17 and advertise that one route ( NRLI ) instead of
128 separate routing updates. Even if a network such as 10.10.11.0/24 is not included
in the range, the more specific advertisement of 10.10.11.0/24 and the longest match
rule will make sure traffic finds its way to the right place—as long as the route is advertised properly. Nevertheless, there are many reasons people do not aggregate as much
as they should, and many of their reasons are flawed. For example, trying to protect a
network against “prefix hijacking” is a bad reason not to aggregate.
There is no need for an EGP to reproduce the features of an IGP. An IGP needs to
tell every router in the AS which router has which interfaces and what IP addresses are
attached to these interfaces or reachable through that router (such as static routes). All
that other ASs need to know is which IP addresses are reachable in a particular AS and
how to get to a border router on, or nearer to, the target AS.
The Birth of BGP
EGP suffered from a number of limitations, too technical to recount. After some initial attempts to upgrade EGP, it was decided to create a better EGP (as a class of
routing protocol, contrasted with IGPs) than EGP: BGP. BGP was defined in 1989
with RFC 1105 (BGP1 or BGP-1 or BGPv1), revised in 1990 as RFC 1163 (BGP2), and
revised again in 1991 as RFC 1267 (BGP3). The version of BGP used today on the
Internet, BGP4, emerged in 1994 as RFC 1654 and was extended for classless operation in 1995 as RFC 1771. The baseline BGP specification today is RFC 4271. This
chapter describes BGP4.
388
PART III Routing and Routing Protocols
BGP has been extended for new roles on the Internet. BGP extended communities
are used with virtual private networks (VPNs). Communities are simply labeled that
so they can be used to associate NLRIs that do not share other traits. For example, a
community value can be assigned to small customers and another community value
used to identify a small customer with multiple sites. There are few limits to the community “tags’” usage. And BGP routes are often the only ones that can use multiprotocol
label switching (MPLS) label-switched paths (LSPs). BGP is as easily extensible as IS–IS
and OSPF to support new functions and add routing information that needs to be circulated between ASs.
Many organizations find themselves suddenly forced to adapt BGP in a hurry, for
instance, when they have to multihome their networks. Also, when they deploy VPNs
or MPLS or any one of the many newer technologies used to potentially span ISPs and
ASs, BGP is needed. The problem with IGPs is that they cannot easily share information
across routing domain boundaries.
BGP AS A PATH-VECTOR PROTOCOL
One of the problems with EGP was that the metrics looked very much like RIP hop
counts. Simple distance vectors were not helpful at the AS level, because hop counts
did not distinguish the fast links that began appearing in major ISP network backbones.
Destinations that were “close” over two or three 56- or 64-kbps links actually took
much longer to reach than through four or five hops over 45-Mbps links, and distance
vectors had no protection against routing loops.
Link-state protocols could have dealt with the problem by implementing some of
the alternate TOS metrics described for OPSF and IS–IS. However, these would rely not
only on consistent implementation among all ISPs but the proper setting of bits in IP
packets. In the world of independent highly competitive ISPs, this consistency was
next to impossible. So, BGP was developed as a path-vector protocol. This means that
one of the most important attributes BGP uses to choose the active route is the length
of the AS path reported in the NLRI.
To create this AS list, BGP routing updates carry a complete list of transit networks
(ASs) that must be traversed between the AS receiving the update and the AS that can
deliver the packet using its IGP. A loop occurs when an AS path list contains the same
AS that is receiving the update, so this update is rejected and loops are prevented. If
the update is accepted, that AS will add its own AS to the list when advertising the
routing update to other ASs. This lets an AS apply routing policies to the updates and
avoid using routes that lead through an AS that is not the preferred way to reach a
destination.
Path vectors do not mean that all ASs are created equal. Numerous small ASs might
get traffic through faster than one huge AS. But more aspects of a route are described in
BGP than just the length of the AS path to the destination. The system allows each AS
to represent the route with a different metric that means something to the AS originating the route.
CHAPTER 15 Border Gateway Protocol
389
But more ASs generate more and longer path information. RFC 1774 in 1995
estimated that 100,000 routes generated by 3000 ASs would have paths about 20 ASs
long. There was a concern about router memory and processor requirements to store
and maintain all of this information, especially in smaller routers.
Several mechanisms are built into BGP to address this. ISPs would not usually accept
a BGP route advertisement with a mask more than 19 bits long (/19). This was called
the universally reachable address level. The price for compact routing tables and
maintenance was a loss of routing accuracy, and many ISPs relaxed this policy. Most
today accept /24 prefixes (although they can accept more specific addresses from their
own customers, of course). The other BGP mechanisms to cut down on routing table
size and maintenance complexity are route reflectors, confederations (also called subconfederations), and route damping (or dampening). All of these are beyond the scope
of this chapter, but should be mentioned.
IBPG AND EBGP
BGP is an EGP that runs between individual routing domains, or ASs. When BGP speakers (the term for routers configured to peer with BGP neighbors) are in different ASs,
the routers use an exterior BGP (EBPG) session to exchange information. When BGP
peers are within the same AS, the routers use interior BGP (IBGP). These terms often
appear as E-BPG/I-BGP or eBGP/iBGP.
IBGP is not some IGP version of BGP. It is used to allow BGP routers to exchange
BGP routing information inside the same AS. IBGP sessions are usually only required
when an AS is multihomed or has multiple links to other ASs. (However, we used them
on the Illustrated Network anyway, and that’s fine too.) An AS with only a single link to
one other AS need only run EBGP on the border router and relies on the IGP to distribute routes learned by EBPG to the other routers. In the case where there is only one
exit point for the entire AS, a single static default route to the border router can be used
effectively instead. The reason that IBGP is needed is shown in Figure 15.2.
Without IBGP, all routes learned by EBGP must be dumped into the IGP to make
sure all routes are known in the entire AS. This can easily overwhelm the IGP. For this
reason, it is usual to create an IBGP mesh between routers on the backbone (other routers can make do with a handful of default routes).
EBGP sessions typically peer to the physical interface address of the neighbor router.
These are often point-to-point WAN links, and are the only way to reach another AS. If
the link is down, the other AS is unreachable over that link. So, there is little point in
trying to keep a BGP session going to the peer.
On the other hand, IBGP sessions usually peer to the stable “loopback” interface
address of the peer router. An IBGP peer can typically be reached over more than one
physical interface within the AS, so even if an IBGP peer’s “closest” interface is down
the BGP sessions can stay up because BGP packets use the IGP routing table to find an
alternate route to the peer.
390
PART III Routing and Routing Protocols
“I can reach
10.10.12/24”
“I can reach
10.10.11.0/24”
EBGP
Router in
AS 65459
EBGP
IBGP
AS 64513
Router A
“How can Router A
know how to reach
10.10.12.0/24?”
Router B
Router in
AS 65127
“How can Router B
know how to reach
10.10.11.0/24?”
FIGURE 15.2
The need for IBGP. Note that if only EBGP is running, the AS in the middle must dump all BGP
routes into the IGP to advertise them throughout the network.
Two BGP neighbors, EBGP or IBGP, first exchange their entire BGP routing tables—
subject to the policies on each router. After that, only incremental or partial table
information is exchanged when routing changes occur. BGP keepalives are exchanged
because in stable networks long periods of time might elapse before something interesting happens.
IGP Next Hops and BGP Next Hops
BGP uses NLRIs as the way one AS tells another,“I know how to reach IP address space
192.168.27.0/24 and 172.16.44.0/24 and…” The AS does not say that it is the AS that
has assigned that IP address space locally. Many of the addresses might be from other
ASs beyond the AS advertising the routes. The AS path allows an AS to figure out how
far away a destination is through the AS that has advertised the route, or NLRI.
With an IGP, the next hop associated with a route is usually the IP address of the
physical interface on the next hop router. But the BGP next hop (also sometimes called
the “protocol next hop”) is often the IP address of the router that is advertising the
BGP NLRI information. The BGP next hop is the address of the BGP peer, most often
the loopback interface address (the BGP Identifier) for IBGP and the physical interface
address in the other AS for EBGP. The BGP next hop is the way one BGP router tells
another,“If you have a packet for this IP address space, send it here.”
The IGP has to know how to reach the next hop, whether it’s a BGP next hop or
not. But the next hop for EBGP is often at the end of a link to the other AS and is not
running an IGP (it’s not an internal link). So, how is the IGP to know about it? Well, BGP
routes could be “dumped” into the IGP—but there are a lot more external routes than
internal, and the whole point is to keep the IGP and EGP separate to some extent. This
brings up an interesting point about the relationship of BGP and the IGP and a practice
known as next hop self.
CHAPTER 15 Border Gateway Protocol
391
BGP and the IGP
There is a well-known unreachable condition in BGP that must be solved with a
simple routing policy know as next hop self, or just NHS. An EBGP route (NLRI) normally arrives from another AS with the physical address of the remote interface as
the BGP next hop. If the EBGP route is readvertised through IBPG, it is likely that
the BGP next hop will be completely unknown to the IGP routing tables inside the
receiving AS. A router within an AS does not care how to reach a physical interface
IP address in another AS. Next hop self is just a way to have the router advertising the
route through IBGP use itself as the next hop for the EBGP route. The idea is not BGP
“next-hop-is-the-physical-interface-in-another-AS” but BGP “next-hop-is-me-in-this-AS”
or BGP “next-hop-self.”
BGP is not a routing protocol built directly on top of IP. BGP relies on TCP connections to reach its peers, and so resembles an IP application more than an IGP routing
protocol. Without the IGP to provide connectivity, TCP sessions for the BGP messages
cannot be established except on links to adjacent routers. BGP does not flood information with IBPG. So, what an IBGP router learns from its IBGP peers is never passed
along to another IBGP neighbor.
To fully distribute BGP information among the routers within an AS, a full mesh of
IBGP connections (adjacencies) is necessary. Every IBGP router must send complete
routing information to every other IBGP router in the AS. In a large AS with many external links to other ASs, this meshing requirement can add a lot of overhead traffic and
configuration maintenance to the network. This is where route reflectors and confederations come in (these concepts are far beyond the scope of this chapter and will not
be discussed further).
The main reasons BGP was built this way were to keep BGP as simple as possible
and to prevent routing loops inside the AS. The dependency on TCP and the lack of
flooding means that IBGP must communicate directly with every other router that
needs to know BGP routing information. This does not mean that every router must be
adjacent (connected by a direct link), because TCP can be routed through many routers
to reach its destination. What it does mean is that routers connected by IBGP inside an
AS must create a full mesh of IGBP peering sessions. This need to create a full mesh
and synchronize BGP with the IGP is shown in Figure 15.3.
In the figure, Ace ISP and Best ISP are no longer peers. Now they are both customers of National ISP. Naturally, everyone on LAN2 still has to know how to reach LAN2
at 10.10.11.0/24 (and vice versa, of course). EBGP advertises LAN1 to National ISP, and
IBGP from border router to border router makes sure that LAN2 on Best ISP can reach
10.10.11.0/24. But what about an internal router inside National ISP’s AS? There are
only two ways to allow everyone in National ISP’s service area to access LAN1 (presumably to buy something, although there are cases concerning LAN1 security where
the route might not be advertised everywhere). With a full mesh of IBGP sessions in
National ISP, there is no need to dump all external routes into the IGP (the IGP should
only handle routes within the AS).
392
PART III Routing and Routing Protocols
Internal
RTR 1
Internal
RTR 2
Border
RTR 1
“How do I get to
10.10.11.0/24?”
Internal RTR 3
“I know how
to get to
10.10.11.0/24”
IBGP
National ISP
Border
RTR 2
EBGP
10.10.11.0/24
EBGP
Best ISP
Ace ISP
10.10.11.0/24
FIGURE 15.3
The need for a full IBGP mesh. Note that the routers inside National ISP do not necessarily know
how to reach 10.10.11.0/24 (LAN1).
OTHER TYPES OF BGP
The major types of BGP are EBGP for external peers outside the AS and IBGP for internal peers within the same AS. These are usually the only types of BGP mentioned in
most sources. But there are other variations of BGP used in other situations.
One BGP variation that is becoming very important, especially where VPNs are concerned, is Multiprotocol BGP (often seen as MBGP or MP-BGP). Multiprotocol BGP
originally extended BGP to support IP multicast routes and routing information. But
MBGP is also used to support IP-based VPN information and to carry IPv6 routing information, such as from RIPng and OSPF for IPv6. MBGP work on IPv6 is just starting, so
no special consideration of using BGP for IPv6 appears in this chapter other than to
note than MBGP is used for this purpose. MBGP is currently defined in RFC 4760.
There is also Multihop BGP, sometimes seen as EBGP multihop. Multihop BGP is only
used with EBGP and allows an EBGP peer in another AS to be more than one hop away.
Usually, EBGP peers are directly connected by a point-to-point WAN link. But sometimes
it is necessary to peer with a router beyond the border router that actually terminates
the link. Normally, BGP packets have a TTL of 1 and thus never travel beyond the adjacent
router. Multihop BGP packets have a TTL greater than 1 and the peer is beyond the adjacent router. Multihop BGP is also used in load balancing situations when there is more
than one link between two border routers, and for “route-view”–style route collectors.
Finally, there is a slight change in behavior of the BGP that runs between confederations. In most cases, the version of BGP that runs between confederations is just
called EBGP. However, there are slight differences in the EBGP that runs between
ASs and the EBGP that runs between confederations—which are always inside the
CHAPTER 15 Border Gateway Protocol
393
same AS. Sometimes the variant of BGP that runs between confederations is known as
Confederation BGP, or CBGP, although use of this term is not common.
BGP ATTRIBUTES
The information that all forms of BGP carry is associated with a route (NLRI) as a series
of attributes. This is the major difference between BGP and IGPs. IGP routes carry the
route, next hop, metric, and maybe an optional tag (or two). BGP routes can carry a
considerable amount of information, all intended to allow an AS to choose the “best”
way to reach a destination.
Most implementations of BGP will understand 10 attributes, and some use and understand even more. Every BGP attribute is characterized by two major parameters. An attribute is either well known or optional. Well-known attributes must be understood and
processed by every implementation of BGP regardless of vendor. Optional attributes are
exactly that: there is no guarantee that a given BGP implementation will understand or
process that particular attribute. BGP implementations that do not support an optional
attribute simply pass that information on if that is what is called for, or ignore it.
In addition, a well-known BGP attribute is either mandatory or discretionary. Mandatory BGP attributes must be present in every BGP update message for EBGP, IBGP, or
something else. Discretionary BGP attributes appear only in some types of BGP update
messages, such as those used by EBGP only.
Finally, optional BGP attributes are transitive or nontransitive. Transitive BGP optional
attributes are passed from peer to peer even if the router does not support that option.
Nontransitive BGP optional attributes can be ignored by the receiver BGP process if not
supported and not sent along to peers. The ten BGP attributes discussed in this chapter
are listed in Table 15.1 and their characteristics are described in the list that follows.
Table 15.1 BGP Attributes
Attribute and Type
Code
Well-Known
Mandatory
ORIGIN (1)
X
AS_PATH (2)
X
NEXT_HOP (3)
X
Well-Known
Discretionary
LOCAL_PREF (4)
X
ATOMIC_AGGR (5)
X
Optional
Transitive
AGGREGATOR (6)
X
COMMUNITY (7)
X
Optional
Nontransitive
MED (8)
X
ORIGINATOR_ID (9)
X
CLUSTER_LIST (10)
X
394
PART III Routing and Routing Protocols
ORIGIN—This attribute reflects where BGP obtained knowledge of the route in
the first place. This can be the IGP, EGP, or “incomplete.”
AS_PATH—This forms a sequence of AS numbers that leads to the originating AS
for the NLRI. The main use of the AS Path is for loop avoidance among ASs, but
it is common to artificially extend the AS Path attribute through a routing policy
so that a particular path through a certain router looks very unattractive. The
AS Path attribute can consist of an ordered list of AS numbers (AS_SEQUENCE)
or just a collection of AS numbers in no particular order (AS_SET).
NEXT_HOP—The BGP Next Hop (or “protocol next hop”) is quite distinct from
an IGP’s next hop. Outside an AS, the BGP Next Hop is most likely the border
router—not the actual router inside the other AS that has this network on a
local interface. Next Hop Self is the typical way to make sure that the BGP Next
Hop is reachable.
LOCAL_PREF—The Local Preference of the NLRI is relative to other routes learned
by IBGP within an AS and therefore is not used by EBGP. When routes are
advertised with IBGP, traffic will flow toward the AS exit point (border router)
that advertised the highest Local Preference for the route. It is used to establish a preferred exit link to another AS.
MULTI_EXIT_DISC (MED)—The Multi-Exit Discriminator (MED) attribute is the
way one AS tries to influence another when it goes to choosing among multiple exit points (border routers) that link to the AS. A MED is the closest thing
to a purely IGP metric that BGP has. Changing MEDs is one of the most common ways one ISP tries to make another ISP use the links it wants between
the ISPs, such as higher speed links (“use this address on this link to reach me,
unless it’s down, then use this one…”). MED values are totally arbitrary.
ATOMIC_AGGREGATE and AGGREGATOR—These two attributes work together.
Both are used when routing information is aggregated for BGP. A common
goal on the Internet today is to represent as many networks (routes) with
as few routing table entries as possible. So, as routing information makes
its way through the Internet each AS will often try to condense (aggregate)
the routing information as much as possible with as short a VLSM as can be
properly contrived.
COMMUNITY—The BGP Community attribute is sort of a “club for routes.”
Communities make it easier to apply policies to routes as a group. There might
be a community that applies to an ISP’s customers. In that case, it is not necessary to list every customer’s IP address in a policy to set Local Pref or MED
(for example) as long as they all are assigned to a unique “customer” community
value. Community values are often used today as a way for one ISP to inform a
peer ISP of the value of the Local Pref for the route inside the originating ISP’s
CHAPTER 15 Border Gateway Protocol
395
AS (Local Pref is not present in EBGP). The Community attribute was originally
Cisco specific, but was standardized in RFC 1997. Communities just make it easier
for a router to find all NLRIs associated with (for example) a particular VPN.
ORIGINATOR_ID and CLUSTER_LIST—These attributes are used by BGP route
reflectors. Both of these attributes are used to prevent routing loops when
route reflectors are in use. The Originator ID is a 32-bit value created by the
route reflector and is the originator of the route within the local AS. If the
originator router sees that its own ID is a received route, a loop has occurred
and the route is ignored. The Cluster List is a list of the route reflection cluster
IDs of the clusters through which the route has passed. If a route reflector
sees it own cluster ID in the Cluster List, a loop has occurred and the route is
ignored.
BGP AND ROUTING POLICY
BGP is a policy-driven protocol. What BGP does and how BGP does it can be almost
totally determined by routing policy. It is difficult to make BGP do exactly what an ISP
wants without the use of routing policies.
Want BGP to advertise customers on static routes or running OSPF, IS–IS, or RIP?
Redistribute statics, OSPF, IS–IS, and RIP into BGP? Want to artificially extend an AS path
to make an AS look very unattractive for transit traffic? Write a routing policy to prepend the AS multiple times. Want to change the community attribute to add or subtract
information? Use a routing policy. Concerned about the shear amount of routes advertised? Write a routing policy to aggregate the routes any way that makes sense. Want to
advertise a more specific route along with a more general aggregate (called “punching
a hole” in the advertised address space)? Write a routing policy. BGP depends on routing policy to behave the way it should.
BGP Scaling
A global corporation today might have 3000 routers large and small spread around the
world. Even with multiple ASs, there could be 1000 routers within an AS that might
all need IBGP information—no matter how the routes have been aggregated. To
fully mesh 1000 IBGP routers within an AS requires 499,500 IBGP sessions. A network 100 times larger than a 10-router network requires more than 10,000 times
more IBGP sessions. Adding one router adds 1000 additional IBGP sessions to the
network.
This problem with the exponential growth of IBGP sessions is the main BGP scaling
issue. There are two ways to deal with this issue: the use of router reflectors (RR) and
confederations.
What is the difference between RRs and confederations? At the risk of offending
BGP purists, it can be loosely stated that RRs are a way of grouping BGP routers inside
396
PART III Routing and Routing Protocols
an AS and running IBGP between the RR clusters. Confederations are a way of grouping BGP routers inside an AS and running EBGP between the confederation “sub-ASs.”
Because of the differences between RRs and confederations, it is even possible to have
both configured at the same time in the same AS. There is also BGP route damping,
which is not a way of dealing with BGP scaling directly but rather a way to deal with
the effects of BGP scaling in terms of the amount of routing information that needs to
be distributed to IBGP and EBGP peers when a router or link fails.
BGP MESSAGE TYPES
BGP messages types are simpler than those used by OSPF and IS–IS because of the
presence of TCP. TCP handles all of the details of connection setup and maintenance,
and before a BGP peering session is established the router performs the usual TCP
three-way handshake using TCP port 179 on one router. The other router uses a port
that is not well known, and it is just a matter of whose TCP SYN message arrives first
that determines which BGP peer is technically the “server.” All BGP messages are then
unicast over the TCP connection. There are only four BGP message types.
Open—Used to exchange version numbers (usually four, but two routers can agree
on an earlier version), AS numbers (same for IBGP, different for EBGP), hold
time until a Keepalive or Update is received (the smaller value is used if they
differ), the BGP identifier (Router ID, usually the loopback interface address),
and options such as authentication method (if used).
Keepalive—Keepalive messages are used to maintain the TCP session when there
are no Updates to send. The default time is one-third of the hold time established in the Open message exchange.
Update—This advertises or withdraws routes. The Update has fields for the NLRI
(both prefix and VLSM length), path attributes, and withdrawn routes by prefix
and length.
Notification—These are for errors and always close a BGP connection. For example, a BGP version mismatch in the Open message closes the connection,
which must then be reopened when one router or the other adjusts its version
support.
The maximum TCP segment size for a BGP message is 4096 bytes and the minimum
is 19 bytes. All BGP messages have a common header, as shown in Figure 15.4.
The Marker is a 16-byte field used for synchronizing BGP connections and in
authentication. If no authentication is used and the message is an Open, this field is
set to all 1s. The Length is a 16-bit field that contains the length of the message, including the header, in bytes. Finally, the Type is an 8-bit field set to 1 (Open), 2 (Update), 3
(Notification), or 4 (Keepalive).
CHAPTER 15 Border Gateway Protocol
1 byte
1 byte
H
e
a
d
e
r
1 byte
397
1 byte
Marker
Length
Type
32 bits
FIGURE 15.4
The BGP message header carried inside a TCP segment.
BGP MESSAGE FORMATS
A data portion follows the header in all but the Keepalive messages. Keepalives consist
of only the BGP message headers and so need not be discussed further in this section.
The Open Message
Once a TCP connection has been established between two BGP speakers, Open messages are exchanged between the BGP peers. If the Open is acceptable to a router,
a Keepalive is sent to confirm the Open. Once Keepalives are exchanged, peers can
exchange Updates, Keepalives, and Notification messages. The format of the Open message is shown in Figure 15.5.
The Open message has an 8-bit Version field, a 2-byte My Autonomous System field,
a 2-byte Hold Time value (0 or at least 3 seconds), a 32-bit BGP Identifier (router ID),
an 8-bit Optional Parameters Length field (set to 0 if no options are present), and the
optional parameters themselves in the same TLV format used by IS–IS in the previous
chapter. BGP options are not discussed in this chapter.
The Update Message
The Update message is used to advertise NLRIs (routes) to a BGP peer, to withdraw
multiple routes that are now unreachable (or unfeasible), or both. The format of the
Update message is shown in Figure 15.6. Because of the peculiar “skew” the 19-byte
BGP header puts on subsequent fields, this message is shown in a different format than
the others. There are two distinct sections to the Update message. They are used to
Withdraw and Advertise routes.
398
PART III Routing and Routing Protocols
1 byte
1 byte
1 byte
1 byte
Version
My Autonomous System
Hold Time
BGP Identifier
Option Parameters
Length
Optional Parameters
Optional Parameters
32 bits
FIGURE 15.5
The BGP Open message showing optional fields at the end.
Unfeasible Routes Length
(2 bytes)
Withdrawn Routes
(variable length)
Total Path Attribute Length
(2 bytes)
Path Attribute
(variable length)
Network Layer Reachability Information
(variable length)
FIGURE 15.6
The BGP Update message. This is the main way routes are advertised with BGP.
The Update message starts with a 20-byte field indicating the total length of the
Withdrawn Routes field in bytes. If there are no Withdrawn Routes, this field is set
to zero. If there are Withdrawn Routes, the routes follow in a variable-length field
with the list of Withdrawn Routes. Each route is a Length/Prefix pair. The length indicates the number of bits that are significant in the following prefix and form a mask/
prefix pair.
The next field is a 2-byte Total Path Attribute Length field. This is the length in bytes
of the Path Attributes field that follows. A value of zero means that nothing follows.
The variable-length Path Attributes field lists the attributes associated with the
NRLIs that follow. Each Path attribute is a TLV of varying length, the first part of which
CHAPTER 15 Border Gateway Protocol
8 bits
O
T
P
E
U
399
8 bits
U
U
U
Attribute Type Code
Flag bits:
O: Optional bit
0 5 Optional
1 5 Well known
T: Transitive bit
0 5 Transitive
1 5 Nontransitive
P: Partial bit
0 5 Optional transitive attribute is partial
1 5 Optional transitive attribute is complete
E: Extended length bit
0 5 Attribute length is 1 byte
1 5 Attribute length is 2 bytes
U: Unused
FIGURE 15.7
The BGP Attribute Type format. This is how NRLIs are grouped.
is the 2-byte Attribute Type. There is a structure to the Attribute Type field, as shown
in Figure 15.7. There are four flag bits, four unused bits, and then an 8-bit Attribute
Type code.
There are other attribute codes in use with BGP, but these are not discussed in this
chapter. One of the most important of these other attributes is the Extended Community attribute used in VPNs.
The Update message ends with a variable-length NLRI field. Each NLRI (route)
is a Length/Prefix pair. The length indicates the number of bits that is significant
in the following prefix. There is no length field for this list that ends the Update
message. The number of NLRIs present is derived from the known length of all of the
other fields.
So, instead of saying “here’s a route and these are its attributes…” for every NLRI
advertised the Update message basically says “here’s a group of path attributes and here
are the routes that these apply to…” This cuts down on the number of messages that
needs to be sent across the network. In this way, each Update message forms a unit of
its own and has no further fragmentation concerns.
The Notification Message
Error messages in BGP have an 8-bit Error Code, an 8-bit Subcode, and a variable-length
Data field determined by the Error Code and Subcode. The format of the BGP Notification message is shown in Figure 15.8.
400
PART III Routing and Routing Protocols
1 byte
1 byte
1 byte
Error Code
Error Subcode
1 byte
Data
Error codes:
1: Message header error
2: Open message error
3: Update message error
32 bits
4: Hold timer expired
5: Finite State Machine error
6: Cease
FIGURE 15.8
The BGP Notification message format. BGP benefits from using TCP as a transport protocol.
A full discussion of BGP Notification codes and subcodes is beyond the scope of
this chapter. The major Error Codes are Message Header Error (1), Open Message Error
(2), Update Message Error (3), Hold Timer Expired (4), Finite State Machine Error (5),
used when the BGP implementation gets hopelessly confused about what it should be
doing next, and Cease (6), used to end the session.
401
QUESTIONS FOR READERS
Figure 15.9 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
“I don’t know
10.0.75.1!
It’s not in this AS!”
IBGP
without
NHS
Router
192.168.14.1
“Oh! I know how to reach
192.168.14.1”
10.0.75.2
IBGP with
NHS
EBGP
(No IGP)
Router in
AS 65127
10.0.75.1
“I can reach
10.10.12/24.
Use 10.0.75.1
as a next hop.”
FIGURE 15.9
How Next Hop Self allows internal routers to forward packets for BGP routes. Border router
192.168.14.1 substitutes its own address for the “real” next hop.
1. BGP distributes “reachability” information and not routes. Why doesn’t BGP
distribute route information?
2. What does it mean to say that the BGP is a “path-vector” protocol?
3. What is “next hop self” and why is it important in BGP?
4. Which two major BGP router configurations are employed to deal with BGP
scaling?
5. What are the ten major BGP attributes?
CHAPTER
16
Multicast
What You Will Learn
In this chapter, you will learn how multicast routing protocols allow multicast
traffic to make its way from a source to interested receivers through a router-based
network. We’ll look at both dense and parse multicast routing protocols, as well as
some of the other protocols used with them (such as IGMP).
You will learn how the PIM rendezvous point (RP) has become the key
component in a multicast network. We’ll see how to configure an RP on the
network and use it to deliver a simple multicast traffic stream to hosts.
If the Internet and TCP/IP are going to be used for everything from the usual data
activities to voice and video, something must be done about the normal unicast packet
addressing reflecting one specific source and one specific destination. Almost everything described in this book so far has featured unicast, although multicast addresses
have been mentioned from time to time—especially when used by routing protocols.
The one-to-many operation of multicast is a technique between the one-to-one
packet delivery operation of unicast and the one-to-all operation of broadcast. Broadcasts tend to disrupt hosts’ normal processing because most broadcasts are not really
intended for every host yet each receiving host must pay attention to the broadcast
packet’s content. Many protocols that routinely used broadcasts, such as RIPv1, were
replaced by versions that used multicast groups instead (RIPv2, OSPF). Even the protocols in IPv4 that still routinely use broadcast, such as ARPing to find the MAC address
that goes with an IP address, have been replaced in IPv6 with multicast-friendly versions
of the same procedure.
Multicast protocols are still not universally supported on much of the Internet. Then
how do large numbers of people all watch the same video feed from a Web server
(for example) at the same time? Today, this is normally accomplished with numerous
unicast links, each running from the server to every individual host. This works, but
it does not scale. Can a server handle 100, 1000, or 1,000,000 simultaneous users?
Many-to-many multicast applications, such as on-line gaming and gambling sites, use
404
PART III Routing and Routing Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
BestWireless
in Home
ge0/0
50. /3
1
ink
LL
DS
/0
/0
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/3
49.2
so-0/0/1
79.2
so0/0
29. /2
2
/0
/0
-0 .1
59
so
PE5
lo0: 192.168.5.1
Rendezvous
Point (RP)
so
-0
45 /0/2
.2
so
45
-0
.1
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
/0
/2
so-0/0/3
49.1
P4
lo0: 192.168.4.1
/0
0/0
1
7
4.
so-0/0/1
24.2
so-
AS 65459
FIGURE 16.1
Portion of the Illustrated Network used for the multicast examples. The RP will be router PE5, and
the ISPs have merged into a single AS for this chapter.
CHAPTER 16 Multicast
405
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Ace ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
406
PART III Routing and Routing Protocols
multiple point-to-point meshes of links in most cases. Even if modern server clusters
could do this, could all the routers and links handle this traffic? Multicast uses the routers to replicate packets, not the servers.
However, interdomain (or even intersubnet) multicasting is a problem. IP multicast
is widely leveraged on localized subnets where it’s solely a question of host support.
Many-to-many applications have some fundamental scaling challenges and multicast
does not address these very well. For example, how does each host in a shared tree of
multicast traffic manage the receipt of perhaps 50 video streams from participants?
Today, multicast is a key component of local IPv6 and IPv4 resource discovery
mechanisms and is not confined to enterprise applications. However, multicast applications are used mainly on enterprise networks not intended for the general public.
In the future, multicast must move beyond a world where special routers (not all routers can handle multicast packets) use special parts of the Internet (most famously, the
MBONE, or multicast backbone) to link interested hosts to their sources. Multicast must
become an integral part of every piece of hardware and software on the Internet.
Let’s look at a few simple multicast packets and frames on the Illustrated Network.
We don’t have any video cameras or music servers on the network to pump out content, but we do have the ability to use simple socket programs to generate a stream of
packets to multicast group addresses as easily as to unicast destinations. We could look
at multicast as used by OSPF or IPv6 router announcements, but we’ll look at simple
applications instead.
We’ll look at IPv4 first, and then take a quick look at IPv6 multicasting. We’ll use the
devices shown in Figure 16.1 to illustrate multicast protocols, introducing the terms
used in multicast protocols as we go. We’ll explore all of the terms in detail later in the
chapter.
This chapter uses wincli2 and lxnclient on LAN2 and wincli1 on LAN1. The router
PE5 will serve as our PIM sparse-mode RP. To simplify the number of multicast protocols
used, we’ve merged the two ISPs into Best-Ace ISP for this chapter. This means we will
not need to configure the Multicast Source Discovery Protocol (MSDP), which allows
receivers in an AS to find RPs in another AS. A full investigation of MSDP is beyond the
scope of this chapter, but we will go over the basics.
A FIRST LOOK AT IPV4 MULTICAST
This section uses two small socket programs from the source cited in Chapter 12: the
excellent TCP/IP Sockets in C by Michael J. Donahoo and Kenneth L. Calvert. We’ll use
two programs run as MulticastReceiver and MulticastSender, and two free Windows
multicast utilities, wsend and wlisten.
Let’s start with two hosts on the same LAN. We’ll use lnxclient (10.10.12.166)
and wincli2 (10.10.12.222) for this exercise (both clients, but there’s no heavy multicasting going on). We’ll set the Linux client to multicast the text string HEY once
every 3 seconds onto the LAN using multicast group address 239.2.2.2 (multicasts
use special IP addresses for destinations) and UDP port 22222 (multicast applications
CHAPTER 16 Multicast
407
often use UDP, and cannot use TCP). Naturally, we’ll set the multicast receiver socket
program on the Windows XP client to receive traffic sent to that group.
It should be noted that the multicast group addresses used here are administratively scoped addresses that should only reach a limited number of hosts and not be
used on the global public Internet, much like private IP addresses. However, we won’t
discuss how the traffic to these groups is limited. This is mainly because there are some
operational disagreements about how to apply administratively scoped boundaries. We
are using scoped addresses primarily as an analogy for private IP addresses. We could
also have used GLOP addresses (discussed in this chapter) or addresses from the
dynamic multicast address block.
The receiver socket program does not generate any special messages to say,“Send
me content addressed to group 239.2.2.2.” We know it’s going to be there. Later,
we’ll see that a protocol called Internet Group Management Protocol (IGMP) sends
join or leave messages and knows what content is carried at this time by group
239.2.2.2 because of the Session Announcement Protocol and Source Description
Protocol (SAP/SDP) messages it receives. In reality, multicast is a suite of protocols—
and much more is required to create a complete multicast application. However,
this little send-and-receive exercise will still reveal a lot about multicast. Figure 16.2
shows a portion of the Ethereal capture of the packet stream, detailing the UDP content inside the IP packet.
FIGURE 16.2
Multicast packet capture, showing the MAC address format used and the port in the UDP
datagram. Some IGMPv3 messages appear also.
408
PART III Routing and Routing Protocols
The Ethernet frame destination address is in a special form, starting with 01 and
ending in 02:02:02—which corresponds to the 239.2.2.2 multicast group address.
We’ll explore the rules for determining this frame address in material following. Note
that the packet is addressed to the entire group, not an individual host (as in unicast).
How does the network know where to send replicated packets? Two strategies (discussed later in the chapter) are to send content everywhere and then stop if no one
says they are listening (flood-and-prune, or dense mode), or to send content only to
hosts that have indicated a desire to receive the content (sparse mode).
The figure also shows that the Windows XP receiver (10.10.12.222) is generating
IGMPv3 membership reports sent to multicast group address 224.0.0.22 (the IGMP
multicast group). XP does this to keep the multicast content coming, even though
the socket sender program has no idea what it means. These messages from XP to the
IGMP group sometimes cause consternation with Windows network administrators,
who are not always familiar with multicast and wonder where the 224.0.0.22 “server”
could be.
Now let’s set our multicast group send program to span the router network from
LAN1 to LAN2. We’ll start the socket utility sending on wincli1 (10.10.11.51), using
multicast group 239.1.1.1 and UDP port 11111. The listener will still be wincli2
(10.10.12.222).
This is easy enough, and Ethereal on wincli1 shows a steady stream of multicast
traffic being dumped onto LAN1. However, the Ethereal capture on wincli2 (which had
no problem receiving a multicast stream only moments ago) now receives absolutely
nothing. What’s wrong?
The problem is that the routers between LAN1 and LAN2 are not running a multicast
routing protocol. The router on LAN1 at 10.10.11.1 adjacent to the source receives
every multicast packet sent by wincli1. But the destination address of 239.1.1.1 is
meaningless when considered as a unicast address. No entry exists in the unicast routing table, and there is yet no multicast “routing table” (more properly, table for multicast
interface state) on the router network.
Before we configure multicast for use on our router network and allow multicast traffic to travel from LAN1 to LAN2, there are many new terms and protocols to explain—a
few of which we’ve already mentioned (IGMP, SAP/SDP, how a multicast group maps to
a frame destination address, and so on.) Let’s start with the basics.
MULTICAST TERMINOLOGY
Multicast in TCP/IP has developed a reputation of being more difficult to understand
than unicast. Part of the problem is the special terminology used with multicast, and
the implication that if something is not universally supported, it must be complicated
and difficult to understand. But there is nothing in multicast that is more complex than
subnet masking, multicast sockets are nearly the same as unicast sockets (except that
they don’t use TCP sockets), and many things that routing protocols do with multicast
packets are now employed in unicast as well (the reverse-path forwarding, or RFP
CHAPTER 16 Multicast
Multicast
Source
(Group A)
Root of
Multicast
Tree
Multicast
Host
Multicast
Host
409
Multicast
Source
(Group B)
Distribution
Tree(s)
Upstream
Downstream
PRUNE
Multicast
Routers
JOIN
JOIN
Multicast
Host
Multicast
Host
Multicast
Host
Multicast
Host
Multicast
Host
Multicast
Leafs
Host
Uninterested
Host
Interested
Host
(Group A)
Uninterested
Host
Interested
Host
(Group B)
Interested
Host
(Group B)
Interested
Host
(Group B)
FIGURE 16.3
Examples of multicast terminology showing how multicast trees are “rooted” at the source. JOINs
are also sent using IGMP from receivers to local routers.
check). Figure 16.3 shows a general view of some of the terms commonly used in an
IP multicast network.
The key component of the multicast network is the multicast-capable router, which
replicates the packets. The routers in the IP multicast network, which has exactly the
same topology as the unicast network it is based on, use a multicast routing protocol
to build a distribution tree to connect receivers (this term is preferred to the multimedia implications of listeners, but the listener term is also used) to sources. The
distribution tree is rooted at the source. The interface on the router leading toward
the source is the upstream interface, although the less precise terms incoming or
inbound interface are also used. There should be only one upstream interface on the
router receiving multicast packets. The interface on the router leading toward the
receivers is the downstream interface, although the less precise terms outgoing or
outbound interface are used as well. There can be 0 to N – 1 downstream interfaces on
a router, where N is the number of logical interfaces on the router. To prevent looping,
the upstream interface should never receive copies of downstream multicast packets.
Routing loops are disastrous in multicast networks because of the repeated replication of packets. Modern multicast routing protocols need to avoid routing loops, packet
by packet, much more rigorously than in unicast routing protocols.
Each subnetwork with hosts on the router that has at least one interested receiver
is a leaf on the distribution tree. Routers can have multiple leafs or leaves (both terms
are used) on different interfaces and must send a copy of the IP multicast packet out
410
PART III Routing and Routing Protocols
on each interface with a leaf. When a new leaf subnetwork is added to the tree (i.e.,
the interface to the host subnetwork previously received no copies of the multicast
packets), a new branch is built, the leaf is joined to the tree, and replicated packets are
now sent out on the interface.
When a branch contains no leaves because there are no interested hosts on the
router interface leading to that IP subnetwork, the branch is pruned from the distribution tree, and no multicast packets are sent out from that interface. Packets are replicated and sent out from multiple interfaces only where the distribution tree branches
at a router, and no link ever carries a duplicate flow of packets.
Collections of hosts all receiving the same stream of IP packets, usually from the
same multicast source, are called groups. In IP multicast networks, traffic is delivered to
multicast groups based on an IP multicast address or group address. The groups determine the location of the leaves, and the leaves determine the branches on the multicast
network. Some multicast routing protocols use a special RP router to allow receivers
to find sources efficiently.
DENSE AND SPARSE MULTICAST
Multicast addresses represent groups of receivers, and two strategies can be employed
to ensure that all receivers interested in a multicast group receive the traffic.
Dense-Mode Multicast
The assumption here is that almost all possible subnets have at least one receiver
wanting to receive the multicast traffic from a source, so the network is flooded with
traffic on all possible branches and then pruned back as branches do not express
an interest in receiving the packets—explicitly (by message) or implicitly (timeout
silence). This is the dense mode of multicast operation. LANs are appropriate environments for dense-mode operation. In practice, although PIM-DM is worth covering (and
we’ll even configure it!) there aren’t a lot of scenarios in which people would seriously
consider it. Periodic blasting of source content is neither a very scalable nor efficient
use of resources.
Sparse-Mode Multicast
The assumption here is that very few of the possible receivers want packets from this
source, so the network establishes and sends packets only on branches that have at
least one leaf indicating (by message) a desire for the traffic. This is the sparse mode
of multicast operation. WANs (like the Internet) are appropriate networks for sparsemode operation. Sparse-mode multicast protocols use the special RP router to allow
receivers to find sources efficiently.
Specific networks can run whichever mode makes sense. A low-volume multicast
application can make effective use of dense mode, even on a WAN. A high-volume
application on a LAN might still use sparse mode for efficiency.
CHAPTER 16 Multicast
411
Some multicast routing protocols, especially older ones, support only dense-mode
operation—which makes them difficult to use efficiently on the public Internet. Others
allow sparse mode as well. If sparse-dense mode is supported, the multicast routing
protocol allows some special dense multicast groups to be used to the RPs—at which
point the router operates in sparse mode.
MULTICAST NOTATION
To avoid multicast routing loops, every multicast router must always be aware of the
interface that leads to the source of that multicast group content by the shortest path.
This is the upstream (incoming) interface, and packets should never be forwarded back
toward a multicast source. All other interfaces are potential downstream (outgoing)
interfaces, depending on the number of branches on the distribution tree.
Routers closely monitor the status of the incoming and outgoing interfaces, a process
that determines the multicast forwarding state. A router with a multicast forwarding
state for a particular multicast group is essentially “turned on” for that group’s content.
Interfaces on the router’s outgoing interface list (OIL) send copies of the group’s packets received on the incoming interface list for that group. The incoming and outgoing
interface lists might be different for different multicast groups.
The multicast forwarding state in a router is usually written in (S,G) or (*,G)
notation. These are pronounced “S comma G” and “star comma G,” respectively. In (S,G),
the S refers to the unicast IP address of the source for the multicast traffic, and the
G refers to the particular multicast group IP address for which S is the source. All multicast packets sent from this source have S as the source address and G as the destination
address.
The asterisk (*) in the (*,G) notation is a wild card indicating that the source sending
to group G is unknown. Routers try to track down these sources when they have to in
order to operate more efficiently.
MULTICAST CONCEPTS
The basic terminology of multicast is complicated by the use of several related concepts. Many of these apply to how the routers on a multicast-capable network handle
multicast packets and have little to do with hosts on LANs, but they are important
concepts nonetheless.
Reverse-Path Forwarding
Unicast forwarding decisions are typically based on the destination address of the
packet arriving at a router. The unicast routing table is organized by destination subnet
and mainly set up to forward the packet toward the destination.
412
PART III Routing and Routing Protocols
In multicast, the router forwards the packet away from the source to make progress
along the distribution tree and prevent routing loops. The router’s multicast forwarding state runs more logically by organizing tables based on the reverse path, from the
receiver back to the root of the distribution tree. This process is known as reverse-path
forwarding (RPF).
The router adds a branch to a distribution tree depending on whether the request for
traffic from a multicast group passes the RPF check. Every multicast packet received must
pass an RPF check before it is eligible to be replicated or forwarded on any interface.
The RPF check is essential for every router’s multicast implementation. When a
multicast packet is received on an interface, the router interprets the source address in
the multicast IP packet as the destination address for a unicast IP packet. The source
multicast address is found in the unicast routing table, and the outgoing interface is
determined. If the outgoing interface found in the unicast routing table is the same
as the interface that the multicast packet was received on, the packet passes the RPF
check. Multicast packets that fail the RPF check are dropped because the incoming
interface is not on the shortest path back to the source.
Routers can build and maintain separate tables for RPF purposes. The router must
have some way to determine its RPF interface for the group, which is the interface
topologically closest to the root. The distribution tree should follow the shortest-path
tree topology for efficiency. The RPF check helps to construct this tree.
The RPF Table
The RPF table plays the key role in the multicast router. The RPF table is consulted for
every RPF check, which is performed at intervals on multicast packets entering the
multicast router. Distribution trees of all types rely on the RPF table to form properly,
and the multicast forwarding state also depends on the RPF table.
The routing table used for RPF checks can be the same routing table used to forward unicast IP packets, or it can be a separate routing table used only for multicast
RPF checks. In either case, the RPF table contains only unicast routes because the RPF
check is performed on the source address of the multicast packet (not the multicast
group destination address), and a multicast address is forbidden from appearing in the
source address field of an IP packet header. The unicast address can be used for RPF
checks because there is only one source host for a particular stream of IP multicast
content for a multicast group address, although the same content could be available
from multiple sources.
Populating the RPF Table
If the same routing table used to forward unicast packets is also used for the RPF
checks, the routing table is populated and maintained by the traditional unicast routing
protocols such as Border Gateway Protocol (BGP), Intermediate System-to-Intermediate
System (IS–IS), OSPF, and Routing Information Protocol (RIP). If a dedicated multicast
CHAPTER 16 Multicast
413
RPF table is used, this table must be populated by some other method. Some multicast
routing protocols, such as the Distance Vector Multicast Routing Protocol (DVMRP),
essentially duplicate the operation of a unicast routing protocol and populate a dedicated RPF table. Others, such as Protocol Independent Multicast (PIM), do not duplicate
routing protocol functions and must rely on some other routing protocol to set up this
table—which is why PIM is protocol independent.
Some traditional routing protocols (such as BGP and IS–IS) now have extensions
to differentiate between different sets of routing information sent between routers
for unicast and multicast. For example, there is multiprotocol BGP (MBGP) and multitopology routing in IS–IS (M-ISIS). Multicast Open Shortest Path First (MOSPF) also
extends OSPF for multicast use, but goes further than MBGP or M-ISIS and makes
MOSPF into a complete multicast routing protocol on its own. When these routing
protocols are used, routes can be tagged as multicast RPF routers and used by the
receiving router differently than the unicast routing information.
Using the main unicast routing table for RPF checks provides simplicity. A dedicated
routing table for RPF checks allows a network administrator to set up separate paths
and routing policies for unicast and multicast traffic, allowing the multicast network to
function more independently of the unicast network. The following section discusses
in further detail how PIM operates, although the concepts could be applied to other
multicast routing protocols.
Shortest-Path Tree
The distribution tree used for multicast is rooted at the source and is the shortest-path
tree (SPT) as well. Consider a set of multicast routers without any active multicast
traffic for a certain group (i.e., they have no multicast forwarding state for that group).
When a router learns that an interested receiver for that group is on one of its directly
connected subnets, the router attempts to join the tree for that group.
To join the distribution tree, the router determines the unicast IP address of the
source for that group. This address can be a simple static configuration in the router, or
use more complex methods.
To build the SPT for that group, the router executes an RPF check on the source
address in its routing table. The RPF check produces the interface closest to the source,
which is where multicast packets from this source for this group should flow into the
router.
The router next sends a join message out on this interface using the proper multicast protocol to inform the upstream router that it wishes to join the distribution
tree for that group. This message is an (S,G) join message because both S and G are
known. The router receiving the (S,G) join message adds the interface on which the
message was received to its OIL for the group and performs an RPF check on the
source address. The upstream router then sends an (S,G) join message out the RPF
interface toward the source, informing the upstream router that it also wants to join
the group.
414
PART III Routing and Routing Protocols
Each upstream router repeats this process, propagating joins out the RPF interface—
building the SPT as it goes. The process stops when the join message does the following:
■
■
Reaches the router directly connected to the host that is the source, or
Reaches a router that already has multicast forwarding state for this
source-group pair.
In either case, the branch is created, each of the routers has multicast forwarding state
for the source-group pair, and packets can flow down the distribution tree from source
to receiver. The RPF check at each router ensures that the tree is an SPT.
SPTs are always the shortest path, but they are not necessarily short. That is, sources
and receivers tend to be on the periphery of a router network (not on the backbone) and
multicast distribution trees have a tendency to sprawl across almost every router in the
network. Because multicast traffic can overwhelm a slow interface, and one packet can
easily become a hundred or a thousand on the opposite side of the backbone, it makes
sense to provide a shared tree as a distribution tree so that the multicast source could
be located more centrally in the network (on the backbone). This sharing of distribution
trees with roots in the core network is accomplished by a multicast rendezvous point.
Rendezvous Point and Rendezvous-Point Shared Trees
In a shared tree, the root of the distribution tree is a router (not a host), and is located
somewhere in the core of the network. In the primary sparse-mode multicast routing
protocol, Protocol Independent Multicast sparse mode (PIM-SM), the core router at the
root of the shared tree is the RP. Packets from the upstream source and join messages
from the downstream routers “rendezvous” at this core router.
In the RP model, other routers do not need to know the addresses of the sources for
every multicast group. All they need to know is the IP address of the RP router. The RP
router knows the sources for all multicast groups.
The RP model shifts the burden of finding sources of multicast content from each
router—the (S,G) notation—to the network—the (*,G) notation knows only the RP.
Exactly how the RP finds the unicast IP address of the source varies, but there must
be some method to determine the proper source for multicast content for a particular
group.
Consider a set of multicast routers without any active multicast traffic for a certain
group. When a router learns that an interested receiver for that group is on one of its
directly connected subnets, the router attempts to join the distribution tree for that
group back to RP (not to the actual source of the content). In some sparse-mode protocols, the shared tree is called the rendezvous-point tree (RPT).
When the branch is created, packets can flow from the source to the RP and from
the RP to the receiver. Note that there is no guarantee that the shared tree (RPT) is
the shortest path tree to the source. Most likely it is not. However, there are ways to
“migrate” a shared tree to an SPT once the flow of packets begins. In other words, the
forwarding state can transition from (*,G) to (S,G). The formation of both types of trees
depends heavily on the operation of the RPF check and the RPF table.
CHAPTER 16 Multicast
415
PROTOCOLS FOR MULTICAST
Multicast is not a single protocol used for a specific function, like FTP. Nor is multicast
a series of separate protocols that can be used as desired between adjacent hosts
and routers to perform a function, like IS–IS and OSPF. Multicast is a series of related
protocols that must be carefully coordinated across and between an AS and often
among hosts.
The family of multicast protocols is due to the complexity of source discovery
and the mechanisms used to perform this task. Most hosts can send and receive
multicast frames and packets on a LAN as easily as they handle broadcast or unicast. Routers must be capable of sending copies of a single received packet out on
more than one interface (replication), and many low-end routers cannot do this. In
addition, routers must be able to use unicast routing tables for multicast purposes,
or construct special tables for multicast information (again, many low-end routers
cannot do this).
Multicast routers must be able to maintain state on each interface with regard to
multicast traffic. That is, the router must know which multicast groups have active
receivers on an outgoing interface (called downstream interfaces) and which interface
is the “closest” to the source (called upstream interface). These interfaces vary from
group to group, one group can have more than one potential source (for redundancy
purposes), and special routers might be employed for many groups (the RPs).
Multicast Hosts and Routers
Multicast tasks are very different for hosts versus routers. At this juncture, we will
extend the multicast discussion beyond IPv4 to IPv6 and hosts. General points follow.
■
■
■
■
Hosts must be able to join and leave multicast groups. The major protocols here are
various versions of the Internet Group Management Protocol (IGMP) in IPv4 and
Multicast Listener Discovery (MLD) in IPv6.
Hosts (users) must know the content of multicast groups. The related Session
Announcement Protocol and Session Description Protocol (SAP/SDP, defined in RFC
2974 and RFC 2327) are the standard protocols used to describe the content and
some other aspects of multicast groups. These should not be used as a method of
multicast source discovery.
Routers must be able to find the sources of multicast content, both in their own
multicast (routing) domain and in others. For sparse modes, this means finding the
RPs. These can be configured statically, or use protocols such as Auto-RP, anycast RP
(RFC 3446), bootstrap router (BSR), or MSDP (RFC 3618). For IPv6, embedded RP is
used instead of MSDP—which is not defined for IPv6 use. (This point actually applies
to ASM, not SSM, discussed in material following.)
Routers must be able to prevent loops that replicate the same packet over and over.
The techniques here are not really protocols, and include the use of scoping (limiting multicast packet hops) and RPF checks.
416
■
PART III Routing and Routing Protocols
Routers must provide missing multicast information when feasible. Multicast
networks can use Pragmatic General Multicast (PGM) to add some TCP features
lacking in UDP to multicast networks. However, the only assurance is that you
know you missed something. Application-specific mechanisms can do the same
thing with simple sequence numbers.
Fortunately, only a few of these protocols are really used for multicast at present on
the Internet. The only complication is that some of the special protocols used for IPv4
multicasting do not work with IPv6, and thus different protocols perform the same
functions.
Multicast Group Membership Protocols
Multicast group membership protocols allow a router to know when a host on a
directly attached subnet, typically a LAN, wants to receive traffic from a certain multicast group. Even if more than one host on the LAN wants to receive traffic for that
multicast group, the router has to send only one copy of each packet for that multicast
group out on that interface because of the inherent broadcast nature of LANs. Only
when the router is informed by the multicast group membership protocol that there
are no interested hosts on the subnet can the packets be withheld and that leaf pruned
from the distribution tree.
Internet Group Management Protocol for IPv4
There is only one standard IPv4 multicast group membership protocol: the Internet
Group Management Protocol (IGMP). However, IGMP has several versions that are supported by hosts and routers. There are currently three versions of IGMP.
IGMPv1—The original protocol defined in RFC 1112. An explicit join message
is sent to the router, but a timeout is used to determine when hosts leave a
group. This process wastes processing cycles on the router, especially on older
or smaller routers.
IGMPv2—Among other features, IGMPv2 (RFC 2236) adds an explicit leave message to the join message so that routers can more easily determine when a
group has no interested listeners on a LAN.
IGMPv3—Among other features, IGMPv3 (RFC 3376) optimizes support for a
single source of content for a multicast group or source-specific multicast
(SSM). (RFC 1112 supported both many-to-many and one-to-many multicast, but
one-to-many is considered the more viable model for the Internet at large.)
Although the various versions of IGMP are backward compatible, it is common
for a router to run multiple versions of IGMP on LAN interfaces because backward
compatibility is achieved by dropping back to the most basic of all versions run on
a LAN. For example, if one host is running IGMPv1, any router attached to the LAN
CHAPTER 16 Multicast
417
running IGMPv2 drops back to IGMPv1 operation—effectively eliminating the IGMPv2
advantages. Running multiple IGMP versions ensures that both IGMPv1 and IGMPv2
hosts find peers for their versions on the router.
Multicast Listener Discovery for IPv6
IPv6 does not use IGMP to manage multicast groups. Multicast groups are an integral
part of IPv6, and the Multicast Listener Discovery (MLD) protocol is an integral part
of IPv6. Some IGMP functions are assumed by ICMPv6, but IPv6 hosts perform most
multicast functions with MLD. MLD comes in two versions: MLD version 1 (RFC 2710)
has basic functions, and MLDv2 (RFC 3590) supports SSM groups.
Multicast Routing Protocols
There are five multicast routing protocols.
Distance-Vector Multicast Routing Protocol
This is the first of the multicast routing protocols and hampered by a number of
limitations that make this method unattractive for large-scale Internet use. DVMRP is
a dense-mode-only protocol that uses the flood-and-prune, or implicit join method,
to deliver traffic everywhere and then determines where uninterested receivers are.
DVMRP uses source-based distribution trees in the form (S,G).
Multicast Open Shortest Path First
This protocol extends OSPF for multicast use, but only for dense mode. However,
MOSPF has an explicit join message, and thus routers do not have to flood their entire
domain with multicast traffic from every source. MOSPF uses source-based distribution
trees in the form (S,G).
PIM Dense Mode
This is Protocol Independent Multicast operating in dense mode (PIM DM), but the differences from PIM sparse mode are profound enough to consider the two modes separately. PIM also supports sparse-dense mode, but there is no special notation for that
operational mode. In contrast to DVMRP and MOSPF, PIM dense mode allows a router
to use any unicast routing protocol and performs RPF checks using the unicast routing
table. PIM dense mode has an implicit join message, so routers use the flood-and-prune
method to deliver traffic everywhere and then determine where the uninterested
receivers are. PIM dense mode uses source-based distribution trees in the form (S,G),
as do all dense-mode protocols.
PIM Sparse Mode
PIM sparse mode allows a router to use any unicast routing protocol and performs RPF
checks using the unicast routing table. However, PIM sparse mode has an explicit join
message, so routers determine where the interested receivers are and send join messages upstream to their neighbors—building trees from receivers to RP. The Protocol
418
PART III Routing and Routing Protocols
Table 16.1 Major Characteristics of Multicast Routing Protocols
Multicast
Routing
Protocol
Dense
Mode
Sparse
Mode
Implicit
Join
Explicit
Join
(S,G) SBT
(*,G) Shared
Tree
DVMRP
Yes
No
Yes
No
Yes
No
MOSPF
Yes
No
No
Yes
Yes
No
PIM-DM
Yes
No
Yes
No
Yes
No
PIM-SM
No
Yes
No
Yes
Yes, maybe
Yes, initially
CBT
No
Yes
No
Yes
No
Yes
Independent Multicast sparse mode uses an RP router as the initial source of multicast
group traffic and therefore builds distribution trees in the form (*,G), as do all sparsemode protocols. However, PIM sparse mode migrates to an (S,G) source-based tree if
that path is shorter than through the RP for a particular multicast group’s traffic.
Core-Based Trees
Core-based trees (CBT) share all of the characteristics of PIM sparse mode (sparse
mode, explicit join, and shared [*,G] trees), but are said to be more efficient at finding
sources than PIM sparse mode. CBT is rarely encountered outside academic discussions and the experimental RFC 2201 from September 1997. There are no large-scale
deployments of CBT, commercial or otherwise. The differences among the five multicast routing protocols are summarized in Table 16.1.
It is important to realize that retransmissions due to a high bit-error rate on a link or
overloaded router can make multicast as inefficient as repeated unicast.
Any-Source Multicast and SSM
RFC 1112 originally described both one-to-many (for radio and television) and manyto-many (for videoconferences and application on-line gaming) multicasts. This model
is now known as Any-Source Multicast (ASM). To support many-to-many multicasts,
the network is responsible for source discovery. So, whenever a host expresses a desire
to join a group the network must find all the sources for that group and deliver them
to the receiver.
Source discovery is especially complex with interdomain scenarios (source in one
AS, receiver/s in another). And most plans to commercialize Internet multicasts, such
as bringing radio station and television channel multicasts directly onto the Internet,
revolve around the one-to-many model exclusively. So, the one-to-many scenario has
been essentially split off from the all-embracing RFC 1112 vision and become SourceSpecific Multicast (SSM, defined in FC 3569).
As the name implies, SSM supports multicast content delivery from only one specific
source. In SSM, source discovery is not the responsibility of the network but of the
CHAPTER 16 Multicast
Protocols for Any-Source
Multicast
Protocols for SourceSpecific Multicast
Protocols for ReversePath Forwarding
Peer-RPF Flooding
Interdomain
(AS to AS)
(None needed in
SMS)
Path Vector
MSDP
419
MBGP
Intradomain (same AS)
Sparse Mode
PIM-SM
Sparse Mode
PIM-SSM
(No RP)
Link State
OSPF
M-ISIS
Dense Mode
Dense Mode
Distance Vector
PIM-DM
PIM-DM
RIP
DVMRP
DVMRP
DVRMP
FIGURE 16.4
Suite of multicast protocols showing how those for ASM, SSM, and RFP checks fit together
and are used.
receivers (hosts). This eliminates much of the complexity of multicast mechanisms
required in ASM and the use of MSDP. It also eliminates some of the scaling considerations associated with traffic on (*,G) groups.
ASM and SSM are not protocols but service models. Most of what is described in
this chapter applies to ASM (the more general model). But keep in mind that SSM does
away with many of the procedures covered in detail here that apply to ASM, including
RPs, RPTs, and MSDP. Figure 16.4 shows the current suite of multicast protocols and
how they all fit together.
Multicast Source Discovery Protocol
MSDP, described in RFC 3618, is a mechanism to connect multiple PIM-SM domains
(usually, each in an AS). Each PIM-SM domain can have its own independent RPs,
and these do not interact in any way (so MSDP is not needed in SSM scenarios). The
advantages of MSDP are that the RPs do not need any other resource to find each
other and that domains can have receivers only and get content without globally
advertising group membership. In addition, MSDP can be used with protocols other
than PIM-SM.
420
PART III Routing and Routing Protocols
MSDP routers in a PIM-SM domain peer with their MSDP router peers in other
domains. The peering session uses a TCP connection to exchange control information.
Each domain has one or more of these connections in its “virtual topology.” This allows
domains to discover multicast sources in other domains. If these sources are deemed
of interest to receivers in another domain, the usual source-tree mechanism in PIM-SM
is used to deliver multicast content—but now over an interdomain distribution tree.
More details about MSDP are beyond the scope of this introductory chapter.
Frames and Multicast
Multicasting on a LAN is a good place to start an investigation of multicasting in general.
Consider a single LAN, without routers, with a multicast source sending to a certain
group. The rest of the hosts are receivers interested in the multicast group’s content.
So, the multicast source host generates packets with its unicast IP address as the source
and the group address as the destination.
One issue comes up immediately. The packet source address obviously will be
the unicast IP address of the host originating the multicast content. This translates
to the MAC address for the source address in the frame in which the packet is encapsulated. The packet’s destination address will be the multicast group. So far, so good.
But what should be the frame’s destination address that corresponds to the packet’s
multicast group address?
Using the LAN broadcast MAC address defeats the purpose of multicast, and hosts
could have access to many multicast groups. Broadcasting at the LAN level makes no
sense. Fortunately, there is an easy way out of this. The MAC address has a bit that is set
to 0 for unicast (the LAN term is individual address) and to a 1 to indicate that this
is a multicast address. Some of these addresses are reserved for multicast groups for
specific vendors or MAC-level protocols. Internet multicast applications use the range
0x01-00-5E-00-00-00 to 0x01-00-5E-FF-FF-FF. TCP/IP multicast receivers listen for frames
with one of these addresses when the application joins a multicast group and stops
listening when the application terminates or the host leaves the group.
So, 24 bits are available to map IPv4 multicast addresses to MAC multicast addresses.
But all IPv4 addresses, including multicast addresses, are 32 bits long. There are 8 bits
left over. How should IPv4 multicast addresses be mapped to MAC multicast addresses
to minimize the chance of “collisions” (two different multicast groups mapped to the
same MAC multicast address)?
All IPv4 multicast addresses begin with the same four bits (1110), so we only have
to really worry about 4 bits (not 8). We shouldn’t drop the last bits of the IPv4 address,
because these are almost guaranteed to be host bits—depending on subnet mask. But
the high-order bits, the rightmost bits, are almost always network bits and we’re only
worried about one LAN for now.
One other bit of the remaining 24 MAC address bits is reserved (an initial 0 indicates
an Internet multicast address), so let’s just drop the 5 bits following the initial 1110 in
the IPv4 address and map the 23 remaining bits (one for one) into the last 23 bits of the
MAC address. This procedure is shown in Figure 16.5.
CHAPTER 16 Multicast
421
IPv4 Header Multicast Destination Address
232. 224. 202. 181
Decimal:
E8 - E0 - CA - B5
Hex:
11101000 1110 0000 1100 1010 10110101
Binary:
Ignore
Copy
3 5 0 for Internet 3 110 0000 1100 1010 10110101
3 5 1 for other
0110 0000 1100 1010 10110101
Binary:
60 - CA - B5
Hex:
Multicast Bit
MAC Address in Hex: 01 : 00 : B3 : 27 : FA : 8C
Drop
Copy
MAC Multicast Address: 01 : 00 : B3 : 60 : CA : B5
Ethernet Frame Multicast Destination Address
FIGURE 16.5
How to convert from IPv4 header multicast to Ethernet MAC multicast address formats.
Note that this process means that there are 32 (25) IPv4 multicast addresses that
could map to the same MAC multicast addresses. For example, multicast IPv4 addresses
224.8.7.6 and 229.136.7.6 translate to the same MAC address (0x01-00-5E-08-07-06).
This is a real concern, and because the host will accept frames sent to both multicast
groups, the IP software must reject one or the other. This problem does not exist in
IPv6, but is always a concern in IPv4.
Once the MAC address for the multicast group is determined, the operating system
essentially orders the NIC card to join or leave the multicast group and accept frames
sent to the address as well as the host’s unicast address or ignore that multicast group’s
frames. It is possible for a host to receive multicast content from more than one group
at the same time, of course. The procedure for IPv6 multicast packets inside frames
is nearly identical, except for the MAC destination address 0x3333 prefix and other
points outlined in the previous section.
IPv4 Multicast Addressing
The IPv4 addresses (Class D in the classful addressing scheme) used for multicast usage
range from 224.0.0.0 to 239.255.255.255. Assignment of addresses in this range is
controlled by the Internet Assigned Numbers Authority (IANA). Multicast addresses can
never be used as a source address in a packet (the source address is always the unicast
422
PART III Routing and Routing Protocols
IP address of the content originator). Certain subranges within the range of addresses
are reserved for specific uses.
■
■
■
■
■
224.0.0.0/24—The link-local multicast range (these packets never pass
through routers)
224.2.0.0/16—The SAP/SDP range
232.0.0.0/8—The Source-Specific Multicast (SSM) range
233.0.0.0/8—The AS-encoded statically assigned GLOP range defined in
RFC 3180
239.0.0.0/8—The administratively scoped multicast range defined in
RFC 2365 (these packets may pass through a certain number of routers)
For a complete list of currently assigned IANA multicast addresses, refer to the
www.iana.org/assignments/multicast-addresses Web site. If multicast addresses had
Table 16.2 Multicast Addresses Used for Various Protocols
Address
Purpose
Comment
224.0.0.0
Reserved base address
RFC 1112
224.0.0.1
All systems of this subnet
RFC 1112
224.0.0.2
All routers on this subnet
224.0.0.3
Unassigned
224.0.0.4
DVMRP routers on this subnet
RFC 1075
224.0.0.5
All OSPF routers on this subnet
RFC 1583
224.0.0.6
All OSPF DRs on this subnet
RFC 1583
224.0.0.7
All ST (Streams protocol) routers on this subnet
RFC 1190
224.0.0.8
All ST hosts on this subnet
RFC 1190
224.0.0.9
All RIPv2 routers on this subnet
RFC 1723
224.0.0.10
All Cisco IGRP routers on this subnet
(Cisco)
224.0.0.11
All Mobile IP agents
224.0.0.12
DHCP server/relay agents
RFC 1884
224.0.0.13
All PIM routers
(IANA)
224.0.014-224.0.0.21
Assigned to various routing protocols and router
features
(IANA)
224.0.0.22
IGMP
(IANA)
224.0.0.23-244.0.0.255
See www.iana.org/assignments/multicast-addresses
(IANA)
CHAPTER 16 Multicast
423
been assigned in the same manner that unicast addresses were allocated, the Class D
address space would have been exhausted long ago. However, IANA allocates static
multicast addresses only for protocols. Routers cannot forward packets in these ranges.
Some of these addresses are outlined in Table 16.2.
A simple dynamic address allocation mechanism is used in the SAP/SDP block to
prevent multicast address exhaustion. Applications, such as the Session Directory Tool
(SDR), use this mechanism to randomly select an unused address in this range. This
dynamic allocation mechanism for global multicast addresses is similar to the DHCP
function, which dynamically assigns unicast addresses on a LAN.
However, some applications require static multicast addresses. So, GLOP (described
in RFC 3180) provides static multicast ranges for organizations that already have an
AS number. (GLOP is not an acronym or abbreviation—it’s just the name of the mechanism.) GLOP uses the 2-byte AS number to derive a /24 address block within the
233/8 range. It’s worth noting that there are no GLOP addresses set aside for 4-byte AS
numbers. The static multicast range is derived from the following form:
233.[first byte of AS].[second byte of AS].0/24
For example, AS 65001 is allocated 233.253.233.0/24—and only this AS can use it. The
following is an easy way to compute this address.
1. Convert the AS number to hexadecimal (65001 5 0xFDE9).
2. Convert the first byte back to decimal (0xFD 5 253).
3. Convert the second byte back to decimal (0xE9 5 233).
Addresses in the 239/8 range are defined as administratively scoped. Packets sent
to these addresses should not be forwarded by a router outside an administratively
defined boundary (usually a domain).
Addresses in the 232/8 range are reserved for SSM. A nice feature of SSM is that
the multicast group address no longer needs to be globally unique. The source-group
“channel,” or tuple, provides uniqueness because the receiver is expressing interest in
only one source for the group.
SSM has solved the multicast addressing allocation headache. With SSM, as well
as GLOP, administrative scoping, and SAP/SDP, IPv4 multicast address allocation is
sufficient until IPv6 becomes more common.
IPv6 Multicast Addressing
In IPv6, the number of multicast (and unicast) addresses available is not an issue. All
IPv6 multicast addresses start with 1111 1111 (0xFF). As in IPv4, no IPv6 packet can
have an IPv6 multicast address as a source address. There is really no such thing as a
“broadcast” in IPv6. Instead, devices must belong to certain multicast groups and pay
attention to packets sent to these groups. The structure of the IPv6 multicast address
is shown in Figure 16.6.
424
PART III Routing and Routing Protocols
8 bits
4 bits
4 bits
112 bits
1111 1111
Flags
Scope
Group ID
128 bits
FIGURE 16.6
The IPv6 multicast address format. Note the presence of the scope field.
16 bits
0011 0011 0011 0011
80 bits
32 bits
Must Be All Zeroes
MAC Group ID
128 bits
FIGURE 16.7
The IPv6 multicast group addresses showing how the MAC group ID is embedded.
Format Prefix
This 8-bit field is simply 1111
1111 (0xFF).
Flags
As of RFC 2373, the only flag defined for this 4-bit field is Transient (T). If 0, the multicast
address is a permanently assigned well-known address allocated by IANA. If 1, the
multicast address is not permanently assigned (transient).
Scope
This 4-bit field establishes the multicast packets’ boundaries. RFC 2372 defines several
well-known scopes, including node-local (1), link-local (2), site-local (3), organizationlocal (8), and global (E). Packets sent to 0xFF02:X are confined to a single link and cannot pass through a router (this issue came up in the IGP chapter with RIPng).
Group ID
The IPv6 multicast group ID is 112 bits long. Permanently assigned group IDs are valid
regardless of the scope value, whereas transient group IDs are valid only within a particular scope. The 122 bits of the Group ID field pose a challenge to the 48-bit MAC
address (and only 23 of those bits were used in IPv4). But the solution is much simpler
than in IPv4. RFC 2373 recommends using the low-order 32 bits of the Group ID and
setting the high-order 16 bits to 0x3333. This is shown in Figure 16.7.
Naturally, there are 80 more bits that could be used in the Group ID field. For now,
RFC 2373 recommends setting the 801 bits available for multicast group IDs to 0s. If
there is a problem with 32 bits for multicast groups, which can be as many as 4 billion,
probably in the future the RFC group will think about extending the bits.
CHAPTER 16 Multicast
425
PIM-SM
The most important multicast routing protocol for the Internet today is PIM sparse
mode, defined in RFC 2362. PIM-SM is ideal for a number of reasons, such as its protocolindependent nature (PIM can use regular unicast routing tables for RPF checks and
other things), and it’s a nice fit with SSM (in fact, not much else fits at all with SSM). So,
we’ll look at PIM-SM in a little more detail (also in addition, because that’s what we’ll
be using on the Illustrated Network’s routers).
If a potential receiver is interested in the content of a particular multicast group, it
sends an IGMP Join message to the local router—which must know the location of the
network RPs servicing that group. If the local router is not currently on the distribution tree for that group, the router sends a PIM Join message (not an IGMP message)
through the network until the router becomes a leaf on the shared tree (RPT) to the
RP. Once multicast packets are flowing to the receiver, the routers all check to see if
there is a shorter path from the source to the destination than through the RP. If there
is, the routers will transition the tree from an RPT to an SPT using PIM Join and Prune
messages (technically, they are PIM Join/Prune messages, but it is common to distinguish them). The SPT is rooted at the designated router of the source. All of this is done
transparently to the receivers and usually works very smoothly.
There are other reasons to transition from an RPT to an SPT, even if the SPT is
actually longer than the RPT. An RP might become quite busy, and the shortest path
might not be optimal as determined by unicast routing protocols. A lot of multicast
discussion at ISPs involves issues such as how many RPs there should be (how many
groups should each service?) and where they should be located (near their sources?
centrally?). A related issue is how routers know about RPs (statically? Auto-RP? BSR?),
but these discussions have no clear or accepted answers.
There is only one PIM-SM feature that needs to be explained. How does traffic get
from the sender’s local router to the RP? The rendezvous point could create a tree
directly to every source, but if there is a lot of sources, there is a lot of state information to maintain. It would be better if the senders’ local routers could send the content
directly to the RP.
But how? The destination address of all multicast packets is a group address and not
a unicast address. So, the source’s router (actually, the DR) encapsulates the multicast
packets inside a unicast packet sent to the RP and tunnels the packet to the RP in this
form. The RP decapsulates the multicast content and makes it available for distribution
over the RPT tree.
There is much more to PIM-SM that has not been detailed here, such as PIM-SM for
SSM (sometimes seen as PIM-SSM). But it is enough to explain the interplay among host
receivers, IGMP (in IPv4), MLD (in IPv6), PIM itself, the RP, and the source.
The Resource Reservation Protocol and PGM
A lot of books and material on multicast include long discussions of the Resource
Reservation Protocol (RSVP), and some multicast routers and hosts still use RSVP to
426
PART III Routing and Routing Protocols
signal the network that the multicast packet stream they will be receiving will consume
a certain amount of resources on the network. However, the most common use of RSVP
today is not with multicast but with Multiprotocol Label Switching (MPLS)—and that’s
where we’ll put RSVP.
RVSP makes sense for multicast in a restricted bandwidth environment. But the
need for RSVP was undermined (as was ATM) by the embarrassment of bandwidth
available on LANs and router backbones (the video network YouTube today uses more
bandwidth than the entire Internet had in 2000). On slow networks, the biggest shortcoming is that you can’t reserve bandwidth you don’t have. If you do anyway, you’re
really just performing admission control (limited to those who are allowed to connect
over the network) and hosing the other applications. Everything works better with
enough bandwidth.
However, this is not to say that multicast is fine using UDP in all cases—especially
when multicast content must cross ISP boundaries, where bandwidth on these heavily
used links is often consumed by traffic. Nothing is more annoying when receiving
multicast content, voice, or video than dropped packets causing screen freezes and
unpredictable silences. So, routers and hosts can use Pragmatic General Multicast
(PGM), described in RFC 3208. PGM occupies the same place in the TCP/IP stack as
TCP itself. PGM runs on sender and receiver hosts, and on routers (which perform the
PGM router assist function).
As mentioned, the goal of PGM is not to make multicast UDP streams as reliable as
TCP. The PGM goal is to allow senders or routers (performing router assist functions)
to supply missing multicast packets if possible (such as for stock-ticker applications)
or to assure receivers that the data is indeed missing and not just delayed (it does this
by simply sequencing multicast packets). The issue is that you have to carry all of this
state information in routers, which is not good for scaling.
Multicast Routing Protocols
Now we can go back to the network. We’ll have to run a multicast routing protocol
on our routers. We’ll run PIM, which is the most popular multicast protocol. But PIM
can be configured in dense “send-everywhere” mode or sparse “only if you ask” mode.
Which should we use?
Let’s consider our router configuration. Nothing is easier to configure than dense
mode. We can just configure PIM dense mode (PIM-DM) to run on every router interface (even the LAN interfaces if we like—the PIM messages won’t hurt anything),
except for the network management interface on Juniper Networks routers (fxp0.0).
Multicast traffic is periodically flooded everywhere and pruned back as IGMP membership reports come in on local area network interfaces. This is just an exercise for our lab
network. You definitely should not try this at home. The following is the configuration
on router CE6:
set protocols pim interface all mode dense;
set protocols pim interface fxp0.0 disable;
CHAPTER 16 Multicast
427
It is not necessary to configure IGMP on the LAN interface. As long as PIM is configured, IGMPv2 is run on all interfaces that support broadcasts (including frame relay and
ATM). Of course, if a different version of IGMP—such as IGMPv1 or IGMPv3 (wincli
was running IGMPv3, as shown in Figure 16.2)—is desired, this must be explicitly
configured.
It is more interesting and meaningful to configure the PIM sparse mode, because that
is what is used, with few exceptions, on the Internet. There are two distinct configurations: one for the RP router and the other on all the non-RP routers. We’ll use simple
static configuration to locate the RP router, but that’s not what is typically done in the
real world. The configuration on the RP router, which is router PE5 in this example,
follows:
set protocols pim rp local address 192.168.5.1;
set protocols pim rp interface all mode sparse;
set protocols pim rp interface fxp0.0 disable;
The local keyword means that the local router is the RP. The address is the RP
address that will be used in PIM messages between the routers. The configuration on
the non-RP router, such as P9, follows:
set protocols pim rp static address 192.168.5.1;
set protocols pim rp interface all mode sparse;
set protocols pim rp interface fxp0.0 disable;
The static keyword means that another router is the RP, located at the IP address given.
The RP address is used in PIM messages between the routers.
Once PIM is up and running on the rest of the router network (we don’t need MSDP
because the RP is known everywhere within the merged Best-Ace ISP routing domain
and this precludes interdomain ASM use anyway), wincli2 receives multicast traffic
from wincli1, as shown in Figures 16.8 and 16.9.
FIGURE 16.8
Receiving a stream of multicast traffic from wincli1 across the router network on wincli2.
428
PART III Routing and Routing Protocols
FIGURE 16.9
ICMPv6 multicast packets for neighbor discovery, showing how the MAC address is embedded in
the IPv6 source address field.
IPv6 Multicast
In contrast to IPv4, where multicast sometimes seems like an afterthought compared
to the usual unicast business of the network, IPv6 is fairly teeming with multicast.
You have to do a lot to add multicast to IPv4, but IPv6 simply will not work without
multicasting. Of course, a lot of this multicast use is confined to single subnets. So,
despite being more heavily used, IPv6 multicast is not necessarily easier to deploy
(even though you don’t have to worry about MSDP).
Figure 16.10 shows a multicast IPv6 neighbor discovery packet, which contains an
ICMPv6 message (an echo request). As expected, the packet is sent to IPv6 multicast
address 0xFF02::1, and the frame is sent to the address beginning 0x33:33.
429
QUESTIONS FOR READERS
Figure 16.10 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
Multicast
Source
(Group A)
Multicast
Host
RP
Multicast
Host
Multicast
Source
(Group B)
Multicast
Routers
Routers Running
PIM Sparse Mode
Multicast
Host
Multicast
Host
Uninterested Interested
Host
Host
(Group A)
Multicast
Host
Multicast
Host
Multicast
Host
Multicast
Host
Uninterested
Host
Interested
Host
(Group B)
Interested
Host
(Group B)
Interested
Host
(Group B)
FIGURE 16.10
A group of routers running PIM sparse mode with sources and receivers.
1. Generally, it is a good idea for RPs to be centrally located on the router network.
Why does this make sense?
2. In Figure 16.10, does the rightmost host, which is interested in Group B content,
have to get it initially from the RP when the source is closer?
3. Would the RP be required if the routers were running PIM dense mode?
4. Will the leftmost router with the uninterested host constantly stream multicast
traffic onto the LAN anyway?
5. Is the uninterested host on the LAN in the middle able to listen in on Group A
and Group B traffic without using IGMP to join the groups?
CHAPTER
MPLS and IP Switching
17
What You Will Learn
In this chapter, you will learn how the desire for convergence has led to the
development of various IP switching techniques. We’ll also compare and contrast
frame relay and ATM switched networks to illustrate the concepts behind IP
switching.
You will learn how MPLS is used to create LSPs to switch (instead of route)
IP packet through a routing domain. We’ll see how MPLS can form the basis for a
type of VPN service offering.
One of the reasons TCP/IP and the Internet have grown so popular is that this
architecture is the promising way to create a type of “universal network” well suited
for and equally at home with voice, video, and data. The Internet started as a network
exclusively for data delivery, but has proved to be remarkably adaptable for different
classes of traffic. Some say that more than half of all telephone calls are currently carried
for part of their journey over the Internet, and this percentage will only go higher in
the future. Why not watch an entire movie or TV show over the Internet? Many now
watch episodes they missed on the Internet. Why not everything? As pointed out in
the previous chapter, multicast might not be used to maximum effect for this but video
delivery still works.
When a service provider adds television (or video in general) to Internet access and
telephony, this is called a “triple play” opportunity for the service provider. (Adding wireless services over the Internet is sometimes called a “quadruple play” or “home run.”)
This desire for networking convergence is not new. When the telephone was
invented, there were more than 30 years’ worth of telegraph line infrastructure in
place from coast to coast and in most major cities throughout the United States. The
initial telephone services used existing telegraph links to distribute telegrams, but this
was not a satisfactory solution. The telegraph network was optimized for the dots and
dashes of Morse code, not the smooth analog waveforms of voice. Early attempts to run
voice over telegraph lines stumbled not over bandwidth, but with the crosstalk induced
432
PART III Routing and Routing Protocols
bsdclient
lnxserver
wincli1
winsvr1
em0: 10.10.11.177
MAC: 00:0e:0c:3b:8f:94
(Intel_3b:8f:94)
IPv6: fe80::20e:
cff:fe3b:8f94
eth0: 10.10.11.66
MAC: 00:d0:b7:1f:fe:e6
(Intel_1f:fe:e6)
IPv6: fe80::2d0:
b7ff:fe1f:fee6
LAN2: 10.10.11.51
MAC: 00:0e:0c:3b:88:3c
(Intel_3b:88:3c)
IPv6: fe80::20e:
cff:fe3b:883c
LAN2: 10.10.11.111
MAC: 00:0e:0c:3b:87:36
(Intel_3b:87:36)
IPv6: fe80::20e:
cff:fe3b:8736
Ethernet LAN Switch with Twisted-Pair Wiring
LAN1
Los Angeles
Office
fe-1/3/0: 10.10.11.1
MAC: 00:05:85:88:cc:db
(Juniper_88:cc:db)
IPv6: fe80:205:85ff:fe88:ccdb
ge0/0
50. /3
2
CE0
lo0: 192.168.0.1
BestWireless
in Home
ge0/0
50. /3
1
ink
LL
DS
0
/0/
-0
so 9.2
5
P9
lo0: 192.168.9.1
so-0/0/1
79.2
so-0/0/3
49.2
so0/0
29. /2
2
0
/0/
-0 .1
59
so
PE5
lo0: 192.168.5.1
so
-0
45 /0/2
.2
so
so-0/0/3
49.1
45
-0
.1
/0/
2
P4
lo0: 192.168.4.1
/0
0/0
1
47.
so-0/0/1
24.2
so-
Solid rules ⫽ SONET/SDH
Dashed rules ⫽ Gig Ethernet
Note: All links use 10.0.x.y
addressing...only the last
two octets are shown.
FIGURE 17.1
The routers on the Illustrated Network will be used to illustrate MPLS. Note that we are still dealing with
the merged Best-Ace ISP and a single AS.
CHAPTER 17 MPLS and IP Switching
433
bsdserver
lnxclient
winsvr2
wincli2
eth0: 10.10.12.77
MAC: 00:0e:0c:3b:87:32
(Intel_3b:87:32)
IPv6: fe80::20e:
cff:fe3b:8732
eth0: 10.10.12.166
MAC: 00:b0:d0:45:34:64
(Dell_45:34:64)
IPv6: fe80::2b0:
d0ff:fe45:3464
LAN2: 10.10.12.52
MAC: 00:0e:0c:3b:88:56
(Intel_3b:88:56)
IPv6: fe80::20e:
cff:fe3b:8856
LAN2: 10.10.12.222
MAC: 00:02:b3:27:fa:8c
IPv6: fe80::202:
b3ff:fe27:fa8c
Ethernet LAN Switch with Twisted-Pair Wiring
LAN2
New York
Office
CE6
lo0: 192.168.6.1
fe-1/3/0: 10.10.12.1
MAC: 0:05:85:8b:bc:db
(Juniper_8b:bc:db)
IPv6: fe80:205:85ff:fe8b:bcdb
/3
0/0
ge- .2
16
Ace ISP
so-0/0/1
79.1
so
-0
/
17 0/2
.2
so
-0
/
17 0/2
.1
0
/0/
-0
so 2.1
1
so0/0
29. /2
1
so-0/0/3
27.1
so-0/0/1
24.1
P2
lo0: 192.168.2.1
/3
0/0
1
16.
so-0/0/3
27.2
ge-
/0
0/0
2
47.
so-
P7
lo0: 192.168.7.1
PE1
lo0: 192.168.1.1
0
/0/
.2
2
1
-0
so
Global Public
Internet
AS 65127
434
PART III Routing and Routing Protocols
by the pulses of Morse code running in adjacent wires. The solution was to twist and
pair telephone wires and maintain adequate separation from telegraph wire bundles.
So, two separate networks grew up: telephone and telegraph. When cable TV came
along much later, the inadequate bandwidth of twisted-pair wire led to a third major
distinct network architecture—this one made of coaxial cable capable of delivering
50 or more (compared to the handful of broadcast channels available, that was a lot)
television channels at the same time.
Naturally, communications companies did not want to pay for, deploy, and maintain
three separate networks for separate services. It was much more efficient to use one
converged infrastructure for everything. Once deregulation came to the telecommunications industry, and the same corporate entity could deliver voice as a telephony company, video as a cable TV company, and data as an ISP, the pressure to find a “universal”
network architecture became intense. But the Internet was not the only universal network intended to be used for the convergence of voice, video, and data over the same
links. Telecommunications companies also used frame relay (FR) and asynchronous
transfer mode (ATM) networks to try to carry voice, video, and data on the same links.
Let’s see if we can “converge” these different applications onto the Illustrated
Network. This chapter will use the Illustrated Network routers exclusively. This is
shown in Figure 17.1, which also reveals something interesting when we run traceroute from bsdclient on LAN1 to bsdserver on LAN2.
bsdclient# traceroute bsdserver
traceroute to bsdserver (10.10.12.77), 64 hops max, 44 byte packets
1 10.10.11.1 (10.10.11.1) 0.363 ms 0.306 ms 0.345 ms
2 10.0.50.1 (10.1.36.2) 0.329 ms 0.342 ms 0.346 ms
3 10.0.45.1 (10.0.45.1) 0.330 ms 0.341 ms 0.346 ms
4 10.0.24.1 (10.0.24.1) 0.332 ms 0.343 ms 0.345 ms
5 10.0.12.1 (10.0.12.1) 0.329 ms 0.342 ms 0.347 ms
6 10.0.16.2 (10.0.16.2) 0.330 ms 0.341 ms 0.346 ms
7 10.10.12.77 (10.10.12.77) 0.331 ms 0.343 ms 0.347 ms
bsdclient#
The packets travel from PE5 to P4 and then on to P2 and PE1. Why shouldn’t they
flow through P9 and P7? Well, they could, but without load balancing turned on (and
it is not) PE5 has to choose P9 or P4 as the next hop. All things being equal, if all other
metrics are the same, routers typically pick to lowest IP address. A look at the network
diagram shows this to be the case here.
There are obviously other users on the Best-Ace ISP’s network, not just those on
LAN1 and LAN2. However, it would be nice if the customer-edge (site) routers CE0 and
CE6 were always seven hops away and never any more (in other words, no matter how
traffic is routed there are always six routers between LAN1 and LAN2). This is because
most of the traffic flows between the two sites, as we have seen (on many LANs, vast
quantities of traffic usually flow among a handful of destinations).
Before the rise of the Internet, the company owning LAN1 and LAN2 would pay a
service provider (telephone company or other “common carrier”) to run a point-to-point
CHAPTER 17 MPLS and IP Switching
435
link between New York and Los Angeles and use it for data traffic. They might also do the
same for voice, and perhaps even for video conferences between the two sites. The nice
thing about these leased line links (links used exclusively for voice are called tie lines) is
that they make the two sites appear to be directly connected, reducing the number of
hops (and network processing delay) drastically.
But leased lines are an expensive solution (they are paid for by the mile) and are limited in application (they only connect the two sites). What else could a public network
service provider offer as a convergence solution to make the network more efficient?
We’ll take a very brief look at the ideas behind some public network attempts at
convergence (frame relay and ATM) and then see how TCP/IP itself handles the issue.
We’ll introduce Multiprotocol Label Switching (MPLS) and position this technology as
a way to make IP router networks run faster and more efficiently with IP switching.
CONVERGING WHAT?
Convergence is not physical convergence through channels, which had been done for
a very long time. Consider a transport network composed of a series of fiber optic links
between SONET/SDH multiplexers. The enormous bandwidth on these links can be
(and frequently is) channelized into multiple separate paths for voice bits, data bits, and
video bits on the same physical fiber. But this is not convergence.
In this chapter convergence means the combination of voice, video, and data on
the same physical channel. Convergence means more than just carrying channels on
the same physical transport. It means combining the bits representing voice, video,
and data into one stream and carrying them all over the total bandwidth on the same
“unchannelized” fiber optic link. If there are voice, video, and data channels on the link,
these are now virtual channels (or logical channels) and originate and terminate in the
same equipment—not only at the physical layer, but at some layer above the lowest.
On modern Metro Ethernet links, the convergence is done by combining the traffic
from separate VLANs on the same physical transport. The VLANs can be established
based on traffic type (voice, video, and data), customer or customer site, or both (with
an inner and outer VLAN label.) In this chapter, we’ll talk about MPLS—which can
work with VLANs or virtual channels.
Fast Packet Switching
Before there was MPLS, there was the concept of fast packet switching to speed up
packet forwarding on converged links and through Internet network nodes. Two major
technologies were developed to address this new technology, and they are worth at
least a mention because they still exist in some places.
Frame Relay
Frame relay was an attempt to slim down the bulky X.25 public packet switching
standard protocol stack for public packet networks for the new environment of home
PCs and computers at every work location in an organization. Although it predated
436
PART III Routing and Routing Protocols
modern layered concepts, X.25 essentially defined the data units at the bottom three
layers—physical interface, frame structure, and packet—as an international standard.
It was mildly successful compared to the Internet, but wildly successful for a world
without the Web and satellite or cell phones. In the mid-1980s, about the only way to
communicate text to an off-shore oil platform or ships at sea was with the familiar but
terse “GA” (go ahead) greeting on a teletype over an X.25 connection.
The problem with X.25 packets (called PLP, Packet Layer Protocol, packets) was
that they weren’t IP packets, and so could not easily share or even interface with the
Internet, which had started to take off when the PC hit town. But IP didn’t have a
popular WAN frame defined (SLIP did not really use frames), so the X.25 Layer 2 frame
structure, High-level Data Link Control (HDLC)—also used in ISDN—was modified to
make it more useful in an IP environment populated by routers. In fact, routers, which
struggled with full X.25 interfaces, could easily add frame relay interfaces.
One of the biggest parts of X.25 dropped on the way to frame relay was error
resistance. Today, network experts have a more nuanced and sophisticated understanding of how this should be done instead of the heavyweight X.25 approach to error
detection and recovery.
Frame relay was once popularly known as “X.25 on steroids,” a choice of analogies
that proved unfortunate for both X.25 and frame relay. But at least frame relay switch
network nodes could relay frames faster than X.25 switches could route packets.
Attempts were made to speed X.25 up prior to the frame relay makeover, such as
allowing a connection-request message to carry data, which was then processed and a
reply returned by the destination in a connection-rejected message, thus making X.25
networks as efficient for some things as a TCP/IP network with UDP. However, an X.25
network was still much more costly to build and operate than anything based on the
simple Internet architecture. The optimization to X.25 that frame relay represented is
shown in Figure 17.2.
Even with frame relay defined, there was still one nagging problem: Like X.25 before
it, frame relay was connection oriented. Only signaling protocol messages were connectionless, and many frame relay networks used “permanent virtual circuits” set up
Layer 3
Network Layer
Layer 2
Data Link Layer
Data Link Layer
Layer 2
Layer 1
Physical Layer
Physical Layer
Layer 1
Layers Needed to
Route X.25 Packets
Layers Needed to
“Relay” FR Frames
FIGURE 17.2
How X.25 packet routing relates to frame relaying. Note that frame relay has no network layer,
leaving IP free to function independently.
CHAPTER 17 MPLS and IP Switching
437
with a labor-intensive process comparable to configuring router tables with hundreds
of static entries in the absence of mature routing protocols.
Connections were a large part of the reason that X.25 network nodes were switches
and not routers. A network node that handled only frame relay frames was still a
switch, and connections were now defined by a simple identifier in the frame relay
header and called “virtual circuits.” But a connection was still a connection. In the time
it took a frame relay signaling message exchange to set up a connection, IP with UDP
could send a request and receive a reply. Even for bulk data transfer, connections over
frame relay had few attractions compared to TCP for IP.
The frame relay frame itself was tailor-made for transporting IP packets over public
data networks run by large telecommunications carriers rather than privately owned
routers linked by dedicated bandwidth leased by the mile from these same carriers.
The frame relay frame structure is shown in Figure 17.3.
■
■
■
■
■
DLCI—The Data Link Connection Identifier is a 10-bit field that gives the
connection number.
C/R—The Command/Response bit is inherited from X.25 and not used.
EA—The Extended Address bit tells whether the byte is the last in the
header (headers in frame relay can be longer than 2 bytes).
FECN and BECN—The Forward/Backward Explicit Congestion Notification
bits are used for flow control.
DE—The Discard Eligible bit is used to identify frames to discard under
congested conditions.
Unlike a connectionless packet, the frame relay frame needs only a connection
identifier to allow network switch nodes to route the frame. In frame relay, this is the
DLCI. A connection by definition links two hosts, source and destination. There is no
01111110 Header: Address
(7E)
and Control
1 byte
2 bytes
8
Payload
(information)
Trailer: Frame 01111110
Check Sequence
(7E)
Up to
4096 bytes
2 bytes
Bits
DLCI
(6 high-order bits)
1 byte
1
C/R
E
A
F B
E E D E
DLCI
(4 low-order bits) C C E A
N N
FIGURE 17.3
The basic 2-byte frame relay frame and header. The DLCI field can come in larger sizes.
438
PART III Routing and Routing Protocols
sense of “send this to DLCI 18” or “this is from DLCI 18.” Frames travel on DLCI 18,
and this implies that connections are inherently unidirectional (which they are, but
are usually set up and released in pairs) and that the connection identifiers in each
direction did not have to match (although they typically did, just to keep network
operators sane).
One of the things that complicate DLCI discussions is that unlike globally unique IP
addresses, DLCIs have local significance only. This just means that the DLCI on a frame
relay frame sent from site A on DLCI 25 could easily arrive at site B on DLCI 38. And
in between, the frame could have been passed around the switches as DLCI 18, 44, or
whatever. Site A only needs to know that the local DLCI 25 leads to site B, and site B
needs to know that DLCI 38 leads to site A, and the entire scheme still works. But it is
somewhat jarring to TCP/IP veterans.
This limits the connectivity from each site to the number of unique DLCIs that
can operate at any one time, but the DLCI header field can grow if this becomes a
problem. And frame relay connections were never supposed to be used all of the
time.
What about adding voice and video to frame relay? That was actually done, especially with voice. Frame relay was positioned as a less expensive way of linking an organization’s private voice switches (called private branch exchanges, or PBXs) than with
private voice circuits. Voice was not always packetized, but at least it was “framerized”
over these links. If the links had enough bandwidth, which was not always a given,
primitive videoconferencing (but not commercial-quality video signals that anyone
would pay to view) could be used as well.
Frame relay suffered from three problems, which proved insurmountable. It was
not particularly IP friendly, so frame relay switches (which did not run normal IP routing protocols) could not react to TCP/IP network conditions the way routers could.
The router and switches remained “invisible” to each other. And in spite of efforts to
integrate voice and video onto the data network, frame relay was first and foremost a
data service and addressed voice and video delay concerns by grossly overconfiguring bandwidth in almost all cases. Finally, the telecommunications carriers (unlike the
ISPs) resisted easy interconnection of the frame relay network with those of other carriers, which forced even otherwise eager customers to try to do everything with one
carrier (an often impossible task). It was a little like cell phones without any possibility
of roaming, and in ironic contrast to the carrier’s own behavior as an ISP, this closed
environment was not what customers wanted or needed.
Frame relay still exists as a service offering. However, outside of just another type of
router WAN interface, frame relay has little impact on the Internet or IP world.
Asynchronous Transfer Mode
The Asynchronous Transfer Mode (ATM) was the most ambitious of all convergence
methods. It had to be, because what ATM essentially proposed was to throw everything
out that had come before and to “Greenfield” the entire telecommunications structure
CHAPTER 17 MPLS and IP Switching
439
the world over. ATM was part of an all-encompassing vision of networking known as
broadband ISDN (B-ISDN), which would support all types of voice, video, and data
applications though virtual channels (and virtual connections). In this model, the Internet would yield to a global B-ISDN network—and TCP/IP to ATM.
Does this support plan for converged information sound familiar? Of course it does.
It’s pretty much what the Internet and TCP/IP do today, without B-ISDN or ATM. But
when ATM was first proposed, the Internet and TCP/IP could do none of the things
that ATM was supposed to do with ease. How did ATM handle the problems of mixing
support for bulk data transfer with the needs of delay-sensitive voice and bandwidthhungry (and delay-sensitive) video?
ATM was the international standard for what was known as cell relay (there were
cell relay technologies other than ATM, now mostly forgotten). The cell relay name
seems to have developed out of an analogy with frame relay. Frame relay “relayed”
(switched) Layer 2 frames through network nodes instead of independently routing
Layer 3 packets. The efficiency of doing it all at a lower layer made the frame relay node
faster than a router could have been at the time.
Cell relay took it a step further, doing everything at Layer 1 (the actual bit level).
But there was no natural data unit at the physical layer, just a stream of bits. So, they
invented one 53 bytes long and called it the “cell”—apparently in comparison to the
cell in the human body—which is very small, can be generic, and everything else is
built up from them. Technically, in data protocol stacks, cells are a “shim” layer slipped
between the bits and the frames, because both bits and frames are still needed in hardware and software at source and destination.
Cell relay (ATM) “relayed” (switched) cells through network nodes. This could be
done entirely in hardware because cells were all exactly the same size. Imagine how
fast ATM switches would be compared to slow Layer 3 routers with two more layers
to deal with! And ATM switches had no need to allocate buffers in variable units, or to
clean up fragmented memory. The structure of the 5-byte ATM cell header is shown in
Figure 17.4 (descriptions follow on next page). The call payload is always 48 bytes long.
8
Bits
1
GFC
VPI
VPI
VCI
8
Bits
VPI
VCI
VCI
1
VPI
VCI
VCI
PTI
CLP
HEC
UNI Cell Header
VCI
PTI
CLP
HEC
5
octets
NNI Cell Header
FIGURE 17.4
The ATM cell header. Note the larger VPI fields on the network (NNI) version of the header.
440
PART III Routing and Routing Protocols
■
GFC—The Generic Flow Control is a 4-bit field used between a customer site and
ATM switch, on the User-Network Interface (UNI). It is not present on the Network–
Network Interface (NNI) between ATM switches.
■
VPI—The Virtual Path Identifier is an 8- or 12-bit field used to identify paths between
sites on the ATM network. It is larger on the NNI to accommodate aggregation on
customer paths.
■
VCI—The Virtual Connection Identifier is a 16-bit field used to identify paths between
individual devices on the ATM network.
■
PTI—The Payload Type Indicator is a 3-bit field used to identify one of eight traffic
types carried in the cell.
■
CLP—The Cell Loss Priority bit serves the same function as the DE bit in frame relay,
but identifies cells to discard when congestion occurs.
■
HEC—The Header Error Control byte not only detects bit errors in the entire
40-bit header, but can also correct single bit errors.
In contrast to frame relay, the ATM connection identifier was a two-part virtual path
identifier (VPI) and virtual channel identifier (VCI). Loosely, VPIs were for connections
between sites and VCIs were for connections between devices. ATM switches could
“route” cells based on the VPI, and the local ATM switch could take care of finding the
exact device for which the cell was destined.
Like frame relay DLCIs, ATM VPI/VCIs have local significance only. That is, the VPI/
VPI values change as the cells make their way from switch to switch and depending on
direction. Both frame relay and ATM switch essentially take a data unit in on an input
port, look up the header (DLCI or VPI/VCI label) in a table, and output the data unit
on the port indicated in the table—but also with a new label value, also provided by
the table.
This distinctive label-swapping is characteristic of switching technologies and
protocols. And, as we will see later, switching has come to the IP world with MPLS,
which takes the best of frame relay and ATM and applies it directly to IP without the
burden of “legacy” stacks (frame relay) or phantom applications (ATM and B-ISDN).
The tiny 48-byte payload of the ATM cell was intentional. It made sure that no delaysensitive bits got stuck in a queue behind some monstrous chunk of data a thousand
times larger than the 48 voice or video bytes. Such “serialization delay” introduced
added delay and delay variation (jitter) that rendered converged voice and video almost
useless without more bandwidth than anyone could realistically afford. With ATM, all
data encountered was a slightly elevated delay when data cells shared the total bandwidth with voice and video. But because few applications did anything with data (such
as a file) before the entire group of bits was transferred intact ATM pioneers deemed
this a minor inconvenience at worst.
All of this sounded too good to be true to a lot of networking people, and it turned
out that it was. The problem was not with raw voice and video, which could be molded
into any form necessary for transport across a network. The issue was with data, which
came inside IP packets and had to be broken down into 48-byte units—each of which
had a 5-byte ATM cell header, and often a footer that limited it to only 30 bytes.
CHAPTER 17 MPLS and IP Switching
441
This was an enormous amount of overhead for data applications, which normally
added 3 or 4 bytes to an Ethernet frame for transport across a WAN. Naturally, no hardware existed to convert data frames to cells and back—and software was much too
slow—so this equipment had to be invented. Early results seemed promising, although
the frame-to-cell-and-back process was much more complex and expensive than anticipated. But after ATM caught on, prices would drop and efficiencies would be naturally
discovered. Once ATM networks were deployed, the B-ISDN applications that made the
most of them would appear. Or so it seemed.
However, by the early 1990s it turned out that making cells out of data frames was
effective as long as the bandwidth on the link used to carry both voice and video
along with the data was limited to less than that needed to carry all three at once.
In other words, if the link was limited to 50 Mbps and the voice and video data added
up to 75 Mbps, cells made sense. Otherwise, variable-length data units worked just fine.
Full-motion video was the killer at the time, with most television signals needing about
45 Mbps (and this was not even high-definition TV). Not only that, but it turned out that
the point of diminishing ATM returns (the link bandwidth at which it became slower
and more costly to make cells than simply send variable-length data units) was about
622 Mbps—lower than most had anticipated.
Of course, one major legacy of the Internet bubble was the underutilization of
fiber optic links with more than 45 Mbps, and in many cases greatly in excess of
622 Mbps. And digital video could produce stunning images with less and less bandwidth as time went on. And in that world, in many cases, ATM was left as a solution
without a problem. ATM did not suffer from lack of supporters, but it proved to be the
wrong technology to carry forward as a switching technology for IP networks.
Why Converge on TCP/IP?
Some of the general reasons TCP/IP has dominated the networking scene have been
mentioned in earlier chapters. Specifically, none of the “new” public network technologies were particularly TCP/IP friendly—and some seemed almost antagonistic. ATM
cells, for instance, would be a lot more TCP/IP friendly if the payload were 64 bytes
instead of 48 bytes. At least a lot of TCP/IP traffic would fit inside a single ATM cell
intact, making processing straightforward and efficient.
At 48 bytes, everything in TCP/IP had to be broken up into at least two cells. But the
voice people wanted the cell to be 32 bytes or smaller, in order to keep voice delays as
short as possible. It may be only a coincidence that 48 bytes is halfway between 32 and
64 bytes, but a lot of times reaching a compromise instead of making a decision annoys
both parties and leaves neither satisfied with the result. So, ATM began as a standard
by alienating the two groups (voice and data) that were absolutely necessary to make
ATM a success.
But the real blow to ATM came because a lot of TCP/IP traffic would not fit into
64-byte frames. ACKs would fit well, but TCP/IP packet sizes tend to follow a bimodal
distribution with two distinct peaks at about 64 and between 1210 and 1550 bytes.
The upper cluster is smaller and more spread out, but this represents the vast bulk of
all traffic on the Internet.
442
PART III Routing and Routing Protocols
Then new architectures allowed otherwise normal IP routers to act like frame relay
and ATM switches with the addition of IP-centric MPLS. Suddenly, all of the benefits
of frame relay and ATM could be had without using unfamiliar and special equipment
(although a router upgrade might be called for).
MPLS
Rather than adding IP to fast packet switching networks, such as frame relay and ATM,
MPLS adds fast packet switching to IP router networks. We’ve already talked about
some of the differences between routing (connectionless networks) and switching
networks in Chapter 13. Table 17.1 makes the same type of comparisons from a different perspective.
The difference in the way CoS is handled is the major issue when convergence is
concerned. Naturally, the problem is to find the voice and video packets in the midst of
the data packets and make sure that delay-sensitive packets are not fighting for bandwidth
along with bulk file transfers or email. This is challenging in IP routers because there is no
fixed path set up through the network to make it easy to enforce QoS at every hop along
the way. But switching uses stable paths, which makes it easy to determine exactly which
routers and resources are consumed by the packet stream. QoS is also challenging because
you don’t have administrative control over the routers outside your own domain.
MPLS and Tunnels
Some observers do not apply the term “tunnel” to MPLS at all. They reserve the term
for wholesale violations on normal encapsulations (packet in frame in a packet, for
example). MPLS uses a special header (sometimes called a “shim” header) between
packet and frame header, a header that is not part of the usual TCP/IP suite layers.
However, RFCs (such as RFC 2547 and 4364) apply the tunnel terminology
to MPLS. MPLS headers certainly conform to general tunnel “rules” about stack
encapsulation violations. This chapter will not dwell on “MPLS tunnel” terminology but will not avoid the term either. (This note also applies to MPLS-based VPNs,
discussed in Chapter 26.)
But QoS enforcement is not the only attraction of MPLS. There are at least two
others, and probably more. One is the ability to do traffic engineering with MPLS, and
the other is that MPLS tunnels form the basis for a certain virtual private network
(VPN) scheme called Layer 3 VPNs. There are also Layer 2 VPNs, and we’ll look at them
in more detail in Chapter 26.
MPLS uses tunnels in the generic sense: The normal flow of the layers is altered at one
point or another, typically by the insertion of an “extra” header. This header is added at
one end router and removed (and processed) at the other end. In MPLS, routers form the
CHAPTER 17 MPLS and IP Switching
443
Table 17.1 Comparing Routing and Switching on a WAN
Characteristic
Routing
Switching
Network node
Router
Switch
Traffic flow
Each packet routed independently
hop by hop
Each data unit follows same
path through network
Node coordination
Routing protocols share
information
Signaling protocols set up
paths through network
Addressing
Global, unique
Label, local significance
Consistency of address
Unchanged source to destination
Label is swapped at each node
QoS
Challenging
Associated with path
Packet for
10.10.100.0/24
Router
Router
Router
Upstream
ISP
Border
Router
Border
Router
ISP
Router
Router
Router
Downstream
ISP
Network
10.10.100.0/24
(and many more)
FIGURE 17.5
The rationale for MPLS. The LSP forms a “shortcut” across the routing network for transit traffic.
The Border Router knows right away, thanks to BGP, that the packet for 10.10.100.0/24 must exit
at the other border router. Why route it independently at every router in between?
endpoints of the tunnels. In MPLS, the header is called a label and is placed between the
IP header and the frame headers—making MPLS a kind of “Layer 2 and a half” protocol.
MPLS did not start out to be the answer to everyone’s dream for convergence or
traffic engineering or anything else. MPLS addressed a simple problem faced by every
large ISP in the world, a problem shown in Figure 17.5.
MPLS was conceived as a sort of BGP “shortcut” connecting border routers across
the ISP. As shown in the figure, a packet bound for 10.10.100.0/24 entering the border
router from the upstream ISP is known, thanks to the IBGP information, to have to exit
the ISP at the other border router. In practice, of course, this will apply to many border
routers and thousands of routes (usually most of them), but the principle is the same.
Only the local packets with destinations within the ISP technically need to be
routed by the interior routers. Transit packets can be sent directly to the border router,
444
PART III Routing and Routing Protocols
PPP Header
MPLS Label
(32 bits)
Label
20 bits
IP Packet
CoS
S
3 bits 1
bit
TTL
8 bits
FIGURE 17.6
The 32-bit MPLS label fields. Note the 3-bit CoS field, which is often related to the IP ToS header.
The label field is used to identify flows that should be kept together as they cross the network.
if possible. MPLS provides this mechanism, which works with BGP to set up tunnels
through the ISP between the border routers (or anywhere else the ISP decides to use
them).
The structure of the label used in MPLS is shown in Figure 17.6. In the figure,
it is shown between a Layer 2 PPP frame and the Layer 3 IP packet (which is very
common).
■
■
■
■
Label—This 20-bit field identifies the packets included in the “flow” through the
MPLS tunnel.
CoS—Class-of-Service is a 3-bit field used to classify the data stream into one of
eight categories.
S—The Stack bit lets the router know if another label is stacked after the
current 32-bit label.
TTL—The Time-to-Live is an 8-bit field used in exactly the same way as the IP
packet header TTL. This value can be copied from or into the IP packet or used
in other ways.
Certain label values and ranges have been reserved for MPLS. These are outlined in
Table 17.2.
The MPLS architecture is defined in RFC 3031, and MPLS label stacking is defined in
RFC 3032 (more than one MPLS label can precede an IP packet). General traffic engineering in MPLS is described in RFC 2702, and several drafts add details and features
to these basics.
What does it mean to use traffic engineering on a router network? Consider the
Illustrated Network. We saw that traffic from LAN1 to LAN2 flows through backbone
routers P4 and P2 (reverse traffic also flows this way). But notice that P2 and P4 also
have links to and from the Internet. A lot of general Internet traffic flows through routers P2 and P4 and their links, as well as LAN1 and LAN2 traffic.
CHAPTER 17 MPLS and IP Switching
445
Table 17.2 MPLS Label Values and Their Uses
Value or Range
Use
0
IPv4 Explicit Null. Must be the last label (no stacking). Receiver
removes the label and routes the IPv4 packet inside.
1
Router Alert. The IP packet inside has information for the
router itself, and the packet should not be forwarded.
2
IPv6 Explicit Null. Same as label 0, but with IPv6 inside.
3
Implicit Null. A “virtual” label that never appears in the
label itself. It is a table entry to request label removal by the
downstream router.
4–15
Reserved.
16–1023 and 10000–99999
Ranges used in Juniper Networks routers to manually configure
MPLS tunnels (not used by the signaling protocols).
1024–9999
Reserved.
100000–1048575
Used by signaling protocols.
So, it would make sense to “split off” the LAN1 and LAN2 traffic onto a less utilized
path through the network (for example, from PE5 to P9 to P7 to PE1). This will ease
congestion and might even be faster, even though in some configurations there might
be more hops (for example, there might be other routers between P9 and P7).
Why Not Include CE0 and CE6?
Why did we start the MPLS tunnels at the provider-edge routers instead of directly
at the customer edge, on the premises? Actually, as long as the (generally) smaller
site routers support the full suite of MPLS features and protocols there’s no reason
the tunnel could not span LAN to LAN.
However, MPLS traditionally begins and ends in the “provider cloud”—usually
on the PE routers, as in this chapter. This allows the customer routers to be more
independent and less costly, and allows reconfiguration of MPLS without access to
the customer’s routers. Of course, in some cases the customer might want ISP to
handle MPLS management—and then the CE routers certainly could be included
on the MPLS path.
There are ways to do this with IGPs, such as OSPF and IS–IS, by adjusting the link
metrics, but these solutions are not absolute and have global effects on the network.
In contrast, an MPLS tunnel can be configured from PE5 to PE1 through P9 and P7 and
446
PART III Routing and Routing Protocols
only affect the routing on PE5 and PE1 that involves LAN1 and LAN2 traffic, exactly the
effect that is desired.
MPLS Terminology
Before looking at how MPLS would handle a packet sent from LAN1 to LAN2 over an
MPLS tunnel, we should look at the special terminology involved with MPLS. In no
particular order, the important terms are:
LSP—We’ve been calling them tunnels, and they are, but in MPLS the tunnel is
called a label-switched path. The LSP is a unidirectional connection following
the same path through the network.
Ingress router—The ingress router is the start of the LSP and where the label is
pushed onto the packet.
Egress router—The egress router is the end of the LSP and where the label is
popped off the packet.
Transit or intermediate router—There must be at least one transit (sometimes
called intermediate) router between ingress and egress routers. The transit
router(s) swaps labels and replaces the incoming values with the outgoing
values.
Static LSPs—These are LSPs set up by hand, much like permanent virtual circuits
(PVCs) in FR and ATM. They are difficult to change rapidly.
Signaled LSPs—These are LSPs set up by a signaling protocol used with MPLS
(there are two) and are similar to switched-virtual circuits (SVCs) in FR
and ATM.
MPLS domain—The collection of routers within a routing domain that starts and
ends all LSPs form the MPLS domain. MPLS domains can be nested, and can be
a subset of the routing domain itself (that is, all routers do not have to understand MPLS; only those on the LSP).
Push, pop, and swap—A push adds a label to an IP packet or another MPLS label.
A pop removes and processes a label from an IP packet or another MPLS label.
A swap is a pop followed by a push and replaces one label by another (with
different field values). Multiple labels can be added (push push . . .) or removed
(pop pop . . .) at the same time.
Penultimate hop popping (PHP)—Many of LSPs can terminate at the same border router. This router must not only pop and process all the labels but route
all packets inside, plus all other packets that arrive from within the ISP. To
ease the load of this border router, the router one hop upstream from the
egress router (known as the penultimate router) can pop the label and simply
route the packet to the egress router (it must be one hop, so the effect is the
CHAPTER 17 MPLS and IP Switching
447
same). PHP is an optional feature of LSPs, and keep in mind that the LSP is still
considered to terminate at the egress router (not at the penultimate).
Constrained path LSPs—These are traffic engineering (TE) LSPs set up by a
signaling protocol that must respect certain TE constraints imposed on the
network with regard to delay, security, and so on. TE is the most intriguing
aspect of MPLS.
IGP shortcuts—Usually, LSPs are used in special router tables and only available to
routes learned by BGP (transit traffic). Interior Gateway Protocol (IGP) shortcuts allow LSPs to be installed in the main routing table and used by traffic
within the ISP itself, routes learned by OSPF or another IGP.
Signaling and MPLS
There are two signaling protocols that can be used in MPLS to automatically set up
LSPs without human intervention (other than configuring the signaling protocols
themselves!). The Resource Reservation Protocol (RSVP) was originally invented to set
up QoS “paths” from host to host through a router network, but it never scaled well or
worked as advertised. Today, RSVP has been defined in RFC 3209 as RSVP for TE and is
used as a signaling protocol for MPLS. RSVP is used almost exclusively as RSVP-TE (most
people just say RSVP) by routers to set up LSPs (explicit-path LSPs), but can still be used
for QoS purposes (constrained-path LSPs).
The Label Distribution Protocol (LDP), defined in RFC 3212, is used exclusively with
MPLS but cannot be used for adding QoS to LSPs other than using simple constraints
when setting up paths. On the other hand, LDP is trivial to configure compared to RSVP.
This is because LDP works directly from the tables created by the IGP (OSPF or IS–IS).
The lack of QoS support in LDP is due to the lack of any intention in the process. The
reason for the LDP paths created from the IGP table to exist is only simple adjacency. In
addition, LDP does not offer much if your routing platform can forward packets almost
as fast as it can switch labels. Today, use of LDP is deprecated (see the admonitions in
RFC 3468) in favor of RSVP-TE.
A lot of TCP/IP texts spend a lot of time explaining how RSVP-TE works (they deal
with LDP less often). This is more of an artifact of the original use of RSVP as a hostbased protocol. It is enough to note that RSVP messages are exchanged between all
routers along the LSP from ingress to egress. The LSP label values are determined, and
TE constraints respected, hop by hop through the network until the LSP is ready for
traffic. The process is quick and efficient, but there are few parameters that can be
configured even on routers that change RSVP operation significantly (such as interval
timers)—and none at all on hosts.
Although not discussed in detail in this introduction to MPLS, another protocol is
commonly used for MPLS signaling, as described in RFC 2547bis. BGP is a routing protocol, not a signaling protocol, but the extensions used in multiprotocol BPG (MPBGP)
make it well suited for the types of path setup tasks described in this chapter. With
MPBGP, it is possible to deploy BGP- and MPLS-based VPNs without the use of any other
448
PART III Routing and Routing Protocols
signaling protocol. LSPs are established based on the routing information distributed by
MPBGP from PE to PE. MPBGP is backward compatible with “normal” BGP, and thus use
of these extensions does not require a wholesale upgrade of all routers at once.
Label Stacking
Of all the MPLS terms outlined in the previous section, the one that is essential to
understand is the concept of “nested” LSPs; that is, LSPs which include one or more
other LSPs along their path from ingress to egress. When this happens, there will be
more than one label in front of the IP packet for at least part of its journey.
It is common for many large ISPs to stack three labels in front of an IP packet. Often,
the end of two LSPs is at the same router and two labels are pushed or popped at once.
The current limit is eight labels.
There are several instances where this stacking ability comes in handy. A larger ISP
can buy a smaller ISP and simply “add” their own LSPs onto (outside) the existing ones.
In addition, when different signaling protocols are used in core routers and border
routers, these domains can be nested instead of discarding one or the other.
The general idea of nested MPLS domains with label stacking is shown in Figure 17.7.
There are five MPLS domains, each with its own way of setting up LSPs: static, RSVP,
and LDP. The figure shows the number of labels stacked at each point and the order
MPLS Domain 1
MPLS Domain 3
MPLS Domain 2
R
R
MPLS
Domain 4
LDP
Static
R
MPLS
Domain 5
LDP
R
RSVP
RSVP
Two stacked labels
(MPLS2, MPLS1, IP)
Three stacked labels
(MPLS4, MPLS3,
MPLS1, IP)
Three stacked labels
(MPLS5, MPLS3,
MPLS1, IP)
FIGURE 17.7
MPLS domains, showing how the domains can be nested or chained, and how multiple labels
are used.
CHAPTER 17 MPLS and IP Switching
449
they are stacked in front of the packet. All of the routers shown (in practice, there will
be many more) pop and process multiple labels. MPLS domains can be nested for geographical, vendor, or organizational reasons as well.
MPLS and VPNs
MPLS forms the basis for many types of VPNs used on IP networks today, especially
Layer 3 VPNs. LSPs are like the PVCs and SVCs that formed “virtually private” links
across a shared public network such as FR or ATM. LSPs are not really the same as
private leased-line links, but they appear to be to their users.
Of course, while the path is constrained, the MPLS-based Layer 3 VPN is not actually
doing anything special to secure the content of the tunnel or to protect its integrity. So,
this “security” value is limited to constraining the path. This reduces the places where
snooping or injection can occur, but it does not replace other Layer 3 VPN technology
for security (such as IPSec, discussed in Chapter 29).
Nevertheless, VPNs are often positioned as a security feature on router networks.
This is because, like “private” circuits, hackers cannot hack into the middle of an LSP
(VPN) just by spoofing packets. There are labels to be dealt with, often nested labels.
The ingress and egress routers are more vulnerable, but it’s not as easy to harm VPNs or
the sites they connect as it is to disrupt “straight” router networks.
So, VPNs have a lot in common with MPLS and LSPs—except that the terms are
different! For example, the transit routers in MPLS are now provider (P) routers in
VPNs. VPNs are discussed further in the security chapters.
MPLS Tables
The tables used to push, pop, and swap labels in multiprotocol label switching are different from the tables used to route packets. This makes sense: MPLS uses switching,
and packets are routed.
Most MPLS tables are little more than long lists of labels with two key pieces of
information attached: the output interface to the next-hop router on the LSP and the
new value of the label. Other pieces of information can be added, but this is the absolute minimum.
What does an MPLS switching table look like? Suppose we did set up an LSP between
LAN1 and LAN2 to carry packets from PE5 to PE1 through backbone routers P9 and P7
instead of through P4 and P2?
Figure 17.8 shows how the MPLS switching tables might be set up to switch a
packet from LAN1 to LAN2. Note that this has nothing to do with routed traffic going
back from LAN2 to LAN1! (In the real world, we would set up an LSP going from LAN2
to LAN1 as well.)
450
PART III Routing and Routing Protocols
PE5
10.10.11/24
Ingress
Router
P9
10.0.59/24
Transit
Router
P7
10.0.79/24
Transit
Router
PE1
10.0.17/24
Egress
Router
10.10.12/24
Label Table
Push 1023
Label Table
Pop 1023
Push 1104
(swap 1104
for 1023)
Label Table
Pop 1104
Push 1253
(swap 1253
for 1104)
Label Table
Pop 1253
Output on:
10.0.59/24
Output on:
10.0.79/24
Output on:
10.0.17/24
ROUTE to:
10.10.12/24
FIGURE 17.8
Label tables for a static LSP from PE5 (ingress) to PE1 (egress).
CONFIGURING MPLS USING STATIC LSPS
Let’s build the static LSP from LAN1 to LAN2 from PE5 to P9 to P7 to PE1 that was shown
in Figure 17.8. Then we’ll show how that affects the routing table entries and run a
traceroute for packets sent from 10.10.11.0/24 (LAN1) to 10.10.12.0/24 (LAN2).
The Ingress Router
Let’s start by configuring the LSP on PE5, the ingress router, so that packets from LAN1’s
address space get an MPLS label value of 1023 and are sent to 10.0.59.2 as a next hop
on the link to P9 (so-0/0/0).
set protocols mpls static-path LAN1-to-LAN2 10.10.11.0/24 next-hop 10.0.59.2;
set protocols mpls static-path LAN1-to-LAN2 10.10.11.0/24 push 1023;
set protocols mpls static-path LAN1-to-LAN2 interface so-0/0/0;
Once the configuration is committed, the static LSP shows up as a static route naturally (signaled LSPs are referenced by signaling a protocol, RSVP or LDP).
[email protected]# show route table inet.0 protocol static
10.10.11.0/24
*[Static/5] 00:01:42
> to 10.0.59.2 via so-0/0/0. push 1023
The Transit Routers
This is how the LSP is configured on P9, the first transit (or intermediate) router.
set protocols mpls interface so-0/0/0 label-map 1023 next-hop 10.0.79.1;
set protocols mpls interface so-0/0/0 label-map 1023 swap 1104;
CHAPTER 17 MPLS and IP Switching
451
Note that this table is not organized by destination, as on the PE router, but by
the interface that the MPLS data unit arrives on. There can be many labels, but this
“label map” looks for 1023, swaps it for label 1104, and forwards it to 10.0.79.1. Note
that there was no need to look anything up in the main routing table (in Juniper
Networks routers, the interface addresses are held in hardware). Transit LSPs are
identified by the use of swap in the static router entry, but this time in MPLS “label
table” mpls.0.
[email protected]# show route table mpls.0 protocol static
1023
*[Static/5] 00:01:57
> to 10.0.79.1 via so-0/0/1. swap 1104
The link to P7 is so-0/0/1, as expected. The configuration on the P7, the second transit
router, is very similar.
set protocols mpls interface so-0/0/1 label-map 1104 next-hop 10.0.17.1;
set protocols mpls interface so-0/0/1 label-map 1104 swap 1253;
If we wanted to configure PHP, this is the router where we would enable it. The
statement swap 3 is the “magic word” that enables PHP. MPLS label value 3 says to the
local router, “Don’t really push a 3 on the packet, but instead pop the label and route
the packet inside.” The use of the label at least makes it easier to remember that the end
of the LSP is really on PE1.
The Egress Router
The configuration on the egress router, PE1, is essentially the opposite of that on the
ingress router but more similar to that on a transit router.
set protocols mpls interface so-0/0/2 label-map 1253 next-hop 10.0.12.0/24;
set protocols mpls interface so-0/0/2 label-map 1253 pop;
[email protected]# set protocols mpls interface so-0/0/2 label-map 1253 next-hop 10.10.12.0/24;
[email protected]# set protocols mpls interface so-0/0/2 label-map 1253 pop;
There is no need to tell the router what label value to pop: if it got this far, the label
value is 1253. Note that the next hop is the IP address of LAN2, which is the entire
point of the exercise. When PHP is used, there is no need for a label map for that LSP
on the egress router. When PHP is not used, the egress LSPs are identified by the use of
pop in the static router entry in mpls.0.
[email protected]# show route table mpls.0 protocol static
1253
*[Static/5] 00:02:17
> to 10.10.12.0/24 via ge-0/0/3. pop
452
PART III Routing and Routing Protocols
Static LSPs are fine, but offer no protection at all against link failure. And consider
how many interfaces, labels, and other information have to be maintained and entered
by hand. In MPLS classes, most instructors make students suffer through a complex
static LSP configuration (some of which never work correctly) before allowing the use
of RSVP-TE and LDP to “automatically” set up LSPs anywhere or everywhere. It is a lesson that is not soon forgotten. (In fact, dynamic LSP configuration using RVSP-TE is so
simple that it is not even used as an example in this chapter.)
Traceroute and LSPs
How do we know that our static LSP is up and running properly? A ping that works
proves nothing about the LSP because it could have been routed, not switched. Even
one that fails proves nothing except the fact that something is broken.
But traceroute is the perfect tool to see if the LSP is up and running correctly. The
following is what it looked like before we configured the LSP.
bsdclient# traceroute bsdserver
traceroute to bsdserver (10.10.12.77), 64 hops max, 44 byte packets
1 10.10.11.1 (10.10.11.1) 0.363 ms 0.306 ms 0.345 ms
2 10.0.50.1 (10.1.36.2) 0.329 ms 0.342 ms 0.346 ms
3 10.0.45.1 (10.0.45.1) 0.330 ms 0.341 ms 0.346 ms
4 10.0.24.1 (10.0.24.1) 0.332 ms 0.343 ms 0.345 ms
5 10.0.12.1 (10.0.12.1) 0.329 ms 0.342 ms 0.347 ms
6 10.0.16.2 (10.0.16.2) 0.330 ms 0.341 ms 0.346 ms
7 10.10.12.77 (10.10.12.77) 0.331 ms 0.343 ms 0.347 ms
bsdclient#
Let’s look at it now, after the LSP.
bsdclient# traceroute bsdserver
traceroute to bsdserver (10.10.12.77), 64 hops max, 44 byte packets
1 10.10.11.1 (10.10.11.1) 0.363 ms 0.306 ms 0.345 ms
2 10.0.59.1 (10.0.59.1) 0.329 ms 0.342 ms 0.346 ms
3 10.0.16.2 (10.0.16.2) 0.330 ms 0.343 ms 0.0347 ms
4 10.10.12.77 (10.10.12.77) 0.331 ms 0.343 ms 0.347 ms
bsdclient#
Only four routers have “routed” the packet. On the backbone, the packet is switched
based on the MPLS tables, and so forms one router hop. But at least we can see that the
packets are sent toward P9 (10.0.59.1) and not P4 (10.0.50.1).
The details of the path of MPLS LSPs are not visible from the hosts. Why should
they be? LSPs are tools for the service providers on our network. Only on the routers,
running a special version of traceroute, can we reveal the hop-by-hop functioning of
the LSP. When run on PE5 to trace the path to the link to CE6, traceroute “expands” the
path and provides details—showing that the CE6 is still five routers away from CE0
(and that there are still six routers and seven hops between LAN1 and LAN2).
CHAPTER 17 MPLS and IP Switching
453
[email protected]> traceroute 10.10.16.1
traceroute to 10.10.12.0 (10.10.12.0), 30 hops max, 40 byte packets
1 10.10.12.1 (10.10.12.1) 0.851 ms 0.743 ms 0.716 ms
MPLS Label=1023 CoS=0 TTL=1 S=1
2 10.0.59.1 (10.0.59.1) 0.799 ms 0.753 ms 0.721 ms
MPLS Label=1104 CoS=0 TTL=1 S=1
3 10.0.79.1 (10.0.79.1) 0.832 ms 0.769 ms 0.735 ms
MPLS Label=1253 CoS=0 TTL=1 S=1
4 10.0.17.1 (10.0.17.1) 0.854 ms 0.767 ms 0.734 ms
5 10.0.16.1 (10.0.16.1) 0.629 ms !N 0.613 ms !N 0.582 ms !N
[email protected]>
Just to show that the LSP we set up is unidirectional, watch what happens when we
run traceroute in reverse from bsdserver on LAN2 to bsdclient on LAN1.
bsdserver# traceroute bsdclient
traceroute to bsdclient (10.10.11.177), 64 hops max, 44 byte packets
1 10.10.12.1 (10.10.12.1) 0.361 ms 0.304 ms 0.343 ms
2 10.0.16.1 (10.1.16.1) 0.331 ms 0.344 ms 0.347 ms
3 10.0.12.2 (10.0.12.2) 0.329 ms 0.340 ms 0.345 ms
4 10.0.24.2 (10.0.24.2) 0.333 ms 0.344 ms 0.346 ms
5 10.0.45.2 (10.0.45.2) 0.329 ms 0.342 ms 0.347 ms
6 10.0.50.2 (10.0.50.2) 0.330 ms 0.341 ms 0.346 ms
7 10.10.11.177 (10.10.11.177) 0.331 ms 0.343 ms 0.347 ms
bsdclient#
Packets flow through backbone routers P2 and P4, as they did before the MPLS LSP
was set up! The “old” route is used, showing that MPLS is the basis for traffic engineering
on a router network.
This page intentionally left blank
455
QUESTIONS FOR READERS
Figure 17.9 shows some of the concepts discussed in this chapter and can be used to
help you answer the following questions.
Packet for
10.10.100.0/24
1253
Router
A
Router
B
1104
Upstream
ISP
Ingress
Router
ISP
1215
Router
C
Router
D
Router
E
Router
F
3
Egress
Router
Downstream
ISP
Network
10.10.100.0/24
(and many more)
FIGURE 17.9
An MPLS LSP from ingress to ingress router, showing label value to path. The LSP runs along the
heavy lines through the routers designated. The label values used on each link are also shown.
1. Does the LSP in Figure 17.9 use the shortest path in terms of number of routers
from ingress to egress?
2. What does traffic engineering mean as the term applies to MPLS?
3. Is there an LSP set up on the reverse path from egress to ingress router?
4. Which label is used on the LSP between routers A and B? Is this label added to
another, or swapped?
5. Is PHP used on the LSP? How can you tell?
PART
Application
Level
IV
Every host on the Internet typically runs a set of basic client–server applications.
This part of the book examines each one in detail.
■
■
■
■
■
■
Chapter 18—Dynamic Host Configuration Protocol
Chapter 19—The Domain Name System
Chapter 20—File Transfer Protocol
Chapter 21—SMTP and Email
Chapter 22—Hypertext Transfer Protocol
Chapter 23—Securing Sockets with SSL
CHAPTER
Dynamic Host Configuration
Protocol
18
What You Will Learn
In this chapter, you will learn how IP addresses are assigned in modern IP networks.
You will learn how the Dynamic Host Configuration Protocol (DHCP) and related
protocols, such as BOOTP, combine to allow IP addresses to be assigned to devices
dynamically instead of by hand.
You will learn how users often struggle to find printers and servers whose IP
addresses “jump around,” and you will learn means of dealing with this issue.
When TCP/IP first became popular, configuration was never trivial and often complex.
Whereas many clients needed only a handful of parameters, servers often required
long lists of values. Operating systems had quickly outgrown single floppies, and most
hosts now needed hard drives just to boot themselves into existence. Routers were in
a class by themselves, especially when they connected more than two subnets—and
in the days of expensive memory and secondary storage (hard drives), routers usually
needed to load not only their configuration from a special server, but often their entire
operating systems.
A once-popular movement to “diskless workstations” hyped devices that put all of
their value into hefty processors while dispensing with expensive (and failure-prone)
hard drives altogether. Semiconductor memory was not only prohibitively expensive in
adequate quantities but universally volatile, meaning that the content did not carry over
a power failure if shut down. How could routers and diskless workstations find the software and configuration information they needed when they were initially powered on?
RFC 951 addressed this situation by defining BOOTP, the bootstrap protocol, to find
servers offering the software and configuration files routers and other devices needed
on the subnet. The basic functions were extended in RFC 1542, which described relay
agents that could be used to find BOOTP servers almost anywhere on a network. BOOTP
did a good job at router software loading, but the configuration part (notably the IP
addresses) assigned by the device’s physical address had to be laboriously maintained
by the BOOTP server a