TCP/IP Tutorial and Technical Overview Front cover

TCP/IP Tutorial and Technical Overview Front cover
Front cover
TCP/IP Tutorial and
Technical Overview
Understand networking fundamentals
of the TCP/IP protocol suite
Introduces advanced concepts
and new technologies
Includes the latest
TCP/IP protocols
Lydia Parziale
David T. Britt
Chuck Davis
Jason Forrester
Wei Liu
Carolyn Matthews
Nicolas Rosselot
International Technical Support Organization
TCP/IP Tutorial and Technical Overview
December 2006
Note: Before using this information and the product it supports, read the information in
“Notices” on page xvii.
Eighth Edition (December 2006)
© Copyright International Business Machines Corporation 1989-2006. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP
Schedule Contract with IBM Corp.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
Part 1. Core TCP/IP protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1. Architecture, history, standards, and trends . . . . . . . . . . . . . . . 3
1.1 TCP/IP architectural model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Internetworking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 The TCP/IP protocol layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 TCP/IP applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 The roots of the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 ARPANET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.2 NSFNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.3 Commercial use of the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.4 Internet2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.5 The Open Systems Interconnection (OSI) Reference Model . . . . . . 20
1.3 TCP/IP standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.1 Request for Comments (RFC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.2 Internet standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Future of the Internet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.1 Multimedia applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.2 Commercial use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.3 The wireless Internet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 2. Network interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1 Ethernet and IEEE 802 local area networks (LANs) . . . . . . . . . . . . . . . . . 30
2.1.1 Gigabit Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Fiber Distributed Data Interface (FDDI). . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Serial Line IP (SLIP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Point-to-Point Protocol (PPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Point-to-point encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Integrated Services Digital Network (ISDN) . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 X.25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
© Copyright IBM Corp. 1989-2006. All rights reserved.
2.7 Frame relay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7.1 Frame format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7.2 Interconnect issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7.3 Data link layer parameter negotiation . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7.4 IP over frame relay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.8 PPP over SONET and SDH circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.8.1 Physical layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.9 Multi-Path Channel+ (MPC+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.10 Asynchronous transfer mode (ATM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.10.1 Address resolution (ATMARP and InATMARP) . . . . . . . . . . . . . . . 47
2.10.2 Classical IP over ATM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.10.3 ATM LAN emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.10.4 Classical IP over ATM versus LAN emulation. . . . . . . . . . . . . . . . . 59
2.11 Multiprotocol over ATM (MPOA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.11.1 Benefits of MPOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.11.2 MPOA logical components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.11.3 MPOA functional components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.11.4 MPOA operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.12 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Chapter 3. Internetworking protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.1 Internet Protocol (IP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.1 IP addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.2 IP subnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.1.3 IP routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.4 Methods of delivery: Unicast, broadcast, multicast, and anycast . . . 84
3.1.5 The IP address exhaustion problem . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.1.6 Intranets: Private IP addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.1.7 Network Address Translation (NAT) . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.1.8 Classless Inter-Domain Routing (CIDR) . . . . . . . . . . . . . . . . . . . . . . 95
3.1.9 IP datagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2 Internet Control Message Protocol (ICMP) . . . . . . . . . . . . . . . . . . . . . . . 109
3.2.1 ICMP messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.2.2 ICMP applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.3 Internet Group Management Protocol (IGMP) . . . . . . . . . . . . . . . . . . . . 119
3.4 Address Resolution Protocol (ARP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.4.1 ARP overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.4.2 ARP detailed concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.4.3 ARP and subnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.4.4 Proxy-ARP or transparent subnetting . . . . . . . . . . . . . . . . . . . . . . . 123
3.5 Reverse Address Resolution Protocol (RARP) . . . . . . . . . . . . . . . . . . . . 124
3.5.1 RARP concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.6 Bootstrap Protocol (BOOTP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
TCP/IP Tutorial and Technical Overview
3.6.1 BOOTP forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.6.2 BOOTP considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.7 Dynamic Host Configuration Protocol (DHCP) . . . . . . . . . . . . . . . . . . . . 130
3.7.1 The DHCP message format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.7.2 DHCP message types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.7.3 Allocating a new network address. . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.7.4 DHCP lease renewal process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.7.5 Reusing a previously allocated network address . . . . . . . . . . . . . . 138
3.7.6 Configuration parameters repository . . . . . . . . . . . . . . . . . . . . . . . . 139
3.7.7 DHCP considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.7.8 BOOTP and DHCP interoperability . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.8 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Chapter 4. Transport layer protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.1 Ports and sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.1.1 Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.1.2 Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.2 User Datagram Protocol (UDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.2.1 UDP datagram format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.2.2 UDP application programming interface . . . . . . . . . . . . . . . . . . . . . 149
4.3 Transmission Control Protocol (TCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.3.1 TCP concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.3.2 TCP application programming interface . . . . . . . . . . . . . . . . . . . . . 164
4.3.3 TCP congestion control algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.4 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Chapter 5. Routing protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.1 Autonomous systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.2 Types of IP routing and IP routing algorithms . . . . . . . . . . . . . . . . . . . . . 174
5.2.1 Static routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.2.2 Distance vector routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.2.3 Link state routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.2.4 Path vector routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.2.5 Hybrid routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.3 Routing Information Protocol (RIP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.3.1 RIP packet types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.3.2 RIP packet format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.3.3 RIP modes of operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.3.4 Calculating distance vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.3.5 Convergence and counting to infinity . . . . . . . . . . . . . . . . . . . . . . . 185
5.3.6 RIP limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.4 Routing Information Protocol Version 2 (RIP-2) . . . . . . . . . . . . . . . . . . . 189
5.4.1 RIP-2 packet format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.4.2 RIP-2 limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.5 RIPng for IPv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.5.1 Differences between RIPng and RIP-2 . . . . . . . . . . . . . . . . . . . . . . 193
5.5.2 RIPng packet format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.6 Open Shortest Path First (OSPF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
5.6.1 OSPF terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
5.6.2 Neighbor communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.6.3 OSPF neighbor state machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.6.4 OSPF route redistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
5.6.5 OSPF stub areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.6.6 OSPF route summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
5.7 Enhanced Interior Gateway Routing Protocol (EIGRP). . . . . . . . . . . . . . 212
5.7.1 Features of EIGRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.7.2 EIGRP packet types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.8 Exterior Gateway Protocol (EGP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.9 Border Gateway Protocol (BGP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.9.1 BGP concepts and terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
5.9.2 IBGP and EBGP communication . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.9.3 Protocol description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.9.4 Path selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5.9.5 BGP synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
5.9.6 BGP aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.9.7 BGP confederations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
5.9.8 BGP route reflectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.10 Routing protocol selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
5.11 Additional functions performed by the router. . . . . . . . . . . . . . . . . . . . . 234
5.12 Routing processes in UNIX-based systems . . . . . . . . . . . . . . . . . . . . . 235
5.13 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Chapter 6. IP multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.1 Multicast addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
6.1.1 Multicasting on a single physical network . . . . . . . . . . . . . . . . . . . . 238
6.1.2 Multicasting between network segments . . . . . . . . . . . . . . . . . . . . 240
6.2 Internet Group Management Protocol (IGMP) . . . . . . . . . . . . . . . . . . . . 241
6.2.1 IGMP messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.2.2 IGMP operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.3 Multicast delivery tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.4 Multicast forwarding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.4.1 Reverse path forwarding algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 252
6.4.2 Center-based tree algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.4.3 Multicast routing protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6.5 Distance Vector Multicast Routing Protocol (DVMRP) . . . . . . . . . . . . . . 254
6.5.1 Protocol overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
TCP/IP Tutorial and Technical Overview
6.5.2 Building and maintaining multicast delivery trees . . . . . . . . . . . . . . 256
6.5.3 DVMRP tunnels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.6 Multicast OSPF (MOSPF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.6.1 Protocol overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.6.2 MOSPF and multiple OSPF areas . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.6.3 MOSPF and multiple autonomous systems . . . . . . . . . . . . . . . . . . 260
6.6.4 MOSPF interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.7 Protocol Independent Multicast (PIM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.7.1 PIM dense mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.7.2 PIM sparse mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.8 Interconnecting multicast domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.8.1 Multicast Source Discovery Protocol (MSDP) . . . . . . . . . . . . . . . . . 266
6.8.2 Border Gateway Multicast Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.9 The multicast backbone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.9.1 MBONE routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.9.2 Multicast applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
6.10 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Chapter 7. Mobile IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.1 Mobile IP overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
7.1.1 Mobile IP operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7.1.2 Mobility agent advertisement extensions . . . . . . . . . . . . . . . . . . . . 278
7.2 Mobile IP registration process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.2.1 Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.2.2 Broadcast datagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.2.3 Move detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.2.4 Returning home. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
7.2.5 ARP considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
7.2.6 Mobile IP security considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 286
7.3 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Chapter 8. Quality of service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
8.1 Why QoS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
8.2 Integrated Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
8.2.1 Service classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
8.2.2 Controlled Load Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.2.3 Guaranteed Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
8.2.4 The Resource Reservation Protocol (RSVP) . . . . . . . . . . . . . . . . . 296
8.2.5 Integrated Services outlook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
8.3 Differentiated Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.3.1 Differentiated Services architecture . . . . . . . . . . . . . . . . . . . . . . . . 310
8.3.2 Organization of the DSCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
8.3.3 Configuration and administration of DS with LDAP. . . . . . . . . . . . . 322
8.4 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Chapter 9. IP version 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
9.1 IPv6 introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
9.1.1 IP growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
9.1.2 IPv6 feature overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
9.2 The IPv6 header format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
9.2.1 Extension headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
9.2.2 IPv6 addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
9.2.3 Traffic class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
9.2.4 Flow labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
9.2.5 IPv6 security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
9.2.6 Packet sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
9.3 Internet Control Message Protocol Version 6 (ICMPv6) . . . . . . . . . . . . . 352
9.3.1 Neighbor discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
9.3.2 Multicast Listener Discovery (MLD) . . . . . . . . . . . . . . . . . . . . . . . . 365
9.4 DNS in IPv6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
9.4.1 Format of IPv6 resource records. . . . . . . . . . . . . . . . . . . . . . . . . . . 368
9.5 DHCP in IPv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
9.5.1 DHCPv6 messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
9.6 IPv6 mobility support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
9.7 IPv6 new opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.7.1 New infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.7.2 New services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
9.7.3 New research and development platforms . . . . . . . . . . . . . . . . . . . 378
9.8 Internet transition: Migrating from IPv4 to IPv6 . . . . . . . . . . . . . . . . . . . . 379
9.8.1 Dual IP stack implementation: The IPv6/IPv4 node . . . . . . . . . . . . 380
9.8.2 Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
9.8.3 Interoperability summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
9.9 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Chapter 10. Wireless IP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
10.1 Wireless concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
10.2 Why wireless? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
10.2.1 Deployment and cost effectiveness . . . . . . . . . . . . . . . . . . . . . . . 395
10.2.2 Reachability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
10.2.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
10.2.4 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
10.2.5 Connectivity and reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
10.3 WiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
10.4 WiMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.5 Applications of wireless networking. . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
10.5.1 Last mile connectivity in broadband services . . . . . . . . . . . . . . . . 402
TCP/IP Tutorial and Technical Overview
10.5.2 Hotspots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
10.5.3 Mesh networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
10.6 IEEE standards relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . 403
Part 2. TCP/IP application protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Chapter 11. Application structure and programming interfaces . . . . . . 407
11.1 Characteristics of applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
11.1.1 The client/server model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
11.2 Application programming interfaces (APIs) . . . . . . . . . . . . . . . . . . . . . . 410
11.2.1 The socket API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
11.2.2 Remote Procedure Call (RPC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
11.2.3 The SNMP distributed programming interface (SNMP DPI) . . . . . 419
11.2.4 REXX sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
11.3 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Chapter 12. Directory and naming protocols . . . . . . . . . . . . . . . . . . . . . . 425
12.1 Domain Name System (DNS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
12.1.1 The hierarchical namespace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
12.1.2 Fully qualified domain names (FQDNs) . . . . . . . . . . . . . . . . . . . . 428
12.1.3 Generic domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
12.1.4 Country domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
12.1.5 Mapping domain names to IP addresses . . . . . . . . . . . . . . . . . . . 429
12.1.6 Mapping IP addresses to domain names: Pointer queries . . . . . . 430
12.1.7 The distributed name space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
12.1.8 Domain name resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
12.1.9 Domain Name System resource records . . . . . . . . . . . . . . . . . . . 436
12.1.10 Domain Name System messages . . . . . . . . . . . . . . . . . . . . . . . . 439
12.1.11 A simple scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
12.1.12 Extended scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
12.1.13 Transport. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
12.1.14 DNS applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
12.2 Dynamic Domain Name System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
12.2.1 Dynamic updates in the DDNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
12.2.2 Incremental zone transfers in DDNS. . . . . . . . . . . . . . . . . . . . . . . 456
12.2.3 Prompt notification of zone transfer . . . . . . . . . . . . . . . . . . . . . . . 457
12.3 Network Information System (NIS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
12.4 Lightweight Directory Access Protocol (LDAP) . . . . . . . . . . . . . . . . . . . 459
12.4.1 LDAP: Lightweight access to X.500 . . . . . . . . . . . . . . . . . . . . . . . 460
12.4.2 The LDAP directory server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
12.4.3 Overview of LDAP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
12.4.4 LDAP models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
12.4.5 LDAP security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
12.4.6 LDAP URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
12.4.7 LDAP and DCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
12.4.8 The Directory-Enabled Networks (DEN) initiative . . . . . . . . . . . . . 477
12.4.9 Web-Based Enterprise Management (WBEM) . . . . . . . . . . . . . . . 478
12.5 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
Chapter 13. Remote execution and distributed computing. . . . . . . . . . . 483
13.1 Telnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
13.1.1 Telnet operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
13.1.2 Network Virtual Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
13.1.3 Telnet options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
13.1.4 Telnet command structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
13.1.5 Option negotiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
13.1.6 Telnet basic commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
13.1.7 Terminal emulation (Telnet 3270) . . . . . . . . . . . . . . . . . . . . . . . . . 492
13.1.8 TN3270 enhancements (TN3270E) . . . . . . . . . . . . . . . . . . . . . . . 493
13.1.9 Device-type negotiation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
13.2 Remote Execution Command protocol (REXEC and RSH) . . . . . . . . . 495
13.3 Introduction to the Distributed Computing Environment (DCE) . . . . . . . 496
13.3.1 DCE directory service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
13.3.2 Authentication service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
13.3.3 DCE threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
13.3.4 Distributed Time Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
13.3.5 Additional information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
13.4 Distributed File Service (DFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
13.4.1 File naming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
13.4.2 DFS performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
13.5 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
Chapter 14. File-related protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
14.1 File Transfer Protocol (FTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
14.1.1 An overview of FTP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
14.1.2 FTP operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
14.1.3 The active data transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
14.1.4 The passive data transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
14.1.5 Using proxy transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
14.1.6 Reply codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
14.1.7 Anonymous FTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
14.1.8 Using FTP with IPv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
14.1.9 Securing FTP sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
14.2 Trivial File Transfer Protocol (TFTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
14.2.1 TFTP usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
14.2.2 Protocol description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
14.2.3 TFTP packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
TCP/IP Tutorial and Technical Overview
14.2.4 Data modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
14.2.5 TFTP multicast option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
14.2.6 Security issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
14.3 Secure Copy Protocol (SCP) and SSH FTP (SFTP) . . . . . . . . . . . . . . . 533
14.3.1 SCP syntax and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
14.3.2 SFTP syntax and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
14.3.3 SFTP interactive commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
14.4 Network File System (NFS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
14.4.1 NFS concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
14.4.2 File integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
14.4.3 Lock Manager protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
14.4.4 NFS file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
14.4.5 NFS version 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
14.4.6 Cache File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
14.4.7 WebNFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
14.5 The Andrew File System (AFS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
14.6 Common Internet File System (CIFS) . . . . . . . . . . . . . . . . . . . . . . . . . . 548
14.6.1 NetBIOS over TCP/IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
14.6.2 SMB/CIFS specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
14.7 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
Chapter 15. Mail applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
15.1 Simple Mail Transfer Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
15.1.1 How SMTP works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
15.1.2 SMTP and the Domain Name System . . . . . . . . . . . . . . . . . . . . . 565
15.2 Sendmail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
15.2.1 Sendmail as a mail transfer agent (MTA) . . . . . . . . . . . . . . . . . . . 568
15.2.2 How sendmail works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
15.3 Multipurpose Internet Mail Extensions (MIME) . . . . . . . . . . . . . . . . . . . 571
15.3.1 How MIME works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
15.3.2 The Content-Transfer-Encoding field . . . . . . . . . . . . . . . . . . . . . . 582
15.3.3 Using non-ASCII characters in message headers . . . . . . . . . . . . 587
15.4 Post Office Protocol (POP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
15.4.1 Connection states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
15.4.2 POP3 commands and responses . . . . . . . . . . . . . . . . . . . . . . . . . 590
15.5 Internet Message Access Protocol (IMAP4) . . . . . . . . . . . . . . . . . . . . . 591
15.5.1 Fundamental IMAP4 electronic mail models . . . . . . . . . . . . . . . . . 591
15.5.2 IMAP4 states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
15.5.3 IMAP4 commands and response interaction . . . . . . . . . . . . . . . . 594
15.5.4 IMAP4 messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
15.6 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
Chapter 16. The Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
16.1 Web browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
16.2 Web servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
16.3 Hypertext Transfer Protocol (HTTP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
16.3.1 Overview of HTTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
16.3.2 HTTP operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
16.4 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
16.4.1 Static content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
16.4.2 Client-side dynamic content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
16.4.3 Server-side dynamic content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
16.4.4 Developing content with IBM Web application servers . . . . . . . . . 621
16.5 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Chapter 17. Network management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
17.1 The Simple Network Management Protocol (SNMP) . . . . . . . . . . . . . . 624
17.1.1 The Management Information Base (MIB) . . . . . . . . . . . . . . . . . . 625
17.1.2 The SNMP agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
17.1.3 The SNMP manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
17.1.4 The SNMP subagent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
17.1.5 The SNMP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
17.1.6 SNMP traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
17.1.7 SNMP versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
17.1.8 Single authentication and privacy protocol . . . . . . . . . . . . . . . . . . 647
17.2 The NETSTAT utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
17.2.1 Common NETSTAT options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
17.2.2 Sample NETSTAT report output . . . . . . . . . . . . . . . . . . . . . . . . . . 649
17.3 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Chapter 18. Wireless Application Protocol . . . . . . . . . . . . . . . . . . . . . . . . 655
18.1 The WAP environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
18.2 Key elements of the WAP specifications. . . . . . . . . . . . . . . . . . . . . . . . 657
18.3 WAP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
18.4 Client identifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
18.5 Multimedia messaging system (MMS) . . . . . . . . . . . . . . . . . . . . . . . . . 663
18.6 WAP push architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
18.6.1 Push framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
18.6.2 Push proxy gateway (PPG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
18.6.3 Push access control protocol (PAP) . . . . . . . . . . . . . . . . . . . . . . . 667
18.6.4 Service indication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.6.5 Push over-the-air protocol (OTA) . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.6.6 Client-side infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
18.6.7 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
18.7 The Wireless Application Environment (WAE2) . . . . . . . . . . . . . . . . . . 670
18.8 User Agent Profile (UAProf) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
TCP/IP Tutorial and Technical Overview
18.9 Wireless protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
18.9.1 Wireless Datagram Protocol (WDP) . . . . . . . . . . . . . . . . . . . . . . . 672
18.9.2 Wireless Profiled Transmission Control Protocol (WP-TCP) . . . . 674
18.9.3 Wireless Control Message Protocol (WCMP) . . . . . . . . . . . . . . . . 678
18.9.4 Wireless Transaction Protocol (WTP) . . . . . . . . . . . . . . . . . . . . . . 679
18.9.5 Wireless Session Protocol (WSP) . . . . . . . . . . . . . . . . . . . . . . . . . 682
18.9.6 Wireless profiled HTTP (W-HTTP) . . . . . . . . . . . . . . . . . . . . . . . . 695
18.10 Wireless security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
18.10.1 Wireless Transport Layer Security (WTLS). . . . . . . . . . . . . . . . . 696
18.10.2 Wireless Identity Module (WIM) . . . . . . . . . . . . . . . . . . . . . . . . . 701
18.11 Wireless Telephony Application (WTA) . . . . . . . . . . . . . . . . . . . . . . . . 702
18.12 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
18.13 Specifications relevant to this chapter. . . . . . . . . . . . . . . . . . . . . . . . . 703
Chapter 19. Presence over IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
19.1 Overview of the presence service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
19.2 Presence Information Data Format (PIDF) . . . . . . . . . . . . . . . . . . . . . . 714
19.3 Presence protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716
19.3.1 Binding to TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
19.3.2 Address resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
19.4 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
Part 3. Advanced concepts and new technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Chapter 20. Voice over Internet Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 723
20.1 Voice over IP (VoIP) introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
20.1.1 Benefits and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
20.1.2 VoIP functional components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
20.2 Session Initiation Protocol (SIP) technologies. . . . . . . . . . . . . . . . . . . . 730
20.2.1 SIP request and response. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
20.2.2 Sample SIP message flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733
20.2.3 SIP protocol architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
20.3 Media Gateway Control Protocol (MGCP) . . . . . . . . . . . . . . . . . . . . . . 736
20.3.1 MGCP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
20.3.2 MGCP primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
20.4 Media Gateway Controller (Megaco). . . . . . . . . . . . . . . . . . . . . . . . . . . 738
20.4.1 Megaco architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
20.5 ITU-T recommendation H.323 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
20.5.1 H.323 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
20.5.2 H.323 protocol stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
20.6 Summary of VoIP protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742
20.7 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
Chapter 21. Internet Protocol Television. . . . . . . . . . . . . . . . . . . . . . . . . . 745
21.1 IPTV overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
21.1.1 IPTV requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
21.1.2 Business benefits and applications . . . . . . . . . . . . . . . . . . . . . . . . 749
21.2 Functional components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
21.2.1 Content acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
21.2.2 CODEC (encode and decode) . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
21.2.3 Display devices and control gateway . . . . . . . . . . . . . . . . . . . . . . 751
21.2.4 IP (TV) transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
21.3 IPTV technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
21.3.1 Summary of protocol standards . . . . . . . . . . . . . . . . . . . . . . . . . . 753
21.3.2 Stream Control Transmission Protocol . . . . . . . . . . . . . . . . . . . . . 753
21.3.3 Session Description Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
21.3.4 Real-Time Transport Protocol (RTP) . . . . . . . . . . . . . . . . . . . . . . 756
21.3.5 Real-Time Control Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
21.3.6 Moving Picture Experts Group (MPEG) standards . . . . . . . . . . . . 767
21.3.7 H.261. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
21.4 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 770
Chapter 22. TCP/IP security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771
22.1 Security exposures and solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
22.1.1 Common attacks against security . . . . . . . . . . . . . . . . . . . . . . . . . 772
22.1.2 Solutions to network security problems. . . . . . . . . . . . . . . . . . . . . 772
22.1.3 Implementations of security solutions . . . . . . . . . . . . . . . . . . . . . . 774
22.1.4 Network security policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
22.2 A short introduction to cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
22.2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
22.2.2 Symmetric or secret-key algorithms . . . . . . . . . . . . . . . . . . . . . . . 779
22.2.3 Asymmetric or public key algorithms. . . . . . . . . . . . . . . . . . . . . . . 780
22.2.4 Hash functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
22.2.5 Digital certificates and certification authorities . . . . . . . . . . . . . . . 791
22.2.6 Random-number generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
22.2.7 Export/import restrictions on cryptography . . . . . . . . . . . . . . . . . . 793
22.3 Firewalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
22.3.1 Firewall concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
22.3.2 Components of a firewall system . . . . . . . . . . . . . . . . . . . . . . . . . 796
22.3.3 Types of firewalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
22.4 IP Security Architecture (IPSec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 809
22.4.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 810
22.4.2 Authentication Header (AH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813
22.4.3 Encapsulating Security Payload (ESP) . . . . . . . . . . . . . . . . . . . . . 817
22.4.4 Combining IPSec protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
22.4.5 Internet Key Exchange (IKE) protocol. . . . . . . . . . . . . . . . . . . . . . 829
22.5 SOCKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
TCP/IP Tutorial and Technical Overview
22.5.1 SOCKS Version 5 (SOCKSv5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 848
22.6 Secure Shell (1 and 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
22.6.1 SSH overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
22.7 Secure Sockets Layer (SSL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854
22.7.1 SSL overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854
22.7.2 SSL protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856
22.8 Transport Layer Security (TLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
22.9 Secure Multipurpose Internet Mail Extension (S-MIME) . . . . . . . . . . . . 861
22.10 Virtual private networks (VPNs) overview . . . . . . . . . . . . . . . . . . . . . . 861
22.10.1 VPN introduction and benefits. . . . . . . . . . . . . . . . . . . . . . . . . . . 862
22.11 Kerberos authentication and authorization system . . . . . . . . . . . . . . . 864
22.11.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865
22.11.2 Naming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865
22.11.3 Kerberos authentication process. . . . . . . . . . . . . . . . . . . . . . . . . 866
22.11.4 Kerberos database management . . . . . . . . . . . . . . . . . . . . . . . . 870
22.11.5 Kerberos Authorization Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
22.11.6 Kerberos Version 5 enhancements. . . . . . . . . . . . . . . . . . . . . . . 871
22.12 Remote access authentication protocols. . . . . . . . . . . . . . . . . . . . . . . 872
22.13 Extensible Authentication Protocol (EAP) . . . . . . . . . . . . . . . . . . . . . . 874
22.14 Layer 2 Tunneling Protocol (L2TP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 875
22.14.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876
22.14.2 Protocol overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877
22.14.3 L2TP security issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879
22.15 Secure Electronic Transaction (SET) . . . . . . . . . . . . . . . . . . . . . . . . . 880
22.15.1 SET roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
22.15.2 SET transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881
22.15.3 The SET certificate scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883
22.16 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
Chapter 23. Port based network access control . . . . . . . . . . . . . . . . . . . 889
23.1 Port based network access control (NAC) overview . . . . . . . . . . . . . . . 890
23.2 Port based NAC component overview . . . . . . . . . . . . . . . . . . . . . . . . . 891
23.3 Port based network access control operation . . . . . . . . . . . . . . . . . . . . 892
23.3.1 Port based network access control functional considerations. . . . 904
23.4 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906
Chapter 24. Availability, scalability, and load balancing . . . . . . . . . . . . . 907
24.1 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909
24.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909
24.3 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910
24.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910
24.5 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912
24.6 Virtual Router Redundancy Protocol (VRRP) . . . . . . . . . . . . . . . . . . . . 914
24.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914
24.6.2 VRRP definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
24.6.3 VRRP overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
24.6.4 Sample configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 918
24.6.5 VRRP packet format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919
24.7 Round-robin DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 921
24.8 Alternative solutions to load balancing . . . . . . . . . . . . . . . . . . . . . . . . . 921
24.8.1 Network Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 922
24.8.2 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923
24.9 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
Appendix A. Multiprotocol Label Switching . . . . . . . . . . . . . . . . . . . . . . . 925
A.1 MPLS: An introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926
A.1.1 Conventional routing versus MPLS forwarding mode. . . . . . . . . . . 926
A.1.2 Benefits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927
A.1.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929
A.2 MPLS network processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932
A.2.1 Label swapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932
A.2.2 Label switched path (LSP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934
A.2.3 Label stack and label hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . 934
A.2.4 MPLS stacks in a BGP environment. . . . . . . . . . . . . . . . . . . . . . . . 936
A.2.5 Label distribution protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938
A.2.6 Stream merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939
A.3 Emulating Ethernet over MPLS networks . . . . . . . . . . . . . . . . . . . . . . . . 939
A.4 Generalized Multiprotocol Label Switching (GMPLS) . . . . . . . . . . . . . . . 941
A.4.1 Benefits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 941
A.4.2 MPLS and GMPLS comparison in OTN environment. . . . . . . . . . . 942
A.4.3 How does GMPLS work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943
A.4.4 Link Management Protocol (LMP) . . . . . . . . . . . . . . . . . . . . . . . . . 944
A.4.5 Signaling for route selection and path setup. . . . . . . . . . . . . . . . . . 947
A.4.6 GMPLS considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
A.4.7 GMPLS examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 950
A.5 RFCs relevant to this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 952
Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
TCP/IP Tutorial and Technical Overview
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on
the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
© Copyright IBM Corp. 1989-2006. All rights reserved.
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Advanced Peer-to-Peer
IBM Global Network®®
Lotus Notes®
Operating System/2®
Redbooks (logo)
RISC System/6000®
The following terms are trademarks of other companies:
SAP, and SAP logos are trademarks or registered trademarks of SAP AG in Germany and in several other
CacheFS, Enterprise JavaBeans, EJB, IPX, Java, Java Naming and Directory Interface, JavaBeans,
JavaScript, JavaServer, JavaServer Pages, JavaSoft, JDBC, JDK, JSP, JVM, J2EE, ONC, Solaris, Sun,
Sun Microsystems, WebNFS, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in
the United States, other countries, or both.
Internet Explorer, Microsoft, MSN, Windows NT, Windows, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
TCP/IP Tutorial and Technical Overview
The TCP/IP protocol suite has become a staple of today's international society
and global economy. Continually evolving standards provide a wide and flexible
foundation on which an entire infrastructure of applications are built. Through
these we can seek entertainment, conduct business, make financial
transactions, deliver services, and much, much more.
However, because TCP/IP continues to develop and grow in order to meet the
changing needs of our communities, it might sometimes be hard to keep track of
new functionality or identify new possibilities. For this reason, the TCP/IP Tutorial
and Technical Overview provides not only an introduction to the TCP/IP protocol
suite, but also serves as a reference for advanced users seeking to keep their
TCP/IP skills aligned with current standards. It is our hope that both the novice
and the expert will find useful information in this publication.
In Part I, you will find an introduction to the core concepts and history upon which
TCP/IP is founded. Included is an introduction to the history of TCP/IP and an
overview of its current architecture. We also provide detailed discussions about
the protocols that comprise the suite, and how those protocols are most
commonly implemented.
Part II expands on the information provided in Part I, providing general
application concepts (such as file sharing) and specific application protocols
within those concepts (such as the File Transfer Protocol, or FTP). Additionally,
Part II discusses applications that might not be included in the standard TCP/IP
suite but, because of their wide use throughout the Internet community, are
considered de facto standards.
Finally, Part III addresses new concepts and advanced implementations within
the TCP/IP architecture. Of particular note, Part III examines the convergence of
many formerly disparate networks and services using IP technology. Conjointly,
this section reviews potential dangers of this IP convergence and approaches
the ever-growing standards used to secure and control access to networks and
networked resources.
We purposely kept this book platform independent. However, we recognize that
you might have a need to learn more about TCP/IP on various platforms, so the
following Web sites might assist you in further researching this topic:
򐂰 TCP/IP and System z:
© Copyright IBM Corp. 1989-2006. All rights reserved.
򐂰 TCP/IP and System p:
򐂰 TCP/IP and System i:
򐂰 TCP/IP and System x:
The team that wrote this redbook
This redbook was produced by a team of specialists from around the world
working at the International Technical Support Organization, Poughkeepsie
Lydia Parziale is a Project Leader for the ITSO team in
Poughkeepsie, New York with domestic and international
experience in technology management including software
development, project leadership, and strategic planning.
Her areas of expertise include e-business development
and database management technologies. Lydia is a
Certified IT Specialist with an MBA in Technology
Management and has been employed by IBM for 23 years in various
technology areas.
David T. Britt is a Software Engineer for IBM in Research
Triangle Park, NC, working specifically with the z/OS®
Communications Server product. He is a subject matter
expert in the Simple Networking Management Protocol
(SNMP) and File Transfer Protocol (FTP), and has written
educational material for both in the form of IBM
Technotes, Techdocs, and Webcasts. He holds a degree
in Mathematical Sciences from the University of North
Carolina in Chapel Hill, and is currently pursuing a master
of science in Information Technology and Management
from the University of North Carolina in Greensboro.
TCP/IP Tutorial and Technical Overview
Chuck Davis is a Security Architect in the U.S. He has 12
years of experience in IT security field. He has worked at
IBM for nine years. His areas of expertise include IT
security and privacy. He has written extensively about
UNIX/Linux® and Internet security.
Jason Forrester is an IT Architect for IBM Global
Technology Services in Boulder, CO. He has more than 12
years of experience with network communications.
Specializing in IT strategy and architecture, Jason has
designed large-scale enterprise infrastructures. He holds a
CCIE certification and his work has lead to multiple patents
on advanced networking concepts.
Dr. Wei Liu received his Ph.D. from Georgia Institute of
Technology. He has taught TCP/IP networks in the
University of Maryland (UMBC campus) and he has
participated in ICCCN conference organization
committees. Dr. Liu has given lectures at Sun™ Yat-Sen
University and Shantou University in Next Generation
Networks (NGNs). With more than 30 technical
publications (in packet networks, telecommunications, and
standards), he has received several awards from ATIS
committees. Dr. Wei Liu has more than 10 years of telecom industry
experience, having participated in various network transformation projects and
service integration programs. Currently, he is investigating new infrastructure
opportunities (virtualization, network, services, security, and metadata models)
that can lead to future offering and new capabilities.
Carolyn Matthews is an IT Architect for IBM Global
Technology Services in South Africa. She is an
infrastructure architect for one of South Africa’s largest
accounts. She also acts as a consultant, using various
IBM techniques. Carolyn holds an honors degree in
Information Systems and is currently pursuing her
master’s degree in Information Systems. Her areas of
expertise include TCP/IP networks, IT architecture, and
new technologies.
Nicolas Rosselot is a Developer from Santiago, Chile.
He has most recently been teaching an “Advanced
TCP/IP Networking” class at Andres Bello University.
Thanks to the following people for their contributions to this project and laying the
foundation for this book by writing the earlier version:
Adolfo Rodriguez, John Gatrell, John Karas, Roland Peschke, Srinath Karanam,
and Martín F. Maldonado
International Technical Support Organization, Poughkeepsie Center
Become a published author
Join us for a two- to six-week residency program! Help write an IBM® Redbook
dealing with specific products or solutions, while getting hands-on experience
with leading-edge technologies. You'll have the opportunity to team with IBM
technical professionals, Business Partners, and Clients.
Your efforts will help increase product acceptance and client satisfaction. As a
bonus, you'll develop a network of contacts in IBM development labs, and
increase your productivity and marketability.
Find out more about the residency program, browse the residency index, and
apply online at:
TCP/IP Tutorial and Technical Overview
Comments welcome
Your comments are important to us!
We want our Redbooks™ to be as helpful as possible. Send us your comments
about this or other Redbooks in one of the following ways:
򐂰 Use the online Contact us review redbook form found at:
򐂰 Send your comments in an e-mail to:
[email protected]
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
TCP/IP Tutorial and Technical Overview
Part 1
The Transmission Control Protocol/Internet Protocol (TCP/IP) suite has become
the industry-standard method of interconnecting hosts, networks, and the
Internet. As such, it is seen as the engine behind the Internet and networks
Although TCP/IP supports a host of applications, both standard and
nonstandard, these applications could not exist without the foundation of a set of
core protocols. Additionally, in order to understand the capability of TCP/IP
applications, an understanding of these core protocols must be realized.
With this in mind, Part I begins with providing a background of TCP/IP, the
current architecture, standards, and most recent trends. Next, the section
explores the two aspects vital to the IP stack itself. This portion begins with a
discussion of the network interfaces most commonly used to allow the protocol
suite to interface with the physical network media. This is followed by the
protocols that must be implemented in any stack, including protocols belonging
to the IP and transport layers.
© Copyright IBM Corp. 1989-2006. All rights reserved.
Finally, other standard protocols exist that might not necessarily be required in
every implementation of the TCP/IP protocol suite. However, there are those that
can be very useful given certain operational needs of the implementation. Such
protocols include IP version 6, quality of service protocols, and wireless IP.
TCP/IP Tutorial and Technical Overview
Chapter 1.
Architecture, history,
standards, and trends
Today, the Internet and World Wide Web (WWW) are familiar terms to millions of
people all over the world. Many people depend on applications enabled by the
Internet, such as electronic mail and Web access. In addition, the increase in
popularity of business applications places additional emphasis on the Internet.
The Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite is
the engine for the Internet and networks worldwide. Its simplicity and power has
led to its becoming the single network protocol of choice in the world today. In
this chapter, we give an overview of the TCP/IP protocol suite. We discuss how
the Internet was formed, how it developed, and how it is likely to develop in the
© Copyright IBM Corp. 1989-2006. All rights reserved.
1.1 TCP/IP architectural model
The TCP/IP protocol suite is so named for two of its most important protocols:
Transmission Control Protocol (TCP) and Internet Protocol (IP). A less used
name for it is the Internet Protocol Suite, which is the phrase used in official
Internet standards documents. In this book, we use the more common, shorter
term, TCP/IP, to refer to the entire protocol suite.
1.1.1 Internetworking
The main design goal of TCP/IP was to build an interconnection of networks,
referred to as an internetwork, or internet, that provided universal
communication services over heterogeneous physical networks. The clear
benefit of such an internetwork is the enabling of communication between hosts
on different networks, perhaps separated by a large geographical area.
The words internetwork and internet are simply a contraction of the phrase
interconnected network. However, when written with a capital “I”, the Internet
refers to the worldwide set of interconnected networks. Therefore, the Internet is
an internet, but the reverse does not apply. The Internet is sometimes called the
connected Internet.
The Internet consists of the following groups of networks:
򐂰 Backbones: Large networks that exist primarily to interconnect other
networks. Also known as network access points (NAPs) or Internet Exchange
Points (IXPs). Currently, the backbones consist of commercial entities.
򐂰 Regional networks connecting, for example, universities and colleges.
򐂰 Commercial networks providing access to the backbones to subscribers, and
networks owned by commercial organizations for internal use that also have
connections to the Internet.
򐂰 Local networks, such as campus-wide university networks.
In most cases, networks are limited in size by the number of users that can
belong to the network, by the maximum geographical distance that the network
can span, or by the applicability of the network to certain environments. For
example, an Ethernet network is inherently limited in terms of geographical size.
Therefore, the ability to interconnect a large number of networks in some
hierarchical and organized fashion enables the communication of any two hosts
belonging to this internetwork.
TCP/IP Tutorial and Technical Overview
Figure 1-1 shows two examples of internets. Each consists of two or more
physical networks.
Network 1
Two networks interconnected by a router
Internet A
Network 1
O ne
Network 2
Network 2
Network 3
M ultiple networks interconnected by routers
(also seen as 1 virtual network, an Internet)
Figure 1-1 Internet examples: Two interconnected sets of networks, each seen as one
logical network
Another important aspect of TCP/IP internetworking is the creation of a
standardized abstraction of the communication mechanisms provided by each
type of network. Each physical network has its own technology-dependent
communication interface, in the form of a programming interface that provides
basic communication functions (primitives). TCP/IP provides communication
services that run between the programming interface of a physical network and
user applications. It enables a common interface for these applications,
independent of the underlying physical network. The architecture of the physical
network is therefore hidden from the user and from the developer of the
application. The application need only code to the standardized communication
abstraction to be able to function under any type of physical network and
operating platform.
As is evident in Figure 1-1, to be able to interconnect two networks, we need a
computer that is attached to both networks and can forward data packets from
one network to the other; such a machine is called a router. The term IP router is
also used because the routing function is part of the Internet Protocol portion of
the TCP/IP protocol suite (see 1.1.2, “The TCP/IP protocol layers” on page 6).
Chapter 1. Architecture, history, standards, and trends
To be able to identify a host within the internetwork, each host is assigned an
address, called the IP address. When a host has multiple network adapters
(interfaces), such as with a router, each interface has a unique IP address. The
IP address consists of two parts:
IP address = <network number><host number>
The network number part of the IP address identifies the network within the
internet and is assigned by a central authority and is unique throughout the
internet. The authority for assigning the host number part of the IP address
resides with the organization that controls the network identified by the network
number. We describe the addressing scheme in detail in 3.1.1, “IP addressing”
on page 68.
1.1.2 The TCP/IP protocol layers
Like most networking software, TCP/IP is modeled in layers. This layered
representation leads to the term protocol stack, which refers to the stack of
layers in the protocol suite. It can be used for positioning (but not for functionally
comparing) the TCP/IP protocol suite against others, such as Systems Network
Architecture (SNA) and the Open System Interconnection (OSI) model.
Functional comparisons cannot easily be extracted from this, because there are
basic differences in the layered models used by the different protocol suites.
By dividing the communication software into layers, the protocol stack allows for
division of labor, ease of implementation and code testing, and the ability to
develop alternative layer implementations. Layers communicate with those
above and below via concise interfaces. In this regard, a layer provides a service
for the layer directly above it and makes use of services provided by the layer
directly below it. For example, the IP layer provides the ability to transfer data
from one host to another without any guarantee to reliable delivery or duplicate
suppression. Transport protocols such as TCP make use of this service to
provide applications with reliable, in-order, data stream delivery.
TCP/IP Tutorial and Technical Overview
Figure 1-2 shows how the TCP/IP protocols are modeled in four layers.
Network Interface
Network Interface
and Hardware
Figure 1-2 The TCP/IP protocol stack: Each layer represents a package of functions
These layers include:
Application layer
The application layer is provided by the program that
uses TCP/IP for communication. An application is a
user process cooperating with another process usually
on a different host (there is also a benefit to application
communication within a single host). Examples of
applications include Telnet and the File Transfer
Protocol (FTP). The interface between the application
and transport layers is defined by port numbers and
sockets, which we describe in more detail in 4.1, “Ports
and sockets” on page 144.
Transport layer
The transport layer provides the end-to-end data
transfer by delivering data from an application to its
remote peer. Multiple applications can be supported
simultaneously. The most-used transport layer
protocol is the Transmission Control Protocol (TCP),
which provides connection-oriented reliable data
delivery, duplicate data suppression, congestion
control, and flow control. We discuss this in more detail
in 4.3, “Transmission Control Protocol (TCP)” on
page 149.
Another transport layer protocol is the User Datagram
Protocol (see 4.2, “User Datagram Protocol (UDP)” on
page 146). It provides connectionless, unreliable,
Chapter 1. Architecture, history, standards, and trends
best-effort service. As a result, applications using UDP
as the transport protocol have to provide their own
end-to-end integrity, flow control, and congestion
control, if desired. Usually, UDP is used by
applications that need a fast transport mechanism and
can tolerate the loss of some data.
Internetwork layer
The internetwork layer, also called the internet layer
or the network layer, provides the “virtual network”
image of an internet (this layer shields the higher
levels from the physical network architecture below
it). Internet Protocol (IP) is the most important
protocol in this layer. It is a connectionless protocol
that does not assume reliability from lower layers. IP
does not provide reliability, flow control, or error
recovery. These functions must be provided at a
higher level.
IP provides a routing function that attempts to deliver
transmitted messages to their destination. We discuss
IP in detail in Chapter 3, “Internetworking protocols” on
page 67. A message unit in an IP network is called an
IP datagram. This is the basic unit of information
transmitted across TCP/IP networks. Other
internetwork-layer protocols are IP, ICMP, IGMP,
ARP, and RARP.
Network interface layer The network interface layer, also called the link layer
or the data-link layer, is the interface to the actual
network hardware. This interface may or may not
provide reliable delivery, and may be packet or stream
oriented. In fact, TCP/IP does not specify any protocol
here, but can use almost any network interface
available, which illustrates the flexibility of the IP layer.
Examples are IEEE 802.2, X.25 (which is reliable in
itself), ATM, FDDI, and even SNA. We discuss some
physical networks and interfaces in Chapter 2,
“Network interfaces” on page 29.
TCP/IP specifications do not describe or standardize
any network-layer protocols per se; they only
standardize ways of accessing those protocols from
the internetwork layer.
TCP/IP Tutorial and Technical Overview
A more detailed layering model is included in Figure 1-3.
SMTP, Telnet, FTP, Gopher...
Network Interface
and Hardware
Ethernet, Token-Ring, FDDI, X.25, Wireless, Async, ATM,
Figure 1-3 Detailed architectural model
1.1.3 TCP/IP applications
The highest-level protocols within the TCP/IP protocol stack are application
protocols. They communicate with applications on other internet hosts and are
the user-visible interface to the TCP/IP protocol suite.
All application protocols have some characteristics in common:
򐂰 They can be user-written applications or applications standardized and
shipped with the TCP/IP product. Indeed, the TCP/IP protocol suite includes
application protocols such as:
– Telnet for interactive terminal access to remote internet hosts
– File Transfer Protocol (FTP) for high-speed disk-to-disk file transfers
– Simple Mail Transfer Protocol (SMTP) as an internet mailing system
These are some of the most widely implemented application protocols, but
many others exist. Each particular TCP/IP implementation will include a
lesser or greater set of application protocols.
򐂰 They use either UDP or TCP as a transport mechanism. Remember that UDP
is unreliable and offers no flow-control, so in this case, the application has to
provide its own error recovery, flow control, and congestion control
functionality. It is often easier to build applications on top of TCP because it is
a reliable stream, connection-oriented, congestion-friendly, flow
control-enabled protocol. As a result, most application protocols will use TCP,
but there are applications built on UDP to achieve better performance through
increased protocol efficiencies.
򐂰 Most applications use the client/server model of interaction.
Chapter 1. Architecture, history, standards, and trends
The client/server model
TCP is a peer-to-peer, connection-oriented protocol. There are no
master/subordinate relationships. The applications, however, typically use a
client/server model for communications, as demonstrated in Figure 1-4.
A server is an application that offers a service to internet users. A client is a
requester of a service. An application consists of both a server and a client part,
which can run on the same or on different systems. Users usually invoke the
client part of the application, which builds a request for a particular service and
sends it to the server part of the application using TCP/IP as a transport vehicle.
The server is a program that receives a request, performs the required service,
and sends back the results in a reply. A server can usually deal with multiple
requests and multiple requesting clients at the same time.
Internet Network
Figure 1-4 The client/server model of applications
Most servers wait for requests at a well-known port so that their clients know to
which port (and in turn, which application) they must direct their requests. The
client typically uses an arbitrary port called an ephemeral port for its
communication. Clients that want to communicate with a server that does not use
a well-known port must have another mechanism for learning to which port they
must address their requests. This mechanism might employ a registration
service such as portmap, which does use a well-known port.
For detailed information about TCP/IP application protocols, refer to Part 2,
“TCP/IP application protocols” on page 405.
TCP/IP Tutorial and Technical Overview
Bridges, routers, and gateways
There are many ways to provide access to other networks. In an internetwork,
this done with routers. In this section, we distinguish between a router, a bridge,
and a gateway for allowing remote network access:
Interconnects LAN segments at the network interface
layer level and forwards frames between them. A bridge
performs the function of a MAC relay, and is independent
of any higher layer protocol (including the logical link
protocol). It provides MAC layer protocol conversion, if
A bridge is said to be transparent to IP. That is, when an
IP host sends an IP datagram to another host on a
network connected by a bridge, it sends the datagram
directly to the host and the datagram “crosses” the bridge
without the sending IP host being aware of it.
Interconnects networks at the internetwork layer level and
routes packets between them. The router must
understand the addressing structure associated with the
networking protocols it supports and take decisions on
whether, or how, to forward packets. Routers are able to
select the best transmission paths and optimal packet
sizes. The basic routing function is implemented in the IP
layer of the TCP/IP protocol stack, so any host or
workstation running TCP/IP over more than one interface
could, in theory and also with most of today's TCP/IP
implementations, forward IP datagrams. However,
dedicated routers provide much more sophisticated
routing than the minimum functions implemented by IP.
Because IP provides this basic routing function, the term
“IP router,” is often used. Other, older terms for router are
“IP gateway,” “Internet gateway,” and “gateway.” The term
gateway is now normally used for connections at a higher
layer than the internetwork layer.
A router is said to be visible to IP. That is, when a host
sends an IP datagram to another host on a network
connected by a router, it sends the datagram to the router
so that it can forward it to the target host.
Chapter 1. Architecture, history, standards, and trends
Interconnects networks at higher layers than bridges and
routers. A gateway usually supports address mapping
from one network to another, and might also provide
transformation of the data between the environments to
support end-to-end application connectivity. Gateways
typically limit the interconnectivity of two networks to a
subset of the application protocols supported on either
one. For example, a VM host running TCP/IP can be used
as an SMTP/RSCS mail gateway.
Note: The term “gateway,” when used in this sense, is not
synonymous with “IP gateway.”
A gateway is said to be opaque to IP. That is, a host
cannot send an IP datagram through a gateway; it can
only send it to a gateway. The higher-level protocol
information carried by the datagrams is then passed on by
the gateway using whatever networking architecture is
used on the other side of the gateway.
Closely related to routers and gateways is the concept of a firewall, or firewall
gateway, which is used to restrict access from the Internet or some untrusted
network to a network or group of networks controlled by an organization for
security reasons. See 22.3, “Firewalls” on page 794 for more information about
1.2 The roots of the Internet
Networks have become a fundamental, if not the most important, part of today's
information systems. They form the backbone for information sharing in
enterprises, governmental groups, and scientific groups. That information can
take several forms. It can be notes and documents, data to be processed by
another computer, files sent to colleagues, and multimedia data streams.
A number of networks were installed in the late 1960s and 1970s, when network
design was the “state of the art” topic of computer research and sophisticated
implementers. It resulted in multiple networking models such as packet-switching
technology, collision-detection local area networks, hierarchical networks, and
many other excellent communications technologies.
The result of all this great know-how was that any group of users could find a
physical network and an architectural model suitable for their specific needs.
This ranges from inexpensive asynchronous lines with no other error recovery
TCP/IP Tutorial and Technical Overview
than a bit-per-bit parity function, through full-function wide area networks (public
or private) with reliable protocols such as public packet-switching networks or
private SNA networks, to high-speed but limited-distance local area networks.
The down side of the development of such heterogeneous protocol suites is the
rather painful situation where one group of users wants to extend its information
system to another group of users who have implemented a different network
technology and different networking protocols. As a result, even if they could
agree on some network technology to physically interconnect the two
environments, their applications (such as mailing systems) would still not be able
to communicate with each other because of different application protocols and
This situation was recognized in the early 1970s by a group of U.S. researchers
funded by the Defense Advanced Research Projects Agency (DARPA). Their
work addressed internetworking, or the interconnection of networks. Other
official organizations became involved in this area, such as ITU-T (formerly
CCITT) and ISO. The main goal was to define a set of protocols, detailed in a
well-defined suite, so that applications would be able to communicate with other
applications, regardless of the underlying network technology or the operating
systems where those applications run.
The official organization of these researchers was the ARPANET Network
Working Group, which had its last general meeting in October 1971. DARPA
continued its research for an internetworking protocol suite, from the early
Network Control Program (NCP) host-to-host protocol to the TCP/IP protocol
suite, which took its current form around 1978. At that time, DARPA was well
known for its pioneering of packet-switching over radio networks and satellite
channels. The first real implementations of the Internet were found around 1980
when DARPA started converting the machines of its research network
(ARPANET) to use the new TCP/IP protocols. In 1983, the transition was
completed and DARPA demanded that all computers willing to connect to its
DARPA also contracted Bolt, Beranek, and Newman (BBN) to develop an
implementation of the TCP/IP protocols for Berkeley UNIX® on the VAX and
funded the University of California at Berkeley to distribute the code free of
charge with their UNIX operating system. The first release of the Berkeley
Software Distribution (BSD) to include the TCP/IP protocol set was made
available in 1983 (4.2BSD). From that point on, TCP/IP spread rapidly among
universities and research centers and has become the standard communications
subsystem for all UNIX connectivity. The second release (4.3BSD) was
distributed in 1986, with updates in 1988 (4.3BSD Tahoe) and 1990 (4.3BSD
Reno). 4.4BSD was released in 1993. Due to funding constraints, 4.4BSD was
Chapter 1. Architecture, history, standards, and trends
the last release of the BSD by the Computer Systems Research Group of the
University of California at Berkeley.
As TCP/IP internetworking spread rapidly, new wide area networks were created
in the U.S. and connected to ARPANET. In turn, other networks in the rest of the
world, not necessarily based on the TCP/IP protocols, were added to the set of
interconnected networks. The result is what is described as the Internet. We
describe some examples of the different networks that have played key roles in
this development in the next sections.
Sometimes referred to as the “grand-daddy of packet networks,” the ARPANET
was built by DARPA (which was called ARPA at that time) in the late 1960s to
accommodate research equipment on packet-switching technology and to allow
resource sharing for the Department of Defense's contractors. The network
interconnected research centers, some military bases, and government
locations. It soon became popular with researchers for collaboration through
electronic mail and other services. It was developed into a research utility run by
the Defense Communications Agency (DCA) by the end of 1975 and split in 1983
into MILNET for interconnection of military sites and ARPANET for
interconnection of research sites. This formed the beginning of the “capital I”
In 1974, the ARPANET was based on 56 Kbps leased lines that interconnected
packet-switching nodes (PSN) scattered across the continental U.S. and western
Europe. These were minicomputers running a protocol known as 1822 (after the
number of a report describing it) and dedicated to the packet-switching task.
Each PSN had at least two connections to other PSNs (to allow alternate routing
in case of circuit failure) and up to 22 ports for user computer (host) connections.
These 1822 systems offered reliable, flow-controlled delivery of a packet to a
destination node. This is the reason why the original NCP protocol was a rather
simple protocol. It was replaced by the TCP/IP protocols, which do not assume
the reliability of the underlying network hardware and can be used on
other-than-1822 networks. This 1822 protocol did not become an industry
standard, so DARPA decided later to replace the 1822 packet switching
technology with the CCITT X.25 standard.
Data traffic rapidly exceeded the capacity of the 56 Kbps lines that made up the
network, which were no longer able to support the necessary throughput. Today
the ARPANET has been replaced by new technologies in its role of backbone on
the research side of the connected Internet (see NSFNET later in this chapter),
while MILNET continues to form the backbone of the military side.
TCP/IP Tutorial and Technical Overview
1.2.2 NSFNET
NSFNET, the National Science Foundation (NSF) Network, is a three-level
internetwork in the United States consisting of:
򐂰 The backbone: A network that connects separately administered and
operated mid-level networks and NSF-funded supercomputer centers. The
backbone also has transcontinental links to other networks such as EBONE,
the European IP backbone network.
򐂰 Mid-level networks: Three kinds of networks (regional, discipline-based, and
supercomputer consortium networks).
򐂰 Campus networks: Whether academic or commercial, connected to the
mid-level networks.
Over the years, the NSF upgraded its backbone to meet the increasing demands
of its clients:
򐂰 First backbone: Originally established by the NSF as a communications
network for researchers and scientists to access the NSF supercomputers,
the first NSFNET backbone used six DEC LSI/11 microcomputers as packet
switches, interconnected by 56 Kbps leased lines. A primary interconnection
between the NSFNET backbone and the ARPANET existed at Carnegie
Mellon, which allowed routing of datagrams between users connected to each
of those networks.
򐂰 Second backbone: The need for a new backbone appeared in 1987, when the
first one became overloaded within a few months (estimated growth at that
time was 100% per year). The NSF and MERIT, Inc., a computer network
consortium of eight state-supported universities in Michigan, agreed to
develop and manage a new, higher-speed backbone with greater
transmission and switching capacities. To manage it, they defined the
Information Services (IS), which is comprised of an Information Center and a
Technical Support Group. The Information Center is responsible for
information dissemination, information resource management, and electronic
communication. The Technical Support Group provides support directly to the
field. The purpose of this is to provide an integrated information system with
easy-to-use-and-manage interfaces accessible from any point in the network
supported by a full set of training services.
Merit and NSF conducted this project in partnership with IBM and MCI. IBM
provided the software, packet-switching, and network-management
equipment, while MCI provided the long-distance transport facilities. Installed
in 1988, the new network initially used 448 Kbps leased circuits to
interconnect 13 nodal switching systems (NSSs), supplied by IBM. Each NSS
was composed of nine IBM RISC systems (running an IBM version of 4.3BSD
UNIX) loosely coupled by two IBM token-ring networks (for redundancy). One
Chapter 1. Architecture, history, standards, and trends
Integrated Digital Network Exchange (IDNX) supplied by IBM was installed at
each of the 13 locations, to provide:
– Dynamic alternate routing
– Dynamic bandwidth allocation
򐂰 Third backbone: In 1989, the NSFNET backbone circuits topology was
reconfigured after traffic measurements and the speed of the leased lines
increased to T1 (1.544 Mbps) using primarily fiber optics.
Due to the constantly increasing need for improved packet switching and
transmission capacities, three NSSs were added to the backbone and the link
speed was upgraded. The migration of the NSFNET backbone from T1 to T3
(45 Mbps) was completed in late 1992. The subsequent migration to gigabit
levels has already started and is continuing today.
In April 1995, the U.S. government discontinued its funding of NSFNET. This
was, in part, a reaction to growing commercial use of the network. About the
same time, NSFNET gradually migrated the main backbone traffic in the U.S. to
commercial network service providers, and NSFNET reverted to being a network
for the research community. The main backbone network is now run in
cooperation with MCI and is known as the vBNS (very high speed Backbone
Network Service).
NSFNET has played a key role in the development of the Internet. However,
many other networks have also played their part and also make up a part of the
Internet today.
1.2.3 Commercial use of the Internet
In recent years the Internet has grown in size and range at a greater rate than
anyone could have predicted. A number of key factors have influenced this
growth. Some of the most significant milestones have been the free distribution
of Gopher in 1991, the first posting, also in 1991, of the specification for hypertext
and, in 1993, the release of Mosaic, the first graphics-based browser. Today the
vast majority of the hosts now connected to the Internet are of a commercial
nature. This is an area of potential and actual conflict with the initial aims of the
Internet, which were to foster open communications between academic and
research institutions. However, the continued growth in commercial use of the
Internet is inevitable, so it will be helpful to explain how this evolution is taking
One important initiative to consider is that of the Acceptable Use Policy (AUP).
The first of these policies was introduced in 1992 and applies to the use of
NSFNET. At the heart of this AUP is a commitment “to support open research
and education.” Under “Unacceptable Uses” is a prohibition of “use for for-profit
TCP/IP Tutorial and Technical Overview
activities,” unless covered by the General Principle or as a specifically
acceptable use. However, in spite of this apparently restrictive stance, the
NSFNET was increasingly used for a broad range of activities, including many of
a commercial nature, before reverting to its original objectives in 1995.
The provision of an AUP is now commonplace among Internet service providers,
although the AUP has generally evolved to be more suitable for commercial use.
Some networks still provide services free of any AUP.
Let us now focus on the Internet service providers who have been most active in
introducing commercial uses to the Internet. Two worth mentioning are PSINet
and UUNET, which began in the late 1980s to offer Internet access to both
businesses and individuals. The California-based CERFnet provided services
free of any AUP. An organization to interconnect PSINet, UUNET, and CERFnet
was formed soon after, called the Commercial Internet Exchange (CIX), based
on the understanding that the traffic of any member of one network may flow
without restriction over the networks of the other members. As of July 1997, CIX
had grown to more than 146 members from all over the world, connecting
member internets. At about the same time that CIX was formed, a non-profit
company, Advance Network and Services (ANS), was formed by IBM, MCI, and
Merit, Inc. to operate T1 (subsequently T3) backbone connections for NSFNET.
This group was active in increasing the commercial presence on the Internet.
ANS formed a commercially oriented subsidiary called ANS CO+RE to provide
linkage between commercial customers and the research and education
domains. ANS CO+RE provides access to NSFNET as well as being linked to
CIX. In 1995 ANS was acquired by America Online.
In 1995, as the NSFNET was reverting to its previous academic role, the
architecture of the Internet changed from having a single dominant backbone in
the U.S. to having a number of commercially operated backbones. In order for
the different backbones to be able to exchange data, the NSF set up four
Network Access Points (NAPs) to serve as data interchange points between the
backbone service providers.
Another type of interchange is the Metropolitan Area Ethernet (MAE). Several
MAEs have been set up by Metropolitan Fiber Systems (MFS), who also have
their own backbone network. NAPs and MAEs are also referred to as public
exchange points (IXPs). Internet service providers (ISPs) typically will have
connections to a number of IXPs for performance and backup. For a current
listing of IXPs, consult the Exchange Point at:
Similar to CIX in the United States, European Internet providers formed the RIPE
(Réseaux IP Européens) organization to ensure technical and administrative
Chapter 1. Architecture, history, standards, and trends
coordination. RIPE was formed in 1989 to provide a uniform IP service to users
throughout Europe. Today, the largest Internet backbones run at OC48 (2.4
Gbps) or OC192 (9.6 Gbps).
1.2.4 Internet2
The success of the Internet and the subsequent frequent congestion of the
NSFNET and its commercial replacement led to some frustration among the
research community who had previously enjoyed exclusive use of the Internet.
The university community, therefore, together with government and industry
partners, and encouraged by the funding component of the Next Generation
Internet (NGI) initiative, have formed the Internet2 project.
The NGI initiative is a federal research program that is developing advanced
networking technologies, introducing revolutionary applications that require
advanced networking technologies and demonstrating these technological
capabilities on high-speed testbeds.
The Internet2 mission is to facilitate and coordinate the development, operation,
and technology transfer of advanced, network-based applications and network
services to further U.S. leadership in research and higher education and
accelerate the availability of new services and applications on the Internet.
Internet2 has the following goals:
򐂰 Demonstrate new applications that can dramatically enhance researchers’
ability to collaborate and conduct experiments.
򐂰 Demonstrate enhanced delivery of education and other services (for instance,
health care, environmental monitoring, and so on) by taking advantage of
virtual proximity created by an advanced communications infrastructure.
򐂰 Support development and adoption of advanced applications by providing
middleware and development tools.
򐂰 Facilitate development, deployment, and operation of an affordable
communications infrastructure, capable of supporting differentiated quality of
service (QoS) based on application requirements of the research and
education community.
򐂰 Promote experimentation with the next generation of communications
򐂰 Coordinate adoption of agreed working standards and common practices
among participating institutions to ensure end-to-end quality of service and
TCP/IP Tutorial and Technical Overview
򐂰 Catalyze partnerships with governmental and private sector organizations.
򐂰 Encourage transfer of technology from Internet2 to the rest of the Internet.
򐂰 Study the impact of new infrastructure, services, and applications on higher
education and the Internet community in general.
Internet2 participants
Internet2 has 180 participating universities across the United States. Affiliate
organizations provide the project with valuable input. All participants in the
Internet2 project are members of the University Corporation for Advanced
Internet Development (UCAID).
In most respects, the partnership and funding arrangements for Internet2 will
parallel those of previous joint networking efforts of academia and government,
of which the NSFnet project is a very successful example. The United States
government will participate in Internet2 through the NGI initiative and related
Internet2 also joins with corporate leaders to create the advanced network
services necessary to meet the requirements of broadband, networked
applications. Industry partners work primarily with campus-based and regional
university teams to provide the services and products needed to implement the
applications developed by the project. Major corporations currently participating
in Internet2 include Alcatel, Cisco Systems, IBM, Nortel Networks, Sprint, and
Sun Microsystems™. Additional support for Internet2 comes from collaboration
with non-profit organizations working in research and educational networking.
Affiliate organizations committed to the project include MCNC, Merit, National
Institutes of Health (NIH), and the State University System of Florida.
For more information about Internet2, see their Web page at:
Chapter 1. Architecture, history, standards, and trends
1.2.5 The Open Systems Interconnection (OSI) Reference Model
The OSI (Open Systems Interconnect) Reference Model (ISO 7498) defines a
seven-layer model of data communication with physical transport at the lower
layer and application protocols at the upper layers. This model, shown in
Figure 1-5, is widely accepted as a basis for the understanding of how a network
protocol stack should operate and as a reference tool for comparing network
stack implementation.
Data Link
Data Link
Figure 1-5 The OSI Reference Model
Each layer provides a set of functions to the layer above and, in turn, relies on the
functions provided by the layer below. Although messages can only pass
vertically through the stack from layer to layer, from a logical point of view, each
layer communicates directly with its peer layer on other nodes.
The seven layers are:
Network applications such as terminal emulation and file
Formatting of data and encryption
Establishment and maintenance of sessions
Provision of reliable and unreliable end-to-end delivery
Packet delivery, including routing
Data Link
Framing of units of information and error checking
Transmission of bits on the physical hardware
In contrast to TCP/IP, the OSI approach started from a clean slate and defined
standards, adhering tightly to their own model, using a formal committee process
TCP/IP Tutorial and Technical Overview
without requiring implementations. Internet protocols use a less formal
engineering approach, where anybody can propose and comment on Request
for Comments, known as RFC, and implementations are required to verify
feasibility. The OSI protocols developed slowly, and because running the full
protocol stack is resource intensive, they have not been widely deployed,
especially in the desktop and small computer market. In the meantime, TCP/IP
and the Internet were developing rapidly, with deployment occurring at a very
high rate.
1.3 TCP/IP standards
TCP/IP has been popular with developers and users alike because of its inherent
openness and perpetual renewal. The same holds true for the Internet as an
open communications network. However, this openness could easily turn into
something that can help you and hurt you if it were not controlled in some way.
Although there is no overall governing body to issue directives and regulations
for the Internet—control is mostly based on mutual cooperation—the Internet
Society (ISOC) serves as the standardizing body for the Internet community. It is
organized and managed by the Internet Architecture Board (IAB).
The IAB itself relies on the Internet Engineering Task Force (IETF) for issuing
new standards, and on the Internet Assigned Numbers Authority (IANA) for
coordinating values shared among multiple protocols. The RFC Editor is
responsible for reviewing and publishing new standards documents.
The IETF itself is governed by the Internet Engineering Steering Group (IESG)
and is further organized in the form of Areas and Working Groups where new
specifications are discussed and new standards are propsoed.
The Internet Standards Process, described in RFC 2026, The Internet Standards
Process, Revision 3, is concerned with all protocols, procedures, and
conventions that are used in or by the Internet, whether or not they are part of the
TCP/IP protocol suite.
The overall goals of the Internet Standards Process are:
򐂰 Technical excellence
򐂰 Prior implementation and testing
򐂰 Clear, concise, and easily understood documentation
򐂰 Openness and fairness
򐂰 Timeliness
Chapter 1. Architecture, history, standards, and trends
The process of standardization is summarized as follows:
򐂰 In order to have a new specification approved as a standard, applicants have
to submit that specification to the IESG where it will be discussed and
reviewed for technical merit and feasibility and also published as an Internet
draft document. This should take no shorter than two weeks and no longer
than six months.
򐂰 After the IESG reaches a positive conclusion, it issues a last-call notification
to allow the specification to be reviewed by the whole Internet community.
򐂰 After the final approval by the IESG, an Internet draft is recommended to the
Internet Engineering Taskforce (IETF), another subsidiary of the IAB, for
inclusion into the Standards Track and for publication as a Request for
Comments (see 1.3.1, “Request for Comments (RFC)” on page 22).
򐂰 Once published as an RFC, a contribution may advance in status as
described in 1.3.2, “Internet standards” on page 24. It may also be revised
over time or phased out when better solutions are found.
򐂰 If the IESG does not approve of a new specification after, or if a document
has remained unchanged within, six months of submission, it will be removed
from the Internet drafts directory.
1.3.1 Request for Comments (RFC)
The Internet protocol suite is still evolving through the mechanism of Request for
Comments (RFC). New protocols (mostly application protocols) are being
designed and implemented by researchers, and are brought to the attention of
the Internet community in the form of an Internet draft (ID).1 The largest source
of IDs is the Internet Engineering Task Force (IETF), which is a subsidiary of the
IAB. However, anyone can submit a memo proposed as an ID to the RFC Editor.
There are a set of rules which RFC/ID authors must follow in order for an RFC to
be accepted. These rules are themselves described in an RFC (RFC 2223),
which also indicates how to submit a proposal for an RFC.
After an RFC has been published, all revisions and replacements are published
as new RFCs. A new RFC that revises or replaces an existing RFC is said to
“update” or to “obsolete” that RFC. The existing RFC is said to be “updated by” or
“obsoleted by” the new one. For example RFC 1542, which describes the
BOOTP protocol, is a “second edition,” being a revision of RFC 1532 and an
amendment to RFC 951. RFC 1542 is therefore labelled like this: “Obsoletes
RFC 1532; Updates RFC 951." Consequently, there is never any confusion over
Some of these protocols can be described as impractical at best. For instance, RFC 1149 (dated
1990 April 1) describes the transmission of IP datagrams by carrier pigeon and RFC 1437 (dated
1993 April 1) describes the transmission of people by electronic mail.
TCP/IP Tutorial and Technical Overview
whether two people are referring to different versions of an RFC, because there
is never more than one current version.
Some RFCs are described as information documents, while others describe
Internet protocols. The Internet Architecture Board (IAB) maintains a list of the
RFCs that describe the protocol suite. Each of these is assigned a state and a
An Internet protocol can have one of the following states:
The IAB has established this as an official protocol for the
Internet. These are separated into two groups:
IP protocol and above, protocols that apply to the
whole Internet
Network-specific protocols, generally specifications
of how to do IP on particular types of networks
Draft standard
The IAB is actively considering this protocol as a possible
standard protocol. Substantial and widespread testing
and comments are desired. Submit comments and test
results to the IAB. There is a possibility that changes will
be made in a draft protocol before it becomes a standard.
Proposed standard
These are protocol proposals that might be considered by
the IAB for standardization in the future. Implementations
and testing by several groups are desirable. Revision of
the protocol is likely.
A system should not implement an experimental protocol
unless it is participating in the experiment and has
coordinated its use of the protocol with the developer of
the protocol.
Protocols developed by other standard organizations, or
vendors, or that are for other reasons outside the purview
of the IAB may be published as RFCs for the convenience
of the Internet community as informational protocols.
Such protocols might, in some cases, also be
recommended for use on the Internet by the IAB.
These are protocols that are unlikely to ever become
standards in the Internet either because they have been
superseded by later developments or due to lack of
Chapter 1. Architecture, history, standards, and trends
Protocol status can be any of the following:
A system must implement the required protocols.
A system should implement the recommended protocol.
A system may or may not implement an elective protocol.
The general notion is that if you are going to do something
like this, you must do exactly this.
Limited use
These protocols are for use in limited circumstances. This
may be because of their experimental state, specialized
nature, limited functionality, or historic state.
Not recommended
These protocols are not recommended for general use.
This may be because of their limited functionality,
specialized nature, or experimental or historic state.
1.3.2 Internet standards
Proposed standard, draft standard, and standard protocols are described as
being on the Internet Standards Track. When a protocol reaches the standard
state, it is assigned a standard (STD) number. The purpose of STD numbers is to
clearly indicate which RFCs describe Internet standards. STD numbers
reference multiple RFCs when the specification of a standard is spread across
multiple documents. Unlike RFCs, where the number refers to a specific
document, STD numbers do not change when a standard is updated. STD
numbers do not, however, have version numbers because all updates are made
through RFCs and the RFC numbers are unique. Therefore, to clearly specify
which version of a standard one is referring to, the standard number and all of
the RFCs that it includes should be stated. For instance, the Domain Name
System (DNS) is STD 13 and is described in RFCs 1034 and 1035. To reference
the standard, a form such as “STD-13/RFC1034/RFC1035” should be used.
For some Standards Track RFCs, the status category does not always contain
enough information to be useful. It is therefore supplemented, notably for routing
protocols, by an applicability statement, which is given either in STD 1 or in a
separate RFC.
References to the RFCs and to STD numbers will be made throughout this book,
because they form the basis of all TCP/IP protocol implementations.
TCP/IP Tutorial and Technical Overview
The following Internet standards are of particular importance:
򐂰 STD 1 – Internet Official Protocol Standards
This standard gives the state and status of each Internet protocol or standard
and defines the meanings attributed to each state or status. It is issued by the
IAB approximately quarterly. At the time of writing, this standard is in RFC
򐂰 STD 2 – Assigned Internet Numbers
This standard lists currently assigned numbers and other protocol parameters
in the Internet protocol suite. It is issued by the Internet Assigned Numbers
Authority (IANA). The current edition at the time of writing is RFC 3232.
򐂰 STD 3 – Host Requirements
This standard defines the requirements for Internet host software (often by
reference to the relevant RFCs). The standard comes in three parts:
– RFC 1122 – Requirements for Internet hosts – communications layer
– RFC 1123 – Requirements for Internet hosts – application and support
– RFC 2181 – Clarifications to the DNS Specification
򐂰 STD 4 – Router Requirements
This standard defines the requirements for IPv4 Internet gateway (router)
software. It is defined in RFC 1812 – Requirements for IPv4 Routers.
For Your Information (FYI)
A number of RFCs that are intended to be of wide interest to Internet users are
classified as For Your Information (FYI) documents. They frequently contain
introductory or other helpful information. Like STD numbers, an FYI number is
not changed when a revised RFC is issued. Unlike STDs, FYIs correspond to a
single RFC document. For example, FYI 4 - FYI on Questions and Answers Answers to Commonly asked “New Internet User” Questions, is currently in its
fifth edition. The RFC numbers are 1177, 1206, 1325 and 1594, and 2664.
Obtaining RFCs
RFC and ID documents are available publicly and online and best obtained from
the IETF Web site:
A complete list of current Internet Standards can be found in RFC 3700 – Internet
Official Protocol Standards.
Chapter 1. Architecture, history, standards, and trends
1.4 Future of the Internet
Trying to predict the future of the Internet is not an easy task. Few would have
imagined, even five years ago, the extent to which the Internet has now become
a part of everyday life in business, homes, and schools. There are a number of
things, however, about which we can be fairly certain.
1.4.1 Multimedia applications
Bandwidth requirements will continue to increase at massive rates; not only is
the number of Internet users growing rapidly, but the applications being used are
becoming more advanced and therefore consume more bandwidth. New
technologies such as dense wave division multiplexing (DWDM) are emerging to
meet these high bandwidth demands being placed on the Internet.
Much of this increasing demand is attributable to the increased use of multimedia
applications. One example is that of Voice over IP technology. As this technology
matures, we are almost certain to see a sharing of bandwidth between voice and
data across the Internet. This raises some interesting questions for phone
companies. The cost to a user of an Internet connection between Raleigh, NC
and Santiago, Chile is the same as a connection within Raleigh, not so for a
traditional phone connection. Inevitably, voice conversations will become video
conversations as phone calls become video conferences.
Today, it is possible to hear radio stations from almost any part of the globe
through the Internet with FM quality. We can watch television channels from all
around the world, leading to the clear potential of using the Internet as the
vehicle for delivering movies and all sorts of video signals to consumers
everywhere. It all comes at a price, however, as the infrastructure of the Internet
must adapt to such high bandwidth demands.
1.4.2 Commercial use
The Internet has been through an explosion in terms of commercial use. Today,
almost all large business depend on the Internet, whether for marketing, sales,
customer service, or employee access. These trends are expected to continue.
Electronic stores will continue to flourish by providing convenience to customers
that do not have time to make their way to traditional stores.
Businesses will rely more and more on the Internet as a means for
communicating branches across the globe. With the popularity of virtual private
networks (VPNs), businesses can securely conduct their internal business over a
wide area using the Internet; employees can work from home offices yielding a
TCP/IP Tutorial and Technical Overview
virtual office environment. Virtual meetings probably will be common
1.4.3 The wireless Internet
Perhaps the most widespread growth in the use of the Internet, however, is that
of wireless applications. Recently, there has been an incredible focus on the
enablement of wireless and pervasive computing. This focus has been largely
motivated by the convenience of wireless connectivity. For example, it is
impractical to physically connect a mobile workstation, which by definition, is free
to roam. Constraining such a workstation to some physical geography simply
defeats the purpose. In other cases, wired connectivity simply is not feasible.
Examples include the ruins of Macchu Picchu or offices in the Sistine Chapel. In
these circumstances, fixed workstations also benefit from otherwise unavailable
network access.
Protocols such as Bluetooth, IEEE 802.11, and Wireless Application Protocol
(WAP) are paving the way toward a wireless Internet. While the personal
benefits of such access are quite advantageous, even more appealing are the
business applications that are facilitated by such technology. Every business,
from factories to hospitals, could enhance their respective services. Wireless
devices will become standard equipment in vehicles, not only for the personal
enjoyment of the driver, but also for the flow of maintenance information to
your favorite automobile mechanic. The applications are limitless.
1.5 RFCs relevant to this chapter
The following RFCs provide detailed information about the connection protocols
and architectures presented throughout this chapter:
򐂰 RFC 2026 – The Internet Standards Process -- Revision 3 (October 1996)
򐂰 RFC 2223 – Instructions to RFC (October 1997)
򐂰 RFC 2900 – Internet Official Protocol Standards (August 2001)
򐂰 RFC 3232 – Assigned Numbers: RFC 1700 is Replaced by an On-line
Database (January 2002)
Chapter 1. Architecture, history, standards, and trends
TCP/IP Tutorial and Technical Overview
Chapter 2.
Network interfaces
This chapter provides an overview of the protocols and interfaces that allow
TCP/IP traffic to flow over various kinds of physical networks. TCP/IP, as an
internetwork protocol suite, can operate over a vast number of physical
networks. The most common and widely used of these protocols is, of course,
Ethernet. The number of network protocols that have provisions for natively
supporting IP is clearly beyond the scope of this redbook. However, we provide a
summary of some of the different networks most commonly used with TCP/IP.
Although this chapter primarily discusses the data link layer of the OSI model
(see 1.2.5, “The Open Systems Interconnection (OSI) Reference Model” on
page 20), it also provides some relevant physical layer technology.
© Copyright IBM Corp. 1989-2006. All rights reserved.
2.1 Ethernet and IEEE 802 local area networks (LANs)
Two frame formats (or standards) can be used on the Ethernet coaxial cable:
򐂰 The standard issued in 1978 by Xerox Corporation, Intel® Corporation, and
Digital Equipment Corporation, usually called Ethernet (or DIX Ethernet)
򐂰 The international IEEE 802.3 standard, a more recently defined standard
See Figure 2-1 for more details.
Figure 2-1 ARP: Frame formats for Ethernet and IEEE 802.3
The difference between the two standards is in the use of one of the header
fields, which contains a protocol-type number for Ethernet and the length of the
data in the frame for IEEE 802.3:
򐂰 The type field in Ethernet is used to distinguish between different protocols
running on the coaxial cable, and allows their coexistence on the same
physical cable.
򐂰 The maximum length of an Ethernet frame is 1526 bytes. This means a data
field length of up to 1500 bytes. The length of the 802.3 data field is also
limited to 1500 bytes for 10 Mbps networks, but is different for other
transmission speeds.
򐂰 In the 802.3 MAC frame, the length of the data field is indicated in the 802.3
header. The type of protocol it carries is then indicated in the 802.2 header
(higher protocol level; see Figure 2-1). In practice, however, both frame
formats can coexist on the same physical coaxial. This is done by using
protocol type numbers (type field) greater than 1500 in the Ethernet frame.
However, different device drivers are needed to handle each of these
TCP/IP Tutorial and Technical Overview
Therefore, for all practical purposes, the Ethernet physical layer and the IEEE
802.3 physical layer are compatible. However, the Ethernet data link layer and
the IEEE 802.3/802.2 data link layer are incompatible.
The 802.2 Logical Link Control (LLC) layer above IEEE 802.3 uses a concept
known as link service access point (LSAP), which uses a 3-byte header, where
DSAP and SSAP stand for destination and source service access point,
respectively. Numbers for these fields are assigned by an IEEE committee (see
Figure 2-2).
Figure 2-2 ARP: IEEE 802.2 LSAP header
Due to a growing number of applications using IEEE 802 as lower protocol
layers, an extension was made to the IEEE 802.2 protocol in the form of the
Subnetwork Access Protocol (SNAP) (see Figure 2-3). It is an extension to the
LSAP header in Figure 2-2, and its use is indicated by the value 170 in both the
SSAP and DSAP fields of the LSAP frame Figure 2-3.
Figure 2-3 ARP: IEEE 802.2 SNAP header
In the evolution of TCP/IP, standards were established that describe the
encapsulation of IP and ARP frames on these networks:
򐂰 Introduced in 1984, RFC 894 – Standard for the Transmission of IP
Datagrams over Ethernet Networks specifies only the use of Ethernet type of
networks. The values assigned to the type field are:
– 2048 (hex 0800), for IP datagrams
– 2054 (hex 0806), for ARP datagrams
Chapter 2. Network interfaces
򐂰 Introduced in 1985, RFC 948 – Two Methods for the Transmission of IP
Datagrams over IEEE 802.3 Networks specifies two possibilities:
– The Ethernet compatible method: The frames are sent on a real IEEE
802.3 network in the same fashion as on an Ethernet network, that is,
using the IEEE 802.3 data-length field as the Ethernet type field, thereby
violating the IEEE 802.3 rules, but compatible with an Ethernet network.
– IEEE 802.2/802.3 LLC type 1 format: Using 802.2 LSAP header with IP
using the value 6 for the SSAP and DSAP fields.
The RFC indicates clearly that the IEEE 802.2/802.3 method is the preferred
method, that is, that all future IP implementations on IEEE 802.3 networks are
supposed to use the second method.
򐂰 Introduced in 1987, RFC 1010 – Assigned Numbers (now obsoleted by RFC
3232, dated 2002) notes that as a result of IEEE 802.2 evolution and the need
for more Internet protocol numbers, a new approach was developed based on
practical experiences exchanged during the August 1986 TCP Vendors
Workshop. It states, in an almost completely overlooked part of this RFC, that
all IEEE 802.3, 802.4, and 802.5 implementations should use the Subnetwork
Access Protocol (SNAP) form of the IEEE 802.2 LLC, with the DSAP and
SSAP fields set to 170 (indicating the use of SNAP), with SNAP assigned as
– 0 (zero) as organization code
– EtherType field:
2048 (hex 0800), for IP datagrams
2054 (hex 0806), for ARP datagrams
32821 (hex 8035), for RARP datagrams
These are the same values used in the Ethernet type field.
򐂰 In1988, RFC 1042 – Standard for the Transmission of IP Datagrams over
IEEE 802 Networks was introduced. Because this new approach (very
important for implementations) passed almost unnoticed in a little note of an
unrelated RFC, it became quite confusing. As such, in February of 1988 it
was repeated in an RFC on its own, RFC 1042, which now obsoletes RFC
򐂰 In 1998, RFC 2464 – Transmission of IPv6 Packets over Ethernet Networks
was introduced. This extended Ethernet’s frame format to allow IPv6 packets
to traverse Ethernet networks.
The relevant IBM TCP/IP products implement RFC 894 for DIX Ethernet and
RFC 3232 for IEEE 802.3 networks. However, in practical situations, there are
still TCP/IP implementations that use the older LSAP method (RFC 948 or 1042).
TCP/IP Tutorial and Technical Overview
Such implementations will not communicate with the more recent
implementations (such as IBM's).
Also note that the last method covers not only the IEEE 802.3 networks, but also
the IEEE 802.4 and 802.5 networks, such as the IBM token-ring LAN.
2.1.1 Gigabit Ethernet
As advances in hardware continue to provide faster transmissions across
networks, Ethernet implementations have improved in order to capitalize on the
faster speeds. Fast Ethernet increased the speed of traditional Ethernet from 10
megabits per second (Mbps) to 100 Mbps. This was further augmented to 1000
Mbps in June of 1998, when the IEEE defined the standard for Gigabit Ethernet
(IEEE 802.3z). Finally, in 2005, IEEE created the 802.3-2005 standard
introduced 10 Gigabit Ethernet, also referred to as 10GbE. 10GbE provides
transmission speeds of 10 gigabits per second (Gbps), or 10000 Mbps, 10 times
the speed of Gigabit Ethernet. However, due to the novelty of 10GbE, there are
still limitations on the adapters over which 10GbE can be used, and no one
implementation standard has yet gained commercial acceptance.
2.2 Fiber Distributed Data Interface (FDDI)
The FDDI specifications define a family of standards for 100 Mbps fiber optic
LANs that provides the physical layer and media access control sublayer of the
data link layer, as defined by the ISO/OSI Model. Proposed initially by
draft-standard RFC 1188, IP and ARP over FDDI networks became a standard in
RFC 1390 (also STD 0036). It defines the encapsulating of IP datagrams and
ARP requests and replies in FDDI frames. RFC 2467 extended this standard in
order to allow the transmission of IPv6 packets over FDDI networks. Operation
on dual MAC stations is described in informational RFC 1329. Figure 2-4 on
page 34 shows the related protocol layers.
RFC 1390 states that all frames are transmitted in standard IEEE 802.2 LLC
Type 1 Unnumbered Information format, with the DSAP and SSAP fields of the
802.2 header set to the assigned global SAP® value for SNAP (decimal 170).
The 24-bit Organization Code in the SNAP header is set to zero, and the
remaining 16 bits are the EtherType from Assigned Numbers (see RFC 3232),
that is:
򐂰 2048 for IP
򐂰 2054 for ARP
Chapter 2. Network interfaces
The mapping of 32-bit Internet addresses to 48-bit FDDI addresses is done
through the ARP dynamic discovery procedure. The broadcast Internet
addresses (whose host address is set to all ones) are mapped to the broadcast
FDDI address (all ones).
IP datagrams are transmitted as series of 8-bit bytes using the usual TCP/IP
transmission order called big-endian or network byte order.
The FDDI MAC specification (ISO 9314-2 - ISO, Fiber Distributed Data Interface
- Media Access Control) defines a maximum frame size of 4500 bytes for all
frame fields. After taking the LLC/SNAP header into account, and to allow future
extensions to the MAC header and frame status fields, the MTU of FDDI
networks is set to 4352 bytes.
Refer to the IBM Redbook Local Area Network Concepts and Products: LAN
Architect, SG24-4753, the first volume of the four-volume series LAN Concepts
and Products, for more details about the FDDI architecture.
Figure 2-4 IP and ARP over FDDI
2.3 Serial Line IP (SLIP)
The TCP/IP protocol family runs over a variety of network media: IEEE 802.3 and
802.5 LANs, X.25 lines, satellite links, and serial lines. Standards for the
encapsulation of IP packets have been defined for many of these networks, but
there is no standard for serial lines. SLIP is currently a de facto standard,
commonly used for point-to-point serial connections running TCP/IP. Even
though SLIP is not an Internet standard, it is documented by RFC 1055.
TCP/IP Tutorial and Technical Overview
SLIP is just a very simple protocol designed quite a long time ago and is merely a
packet framing protocol. It defines a sequence of characters that frame IP
packets on a serial line, and nothing more. It does not provide any of the
򐂰 Addressing: Both computers on a SLIP link need to know each other's IP
address for routing purposes. SLIP defines only the encapsulation protocol,
not any form of handshaking or link control. Links are manually connected
and configured, including the specification of the IP address.
򐂰 Packet type identification: SLIP cannot support multiple protocols across a
single link; thus, only one protocol can be run over a SLIP connection.
򐂰 Error detection/correction: SLIP does no form of frame error detection. The
higher-level protocols should detect corrupted packets caused by errors on
noisy lines. (IP header and UDP/TCP checksums should be sufficient.)
Because it takes so long to retransmit a packet that was altered, it would be
efficient if SLIP could provide some sort of simple error correction mechanism
of its own.
򐂰 Compression: SLIP provides no mechanism for compressing frequently used
IP header fields. Many applications over slow serial links tend to be
single-user interactive TCP traffic, such as Telnet. This frequently involves
small packet sizes and inefficiencies in TCP and IP headers that do not
change much between datagrams, but that can have a noticeably detrimental
effect on interactive response times.
However, many SLIP implementations now use Van Jacobson Header
Compression. This is used to reduce the size of the combined IP and TCP
headers from 40 bytes to 3-4 bytes by recording the states of a set of TCP
connections at each end of the link and replacing the full headers with
encoded updates for the normal case, where many of the fields are
unchanged or are incremented by small amounts between successive IP
datagrams for a session. RFC 1144 describes this compression.
The SLIP protocol has been essentially replaced by the Point-to-Point Protocol
(PPP), as described in the following section.
2.4 Point-to-Point Protocol (PPP)
Point-to-Point Protocol (PPP) is a network-specific standard protocol with STD
number 51. Its status is elective, and it is described in RFC 1661 and RFC 1662.
The standards defined in these RFCs were later extended to allow IPv6 over
PPP, defined in RFC 2472.
Chapter 2. Network interfaces
There are a large number of proposed standard protocols, which specify the
operation of PPP over different kinds of point-to-point links. Each has a status of
elective. We advise you to consult STD 1 – Internet Official Protocol Standards
for a list of PPP-related RFCs that are on the Standards Track.
Point-to-point circuits in the form of asynchronous and synchronous lines have
long been the mainstay for data communications. In the TCP/IP world, the de
facto standard SLIP protocol (see 2.3, “Serial Line IP (SLIP)” on page 34) has
served admirably in this area, and is still in widespread use for dial-up TCP/IP
connections. However, SLIP has a number of drawbacks that are addressed by
the Point-to-Point Protocol.
PPP has three main components:
򐂰 A method for encapsulating datagrams over serial links.
򐂰 A Link Control Protocol (LCP) for establishing, configuring, and testing the
data-link connection.
򐂰 A family of Network Control Protocols (NCPs) for establishing and
configuring different network-layer protocols. PPP is designed to allow the
simultaneous use of multiple network-layer protocols.
Before a link is considered to be ready for use by network-layer protocols, a
specific sequence of events must happen. The LCP provides a method of
establishing, configuring, maintaining, and terminating the connection. LCP goes
through the following phases:
1. Link establishment and configuration negotiation: In this phase, link control
packets are exchanged and link configuration options are negotiated. After
options are agreed on, the link is open, but is not necessarily ready for
network-layer protocols to be started.
2. Link quality determination: This phase is optional. PPP does not specify the
policy for determining quality, but does provide low-level tools, such as echo
request and reply.
3. Authentication: This phase is optional. Each end of the link authenticates
itself with the remote end using authentication methods agreed to during
phase 1.
4. Network-layer protocol configuration negotiation: After LCP has finished the
previous phase, network-layer protocols can be separately configured by the
appropriate NCP.
5. Link termination: LCP can terminate the link at any time. This is usually done
at the request of a human user, but can happen because of a physical event.
TCP/IP Tutorial and Technical Overview
2.4.1 Point-to-point encapsulation
A summary of the PPP encapsulation is shown in Figure 2-5.
Figure 2-5 PPP encapsulation frame
The encapsulation fields are defined as follows:
Protocol field
The protocol field is one or two octets, and its value
identifies the datagram encapsulated in the Information
field of the packet. Up-to-date values of the Protocol field
are specified in RFC 3232.
Information field
The Information field is zero or more octets. The
Information field contains the datagram for the protocol
specified in the Protocol field. The maximum length for the
information field, including padding, but not including the
Protocol field, is termed the Maximum Receive Unit
(MRU), which defaults to 1500 octets. By negotiation,
other values can be used for the MRU.
On transmission, the information field can be padded with
an arbitrary number of octets up to the MRU. It is the
responsibility of each protocol to distinguish padding
octets from real information.
The IP Control Protocol (IPCP) is the NCP for IP and is responsible for
configuring, enabling, and disabling the IP protocol on both ends of the
point-to-point link. The IPCP options negotiation sequence is the same as for
LCP, thus allowing the possibility of reusing the code.
One important option used with IPCP is Van Jacobson Header Compression,
which is used to reduce the size of the combined IP and TCP headers from 40
bytes to approximately 3-4 by recording the states of a set of TCP connections at
each end of the link and replacing the full headers with encoded updates for the
normal case, where many of the fields are unchanged or are incremented by
small amounts between successive IP datagrams for a session. This
compression is described in RFC 1144.
Chapter 2. Network interfaces
2.5 Integrated Services Digital Network (ISDN)
This section describes how to use the PPP encapsulation over ISDN
point-to-point links. PPP over ISDN is documented by elective RFC 1618.
Because the ISDN B-channel is, by definition, a point-to-point circuit, PPP is well
suited for use over these links.
The ISDN Basic Rate Interface (BRI) usually supports two B-channels with a
capacity of 64 kbps each, and a 16 kbps D-channel for control information.
B-channels can be used for voice or data or just for data in a combined way.
The ISDN Primary Rate Interface (PRI) can support many concurrent B-channel
links (usually 30) and one 64 Kbps D-channel. The PPP LCP and NCP
mechanisms are particularly useful in this situation in reducing or eliminating
manual configuration and facilitating ease of communication between diverse
implementations. The ISDN D-channel can also be used for sending PPP
packets when suitably framed, but is limited in bandwidth and often restricts
communication links to a local switch.
PPP treats ISDN channels as bit- or octet-oriented synchronous links. These
links must be full-duplex, but can be either dedicated or circuit-switched. PPP
presents an octet interface to the physical layer. There is no provision for
sub-octets to be supplied or accepted. PPP does not impose any restrictions
regarding transmission rate other than that of the particular ISDN channel
interface. PPP does not require the use of control signals. When available, using
such signals can allow greater functionality and performance.
The definition of various encodings and scrambling is the responsibility of the
DTE/DCE equipment in use. While PPP will operate without regard to the
underlying representation of the bit stream, lack of standards for transmission will
hinder interoperability as surely as lack of data link standards. The D-channel
interface requires Non-Return-To-Zero (NRZ) encoding. Therefore, it is
recommended that NRZ be used over the B-channel interface. This will allow
frames to be easily exchanged between the B- and D-channels. However, when
the configuration of the encoding is allowed, NRZ Inverted (NRZI) is
recommended as an alternative in order to ensure a minimum ones density
where required over the clear B-channel. Implementations that want to
interoperate with multiple encodings can choose to detect those encodings
automatically. Automatic encoding detection is particularly important for primary
rate interfaces to avoid extensive preconfiguration.
Terminal adapters conforming to V.1201 can be used as a simple interface to
workstations. The terminal adapter provides asynchronous-to-synchronous
conversion. Multiple B-channels can be used in parallel. V.120 is not
interoperable with bit-synchronous links, because V.120 does not provide octet
TCP/IP Tutorial and Technical Overview
stuffing to bit stuffing conversion. Despite the fact that HDLC, LAPB, LAPD, and
LAPF are nominally distinguishable, multiple methods of framing should not be
used concurrently on the same ISDN channel. There is no requirement that PPP
recognize alternative framing techniques, or switch between framing techniques
without specific configuration. Experience has shown that the LLC Information
Element is not reliably transmitted end to end. Therefore, transmission of the
LLC-IE should not be relied upon for framing or encoding determination. No
LLC-IE values that pertain to PPP have been assigned. Any other values that are
received are not valid for PPP links, and can be ignored for PPP service. The
LCP recommended sync configuration options apply to ISDN links. The standard
LCP sync configuration defaults apply to ISDN links. The typical network
connected to the link is likely to have an MRU size of either 1500 or 2048 bytes
or greater. To avoid fragmentation, the maximum transmission unit (MTU) at the
network layer should not exceed 1500, unless a peer MRU of 2048 or greater is
specifically negotiated.
2.6 X.25
This topic describes the encapsulation of IP over X.25 networks, in accordance
with ISO/IEC and CCITT standards. IP over X.25 networks is documented by
RFC 1356 (which obsoletes RFC 877). RFC 1356 is a Draft Standard with a
status of elective. The substantive change to the IP encapsulation over X.25 is
an increase in the IP datagram MTU size, the X.25 maximum data packet size,
the virtual circuit management, and the interoperable encapsulation over X.25 of
protocols other than IP between multiprotocol routers and bridges.
One or more X.25 virtual circuits are opened on demand when datagrams arrive
at the network interface for transmission. Protocol data units (PDUs) are sent as
X.25 complete packet sequences. That is, PDUs begin on X.25 data packet
boundaries and the M bit (more data) is used to fragment PDUs that are larger
than one X.25 data packet in length. In the IP encapsulation, the PDU is the IP
datagram. The first octet in the call user data (CUD) field (the first data octet in
the Call Request packet) is used for protocol demultiplexing in accordance with
the Subsequent Protocol Identifier (SPI) in ISO/IEC TR 9577. This field contains
a one octet network-layer protocol identifier (NLPID), which identifies the
network-layer protocol encapsulated over the X.25 virtual circuit. For the Internet
community, the NLPID has four relevant values:
򐂰 The value hex CC (binary 11001100, decimal 204) is IP.
CCITT Recommendations I.465 and V.120, “Data Terminal Equipment Communications over the
Telephone Network with Provision for Statistical Multiplexing”, CCITT Blue Book, Volume VIII,
Fascicle VIII.1, 1988
Chapter 2. Network interfaces
򐂰 The value hex 81 (binary 10000001, decimal 129) identifies ISO/IEC 8473
򐂰 The value hex 82 (binary 10000010, decimal 130) is used specifically for
ISO/IEC 9542 (ES-IS). If there is already a circuit open to carry CLNP, it is not
necessary to open a second circuit to carry ES-IS.
򐂰 The value hex 80 (binary 10000000, decimal 128) identifies the use of the
IEEE Subnetwork Access Protocol (SNAP) to further encapsulate and identify
a single network-layer protocol. The SNAP-encapsulated protocol is identified
by including a five-octet SNAP header in the Call Request CUD field
immediately following the hex 80 octet. SNAP headers are not included in the
subsequent X.25 data packets. Only one SNAP-encapsulated protocol can
be carried over a virtual circuit opened using this encoding.
The value hex 00 identifies the null encapsulation used to multiplex multiple
network-layer protocols over the same circuit. RFC 3232 contains one other
non-CCITT and non-ISO/IEC value that has been used for Internet X.25
encapsulation identification, namely hex C5 (binary 11000101, decimal 197) for
Blacker X.25. This value may continue to be used, but only by prior
preconfiguration of the sending and receiving X.25 interfaces to support this
value. The hex CD (binary 11001101, decimal 205), listed in RFC 3232 for
ISO-IP, is also used by Blacker and can only be used by prior preconfiguration of
the sending and receiving X.25 interfaces.
Each system must only accept calls for protocols it can process. Every Internet
system must be able to accept the CC encapsulation for IP datagrams. Systems
that support NLPIDs other than hex CC (for IP) should allow their use to be
configured on a per-peer address basis. The Null encapsulation, identified by a
NLPID encoding of hex 00, is used in order to multiplex multiple network-layer
protocols over one circuit. When the Null encapsulation is used, each X.25
complete packet sequence sent on the circuit begins with a one-octet NLPID,
which identifies the network-layer protocol data unit contained only in that
particular complete packet sequence. Further, if the SNAP NLPID (hex 80) is
used, the NLPID octet is immediately followed by the five-octet SNAP header,
which is then immediately followed by the encapsulated PDU. The encapsulated
network-layer protocol can differ from one complete packet sequence to the next
over the same circuit.
Use of the single network-layer protocol circuits is more efficient in terms of
bandwidth if only a limited number of protocols are supported by a system. It also
allows each system to determine exactly which protocols are supported by its
communicating partner. Other advantages include being able to use X.25
accounting to detail each protocol and different quality of service or flow control
windows for different protocols. The Null encapsulation, for multiplexing, is useful
when a system, for any reason (such as implementation restrictions or network
TCP/IP Tutorial and Technical Overview
cost considerations), can only open a limited number of virtual circuits
simultaneously. This is the method most likely to be used by a multiprotocol
router to avoid using an unreasonable number of virtual circuits. If performing
IEEE 802.1d bridging across X.25 is required, the Null encapsulation must be
IP datagrams must, by default, be encapsulated on a virtual circuit opened with
the CC CUD. Implementations can also support up to three other possible
encapsulations of IP:
򐂰 IP datagrams can be contained in multiplexed data packets on a circuit using
the Null encapsulation. Such data packets are identified by a NLPID of hex
򐂰 IP can be encapsulated within the SNAP encapsulation on a circuit. This
encapsulation is identified by containing, in the 5-octet SNAP header, an
Organizationally Unique Identifier (OUI) of hex 00-00-00 and Protocol
Identifier (PID) of hex 08-00.
򐂰 On a circuit using the Null encapsulation, IP can be contained within the
SNAP encapsulation of IP in multiplexed data packets.
2.7 Frame relay
The frame relay network provides a number of virtual circuits that form the basis
for connections between stations attached to the same frame relay network. The
resulting set of interconnected devices forms a private frame relay group, which
can be either fully interconnected with a complete mesh of virtual circuits, or only
partially interconnected. In either case, each virtual circuit is uniquely identified at
each frame relay interface by a data link connection identifier (DLCI). In most
circumstances, DLCIs have strictly local significance at each frame relay
interface. Frame relay is documented in RFC 2427, and is expanded in RFC
2590 to allow the transmission of IPv6 packets.
2.7.1 Frame format
All protocols must encapsulate their packets within a Q.922 Annex A frame2.
Additionally, frames contain the necessary information to identify the protocol
carried within the protocol data unit (PDU), thus allowing the receiver to properly
International Telecommunication Union, “ISDN Data Link Layer Specification for Frame Mode
Bearer Services,” ITU-T Recommendation Q.922, 1992
Chapter 2. Network interfaces
process the incoming packet (refer to Figure 2-6 on page 43). The format will be
as follows:
򐂰 The control field is the Q.922 control field. The UI (0x03) value is used unless
it is negotiated otherwise. The use of XID (0xAF or 0xBF) is permitted.
򐂰 The pad field is used to align the data portion (beyond the encapsulation
header) of the frame to a two octet boundary. If present, the pad is a single
octet and must have a value of zero.
򐂰 The Network Level Protocol ID (NLPID) field is administered by ISO and the
ITU. It contains values for many different protocols, including IP, CLNP, and
IEEE Subnetwork Access Protocol (SNAP). This field tells the receiver what
encapsulation or what protocol follows. Values for this field are defined in
ISO/IEC TR 95773. An NLPID value of 0x00 is defined within ISO/IEC TR
9577 as the null network layer or inactive set. Because it cannot be
distinguished from a pad field, and because it has no significance within the
context of this encapsulation scheme, an NLPID value of 0x00 is invalid under
the frame relay encapsulation.
There is no commonly implemented minimum or maximum frame size for frame
relay. A network must, however, support at least a 262-octet maximum.
Generally, the maximum will be greater than or equal to 1600 octets, but each
frame relay provider will specify an appropriate value for its network. A frame
relay data terminal equipment (DTE) must allow the maximum acceptable frame
size to be configurable.
“Information technology -- Protocol identification in the network layer,” ISO/IEC TR 9577, 1999
TCP/IP Tutorial and Technical Overview
Figure 2-6 shows the format for a frame relay packet.
Figure 2-6 Frame relay packet format
2.7.2 Interconnect issues
There are two basic types of data packets that travel within the frame relay
network: routed packets and bridged packets. These packets have distinct
formats and must contain an indicator that the destination can use to correctly
interpret the contents of the frame. This indicator is embedded within the NLPID
and SNAP header information.
2.7.3 Data link layer parameter negotiation
Frame relay stations may choose to support the Exchange Identification (XID)
specified in Appendix III of Q.9224. This XID exchange allows the following
parameters to be negotiated at the initialization of a frame relay circuit: maximum
Chapter 2. Network interfaces
frame size, retransmission timer, and the maximum number of outstanding
information (I) frames.
If this exchange is not used, these values must be statically configured by mutual
agreement of data link connection (DLC) endpoints, or must be defaulted to the
values specified in Section 5.9 of Q.9224.
There are situations in which a frame relay station might want to dynamically
resolve a protocol address over permanent virtual circuits (PVCs). This can be
accomplished using the standard Address Resolution Protocol (ARP)
encapsulated within a SNAP-encoded frame relay packet.
Because of the inefficiencies of emulating broadcasting in a frame relay
environment, a new address resolution variation was developed. It is called
Inverse ARP, and describes a method for resolving a protocol address when the
hardware address is already known. In a frame relay network, the known
hardware address is the DLCI. Support for Inverse ARP is not required to
implement this specification, but it has proven useful for frame relay interface
Stations must be able to map more than one IP address in the same IP subnet to
a particular DLCI on a frame relay interface. This need arises from applications
such as remote access, where servers must act as ARP proxies for many dial-in
clients, each assigned a unique IP address while sharing bandwidth on the same
DLC. The dynamic nature of such applications results in frequent address
association changes with no effect on the DLC's status.
As with any other interface that uses ARP, stations can learn the associations
between IP addresses and DLCIs by processing unsolicited (gratuitous) ARP
requests that arrive on the DLC. If one station wants to inform its peer station on
the other end of a frame relay DLC of a new association between an IP address
and that PVC, it should send an unsolicited ARP request with the source IP
address equal to the destination IP address, and both set to the new IP address
being used on the DLC. This allows a station to “announce” new client
connections on a particular DLCI. The receiving station must store the new
association, and remove any old existing association, if necessary, from any
other DLCI on the interface.
2.7.4 IP over frame relay
Internet Protocol (IP) datagrams sent over a frame relay network conform to the
encapsulation described previously. Within this context, IP can be encapsulated
International Telecommunication Union, “ISDN Data Link Layer Specification for Frame Mode
Bearer Services,” ITU-T Recommendation Q.922, 1992
TCP/IP Tutorial and Technical Overview
in two different ways: NLPID value, indicating IP, or NLPID value, indicating
Although both of these encapsulations are supported under the given definitions,
it is advantageous to select only one method as the appropriate mechanism for
encapsulating IP data. Therefore, encapsulate IP data using the NLPID value of
0xcc, indicating an IP packet. This option is more efficient, because it transmits
48 fewer bits without the SNAP header and is consistent with the encapsulation
of IP in an X.25 network.
2.8 PPP over SONET and SDH circuits
This discussion describes the use of the PPP encapsulation over Synchronous
Optical Network (SONET) and Synchronous Digital Hierarchy (SDH) links, which
is documented by RFC 2615. Because SONET and SDH are, by definition,
point-to-point circuits, PPP is well suited for use over these links. SONET is an
octet-synchronous multiplex scheme that defines a family of standard rates and
formats. Despite the name, it is not limited to optical links. Electrical
specifications have been defined for single-mode fiber, multimode fiber, and
CATV 75 ohm coaxial cable. The transmission rates are integral multiples of
51.840 Mbps, which can be used to carry T3/E3 bit-synchronous signals. The
allowed multiples are currently specified as shown in Table 2-1. Additionally, the
CCITT Synchronous Digital Hierarchy defines a subset of SONET
transmission rates beginning at 155.52 Mbps.
Table 2-1 SONET speed hierarchy
End of option list
Maximum segment size
Window scale
Time stamps
Chapter 2. Network interfaces
2.8.1 Physical layer
PPP presents an octet interface to the physical layer. There is no provision for
sub-octets to be supplied or accepted. SONET and SDH links are full-duplex by
definition. The octet stream is mapped into the SONET/SDH Synchronous
Payload Envelope (SPE) with the octet boundaries aligned with the SPE octet
boundaries. No scrambling is needed during insertion into the SPE. The path
signal label is intended to indicate the contents of the SPE. The experimental
value of 207 (hex CF) is used to indicate PPP. The multiframe indicator is
currently unused and must be zero.
The basic rate for PPP over SONET/SDH is that of STS-3c/STM-1 at 155.52
Mbps. The available information bandwidth is 149.76 Mbps, which is the
STS-3c/STM-1 SPE with section, line, and path inefficiencies removed. This is
the same upper layer mapping that is used for ATM and FDDI. Lower signal rates
must use the Virtual Tributary (VT) mechanism of SONET/SDH. This maps
existing signals up to T3/E3 rates asynchronously into the SPE or uses available
clocks for bit-synchronous and byte-synchronous mapping. Higher signal rates
should conform to the SDH STM series rather than the SONET STS series as
equipment becomes available. The STM series progresses in powers of 4
instead of 3 and employs fewer steps, which is likely to simplify multiplexing and
2.9 Multi-Path Channel+ (MPC+)
The MPC support is a protocol layer that allows multiple read and write
subchannels to be treated as a single transmission group between the host and
channel-attached devices. One level of MPC support, high performance data
transfer (HPDT), also referred to as MPC+, provides more efficient transfer of
data than non-HPDT MPC connections. Multi-Path Channel+ (MPC+)
connections enable you to define a single transmission group (TG) that uses
multiple write-direction and read-direction subchannels. Because each
subchannel operates in only one direction, the half-duplex turnaround time that
occurs with other channel-to-channel connections is reduced.
If at least one read and one write path is allocated successfully, the MPC+
channel connection is activated. Additional paths (defined but not online) in an
MPC+ group can later be dynamically added to the active group using the
MVS™ VARY device ONLINE command.
For example, if there is a need for an increase in capacity to allow for extra traffic
over a channel, additional paths can be added to the active group without
disruption. Similarly, paths can be deleted from the active group when no longer
needed using the MVS VARY device OFFLINE command.
TCP/IP Tutorial and Technical Overview
2.10 Asynchronous transfer mode (ATM)
ATM-based networks are of increasing interest for both local and wide area
applications. The ATM architecture is different from the standard LAN
architectures and, for this reason, changes are required so that traditional LAN
products will work in the ATM environment. In the case of TCP/IP, the main
change required is in the network interface to provide support for ATM.
There are several approaches already available, two of which are important to
the transport of TCP/IP traffic. We describe these in 2.10.2, “Classical IP over
ATM” on page 50 and 2.10.3, “ATM LAN emulation” on page 56. We also
compare them in 2.10.4, “Classical IP over ATM versus LAN emulation” on
page 59.
2.10.1 Address resolution (ATMARP and InATMARP)
The address resolution in an ATM logical IP subnet is done by the ATM Address
Resolution Protocol (ATMARP), based on RFC 826 (also STD 37), and the
Inverse ATM Address Resolution Protocol (InATMARP), based on RFC 2390.
ATMARP is the same protocol as the ARP protocol, with extensions needed to
support ARP in a unicast server ATM environment. InATMARP is the same
protocol as the original InARP protocol, but applied to ATM networks. Use of
these protocols differs depending on whether permanent virtual connections
(PVCs) or switched virtual connections (SVCs) are used.
Both ATMARP and InATMARP are defined in RFC 2225; a proposed standard
with a state of elective. We describe the encapsulation of ATMARP and
InATMARP requests/replies in 2.10.2, “Classical IP over ATM” on page 50.
The ARP protocol resolves a host's hardware address for a known IP address.
The InATMARP protocol resolves a host's IP address for a known hardware
address. In a switched environment, you first establish a virtual connection (VC)
of either a permanent virtual connection (PVC) or switched virtual connection
(SVC) in order to communicate with another station. Therefore, you know the
exact hardware address of the partner by administration, but the IP address is
unknown. InATMARP provides dynamic address resolution. InARP uses the
same frame format as the standard ARP, but defines two new operation codes:
򐂰 InARP request=8
򐂰 InARP reply=9
See “ARP packet generation” on page 120 for more details.
Chapter 2. Network interfaces
Basic InATMARP operates essentially the same as ARP, with the exception that
InATMARP does not broadcast requests. This is because the hardware address
of the destination station is already known. A requesting station simply formats a
request by inserting its source hardware and IP address and the known target
hardware address. It then zero fills the target protocol address field and sends it
directly to the target station. For every InATMARP request, the receiving station
formats a reply using the source address from the request as the target address
of the reply. Both sides update their ARP tables. The hardware type value for
ATM is 19 decimal and the EtherType field is set to 0x806, which indicates ARP,
according to RFC 3232.
Address resolution in a PVC environment
In a PVC environment, each station uses the InATMARP protocol to determine
the IP addresses of all other connected stations. The resolution is done for those
PVCs that are configured for LLC/SNAP encapsulation. It is the responsibility of
each IP station supporting PVCs to revalidate ARP table entries as part of the
aging process.
Address resolution in an SVC environment
SVCs require support for ATMARP in the non-broadcast environment of ATM. To
meet this need, a single ATMARP server must be located within the Logical IP
Subnetwork (LIS) (see “The Logical IP Subnetwork (LIS)” on page 53). This
server has authoritative responsibility for resolving the ATMARP requests of all IP
members within the LIS. For an explanation of ATM terms, refer to 2.10.2,
“Classical IP over ATM” on page 50.
The server itself does not actively establish connections. It depends on the
clients in the LIS to initiate the ATMARP registration procedure. An individual
client connects to the ATMARP server using a point-to-point VC. The server,
upon the completion of an ATM call/connection of a new VC specifying
LLC/SNAP encapsulation, will transmit an InATMARP request to determine the
IP address of the client. The InATMARP reply from the client contains the
information necessary for the ATMARP server to build its ATMARP table cache.
This table consists of:
򐂰 IP address
򐂰 ATM address
򐂰 Time stamp
򐂰 Associated VC
This information is used to generate replies to the ATMARP requests it receives.
TCP/IP Tutorial and Technical Overview
Note: The ATMARP server mechanism requires that each client be
administratively configured with the ATM address of the ATMARP server.
ARP table add/update algorithm
Consider the following points:
򐂰 If the ATMARP server receives a new IP address in an InATMARP reply, the
IP address is added to the ATMARP table.
򐂰 If the InATMARP IP address duplicates a table entry IP address and the
InATMARP ATM address does not match the table entry ATM address, and
there is an open VC associated with that table entry, the InATMARP
information is discarded and no modifications to the table are made.
򐂰 When the server receives an ATMARP request over a VC, where the source
IP and ATM address match the association already in the ATMARP table and
the ATM address matches that associated with the VC, the server updates
the timeout on the source ATMARP table entry. For example, if the client is
sending ATMARP requests to the server over the same VC that it used to
register its ATMARP entry, the server notes that the client is still “alive” and
updates the timeout on the client's ATMARP table entry.
򐂰 When the server receives an ARP_REQUEST over a VC, it examines the
source information. If there is no IP address associated with the VC over
which the ATMARP request was received and if the source IP address is not
associated with any other connection, the server adds this station to its
ATMARP table. This is not the normal way because, as mentioned earlier, it
is the responsibility of the client to register at the ATMARP server.
ATMARP table aging
ATMARP table entries are valid:
򐂰 In clients for a maximum time of 15 minutes
򐂰 In servers for a minimum time of 20 minutes
Prior to aging an ATMARP table entry, the ATMARP server generates an
InARP_REQUEST on any open VC associated with that entry and decides what
to do according to the following rules:
򐂰 If an InARP_REPLY is received, that table entry is updated and not deleted.
򐂰 If there is no open VC associated with the table entry, the entry is deleted.
Therefore, if the client does not maintain an open VC to the server, the client
must refresh its ATMARP information with the server at least once every 20
minutes. This is done by opening a VC to the server and exchanging the initial
InATMARP packets.
Chapter 2. Network interfaces
The client handles the table updates according to the following:
򐂰 When an ATMARP table entry ages, the ATMARP client invalidates this table
򐂰 If there is no open VC associated with the invalidated entry, that entry is
򐂰 In the case of an invalidated entry and an open VC, the ATMARP client
revalidates the entry prior to transmitting any non-address resolution traffic on
that VC. There are two possibilities:
– In the case of a PVC, the client validates the entry by transmitting an
InARP_REQUEST and updating the entry on receipt of an
– In the case of an SVC, the client validates the entry by transmitting an
ARP_REQUEST to the ATMARP server and updating the entry on receipt
of an ARP_REPLY.
򐂰 If a VC with an associated invalidated ATMARP table entry is closed, that
table entry is removed.
As mentioned earlier, every ATM IP client that uses SVCs must know its
ATMARP server's ATM address for the particular LIS. This address must be
named at every client during customization. There is at present no well-known
ATMARP server address defined.
2.10.2 Classical IP over ATM
The definitions for implementations of classical IP over asynchronous transfer
mode (ATM) are described in RFC 2225, which is a proposed standard with a
status of elective. This RFC considers only the application of ATM as a direct
replacement for the “wires” and local LAN segments connecting IP endstations
(members) and routers operating in the classical LAN-based paradigm. Issues
raised by MAC level bridging and LAN emulation are not covered. Additionally, IP
over ATM was expanded by RFC 2492, which defines the transmission of IPv6
over ATM.
For ATM Forum's method of providing ATM migration, see 2.10.3, “ATM LAN
emulation” on page 56.
Initial deployment of ATM provides a LAN segment replacement for:
򐂰 Ethernets, token rings, or FDDI networks
򐂰 Local area backbones between existing (non-ATM) LANs
򐂰 Dedicated circuits of frame relay PVCs between IP routers
TCP/IP Tutorial and Technical Overview
RFC 2225 also describes extensions to the ARP protocol (RFC 826) in order to
work over ATM. We discuss this separately in 2.10.1, “Address resolution
(ATMARP and InATMARP)” on page 47.
First, some ATM basics:
All information (voice, image, video, data, and so on) is
transported through the network in very short (48 data
bytes plus a 5-byte header) blocks called cells.
Information flow is along paths (called virtual channels)
set up as a series of pointers through the network. The
cell header contains an identifier that links the cell to the
correct path that it will take toward its destination.
Cells on a particular virtual channel always follow the
same path through the network and are delivered to the
destination in the same order in which they were received.
Hardware-based switching
ATM is designed such that simple hardware-based logic
elements can be employed at each node to perform the
switching. On a link of 1 Gbps, a new cell arrives and a
cell is transmitted every .43 microseconds. There is not a
lot of time to decide what to do with an arriving packet.
Virtual Connection (VC)
ATM provides a virtual connection switched environment.
VC setup can be done on either a permanent virtual
connection (PVC) or a dynamic switched virtual
connection (SVC) basis. SVC call management is
performed by implementations of the Q.93B protocol.
End-user interface
The only way for a higher layer protocol to communicate
across an ATM network is over the ATM Adaptation Layer
(AAL). The function of this layer is to perform the mapping
of protocol data units (PDUs) into the information field of
the ATM cell and vice versa. There are four different AAL
types defined: AAL1, AAL2, AAL3/4, and AAL5. These
AALs offer different services for higher layer protocols.
Here are the characteristics of AAL5, which is used for
Message mode and streaming mode
Assured delivery
Non-assured delivery (used by TCP/IP)
Blocking and segmentation of data
Multipoint operation
Chapter 2. Network interfaces
AAL5 provides the same functions as a LAN at the
Medium Access Control (MAC) layer. The AAL type is
known by the VC endpoints through the cell setup
mechanism and is not carried in the ATM cell header. For
PVCs, the AAL type is administratively configured at the
endpoints when the connection (circuit) is set up. For
SVCs, the AAL type is communicated along the VC path
through Q.93B as part of call setup establishment and the
endpoints use the signaled information for configuration.
ATM switches generally do not care about the AAL type of
VCs. The AAL5 format specifies a packet format with a
maximum size of 64 KB - 1 byte of user data. The
primitives, which the higher layer protocol has to use in
order to interface with the AAL layer (at the AAL service
access point, or SAP), are rigorously defined. When a
high-layer protocol sends data, that data is processed first
by the adaptation layer, then by the ATM layer, and then
the physical layer takes over to send the data to the ATM
network. The cells are transported by the network and
then received on the other side first by the physical layer,
then processed by the ATM layer, and then by the
receiving AAL. When all this is complete, the information
(data) is passed to the receiving higher layer protocol.
The total function performed by the ATM network has
been the non-assured transport (it might have lost some)
of information from one side to the other. Looked at from a
traditional data processing viewpoint, all the ATM network
has done is to replace a physical link connection with
another kind of physical connection. All the higher layer
network functions must still be performed (for example,
IEEE 802.2).
An ATM Forum endpoint address is either encoded as a
20-byte OSI NSAP-based address (used for private
network addressing, three formats possible) or is an
E.164 Public UNI address (telephone number style
address used for public ATM networks).5
Broadcast, multicast There are currently no broadcast functions similar to
LANs provided. But there is a multicast function available.
The ATM term for multicast is point-to-multipoint
The ATM Forum is a worldwide organization, aimed at promoting ATM within the industry and the
user community. The membership includes more than 500 companies representing all sectors of
the communications and computer industries, as well as a number of government agencies,
research organizations, and users.
TCP/IP Tutorial and Technical Overview
The Logical IP Subnetwork (LIS)
The term LIS was introduced to map the logical IP structure to the ATM network.
In the LIS scenario, each separate administrative entity configures its hosts and
routers within a closed logical IP subnetwork (same IP network/subnet number
and address mask). Each LIS operates and communicates independently of
other LISs on the same ATM network. Hosts that are connected to an ATM
network communicate directly to other hosts within the same LIS. This implies
that all members of a LIS are able to communicate through ATM with all other
members in the same LIS. (VC topology is fully meshed.) Communication to
hosts outside of the local LIS is provided through an IP router. This router is an
ATM endpoint attached to the ATM network that is configured as a member of
one or more LISs. This configuration might result in a number of separate LISs
operating over the same ATM network. Hosts of differing IP subnets must
communicate through an intermediate IP router, even though it might be possible
to open a direct VC between the two IP members over the ATM network.
Multiprotocol encapsulation
If you want to use more than one type of network protocol (IP, IPX™, and so on)
concurrently over a physical network, you need a method of multiplexing the
different protocols. This can be done in the case of ATM either by VC-based
multiplexing or LLC encapsulation. If you choose VC-based multiplexing, you
have to have a VC for each different protocol between the two hosts. The LLC
encapsulation provides the multiplexing function at the LLC layer and therefore
needs only one VC. TCP/IP uses, according to RFC 2225 and 2684, the second
method, because this kind of multiplexing was already defined in RFC 1042 for
all other LAN types, such as Ethernet, token ring, and FDDI. With this definition,
IP uses ATM simply as a LAN replacement. All the other benefits ATM has to
offer, such as transportation of isochronous traffic, and so on, are not used.
There is an IETF working group with the mission of improving the IP
implementation and to interface with the ATM Forum in order to represent the
interests of the Internet community for future standards.
Chapter 2. Network interfaces
To be exact, the TCP/IP PDU is encapsulated in an IEEE 802.2 LLC header
followed by an IEEE 802.1a SubNetwork Attachment Point (SNAP) header and
carried within the payload field of an AAL5 CPCS-PDU (Common Part
Convergence Sublayer). The following figure shows the AAL5 CPCS-PDU
format (Figure 2-7).
Figure 2-7 AAL5 CPCS-PDU format
CPCS-PDU Payload The CPCS-PDU payload is shown in Figure 2-8 on
page 55.
The Pad field pads out the CDCS-PDU to fit exactly into
the ATM cells.
The CPCS-UU (User-to-User identification) field is used
to transparently transfer CPCS user-to-user information.
This field has no function for the encapsulation and can
be set to any value.
The Common Part Indicator (CPI) field aligns the
CPCS-PDU trailer with 64 bits.
The Length field indicates the length, in bytes, of the
payload field. The maximum value is 65535, which is
64 KB - 1.
The CRC field protects the entire CPCS-PDU, except the
CRC field itself.
TCP/IP Tutorial and Technical Overview
The following figure shows the payload format for routed IP PDUs (Figure 2-8).
Figure 2-8 CPCS-PDU payload format for IP PDUs
Normal IP datagram, starting with the IP header.
A 3-byte LLC header with the format DSAP-SSAP-Ctrl.
For IP data, it is set to 0xAA-AA-03 to indicate the
presence of a SNAP header. The Ctrl field always has the
value 0x03, specifying Unnumbered Information
Command PDU.
The 3-byte Organizationally Unique Identifier (OUI)
identifies an organization that administers the meaning of
the following 2-byte Protocol Identifier (PID). To specify
an EtherType in PID, the OUI has to be set to 0x00-00-00.
The Protocol Identifier (PID) field specifies the protocol
type of the following PDU. For IP datagrams, the assigned
EtherType or PID is 0x08-00.
The default MTU size for IP members in an ATM network is discussed in RFC
2225 and defined to be 9180 bytes. The LLC/SNAP header is 8 bytes; therefore,
the default ATM AAL5 PDU size is 9188 bytes. The possible values can be
between zero and 65535. You are allowed to change the MTU size, but then all
members of a LIS must be changed as well in order to have the same value.
RFC 1755 recommends that all implementations should support MTU sizes up to
and including 64 KB.
Chapter 2. Network interfaces
The address resolution in an ATM network is defined as an extension of the ARP
protocol and is described in 2.10.1, “Address resolution (ATMARP and
InATMARP)” on page 47.
There is no mapping from IP broadcast or multicast addresses to ATM broadcast
or multicast addresses available. But there are no restrictions for transmitting or
receiving IP datagrams specifying any of the four standard IP broadcast address
forms as described in RFC 1122. Members, upon receiving an IP broadcast or IP
subnet broadcast for their LIS, must process the packet as though addressed to
that station.
2.10.3 ATM LAN emulation
Another approach to provide a migration path to a native ATM network is ATM
LAN emulation. ATM LAN emulation is still under construction by ATM Forum
working groups. For the IETF approach, see 2.10.2, “Classical IP over ATM” on
page 50. There is no ATM Forum implementation agreement available covering
virtual LANs over ATM, but there are some basic agreements on the different
proposals made to the ATM Forum. The following descriptions are based on the
IBM proposals.
The concept of ATM LAN emulation is to construct a system such that the
workstation application software “thinks” it is a member of a real shared medium
LAN, such as a token ring. This method maximizes the reuse of existing LAN
software and significantly reduces the cost of migration to ATM. In PC LAN
environments, for example, the LAN emulation layer can be implemented under
the NDIS/ODI-type interface. With such an implementation, all the higher layer
protocols, such as IP, IPX, NetBIOS, and SNA, can be run over ATM networks
without any change.
TCP/IP Tutorial and Technical Overview
Refer to Figure 2-9 for the implementation of token ring and Ethernet.
LE Server
LE Server
LE Layer
802.3 MAC
Driver Interface
LE Layer
LE Layer
802.5 MAC
Driver Interface
LE Layer
Figure 2-9 Ethernet and token-ring LAN emulation
LAN emulation layer (workstation software)
Each workstation that performs the LE function needs to have software to
provide the LE service. This software is called the LAN emulation layer (LE
layer). It provides the interface to existing protocol support (such as IP, IPX,
IEEE 802.2 LLC, NetBIOS, and so on) and emulates the functions of a real,
shared medium LAN. This means that no changes are needed to existing LAN
application software to use ATM services. The LE layer interfaces to the ATM
network through a hardware ATM adapter.
The primary function of the LE layer is to transfer encapsulated LAN frames
(arriving from higher layers) to their destination either directly (over a direct VC)
or through the LE server. This is done by using AAL5 services provided by ATM.
Each LE layer has one or more LAN addresses as well as an ATM address.
A separate instance (logical copy or LE client) of the LE layer is needed in each
workstation for each different LAN or type of LAN to be supported. For example,
if both token-ring and Ethernet LAN types are to be emulated, you need two LE
layers. In fact, they will probably just be different threads within the same copy of
the same code, but they are logically separate LE layers. Use separate LE layers
also if one workstation needs to be part of two different emulated token-ring
Chapter 2. Network interfaces
LANs. Each separate LE layer needs a different MAC address, but can share the
same physical ATM connection (adapter).
LAN emulation server
The basic function of the LE server is to provide directory, multicast, and address
resolution services to the LE layers in the workstations. It also provides a
connectionless data transfer service to the LE layers in the workstations, if
Each emulated LAN must have an LE server. It would be possible to have
multiple LE servers sharing the same hardware and code (via multithreading),
but the LE servers are logically separate entities. As for the LE layers, an
emulated token-ring LAN cannot have members that are emulating an Ethernet
LAN. Thus, an instance of an LE server is dedicated to a single type of LAN
emulation. The LE server can be physically internal to the ATM network or
provided in an external device, but logically it is always an external function that
simply uses the services provided by ATM to do its job.
Default VCs
A default VC is a connection between an LE layer in a workstation and the LE
server. These connections can be permanent or switched.
All LE control messages are carried between the LE layer and the LE server on
the default VC. Encapsulated data frames can also be sent on the default VC.
The presence of the LE server and the default VCs is necessary for the LE
function to be performed.
Direct VCs
Direct VCs are connections between LE layers in the end systems. They are
always switched and set up on demand. If the ATM network does not support
switched connections, you cannot have direct VCs, and all the data must be sent
through the LE server on default VCs. If there is no direct VC available for any
reason, data transfer must take place through the LE server. (There is no other
Direct VCs are set up on request by an LE layer. (The server cannot set them up,
because there is no third-party call setup function in ATM.) The ATM address of
a destination LE layer is provided to a requesting LE layer by the LE server.
Direct VCs stay in place until one of the partner LE layers decides to end the
connection (because there is no more data).
TCP/IP Tutorial and Technical Overview
During initialization, the LE layer (workstation) establishes the default VC with
the LE server. It also discovers its own ATM address, which is needed if it is to
later set up direct VCs.
In this phase, the LE layer (workstation) registers its MAC addresses with the LE
server. Other things, such as filtering requirements (optional), can be provided.
Management and resolution
This is the method used by ATM endstations to set up direct VCs with other
endstations (LE layers). This function includes mechanisms for learning the ATM
address of a target station, mapping the MAC address to an ATM address,
storing the mapping in a table, and managing the table.
For the server, this function provides the means for supporting the use of direct
VCs by endstations. This includes a mechanism for mapping the MAC address of
an end system to its ATM address, storing the information, and providing it to a
requesting endstation.
This structure maintains full LAN function and can support most higher layer LAN
protocols. Reliance on the server for data transfer is minimized by using switched
VCs for the transport of most bulk data.
2.10.4 Classical IP over ATM versus LAN emulation
These two approaches to providing an easier way to migrate to ATM were made
with different goals in mind.
Classical IP over ATM defines an encapsulation and address resolution method.
The definitions are made for IP only and not for use with other protocols. So if
you have applications requiring other protocol stacks (such as IPX or SNA), IP
over ATM will not provide a complete solution. However, if you only have TCP or
UDP-based applications, this might be the better solution, because this
specialized adaptation of the IP protocol to the ATM architecture is likely to
produce fewer inefficiencies than a more global solution. Another advantage of
this implementation is the use of some ATM-specific functions, such as large
MTU sizes.
The major goal of the ATM Forum's approach is to run layer 3 and higher
protocols unchanged over the ATM network. This means that existing protocols,
for example, TCP/IP, IPX, NetBIOS, and SNA, and their applications can use the
benefits of the fast ATM network without any changes. The mapping for all
protocols is already defined. The LAN emulation (LANE) layer provides all the
Chapter 2. Network interfaces
services of a classic LAN; thus, the upper layer does not know of the existence of
ATM. This is both an advantage and a disadvantage, because the knowledge of
the underlying network could be used to provide a more effective
In the near future, both approaches will be used depending on the particular
requirements. Over time, when the mapping of applications to ATM is fully
defined and implemented, the scheme of a dedicated ATM implementation might
be used.
2.11 Multiprotocol over ATM (MPOA)
The objectives of MPOA are to:
򐂰 Provide end-to-end layer 3 internetworking connectivity across an ATM
network. This is for hosts that are attached either:
– Directly to the ATM network
– Indirectly to the ATM network on an existing LAN
򐂰 Support the distribution of the internetwork layer (for example, an IP subnet)
across traditional and ATM-attached devices. Removes the port to layer 3
network restriction of routers to enable the building of protocol-based virtual
򐂰 Ensure interoperability among the distributed routing components while
allowing flexibility in implementations.
򐂰 Address how internetwork-layer protocols use the services of an ATM
Although the name is multiprotocol over ATM, the actual work being done at the
moment in the MPOA subworking group is entirely focused on IP.
2.11.1 Benefits of MPOA
MPOA represents the transition from LAN emulation to direct exploitation of ATM
by the internetwork-layer protocols. The advantages are:
򐂰 Protocols see ATM as more than just another link. Therefore, we are able to
exploit the facilities of ATM.
򐂰 Increases efficiency of the traditional LAN frame structure.
TCP/IP Tutorial and Technical Overview
The MPOA solution has the following benefits over both Classical IP (RFC 2225)
and LAN emulation solutions:
򐂰 Lower latency by allowing direct connectivity between end systems that can
cut across subnet boundaries. This is achieved by minimizing the need for
multiple hops through ATM routers for communication between end systems
on different virtual LANs.
򐂰 Higher aggregate layer 3 forwarding capacity by distributing processing
functions to the edge of the network.
򐂰 Allows mapping of specific flows to specific QoS characteristics.
򐂰 Allows a layer 3 subnet to be distributed across a physical network.
2.11.2 MPOA logical components
The MPOA solution consists of a number of logical components and information
flows between those components. The logical components are of two kinds:
MPOA server
MPOA servers maintain complete knowledge of the MAC
and internetworking layer topologies for the IASGs they
serve. To accomplish this, they exchange information among
themselves and with MPOA clients.
MPOA client
MPOA clients maintain local caches of mappings (from
packet prefix to ATM information). These caches are
populated by requesting the information from the appropriate
MPOA server on an as-needed basis.
The layer 3 addresses associated with an MPOA client
represent either the layer 3 address of the client itself, or the
layer 3 addresses reachable through the client. (The client
has an edge device or router.)
An MPOA client will connect to its MPOA server to register
the client's ATM address and the layer 3 addresses
reachable by the client.
Chapter 2. Network interfaces
2.11.3 MPOA functional components
The mapping between the logical and physical components are split between the
following layers:
򐂰 MPOA functional group layer
򐂰 LAN emulation layer
򐂰 Physical layer
The MPOA solution will be implemented into various functional groups that
򐂰 Internetwork Address Sub-Group (IASG): A range of internetwork layer
addresses (for example, an IPv4 subnet). Therefore, if a host operates two
internetwork-layer protocols, it will be a member of, at least, two IASGs.
򐂰 Edge Device Functional Group (EDFG): EDFG is the group of functions
performed by a device that provides internetworking level connections
between a traditional subnetwork and ATM.
– An EDFG implements layer 3 packet forwarding, but does not execute any
routing protocols (executed in the RSFG).
– Two types of EDFG are allowed, simple and smart:
Smart EDFGs request resolution of internetwork addresses (that is, it
will send a query ARP type frame if it does not have an entry for the
Simple EDFGs will send a frame via a default class to a default
destination if no entry exists.
– A coresident proxy LEC function is required.
򐂰 ATM-Attached Host Functional Group (AHFG): AHFG is the group of
functions performed by an ATM-attached host that is participating in the
MPOA network.
A coresident proxy LEC function is optional.
Within an IASG, LAN emulation is used as a transport mechanism to either
traditional devices or LAN emulation devices, in which case access to a LEC
is required. If the AHFG will not be communicating with LANE or other
devices, a co-resident LEC is not required.
򐂰 IASG Coordination Functional Group (ICFG): ICFG is the group of functions
used to coordinate the distribution of a single IASG across multiple traditional
LAN ports on one or more EDFG or ATM device, or both. The ICFG tracks the
location of the functional components so that it is able to respond to queries
for layer 3 addresses.
TCP/IP Tutorial and Technical Overview
򐂰 Default Forwarder Function Group (DFFG): In the absence of direct
client-to-client connectivity, the DFFG provides default forwarding for traffic
destined either within or outside the IASG.
– Provides internetwork layer multicast forwarding in an IASG; that is, the
DFFG acts as the multicast server (MCS) in an MPOA-based MARS
– Provides proxy LAN emulation function for AHFGs (that is, for AHFGs that
do not have a LANE client) to enable AHFGs to send/receive traffic with
earlier enterprise-attached systems.
򐂰 Route Server Functional Group (RSFG): RSFG performs internetworking
level functions in an MPOA network. This includes:
– Running conventional internetworking routing protocols (for example,
– Providing address resolution between IASGs, handling requests, and
building responses
򐂰 Remote Forwarder Functional Group (RFFG): RFFG is the group of functions
performed in association with forwarding traffic from a source to a destination,
where these can be either an IASG or an MPOA client. An RFFG is
synonymous with the default router function of a typical IPv4 subnet.
Note: One or more of these functional groups can co-reside in the same
physical entity. MPOA allows arbitrary physical locations of these groups.
2.11.4 MPOA operation
The MPOA system operates as a set of functional groups that exchange
information in order to exhibit the desired behavior. To provide an overview of the
MPOA system, the behavior of the components is described in a sequence order
by significant events:
Ensures that all functional groups have the
appropriate set of administrative information.
Registration and discovery
Includes the functional groups informing each
other of their existence and of the identities of
attached devices and EDFGs informing the
ICFG of earlier devices.
Chapter 2. Network interfaces
Destination resolution
The action of determining the route description
given a destination internetwork layer address
and possibly other information (for example,
QoS). This is the part of the MPOA system that
allows it to perform cut-through (with respect to
IASG boundaries).
Data transfer
To get internetworking layer data from one
MPOA client to another.
Intra-IASG coordination
The function that enables IASGs to be spread
across multiple physical interfaces.
Routing protocol support
Enables the MPOA system to interact with
traditional internetworks.
Spanning tree support
Enables the MPOA system to interact with
existing extended LANs.
Replication Support
Provides for replication of key components for
reasons of capacity or resilience.
2.12 RFCs relevant to this chapter
The following RFCs provide detailed information about the connection protocols
and architectures presented throughout this chapter:
򐂰 RFC 826 – Ethernet Address Resolution Protocol: Or converting network
protocol addresses to 48.bit Ethernet address for transmission on Ethernet
hardware (November 1982)
򐂰 RFC 894 – Standard for the Transmission of IP Datagrams over Ethernet
Networks (April 1984)
򐂰 RFC 948 - Two Methods for the Transmission of IP Datagrams over IEEE
802.3 Networks (June 1985)
򐂰 RFC 1042 – Standard for the Transmission of IP Datagrams over IEEE 802
Networks (February 1988)
򐂰 RFC 1055 – Nonstandard for Transmission of IP Datagrams over Serial
Lines: SLIP (June 1988)
򐂰 RFC 1144 – Compressing TCP/IP Headers for Low-Speed Serial Links
(February 1990)
򐂰 RFC 1188 – Proposed Standard for the Transmission of IP Datagrams over
FDDI Networks (October 1990)
򐂰 RFC 1329 – Thoughts on Address Resoluation for Dual MAC FDDI Networks
(May 1992)
TCP/IP Tutorial and Technical Overview
򐂰 RFC 1356 – Multiprotocol Interconnect on X.25 and ISDN in the Packet Mode
(August 1992)
򐂰 RFC 1390 – Transmission of IP and ARP over FDDI Networks (January
򐂰 RFC 1618 – PPP over ISDN (May 1994)
򐂰 RFC 1661 – The Point-to-Point Protocol (PPP) (July 1994)
򐂰 RFC 1662 – PPP in HDLC-Like Framing (July 1994)
򐂰 RFC 1755 – ATM Signaling Support for IP over ATM (February 1995)
򐂰 RFC 2225 – Classical IP and ARP over ATM (April 1998)
򐂰 RFC 2390 – Inverse Address Resolution Protocol (September 1998)
򐂰 RFC 2427 – Multiprotocol Interconnect over Frame Relay (September 1998)
򐂰 RFC 2464 – Transmission of IPv6 Packets over Ethernet Networks
(December 1998)
򐂰 RFC 2467 – Transmission of IPv6 Packets over FDDI networks (December
򐂰 RFC 2472 – IP Version 6 over PPP (December 1998)
򐂰 RFC 2492 – IPv6 over ATM Networks (January 1999)
򐂰 RFC 2590 – Transmission of IPv6 Packets over Frame Relay Networks (May
򐂰 RFC 2615 – PPP over SONET/SDH (June 1999)
򐂰 RFC 2684 – Multiprotocol Implementation over ATM Adaptation Layer 5
(September 1999)
򐂰 RFC 3232 – Assigned Numbers: RFC 1700 is Replaced by an On-line
Database (January 2002)
Chapter 2. Network interfaces
TCP/IP Tutorial and Technical Overview
Chapter 3.
Internetworking protocols
This chapter provides an overview of the most important and common protocols
associated with the TCP/IP internetwork layer. These include:
򐂰 Internet Protocol (IP)
򐂰 Internet Control Message Protocol (ICMP)
򐂰 Address Resolution Protocol (ARP)
򐂰 Dynamic Host Configuration Protocol (DHCP)
These protocols perform datagram addressing, routing and delivery, dynamic
address configuration, and resolve between the internetwork layer addresses
and the network interface layer addresses.
© Copyright IBM Corp. 1989-2006. All rights reserved.
3.1 Internet Protocol (IP)
IP is a standard protocol with STD number 5. The standard also includes ICMP
(see 3.2, “Internet Control Message Protocol (ICMP)” on page 109) and IGMP
(see 3.3, “Internet Group Management Protocol (IGMP)” on page 119). IP has a
status of required.
The current IP specification is in RFC 950, RFC 919, RFC 922, RFC 3260 and
RFC 3168, which updates RFC 2474, and RFC 1349, which updates RFC 791.
Refer to 3.8, “RFCs relevant to this chapter” on page 140 for further details
regarding the RFCs.
IP is the protocol that hides the underlying physical network by creating a virtual
network view. It is an unreliable, best-effort, and connectionless packet delivery
protocol. Note that best-effort means that the packets sent by IP might be lost,
arrive out of order, or even be duplicated. IP assumes higher layer protocols will
address these anomalies.
One of the reasons for using a connectionless network protocol was to minimize
the dependency on specific computing centers that used hierarchical
connection-oriented networks. The United States Department of Defense
intended to deploy a network that would still be operational if parts of the country
were destroyed. This has been proven to be true for the Internet.
3.1.1 IP addressing
IP addresses are represented by a 32-bit unsigned binary value. It is usually
expressed in a dotted decimal format. For example, is a valid IP
address. The numeric form is used by IP software. The mapping between the IP
address and an easier-to-read symbolic name, for example,, is
done by the Domain Name System (DNS), discussed in 12.1, “Domain Name
System (DNS)” on page 426.
The IP address
IP addressing standards are described in RFC 1166. To identify a host on the
Internet, each host is assigned an address, the IP address, or in some cases, the
Internet address. When the host is attached to more than one network, it is called
multihomed and has one IP address for each network interface. The IP address
consists of a pair of numbers:
IP address = <network number><host number>
TCP/IP Tutorial and Technical Overview
The network number portion of the IP address is administered by one of three
Regional Internet Registries (RIR):
򐂰 American Registry for Internet Numbers (ARIN): This registry is responsible
for the administration and registration of Internet Protocol (IP) numbers for
North America, South America, the Caribbean, and sub-Saharan Africa.
򐂰 Reseaux IP Europeans (RIPE): This registry is responsible for the
administration and registration of Internet Protocol (IP) numbers for Europe,
Middle East, and parts of Africa.
򐂰 Asia Pacific Network Information Centre (APNIC): This registry is responsible
for the administration and registration of Internet Protocol (IP) numbers within
the Asia Pacific region.
IP addresses are 32-bit numbers represented in a dotted decimal form (as the
decimal representation of four 8-bit values concatenated with dots). For example, is an IP address with 128.2 being the network number and 7.9 being
the host number. Next, we explain the rules used to divide an IP address into its
network and host parts.
The binary format of the IP address is:
10000000 00000010 00000111 00001001
IP addresses are used by the IP protocol to uniquely identify a host on the
Internet (or more generally, any internet). Strictly speaking, an IP address
identifies an interface that is capable of sending and receiving IP datagrams.
One system can have multiple such interfaces. However, both hosts and routers
must have at least one IP address, so this simplified definition is acceptable. IP
datagrams (the basic data packets exchanged between hosts) are transmitted by
a physical network attached to the host. Each IP datagram contains a source IP
address and a destination IP address. To send a datagram to a certain IP
destination, the target IP address must be translated or mapped to a physical
address. This might require transmissions in the network to obtain the
destination's physical network address. (For example, on LANs, the Address
Resolution Protocol, discussed in 3.4, “Address Resolution Protocol (ARP)” on
page 119, is used to translate IP addresses to physical MAC addresses.)
Class-based IP addresses
The first bits of the IP address specify how the rest of the address should be
separated into its network and host part. The terms network address and netID
are sometimes used instead of network number, but the formal term, used in
RFC 1166, is network number. Similarly, the terms host address and hostID are
sometimes used instead of host number.
Chapter 3. Internetworking protocols
There are five classes of IP addresses. They are shown in Figure 3-1.
Class A
Class B
Class C
Class D
Class E
future/experimental use
Figure 3-1 IP: Assigned classes of IP addresses
Class A addresses
These addresses use 7 bits for the <network> and 24 bits
for the <host> portion of the IP address. This allows for
27-2 (126) networks each with 224-2 (16777214) hosts—a
total of more than 2 billion addresses.
Class B addresses
These addresses use 14 bits for the <network> and 16
bits for the <host> portion of the IP address. This allows
for 214-2 (16382) networks each with 216-2 (65534)
hosts—a total of more than 1 billion addresses.
Class C addresses
These addresses use 21 bits for the <network> and 8 bits
for the <host> portion of the IP address. That allows for
221-2 (2097150) networks each with 28-2 (254) hosts—a
total of more than half a billion addresses.
Class D addresses
These addresses are reserved for multicasting (a sort of
broadcasting, but in a limited area, and only to hosts
using the same Class D address).
Class E addresses
These addresses are reserved for future or experimental
TCP/IP Tutorial and Technical Overview
A Class A address is suitable for networks with an extremely large number of
hosts. Class C addresses are suitable for networks with a small number of hosts.
This means that medium-sized networks (those with more than 254 hosts or
where there is an expectation of more than 254 hosts) must use Class B
addresses. However, the number of small- to medium-sized networks has been
growing very rapidly. It was feared that if this growth had been allowed to
continue unabated, all of the available Class B network addresses would have
been used by the mid-1990s. This was termed the IP address exhaustion
problem (refer to 3.1.5, “The IP address exhaustion problem” on page 86).
The division of an IP address into two parts also separates the responsibility for
selecting the complete IP address. The network number portion of the address is
assigned by the RIRs. The host number portion is assigned by the authority
controlling the network. As shown in the next section, the host number can be
further subdivided: This division is controlled by the authority that manages the
network. It is not controlled by the RIRs.
Reserved IP addresses
A component of an IP address with a value all bits 0 or all bits 1 has a special
򐂰 All bits 0: An address with all bits zero in the host number portion is
interpreted as this host (IP address with <host address>=0). All bits zero in
the network number portion is this network (IP address with <network
address>=0). When a host wants to communicate over a network, but does
not yet know the network IP address, it can send packets with <network
address>=0. Other hosts in the network interpret the address as meaning this
network. Their replies contain the fully qualified network address, which the
sender records for future use.
򐂰 All bits 1: An address with all bits one is interpreted as all networks or all
hosts. For example, the following means all hosts on network 128.2 (Class B
This is called a directed broadcast address because it contains both a valid
<network address> and a broadcast <host address>.
򐂰 Loopback: The Class A network is defined as the loopback
network. Addresses from that network are assigned to interfaces that process
data within the local system. These loopback interfaces do not access a
physical network.
Chapter 3. Internetworking protocols
Special use IP addresses
RFC 3330 discusses special use IP addresses. We provide a brief description of
these IP addresses in Table 3-1.
Table 3-1 Special use IP addresses
Address block
Present use
“This” network
Public-data networks
Cable television networks
Reserved but subject to allocation
Reserved but subject to allocation
Link local
Reserved but subject to allocation
Reserved but subject to allocation
Test-Net 6to4 relay anycast
Network interconnect device benchmark testing
Reserved but subject to allocation
Reserved for future use
3.1.2 IP subnets
Due to the explosive growth of the Internet, the principle of assigned IP
addresses became too inflexible to allow easy changes to local network
configurations. Those changes might occur when:
򐂰 A new type of physical network is installed at a location.
򐂰 Growth of the number of hosts requires splitting the local network into two or
more separate networks.
򐂰 Growing distances require splitting a network into smaller networks, with
gateways between them.
To avoid having to request additional IP network addresses, the concept of IP
subnetting was introduced. The assignment of subnets is done locally. The entire
network still appears as one IP network to the outside world.
TCP/IP Tutorial and Technical Overview
The host number part of the IP address is subdivided into a second network
number and a host number. This second network is termed a subnetwork or
subnet. The main network now consists of a number of subnets. The IP address
is interpreted as:
<network number><subnet number><host number>
The combination of subnet number and host number is often termed the local
address or the local portion of the IP address. Subnetting is implemented in a
way that is transparent to remote networks. A host within a network that has
subnets is aware of the subnetting structure. A host in a different network is not.
This remote host still regards the local part of the IP address as a host number.
The division of the local part of the IP address into a subnet number and host
number is chosen by the local administrator. Any bits in the local portion can be
used to form the subnet. The division is done using a 32-bit subnet mask. Bits
with a value of zero bits in the subnet mask indicate positions ascribed to the
host number. Bits with a value of one indicate positions ascribed to the subnet
number. The bit positions in the subnet mask belonging to the original network
number are set to ones but are not used (in some platform configurations, this
value was specified with zeros instead of ones, but either way it is not used). Like
IP addresses, subnet masks are usually written in dotted decimal form.
The special treatment of all bits zero and all bits one applies to each of the three
parts of a subnetted IP address just as it does to both parts of an IP address that
has not been subnetted (see “Reserved IP addresses” on page 71). For
example, subnetting a Class B network can use one of the following schemes:
򐂰 The first octet is the subnet number; the second octet is the host number. This
gives 28-2 (254) possible subnets, each having up to 28-2 (254) hosts. Recall
that we subtract two from the possibilities to account for the all ones and all
zeros cases. The subnet mask is
򐂰 The first 12 bits are used for the subnet number and the last four for the host
number. This gives 212-2 (4094) possible subnets but only 24-2 (14) hosts per
subnet. The subnet mask is
In this example, there are several other possibilities for assigning the subnet and
host portions of the address. The number of subnets and hosts and any future
requirements need to be considered before defining this structure. In the last
example, the subnetted Class B network has 16 bits to be divided between the
subnet number and the host number fields. The network administrator defines
either a larger number of subnets each with a small number of hosts, or a smaller
number of subnets each with many hosts.
When assigning the subnet part of the local address, the objective is to assign a
number of bits to the subnet number and the remainder to the local address.
Chapter 3. Internetworking protocols
Therefore, it is normal to use a contiguous block of bits at the beginning of the
local address part for the subnet number. This makes the addresses more
readable. (This is particularly true when the subnet occupies 8 or 16 bits.) With
this approach, either of the previous subnet masks are “acceptable” masks.
Masks such as and are “unacceptable.” In fact,
most TCP/IP implementations do not support non-contiguous subnet masks.
Their use is universally discouraged.
Types of subnetting
There are two types of subnetting: static and variable length. Variable length
subnetting is more flexible than static. Native IP routing and RIP Version 1
support only static subnetting. However, RIP Version 2 supports variable length
subnetting (refer to Chapter 5, “Routing protocols” on page 171).
Static subnetting
Static subnetting implies that all subnets obtained from the same network use the
same subnet mask. Although this is simple to implement and easy to maintain, it
might waste address space in small networks. Consider a network of four hosts
using a subnet mask of This allocation wastes 250 IP addresses.
All hosts and routers are required to support static subnetting.
Variable length subnetting
When variable length subnetting or variable length subnet masks (VLSM) are
used, allocated subnets within the same network can use different subnet
masks. A small subnet with only a few hosts can use a mask that accommodates
this need. A subnet with many hosts requires a different subnet mask. The ability
to assign subnet masks according to the needs of the individual subnets helps
conserve network addresses. Variable length subnetting divides the network so
that each subnet contains sufficient addresses to support the required number of
An existing subnet can be split into two parts by adding another bit to the subnet
portion of the subnet mask. Other subnets in the network are unaffected by the
Mixing static and variable length subnetting
Not every IP device includes support for variable length subnetting. Initially, it
appears that the presence of a host that only supports static subnetting prevents
the use of variable length subnetting. This is not the case. Routers
interconnecting the subnets are used to hide the different masks from hosts.
Hosts continue to use basic IP routing. This offloads subnetting complexities to
dedicated routers.
TCP/IP Tutorial and Technical Overview
Static subnetting example
Consider the Class A network shown in Figure 3-2.
Class A
Figure 3-2 IP: Class A address without subnets
Use the IP address shown in Figure 3-3.
00001001 01000011 00100110 00000001
a 32-bit address
decimal notation (
Figure 3-3 IP address
The IP address is (Class A) with 9 as the <network address> and
67.38.1 as the <host address>.
The network administrator might want to choose the bits from 8 to 25 to indicate
the subnet address. In that case, the bits from 26 to 31 indicate the host
addresses. Figure 3-4 shows the subnetted address derived from the original
Class A address.
Class A
0 netID
subnet number
Figure 3-4 IP: Class A address with subnet mask and subnet address
A bit mask, known as the subnet mask, is used to identify which bits of the
original host address field indicate the subnet number. In the previous example,
the subnet mask is (or 11111111 11111111 11111111
11000000 in bit notation). Note that, by convention, the <network address> is
included in the mask as well.
Chapter 3. Internetworking protocols
Because of the all bits 0 and all bits 1 restrictions, this defines 218-2 (from 1 to
262143) valid subnets. This split provides 262142 subnets each with a maximum
of 26-2 (62) hosts.
The value applied to the subnet number takes the value of the full octet with
non-significant bits set to zero. For example, the hexadecimal value 01 in this
subnet mask assumes an 8-bit value 01000000. This provides a subnet value of
Applying the to the sample Class A address of
provides the following information:
00001001 01000011 00100110 00000001 = (Class A address)
11111111 11111111 11111111 11------ (subnet mask)
===================================== logical_AND
00001001 01000011 00100110 00------ = (subnet base address)
This leaves a host address of:
-------- -------- -------- --000001 = 1
(host address)
IP will recognize all host addresses as being on the local network for which the
logical_AND operation described earlier produces the same result. This is
important for routing IP datagrams in subnet environments (refer to 3.1.3, “IP
routing” on page 77).
The subnet number is:
-------- 01000011 00100110 00------ = 68760
(subnet number)
This subnet number is a relative number. That is, it is the 68760th subnet of
network 9 with the given subnet mask. This number bears no resemblance to the
actual IP address that this host has been assigned ( It has no
meaning in terms of IP routing.
The division of the original <host address> into <subnet><host> is chosen by the
network administrator. The values of all zeroes and all ones in the <subnet> field
are reserved.
Variable length subnetting example
Consider a corporation that has been assigned the Class C network The corporation has the requirement to split this address range
into five separate networks each with the following number of hosts:
򐂰 Subnet 1: 50 hosts
򐂰 Subnet 2: 50 hosts
򐂰 Subnet 3: 50 hosts
TCP/IP Tutorial and Technical Overview
򐂰 Subnet 4: 30 hosts
򐂰 Subnet 5: 30 hosts
This cannot be achieved with static subnetting. For this example, static
subnetting divides the network into four subnets each with 64 hosts or eight
subnets each with 32 hosts. This subnet allocation does not meet the stated
To divide the network into five subnets, multiple masks need to be defined. Using
a mask of, the network can be divided into four subnets each
with 64 hosts. The fourth subnet can be further divided into two subnets each
with 32 hosts by using a mask of There will be three subnets
each with 64 hosts and two subnets each with 32 hosts. This satisfies the stated
requirements and eliminates the possibility of a high number of wasted host
Determining the subnet mask
Usually, hosts will store the subnet mask in a configuration file. However,
sometimes this cannot be done, for example, as in the case of a diskless
workstation. The ICMP protocol includes two messages: address mask request
and address mask reply. These allow hosts to obtain the correct subnet mask
from a server (refer to “Address Mask Request (17) and Address Mask Reply
(18)” on page 116).
Addressing routers and multihomed hosts
Whenever a host has a physical connection to multiple networks or subnets, it is
described as being multihomed. By default, all routers are multihomed because
their purpose is to join networks or subnets. A multihomed host has different IP
addresses associated with each network adapter. Each adapter connects to a
different subnet or network.
3.1.3 IP routing
An important function of the IP layer is IP routing. This provides the basic
mechanism for routers to interconnect different physical networks. A device can
simultaneously function as both a normal host and a router.
A router of this type is referred to as a router with partial routing information. The
router only has information about four kinds of destinations:
򐂰 Hosts that are directly attached to one of the physical networks to which the
router is attached.
򐂰 Hosts or networks for which the router has been given explicit definitions.
Chapter 3. Internetworking protocols
򐂰 Hosts or networks for which the router has received an ICMP redirect
򐂰 A default for all other destinations.
Additional protocols are needed to implement a full-function router. These types
of routers are essential in most networks, because they can exchange
information with other routers in the environment. We review the protocols used
by these routers in Chapter 5, “Routing protocols” on page 171.
There are two types of IP routing: direct and indirect.
Direct routing
If the destination host is attached to the same physical network as the source
host, IP datagrams can be directly exchanged. This is done by encapsulating the
IP datagram in the physical network frame. This is called direct delivery and is
referred to as direct routing.
Indirect routing
Indirect routing occurs when the destination host is not connected to a network
directly attached to the source host. The only way to reach the destination is
through one or more IP gateways. (Note that in TCP/IP terminology, the terms
gateway and router are used interchangeably. This describes a system that
performs the duties of a router.) The address of the first gateway (the first hop) is
called an indirect route in the IP routing algorithm. The address of the first
gateway is the only information needed by the source host to send a packet to
the destination host.
In some cases, there may be multiple subnets defined on the same physical
network. If the source and destination hosts connect to the same physical
network but are defined in different subnets, indirect routing is used to
communicate between the pair of devices. A router is needed to forward traffic
between subnets.
TCP/IP Tutorial and Technical Overview
Figure 3-5 shows an example of direct and indirect routes. Here, host C has a
direct route to hosts B and D, and an indirect route to host A via gateway B.
Host A
Host B
Host D
Host C
Figure 3-5 IP: Direct and indirect routes
IP routing table
The determination of direct routes is derived from the list of local interfaces. It is
automatically composed by the IP routing process at initialization. In addition, a
list of networks and associated gateways (indirect routes) can be configured.
This list is used to facilitate IP routing. Each host keeps the set of mappings
between the following:
򐂰 Destination IP network addresses
򐂰 Routes to next gateways
This information is stored in a table called the IP routing table. Three types of
mappings are in this table:
򐂰 The direct routes describing locally attached networks
򐂰 The indirect routes describing networks reachable through one or more
Chapter 3. Internetworking protocols
򐂰 The default route that contains the (direct or indirect) route used when the
destination IP network is not found in the mappings of the previous types of
type 1 and 2
Figure 3-6 presents a sample network.
Host A
Host E
Host B
Host F
Host C
Host D
Figure 3-6 IP: Routing table scenario
The routing table of host D might contain the following (symbolic) entries
(Table 3-2).
Table 3-2 Host D sample entries
TCP/IP Tutorial and Technical Overview
Because D is directly attached to network, it maintains a direct route
for this network. To reach networks and, however, it must
have an indirect route through E and B, respectively, because these networks
are not directly attached to it.
The routing table of host F might contain the following (symbolic) entries
(Table 3-3).
Table 3-3 Host F sample entries
Because every host not on the network must be reached through host
E, host F simply maintains a default route through E.
IP routing algorithm
IP uses a unique algorithm to route datagrams, as illustrated in Figure 3-7.
Figure 3-7 IP: Routing without subnets
Chapter 3. Internetworking protocols
To differentiate between subnets, the IP routing algorithm is updated, as shown
in Figure 3-8.
Figure 3-8 IP: Routing with subnets
Some implications of this change include:
򐂰 This algorithm represents a change to the general IP algorithm. Therefore, to
be able to operate this way, the particular gateway must contain the new
algorithm. Some implementations might still use the general algorithm, and
will not function within a subnetted network, although they can still
communicate with hosts in other networks that are subnetted.
򐂰 As IP routing is used in all of the hosts (and not just the routers), all of the
hosts in the subnet must have:
– An IP routing algorithm that supports subnetting
– The same subnet mask (unless subnets are formed within the subnet)
򐂰 If the IP implementation on any of the hosts does not support subnetting, that
host will be able to communicate with any host in its own subnet but not with
any machine on another subnet within the same network. This is because the
host sees only one IP network and its routing cannot differentiate between an
IP datagram directed to a host on the local subnet and a datagram that should
be sent through a router to a different subnet.
In case one or more hosts do not support subnetting, an alternative way to
achieve the same goal exists in the form of proxy-ARP. This does not require any
changes to the IP routing algorithm for single-homed hosts. It does require
changes on routers between subnets in the network (refer to 3.4.4, “Proxy-ARP
or transparent subnetting” on page 123).
TCP/IP Tutorial and Technical Overview
Figure 3-9 illustrates the entire IP routing algorithm.
Take destination IP
Bitwise AND local interface(s)
with local_subnet_mask(s)
Bitwise AND dest_IP_addr
with local_subnet_mask(s)
Is there a match?
Deliver directly using the
corresponding local
Deliver indirectly to the
corresponding router's IP
Deliver indirectly to the
default router's IP address
Is there an indirect route
Is a default route
Send ICMP error message
"network unreachable"
Figure 3-9 IP: Routing algorithm (with subnets)
Chapter 3. Internetworking protocols
3.1.4 Methods of delivery: Unicast, broadcast, multicast, and anycast
The majority of IP addresses refer to a single recipient, this is called a unicast
address. Unicast connections specify a one-to-one relationship between a single
source and a single destination. Additionally, there are three special types of IP
addresses used for addressing multiple recipients: broadcast addresses,
multicast addresses, and anycast addresses. Figure 3-10 shows their operation.
M ulticast
Figure 3-10 IP: Packet delivery modes
A connectionless protocol can send unicast, broadcast, multicast, or anycast
messages. A connection-oriented protocol can only use unicast addresses (a
connection must exist between a specific pair of hosts).
Broadcast addresses are never valid as a source address. They must specify the
destination address. The different types of broadcast addresses include:
򐂰 Limited broadcast address: This uses the address (all bits 1
in all parts of the IP address). It refers to all hosts on the local subnet. This is
recognized by every host. The hosts do not need any IP configuration
information. Routers do not forward this packet.
One exception to this rule is called BOOTP forwarding. The BOOTP protocol
uses the limited broadcast address to allow a diskless workstation to contact
a boot server. BOOTP forwarding is a configuration option available on some
TCP/IP Tutorial and Technical Overview
routers. Without this facility, a separate BOOTP server is required on each
subnet (refer to 3.6, “Bootstrap Protocol (BOOTP)” on page 125).
򐂰 Network-directed broadcast address: This is used in an unsubnetted
environment. The network number is a valid network number and the host
number is all ones (for example, This address refers to all
hosts on the specified network. Routers should forward these broadcast
messages. This is used in ARP requests (refer to 3.4, “Address Resolution
Protocol (ARP)” on page 119) on unsubnetted networks.
򐂰 Subnet-directed broadcast address: If the network number is a valid network
number, the subnet number is a valid subnet number, and the host number is
all ones, the address refers to all hosts on the specified subnet. Because the
sender's subnet and the target subnet might have a different subnet mask,
the sender must somehow determine the subnet mask in use at the target.
The broadcast is performed by the router that receives the datagram into the
򐂰 All-subnets-directed broadcast address: If the network number is a valid
network number, the network is subnetted, and the local part is all ones (for
example,, the address refers to all hosts on all subnets in the
specified network. In principle, routers can propagate broadcasts for all
subnets but are not required to do so. In practice, they do not. There are very
few circumstances where such a broadcast is desirable. If misconfigured, it
can lead to problems. Consider the misconfigured host in a
subnetted Class A network. If the device was configured with the address as a local broadcast address instead of, all of
the routers in the network will forward the request to all clients.
If routers do respect all-subnets-directed broadcast address, they use an
algorithm called reverse path forwarding to prevent the broadcast messages
from multiplying out of control. See RFC 922 for more details about this
If an IP datagram is broadcast to a subnet, it is received by every host on the
subnet. Each host processes the packet to determine if the target protocol is
active. If it is not active, the IP datagram is discarded. Multicasting avoids this by
selecting destination groups.
Each group is represented by a Class D IP address. For each multicast address,
a set of zero or more hosts are listening for packets addressed to the address.
This set of hosts is called the host group. Packets sent to a multicast address are
forwarded only to the members of the corresponding host group. Multicast
enables one-to-many connections (refer to Chapter 6, “IP multicast” on
page 237).
Chapter 3. Internetworking protocols
Sometimes, the same IP services are provided by different hosts. For example, a
user wants to download a file using FTP and the file is available on multiple FTP
servers. Hosts that implement the same service provide an anycast address to
other hosts that require the service. Connections are made to the first host in the
anycast address group to respond. This process is used to guarantee the service
is provided by the host with the best connection to the receiver.
The anycast service is included in IPV6 (refer to 9.2.2, “IPv6 addressing” on
page 339).
3.1.5 The IP address exhaustion problem
The number of networks on the Internet has been approximately doubling
annually for a number of years. However, the usage of the Class A, B, and C
networks differs greatly. Nearly all of the new networks assigned in the late
1980s were Class B, and in 1990 it became apparent that if this trend continued,
the last Class B network number would be assigned during 1994. However,
Class C networks were hardly being used.
The reason for this trend was that most potential users found a Class B network
to be large enough for their anticipated needs, because it accommodates up to
65534 hosts, while a Class C network, with a maximum of 254 hosts, severely
restricts the potential growth of even a small initial network. Furthermore, most of
the Class B networks being assigned were small ones. There are relatively few
networks that would need as many as 65,534 host addresses, but very few for
which 254 hosts would be an adequate limit. In summary, although the Class A,
Class B, and Class C divisions of the IP address are logical and easy-to-use
(because they occur on byte boundaries), with hindsight, they are not the most
practical because Class C networks are too small to be useful for most
organizations, while Class B networks are too large to be densely populated by
any but the largest organizations.
In May 1996, all Class A addresses were either allocated or assigned, as well as
61.95 percent of Class B and 36.44 percent of Class C IP network addresses.
The terms assigned and allocated in this context have the following meanings:
򐂰 Assigned: The number of network numbers in use. The Class C figures are
somewhat inaccurate, because the figures do not include many Class C
networks in Europe, which were allocated to RIPE and subsequently
assigned but which are still recorded as allocated.
򐂰 Allocated: This includes all of the assigned networks and additionally, those
networks that have either been reserved by IANA (for example, the 63 Class
A networks are all reserved by IANA) or have been allocated to regional
registries by IANA and will subsequently be assigned by those registries.
TCP/IP Tutorial and Technical Overview
Another way to look at these numbers is to examine the proportion of the
address space that has been used. For example, the Class A address space is
as big as the rest combined, and a single Class A network can theoretically have
as many hosts as 66,000 Class C networks.
Since 1990, the number of assigned Class B networks has been increasing at a
much lower rate than the total number of assigned networks and the anticipated
exhaustion of the Class B network numbers has not yet occurred. The reason for
this is that the policies on network number allocation were changed in late 1990
to preserve the existing address space, in particular to avert the exhaustion of
the Class B address space. The new policies can be summarized as follows:
򐂰 The upper half of the Class A address space (network numbers 64 to 127) is
reserved indefinitely to allow for the possibility of using it for transition to a
new numbering scheme.
򐂰 Class B networks are only assigned to organizations that can clearly
demonstrate a need for them. The same is, of course, true for Class A
networks. The requirements for Class B networks are that the requesting
– Has a subnetting plan that documents more than 32 subnets within its
organizational network
– Has more than 4096 hosts
Any requirements for a Class A network are handled on an individual case
򐂰 Organizations that do not fulfill the requirements for a Class B network are
assigned a consecutively numbered block of Class C network numbers.
򐂰 The lower half of the Class C address space (network numbers 192.0.0
through 207.255.255) is divided into eight blocks, which are allocated to
regional authorities as follows:
192.0.0 - 193.255.255
194.0.0 - 195.255.255
196.0.0 - 197.255.255
198.0.0 - 199.255.255
North America
200.0.0 - 201.255.255
Central and South America
202.0.0 - 203.255.255
Pacific Rim
204.0.0 - 205.255.255
206.0.0 - 207.255.255
208.0.0 - 209.255.255
210.0.0 - 211.255.255
Chapter 3. Internetworking protocols
212.0.0 - 213.255.255
214.0.0 - 215.255.255
US Department of Defense
216.0.0 - 216.255.255
217.0.0 - 217.255.255
218.0.0 - 218.255.255
219.0.0 - 222.255.255
The ranges defined as Others are to be where flexibility outside the
constraints of regional boundaries is required. The range defined as
multi-regional includes the Class C networks that were assigned before this
new scheme was adopted. The 192 networks were assigned by the InterNIC
and the 193 networks were previously allocated to RIPE in Europe.
򐂰 Where an organization has a range of Class C network numbers, the range
provided is assigned as a bit-wise contiguous range of network numbers, and
the number of networks in the range is a power of 2. That is, all IP addresses
in the range have a common prefix, and every address with that prefix is
within the range. For example, a European organization requiring 1500 IP
addresses would be assigned eight Class C network numbers (2048 IP
addresses) from the number space reserved for European networks (194.0.0
through 195.255.255) and the first of these network numbers would be
divisible by eight. A range of addresses satisfying these rules is 194.32.136
through 194.32.143, in which case the range consists of all of the IP
addresses with the 21-bit prefix 194.32.136, or B '110000100010000010001'.
The maximum number of network numbers assigned contiguously is 64,
corresponding to a prefix of 18 bits. An organization requiring more than 4096
addresses but less than 16,384 addresses can request either a Class B or a
range of Class C addresses. In general, the number of Class C networks
assigned is the minimum required to provide the necessary number of IP
addresses for the organization on the basis of a two-year outlook. However,
in some cases, an organization can request multiple networks to be treated
separately. For example, an organization with 600 hosts is normally assigned
four Class C networks. However, if those hosts were distributed across 10
LANs with between 50 and 70 hosts per LAN, such an allocation can cause
serious problems, because the organization would have to find 10 subnets
within a 10-bit local address range. This means at least some of the LANs
have a subnet mask of, which allows only 62 hosts per LAN.
The intent of the rules is not to force the organization into complex subnetting
of small networks, and the organization should request 10 different Class C
numbers, one for each LAN.
Information for this and the following numbers in this list are from:
TCP/IP Tutorial and Technical Overview
The current rules are in RFC 2050, which updates RFC 1466. The reasons for
the rules for the allocation of Class C network numbers will become apparent in
the following sections. The use of Class C network numbers in this way has
averted the exhaustion of the Class B address space, but it is not a permanent
solution to the overall address space constraints that are fundamental to IP. We
discuss a long-term solution in Chapter 9, “IP version 6” on page 327.
3.1.6 Intranets: Private IP addresses
Another approach to conserve the IP address space is described in RFC 1918.
This RFC relaxes the rule that IP addresses must be globally unique. It reserves
part of the global address space for use in networks that do not require
connectivity to the Internet. Typically these networks are administered by a
single organization. Three ranges of addresses have been reserved for this
򐂰 A single Class A network
򐂰 through 16 contiguous Class B networks
򐂰 through 256 contiguous Class C networks
Any organization can use any address in these ranges. However, because these
addresses are not globally unique, they are not defined to any external routers.
Routers in networks not using private addresses, particularly those operated by
Internet service providers, are expected to quietly discard all routing information
regarding these addresses. Routers in an organization using private addresses
are expected to limit all references to private addresses to internal links. They
should neither externally advertise routes to private addresses nor forward IP
datagrams containing private addresses to external routers.
Hosts having only a private IP address do not have direct IP layer connectivity to
the Internet. All connectivity to external Internet hosts must be provided with
application gateways (refer to “Application-level gateway (proxy)” on page 798),
SOCKS (refer to 22.5, “SOCKS” on page 846), or Network Address Translation
(NAT), which is discussed in the next section.
3.1.7 Network Address Translation (NAT)
This section explains Traditional Network Address Translation (NAT), Basic
NAT, and Network Address Port Translation (NAPT). NAT is also known as IP
masquerading. It provides a mapping between internal IP addresses and
officially assigned external addresses.
Chapter 3. Internetworking protocols
Originally, NAT was suggested as a short-term solution to the IP address
exhaustion problem. Also, many organizations have, in the past, used locally
assigned IP addresses, not expecting to require Internet connectivity.
There are two variations of traditional NAT, Basic NAT and NAPT. Traditional
NAT is defined in RFC 3022 and discussed in RFC 2663. The following sections
provide a brief discussion of Traditional NAT, Basic NAT, and NAPT based on
RFC 3022.
Traditional NAT
The idea of Traditional NAT (hereafter referred to as NAT) is based on the fact
that only a small number of the hosts in a private network are communicating
outside of that network. If each host is assigned an IP address from the official IP
address pool only when they need to communicate, only a small number of
official addresses are required.
NAT might be a solution for networks that have private address ranges or
unofficial addresses and want to communicate with hosts on the Internet. When
a proxy server, SOCKS server, or firewall are not available, or do not meet
specific requirements, NAT might be used to manage the traffic between the
internal and external network without advertising the internal host addresses.
Basic NAT
Consider an internal network that is based on the private IP address space, and
the users want to use an application protocol for which there is no application
gateway; the only option is to establish IP-level connectivity between hosts in the
internal network and hosts on the Internet. Because the routers in the Internet
would not know how to route IP packets back to a private IP address, there is no
point in sending IP packets with private IP addresses as source IP addresses
through a router into the Internet.
TCP/IP Tutorial and Technical Overview
As shown in Figure 3-11, Basic NAT takes the IP address of an outgoing packet
and dynamically translates it to an officially assigned global address. For
incoming packets, it translates the assigned address to an internal address.
NAT Configuration
Filtering Rules
RESERVE a.b.2.0
Based on non-translated
IP addresses (10.x.x.x)
src=a.b.1.1 dest=a.b.2.1
src=a.b.1.1 dest=
Figure 3-11 Basic Network Address Translation (NAT)
From the point of two hosts that exchange IP packets with each other, one in the
internal network and one in the external network, the NAT itself is transparent
(see Figure 3-12).
Looks like a
normal router
src=a.b.1.1 dest=a.b.2.1
Figure 3-12 NAT seen from the external network
Basic NAT translation mechanism
For each outgoing IP packet, the source address is checked by the NAT
configuration rules. If a rule matches the source address, the address is
translated to a global address from the address pool. The predefined address
pool contains the addresses that NAT can use for translation. For each incoming
Chapter 3. Internetworking protocols
packet, the destination address is checked if it is used by NAT. When this is true,
the address is translated to the original internal address. Figure 3-13 shows the
Basic NAT configuration.
To be translated
Secure network
Non-secure network
Figure 3-13 Basic NAT configuration
When Basic NAT translates an address for an IP packet, the checksum is also
adjusted. For FTP packets, the task is even more difficult, because the packets
can contain addresses in the data of the packet. For example, the FTP PORT
command contains an IP address in ASCII. These addresses should also be
translated correctly; checksum updates and TCP sequence and
acknowledgement updates should also be made accordingly.
In order to make the routing tables work, the IP network design needs to choose
addresses as though connecting two or more IP networks or subnets through a
router. The NAT IP addresses need to come from separate networks or subnets,
and the addresses need to be unambiguous with respect to other networks or
subnets in the non-secure network. If the external network is the Internet, the
NAT addresses need to come from a public network or subnet; in other words,
the NAT addresses need to be assigned by IANA.
The assigned addresses need to be reserved in a pool in order to use them when
needed. If connections are established from the internal network, NAT can just
pick the next free public address in the NAT pool and assign that to the
requesting internal host. The NAT service keeps track of which internal IP
addresses are mapped to which external IP addresses at any given point in time,
so it will be able to map a response it receives from the external network into the
corresponding secure IP address.
TCP/IP Tutorial and Technical Overview
When the NAT service assigns IP addresses on a demand basis, it needs to
know when to return the external IP address to the pool of available IP
addresses. There is no connection setup or tear-down at the IP level, so there is
nothing in the IP protocol itself that the NAT service can use to determine when
an association between a internal IP address and a NAT external IP address is
no longer needed. Because TCP is a connection-oriented protocol, it is possible
to obtain the connection status information from TCP header (whether
connection is ended or not), while UDP does not include such information.
Therefore, configure a timeout value that instructs NAT how long to keep an
association in an idle state before returning the external IP address to the free
NAT pool. Generally, the default value for this parameter is 15 minutes.
Network administrators also need to instruct NAT whether all the internal hosts
are allowed to use NAT or not. This can be done by using corresponding
configuration commands. If hosts in the external network need to initiate
connections to hosts in the internal network, NAT needs to be configured in
advance as to which external NAT address matches which internal IP address.
Thus, a static mapping should be defined to allow connections from outside
networks to a specific host in the internal network. Note that the external NAT
addresses as statically mapped to internal IP addresses should not overlap with
the addresses specified as belonging to the pool of external addresses that the
NAT service can use on a demand basis.
The external name server can, for example, have an entry for a mail gateway
that runs on a computer in the internal network. The external name server
resolves the public host name of the internal mail gateway to the statically
mapped IP address (the external address), and the remote mail server sends a
connection request to this IP address. When that request comes to the NAT
service on the external interface, the NAT service looks into its mapping rules to
see if it has a static mapping between the specified external public IP address
and a internal IP address. If so, it translates the IP address and forwards the IP
packet into the internal network to the mail gateway.
Network Address Port Translation (NAPT)
The difference between Basic NAT and NAPT is that Basic NAT is limited to only
translating IP addresses, while NAPT is extended to include IP address and
transport identifier (such as TCP/UDP port or ICMP query ID).
Chapter 3. Internetworking protocols
As shown in Figure 3-14, Network Address Port Translation is able to translate
many network addresses and their transport identifiers into a single network
address with many transport identifiers, or more specifically, ports.
Transition Table = a.b.65.1:8000 = a.b.65.1:8001
Internal /24 /24
a.b.65.0 /30
a.b.65.3 /30 /24
Figure 3-14 Network Address Port Translation
NAPT maps private addresses to a single globally unique address. Therefore,
the binding is from the private address and private port to the assigned address
and assigned port. NAPT permits multiple nodes in a local network to
simultaneously access remote networks using the single IP address assigned to
their router.
In NAPT, modifications to the IP header are similar to that of Basic NAT.
However for TCP/UDP sessions, modifications must be extended to include
translation of the source port for outbound packets and destination port for
inbound packets in the TCP/UDP header. In addition to TCP/UDP sessions,
ICMP messages, with the exception of the REDIRECT message type, can also
be monitored by the NAPT service running on the router. ICMP query type
packets are translated similar to that of TCP/UDP packets in that the identifier
field in ICMP message header will be uniquely mapped to a query identifier of the
registered IP address.
NAT limitations
The NAT limitations are mentioned in RFC 3022 and RFC2663. We discuss
some of the limitations here.
NAT works fine for IP addresses in the IP header. Some application protocols
exchange IP address information in the application data part of an IP packet, and
NAT will generally not be able to handle translation of IP addresses in the
application protocol. Currently, most of the implementations handle the FTP
protocol. It should be noted that implementation of NAT for specific applications
that have IP information in the application data is more sophisticated than the
standard NAT implementations.
TCP/IP Tutorial and Technical Overview
NAT is compute intensive even with the assistance of a sophisticated checksum
adjustment algorithm, because each data packet is subject to NAT lookup and
It is mandatory that all requests and responses pertaining to a session be routed
through the same router that is running the NAT service.
Translation of outbound TCP/UDP fragments (that is, those originating from
private hosts) in a NAPT setup will not work (refer to “Fragmentation” on
page 104). This is because only the first fragment contains the TCP/UDP header
that is necessary to associate the packet to a session for translation purposes.
Subsequent fragments do not contain TCP/UDP port information, but simply
carry the same fragmentation identifier specified in the first fragment. When the
target host receives the two unrelated datagrams, carrying the same
fragmentation ID and from the same assigned host address, it is unable to
determine to which of the two sessions the datagrams belong. Consequently,
both sessions will be corrupted.
NAT changes some of the address information in an IP packet. This becomes an
issue when IPSec is used. Refer to 22.4, “IP Security Architecture (IPSec)” on
page 809 and 22.10, “Virtual private networks (VPNs) overview” on page 861.
When end-to-end IPSec authentication is used, a packet whose address has
been changed will always fail its integrity check under the Authentication Header
protocol, because any change to any bit in the datagram will invalidate the
integrity check value that was generated by the source. Because IPSec protocols
offer some solutions to the addressing issues that were previously handled by
NAT, there is no need for NAT when all hosts that compose a given virtual
private network use globally unique (public) IP addresses. Address hiding can be
achieved by the IPSec tunnel mode. If a company uses private addresses within
its intranet, the IPSec tunnel mode can keep them from ever appearing in
cleartext from in the public Internet, which eliminates the need for NAT.
3.1.8 Classless Inter-Domain Routing (CIDR)
Standard IP routing understands only Class A, B, and C network addresses.
Within each of these networks, subnetting can be used to provide better
granularity. However, there is no way to specify that multiple Class C networks
are related. The result of this is termed the routing table explosion problem: A
Class B network of 3000 hosts requires one routing table entry at each backbone
router. The same environment, if addressed as a range of Class C networks,
requires 16 entries.
The solution to this problem is called Classless Inter-Domain Routing (CIDR).
CIDR is described in RFCs 1518 to 1520. CIDR does not route according to the
Chapter 3. Internetworking protocols
class of the network number (thus the term classless). It is based solely on the
high order bits of the IP address. These bits are called the IP prefix.
Each CIDR routing table entry contains a 32-bit IP address and a 32-bit network
mask, which together give the length and value of the IP prefix. This is
represented as the tuple <IP_address network_mask>. For example, to address
a block of eight Class C addresses with one single routing table entry, the
following representation suffices: <>. This refers,
from a backbone point of view, to the Class C network range from
to as one single network. This is illustrated in Figure 3-15.
11000000 00100000 10001000 00000000 =
11111111 11111111 11111--- -------=====================================
11000000 00100000 10001--- -------- = (Class C address) (network mask)
192.32.136 (IP prefix)
11000000 00100000 10001111 00000000 =
11111111 11111111 11111--- -------=====================================
11000000 00100000 10001--- -------- = (Class C address) (network mask)
192.32.136 (same IP prefix)
Figure 3-15 Classless Inter-Domain Routing: IP supernetting example
This process of combining multiple networks into a single entry is referred to as
supernetting. Routing is based on network masks that are shorter than the
natural network mask of an IP address. This contrasts with subnetting (see 3.1.2,
“IP subnets” on page 72) where the subnet masks are longer than the natural
network mask.
The current Internet address allocation policies and the assumptions on which
those policies were based are described in RFC 1518. They can be summarized
as follows:
򐂰 IP address assignment reflects the physical topology of the network and not
the organizational topology. Wherever organizational and administrative
boundaries do not match the network topology, they should not be used for
the assignment of IP addresses.
򐂰 In general, network topology will closely follow continental and national
boundaries. Therefore, IP addresses should be assigned on this basis.
TCP/IP Tutorial and Technical Overview
򐂰 There will be a relatively small set of networks that carry a large amount of
traffic between routing domains. These networks will be interconnected in a
non-hierarchical way that crosses national boundaries. These networks are
referred to as transit routing domains (TRDs) Each TRD will have a unique IP
prefix. TRDs will not be organized in a hierarchical way when there is no
appropriate hierarchy. However, whenever a TRD is wholly within a
continental boundary, its IP prefix should be an extension of the continental IP
򐂰 There will be many organizations that have attachments to other
organizations that are for the private use of those two organizations. The
attachments do not carry traffic intended for other domains (transit traffic).
Such private connections do not have a significant effect on the routing
topology and can be ignored.
򐂰 The great majority of routing domains will be single-homed. That is, they will
be attached to a single TRD. They should be assigned addresses that begin
with that TRD's IP prefix. All of the addresses for all single-homed domains
attached to a TRD can therefore be aggregated into a single routing table
entry for all domains outside that TRD.
򐂰 There are a number of address assignment schemes that can be used for
multihomed domains. These include:
– The use of a single IP prefix for the domain. External routers must have an
entry for the organization that lies partly or wholly outside the normal
hierarchy. Where a domain is multihomed, but all of the attached TRDs
themselves are topologically nearby, it is appropriate for the domain's IP
prefix to include those bits common to all of the attached TRDs. For
example, if all of the TRDs were wholly within the United States, an IP
prefix implying an exclusively North American domain is appropriate.
– The use of one IP prefix for each attached TRD with hosts in the domain
having IP addresses containing the IP prefix of the most appropriate TRD.
The organization appears to be a set of routing domains.
– Assigning an IP prefix from one of the attached TRDs. This TRD becomes
a default TRD for the domain but other domains can explicitly route by one
of the alternative TRDs.
– The use of IP prefixes to refer to sets of multihomed domains having the
TRD attachments. For example, there can be an IP prefix to refer to
single-homed domains attached to network A, one to refer to
single-homed domains attached to network B, and one to refer to
dual-homed domains attached to networks A and B.
Each of these has various advantages, disadvantages, and side effects. For
example, the first approach tends to result in inbound traffic entering the
target domain closer to the sending host than the second approach.
Chapter 3. Internetworking protocols
Therefore, a larger proportion of the network costs are incurred by the
receiving organization.
Because multihomed domains vary greatly in character, none of the these
schemes is suitable for every domain. There is no single policy that is best.
RFC 1518 does not specify any rules for choosing between them.
CIDR implementation
The implementation of CIDR in the Internet is primarily based on Border
Gateway Protocol Version 4 (see 5.9, “Border Gateway Protocol (BGP)” on
page 215). The implementation strategy, described in RFC 1520, involves a
staged process through the routing hierarchy beginning with backbone routers.
Network service providers are divided into four types:
򐂰 Type 1: Those providers that cannot employ any default inter-domain routing.
򐂰 Type 2: Those providers that use default inter-domain routing but require
explicit routes for a substantial proportion of the assigned IP network
򐂰 Type 3: Those providers that use default inter-domain routing and
supplement it with a small number of explicit routes.
򐂰 Type 4: Those providers that perform inter-domain routing using only default
The CIDR implementation began with the Type 1 network providers, then the
Type 2, and finally the Type 3 providers. CIDR has already been widely deployed
in the backbone and more than 190,000 class-based routes have been replaced
by approximately 92,000 CIDR-based routes (through unique announced
3.1.9 IP datagram
The unit of transfer in an IP network is called an IP datagram. It consists of an IP
header and data relevant to higher-level protocols. See Figure 3-16 for details.
base IP datagram...
physical network header
IP datagram as data
encapsulated within the physical network's frame
Figure 3-16 IP: Format of a base IP datagram
TCP/IP Tutorial and Technical Overview
IP can provide fragmentation and reassembly of datagrams. The maximum
length of an IP datagram is 65,535 octets. All IP hosts must support 576 octets
datagrams without fragmentation.
Fragments of a datagram each have a header. The header is copied from the
original datagram. A fragment is treated as a normal IP datagrams while being
transported to their destination. However, if one of the fragments gets lost, the
complete datagram is considered lost. Because IP does not provide any
acknowledgment mechanism, the remaining fragments are discarded by the
destination host.
IP datagram format
The IP datagram header has a minimum length of 20 octets, as illustrated in
Figure 3-17.
Service Type
Total Length
Fragment Offset
Header Checksum
Source IP Address
Destination IP Address
IP Options
Data …
Figure 3-17 IP: Format of an IP datagram header
򐂰 VERS: The field contains the IP protocol version. The current version is 4.
Version 5 is an experimental version. Version 6 is the version for IPv6 (see
9.2, “The IPv6 header format” on page 330).
򐂰 HLEN: The length of the IP header counted in 32-bit quantities. This does not
include the data field.
Chapter 3. Internetworking protocols
򐂰 Service Type: The service type is an indication of the quality of service
requested for this IP datagram. This field contains the information illustrated
in Figure 3-18.
Figure 3-18 IP: Service type
– Precedence: This field specifies the nature and priority of the datagram:
• 000: Routine
• 001: Priority
• 010: Immediate
• 011: Flash
• 100: Flash override
• 101: Critical
• 110: Internetwork control
• 111: Network control
– TOS: Specifies the type of service value:
• 1000: Minimize delay
• 0100: Maximize throughput
• 0010: Maximize reliability
• 0001: Minimize monetary cost
• 0000: Normal service
A detailed description of the type of service is in the RFC 1349 (refer to
8.1, “Why QoS?” on page 288).
– MBZ: Reserved for future use.
򐂰 Total Length: The total length of the datagram, header and data.
򐂰 Identification: A unique number assigned by the sender to aid in reassembling
a fragmented datagram. Each fragment of a datagram has the same
identification number.
򐂰 Flags: This field contains control flags illustrated in Figure 3-19.
Figure 3-19 IP: Flags
TCP/IP Tutorial and Technical Overview
– 0: Reserved, must be zero.
– DF (Do not Fragment): 0 means allow fragmentation; 1 means do not
allow fragmentation.
– MF (More Fragments): 0 means that this is the last fragment of the
datagram; 1 means that additional fragments will follow.
򐂰 Fragment Offset: This is used to aid the reassembly of the full datagram. The
value in this field contains the number of 64-bit segments (header bytes are
not counted) contained in earlier fragments. If this is the first (or only)
fragment, this field contains a value of zero.
򐂰 Time to Live: This field specifies the time (in seconds) the datagram is
allowed to travel. Theoretically, each router processing this datagram is
supposed to subtract its processing time from this field. In practise, a router
processes the datagram in less than 1 second. Therefore, the router
subtracts one from the value in this field. The TTL becomes a hop-count
metric rather than a time metric. When the value reaches zero, it is assumed
that this datagram has been traveling in a closed loop and is discarded. The
initial value should be set by the higher-level protocol that creates the
򐂰 Protocol Number: This field indicates the higher-level protocol to which IP
should deliver the data in this datagram. These include:
0: Reserved
1: Internet Control Message Protocol (ICMP)
2: Internet Group Management Protocol (IGMP)
3: Gateway-to-Gateway Protocol (GGP)
4: IP (IP encapsulation)
5: Stream
6: Transmission Control Protocol (TCP)
8: Exterior Gateway Protocol (EGP)
9: Interior Gateway Protocol (IGP)
17: User Datagram Protocol (UDP)
41:Simple Internet Protocol (SIP)
50: SIPP Encap Security Payload (ESP)
51: SIPP Authentication Header (AH)
89: Open Shortest Path First (OSPF) IGP
The complete list is in STD 2 – Assigned Internet Numbers.
򐂰 Header Checksum: This field is a checksum for the information contained in
the header. If the header checksum does not match the contents, the
datagram is discarded.
򐂰 Source IP Address: The 32-bit IP address of the host sending this datagram.
Chapter 3. Internetworking protocols
򐂰 Destination IP Address: The 32-bit IP address of the destination host for this
򐂰 Options: An IP implementation is not required to be capable of generating
options in a datagram. However, all IP implementations are required to be
able to process datagrams containing options. The Options field is variable in
length (there can be zero or more options). There are two option formats. The
format for each is dependent on the value of the option number found in the
first octet:
– A type octet alone is illustrated in Figure 3-20.
1 byte
Figure 3-20 IP: A type byte
– A type octet, a length octet, and one or more option data octets, as
illustrated in Figure 3-21.
option data...
1 byte
1 byte
length - 2 bytes
Figure 3-21 IP: A type byte, a length byte, and one or more option data bytes
The type byte has the same structure in both cases, as illustrated in
Figure 3-22.
Figure 3-22 IP: The type byte structure
TCP/IP Tutorial and Technical Overview
– fc (Flag copy): This field indicates whether (1) or not (0) the option field is
copied when the datagram is fragmented.
– class: The option class is a 2-bit unsigned integer:
0: Control
1: Reserved
2: Debugging and measurement
3: Reserved
– option number: The option number is a 5-bit unsigned integer:
0: End of option list. It has a class of 0, the fc bit is set to zero, and it
has no length byte or data. That is, the option list is terminated by a
X'00' byte. It is only required if the IP header length (which is a multiple
of 4 bytes) does not match the actual length of the options.
1: No operation. It has a class of 0, the fc bit is not set, and there is no
length byte or data. That is, a X'01' byte is a NOP. It can be used to
align fields in the datagram.
2: Security. It has a class of 0, the fc bit is set, and there is a length
byte with a value of 11 and 8 bytes of data). It is used for security
information needed by U.S. Department of Defense requirements.
3: Loose source routing. It has a class of 0, the fc bit is set, and there is
a variable length data field. We discuss this option in more detail later.
4: Internet time stamp. It has a class of 2, the fc bit is not set, and there
is a variable length data field. The total length can be up to 40 bytes.
We discuss this option in more detail later.
7: Record route. It has a class of 0, the fc bit is not set, and there is a
variable length data field. We discuss this option in more detail later.
8: Stream ID. It has a class of 0, the fc bit is set, and there is a length
byte with a value of 4 and one data byte. It is used with the SATNET
9: Strict source routing. It has a class of 0, the fc bit is set, and there is
a variable length data field. We discuss this option in more detail later.
– length: This field counts the length (in octets) of the option, including the
type and length fields.
– option data: This field contains data relevant to the specific option.
򐂰 Padding: If an option is used, the datagram is padded with all-zero octets up
to the next 32-bit boundary.
򐂰 Data: The data contained in the datagram. It is passed to the higher-level
protocol specified in the protocol field.
Chapter 3. Internetworking protocols
When an IP datagram travels from one host to another, it can pass through
different physical networks. Each physical network has a maximum frame size.
This is called the maximum transmission unit (MTU). It limits the length of a
datagram that can be placed in one physical frame.
IP implements a process to fragment datagrams exceeding the MTU. The
process creates a set of datagrams within the maximum size. The receiving host
reassembles the original datagram. IP requires that each link support a minimum
MTU of 68 octets. This is the sum of the maximum IP header length (60 octets)
and the minimum possible length of data in a non-final fragment (8 octets). If any
network provides a lower value than this, fragmentation and reassembly must be
implemented in the network interface layer. This must be transparent to IP. IP
implementations are not required to handle unfragmented datagrams larger than
576 bytes. In practice, most implementations will accommodate larger values.
An unfragmented datagram has an all-zero fragmentation information field. That
is, the more fragments flag bit is zero and the fragment offset is zero. The
following steps fragment the datagram:
1. The DF flag bit is checked to see if fragmentation is allowed. If the bit is set,
the datagram will be discarded and an ICMP error returned to the originator.
2. Based on the MTU value, the data field is split into two or more parts. All
newly created data portions must have a length that is a multiple of 8 octets,
with the exception of the last data portion.
3. Each data portion is placed in an IP datagram. The headers of these
datagrams are minor modifications of the original:
– The more fragments flag bit is set in all fragments except the last.
– The fragment offset field in each is set to the location this data portion
occupied in the original datagram, relative to the beginning of the original
unfragmented datagram. The offset is measured in 8-octet units.
– If options were included in the original datagram, the high order bit of the
option type byte determines if this information is copied to all fragment
datagrams or only the first datagram. For example, source route options
are copied in all fragments.
– The header length field of the new datagram is set.
– The total length field of the new datagram is set.
– The header checksum field is re-calculated.
4. Each of these fragmented datagrams is now forwarded as a normal IP
datagram. IP handles each fragment independently. The fragments can
traverse different routers to the intended destination. They can be subject to
further fragmentation if they pass through networks specifying a smaller MTU.
TCP/IP Tutorial and Technical Overview
At the destination host, the data is reassembled into the original datagram. The
identification field set by the sending host is used together with the source and
destination IP addresses in the datagram. Fragmentation does not alter this field.
In order to reassemble the fragments, the receiving host allocates a storage
buffer when the first fragment arrives. The host also starts a timer. When
subsequent fragments of the datagram arrive, the data is copied into the buffer
storage at the location indicated by the fragment offset field. When all fragments
have arrived, the complete original unfragmented datagram is restored.
Processing continues as for unfragmented datagrams.
If the timer is exceeded and fragments remain outstanding, the datagram is
discarded. The initial value of this timer is called the IP datagram time to live
(TTL) value. It is implementation-dependent. Some implementations allow it to
be configured.
The netstat command can be used on some IP hosts to list the details of
IP datagram routing options
The IP datagram Options field provides two methods for the originator of an IP
datagram to explicitly provide routing information. It also provides a method for
an IP datagram to determine the route that it travels.
Loose source routing
The loose source routing option, also called the loose source and record route
(LSRR) option, provides a means for the source of an IP datagram to supply
explicit routing information. This information is used by the routers when
forwarding the datagram to the destination. It is also used to record the route, as
illustrated in Figure 3-23.
route data //
Figure 3-23 IP: Loose source routing option
The fields of this header include:
1000011(decimal 131)
This is the value of the option type octet for loose
source routing.
This field contains the length of this option field,
including the type and length fields.
Chapter 3. Internetworking protocols
This field points to the option data at the next IP
address to be processed. It is counted relative to the
beginning of the option, so its minimum value is four. If
the pointer is greater than the length of the option, the
end of the source route is reached and further routing
is to be based on the destination IP address (as for
datagrams without this option).
Route data
This field contains a series of 32-bit IP addresses.
When a datagram arrives at its destination and the source route is not empty
(pointer < length) the receiving host:
1. Takes the next IP address in the route data field (the one indicated by the
pointer field) and puts it in the destination IP address field of the datagram.
2. Puts the local IP address in the source list at the location pointed to by the
pointer field. The IP address for this is the local IP address corresponding to
the network on which the datagram will be forwarded. (Routers are attached
to multiple physical networks and thus have multiple IP addresses.)
3. Increments the pointer by 4.
4. Transmits the datagram to the new destination IP address.
This procedure ensures that the return route is recorded in the route data (in
reverse order) so that the final recipient uses this data to construct a loose
source route in the reverse direction. This is a loose source route because the
forwarding router is allowed to use any route and any number of intermediate
routers to reach the next address in the route.
Strict source routing
The strict source routing option, also called the strict source and record route
(SSRR) option, uses the same principle as loose source routing except the
intermediate router must send the datagram to the next IP address in the source
route through a directly connected network. It cannot use an intermediate router.
If this cannot be done, an ICMP Destination Unreachable error message is
issued. Figure 3-24 gives an overview of the SSRR option.
Figure 3-24 IP: Strict source routing option
TCP/IP Tutorial and Technical Overview
route data //
1001001 (Decimal 137) The value of the option type byte for strict source
This information is described in “Loose source routing”
on page 105.
This information is described in “Loose source routing”
on page 105.
Route data
A series of 32-bit IP addresses.
Record route
This option provides a means to record the route traversed by an IP datagram. It
functions similarly to the source routing option. However, this option provides an
empty routing data field. This field is filled in as the datagram traverses the
network. Sufficient space for this routing information must be provided by the
source host. If the data field is filled before the datagram reaches its destination,
the datagram is forwarded with no further recording of the route. Figure 3-25
gives an overview of the record route option.
route data //
Figure 3-25 IP: Record route option
0000111 (Decimal 7)
The value of the option type byte for record route
This information is described in “Loose source routing”
on page 105.
This information is described in “Loose source routing”
on page 105.
Route data
A series of 32-bit IP addresses.
Internet time stamp
A time stamp is an option forcing some (or all) of the routers along the route to
the destination to put a time stamp in the option data. The time stamps are
measured in seconds and can be used for debugging purposes. They cannot be
used for performance measurement for two reasons:
򐂰 Because most IP datagrams are forwarded in less than one second, the time
stamps are not precise.
Chapter 3. Internetworking protocols
򐂰 Because IP routers are not required to have synchronized clocks, they may
not be accurate.
Figure 3-26 gives an overview of the Internet time stamp option.
IP address
time stamp
Figure 3-26 IP: Internet time stamp option
01000100 (Decimal 68) This field is the value of the option type for the internet
time stamp option.
This field contains the total length of this option,
including the type and length fields.
This field points to the next time stamp to be
processed (first free time stamp).
Oflw (overflow)
This field contains the number of devices that cannot
register time stamps due to a lack of space in the data
This field is a 4-bit value that indicates how time
stamps are to be registered:
Time stamp
Time stamps only, stored in consecutive 32-bit
Each time stamp is preceded by the IP address of
the registering device.
The IP address fields are prespecified; an IP
device only registers when it finds its own address
in the list.
A 32-bit time stamp recorded in milliseconds since
midnight UT (GMT).
The originating host must compose this option with a sufficient data area to hold
all the time stamps. If the time stamp area becomes full, no further time stamps
are added.
TCP/IP Tutorial and Technical Overview
3.2 Internet Control Message Protocol (ICMP)
ICMP is a standard protocol with STD number 5. That standard also includes IP
(see 3.1, “Internet Protocol (IP)” on page 68) and IGMP (see 6.2, “Internet Group
Management Protocol (IGMP)” on page 241). Its status is required. It is
described in RFC 792 with updates in RFC 950. ICMPv6 used for IPv6 is
discussed in 9.3, “Internet Control Message Protocol Version 6 (ICMPv6)” on
page 352.
Path MTU Discovery is a draft standard protocol with a status of elective. It is
described in RFC 1191.
ICMP Router Discovery is a proposed standard protocol with a status of elective.
It is described in RFC 1256.
When a router or a destination host must inform the source host about errors in
datagram processing, it uses the Internet Control Message Protocol (ICMP).
ICMP can be characterized as follows:
򐂰 ICMP uses IP as though ICMP were a higher-level protocol (that is, ICMP
messages are encapsulated in IP datagrams). However, ICMP is an integral
part of IP and must be implemented by every IP module.
򐂰 ICMP is used to report errors, not to make IP reliable. Datagrams can still be
undelivered without any report on their loss. Reliability must be implemented
by the higher-level protocols using IP services.
򐂰 ICMP cannot be used to report errors with ICMP messages. This avoids
infinite repetitions. ICMP responses are sent in response to ICMP query
messages (ICMP types 0, 8, 9, 10, and 13 through 18).
򐂰 For fragmented datagrams, ICMP messages are only sent about errors with
the first fragment. That is, ICMP messages never refer to an IP datagram with
a non-zero fragment offset field.
򐂰 ICMP messages are never sent in response to datagrams with a broadcast or
a multicast destination address.
򐂰 ICMP messages are never sent in response to a datagram that does not have
a source IP address representing a unique host. That is, the source address
cannot be zero, a loopback address, a broadcast address, or a multicast
򐂰 RFC 792 states that ICMP messages can be generated to report IP datagram
processing errors. However, this is not required. In practice, routers will
almost always generate ICMP messages for errors. For destination hosts,
ICMP message generation is implementation dependent.
Chapter 3. Internetworking protocols
3.2.1 ICMP messages
ICMP messages are described in RFC 792 and RFC 950, belong to STD 5, and
are mandatory.
ICMP messages are sent in IP datagrams. The IP header has a protocol number
of 1 (ICMP) and a type of service of zero (routine). The IP data field contains the
ICMP message shown in Figure 3-27.
sequence number
ICMP data (depending on the type of message)
Figure 3-27 ICMP: Message format
The message contains the following components:
Specifies the type of the message:
Echo reply
Destination unreachable
Source quench
Router advertisement
Router solicitation
Time exceeded
Parameter problem
Time stamp request
Time stamp reply
Address mask request
Address mask reply
Domain name request)
Domain name reply)
The following RFCs are required to be mentioned for some of the ICMP
message types: RFC 1256, RFC 1393, and RFC 1788.
Contains the error code for the datagram reported by this
ICMP message. The interpretation is dependent on the
message type.
Contains the checksum for the ICMP message starting
with the ICMP Type field. If the checksum does not
match the contents, the datagram is discarded.
TCP/IP Tutorial and Technical Overview
Contains information for this ICMP message. Typically, it
will contain the portion of the original IP message for
which this ICMP message was generated.
Each of the ICMP messages is described individually.
Echo (8) and Echo Reply (0)
Echo is used to detect if another host is active in the network. It is used by the
Ping command (refer to “Ping” on page 117). The sender initializes the identifier,
sequence number, and data field. The datagram is then sent to the destination
host. The recipient changes the type to Echo Reply and returns the datagram to
the sender. See Figure 3-28 for more details.
sequence number
data ...
Figure 3-28 Echo and Echo Reply
Destination Unreachable (3)
If this message is received from an intermediate router, it means that the router
regards the destination IP address as unreachable.
If this message is received from the destination host, it means that either the
protocol specified in the protocol number field of the original datagram is not
active or the specified port is inactive. (Refer to 4.2, “User Datagram Protocol
(UDP)” on page 146 for additional information regarding ports.) See Figure 3-29
for more details.
unused (zero)
IP header - 64 bits of original data of the datagram
Figure 3-29 ICMP: Destination Unreachable
The ICMP header code field contains one of the following values:
Network unreachable
Host unreachable
Protocol unreachable
Port unreachable
Fragmentation needed but the Do Not Fragment bit was set
Chapter 3. Internetworking protocols
Source route failed
Destination network unknown
Destination host unknown
Source host isolated (obsolete)
Destination network administratively prohibited
Destination host administratively prohibited
Network unreachable for this type of service
Host unreachable for this type of service
Communication administratively prohibited by filtering
Host precedence violation
Precedence cutoff in effect
These are detailed in RFC 792, RFC 1812 updated by RFC 2644, RFC 1122,
updated by RFC 4379, and forms part of STD 3 – Host Requirements.
If a router implements the Path MTU Discovery protocol, the format of the
destination unreachable message is changed for code 4. This includes the MTU
of the link that did not accept the datagram. See Figure 3-30 for more
unused (zero)
link MTU
IP header - 64 bits of original data of the datagram
Figure 3-30 ICMP: Fragmentation required with link MTU
Source Quench (4)
If this message is received from an intermediate router, it means that the router
did not have the buffer space needed to queue the datagram.
If this message is received from the destination host, it means that the incoming
datagrams are arriving too quickly to be processed.
The ICMP header code field is always zero.
See Figure 3-31 for more details.
unused (zero)
IP header - 64 bits of original data of the datagram
Figure 3-31 ICMP: Source Quench
TCP/IP Tutorial and Technical Overview
Redirect (5)
If this message is received from an intermediate router, it means that the host
should send future datagrams for the network to the router whose IP address is
specified in the ICMP message. This preferred router will always be on the same
subnet as the host that sent the datagram and the router that returned the IP
datagram. The router forwards the datagram to its next hop destination. This
message will not be sent if the IP datagram contains a source route.
The ICMP header code field will have one of the following values:
Network redirect
Host redirect
Network redirect for this type of service
Host redirect for this type of service
See Figure 3-32 for more details.
router IP address
IP header - 64 bits of original data of the datagram
Figure 3-32 ICMP: Redirect
Router Advertisement (9) and Router Solicitation (10)
ICMP messages 9 and 10 are optional. They are described in RFC 1256, which
is elective. See Figure 3-33 and Figure 3-34 on page 114 for details.
entry length
router address 1
preference level 1
router address n
preference level n
Figure 3-33 ICMP: Router Advertisement
Chapter 3. Internetworking protocols
Figure 3-34 ICMP: Router Solicitation
The number of entries in the message.
Entry length
The length of an entry in 32-bit units. This is 2 (32 bits for
the IP address and 32 bits for the preference value).
The number of seconds that an entry will be considered
Router address
One of the sender's IP addresses.
Preference level
A signed 32-bit level indicating the preference to be
assigned to this address when selecting a default router.
Each router on a subnet is responsible for advertising its
own preference level. Larger values imply higher
preference; smaller values imply lower. The default is
zero, which is in the middle of the possible range. A value
of X'80000000' (-231) indicates the router should never
be used as a default router.
The ICMP header code field is zero for both of these messages.
These two messages are used if a host or a router supports the router discovery
protocol. Routers periodically advertise their IP addresses on those subnets
where they are configured to do so. Advertisements are made on the all-systems
multicast address ( or the limited broadcast address
( The default behavior is to send advertisements every 10
minutes with a TTL value of 1800 (30 minutes). Routers also reply to solicitation
messages they receive. They might reply directly to the soliciting host, or they
might wait a short random interval and reply with a multicast.
Hosts can send solicitation messages. Solicitation messages are sent to the
all-routers multicast address ( or the limited broadcast address
( Typically, three solicitation messages are sent at 3-second
intervals. Alternatively, a host can wait for periodic advertisements. Each time a
host receives an advertisement with a higher preference value, it updates its
default router. The host also sets the TTL timer for the new entry to match the
value in the advertisement. When the host receives a new advertisement for its
current default router, it resets the TTL value to that in the new advertisement.
TCP/IP Tutorial and Technical Overview
This process also provides a mechanism for routers to declare themselves
unavailable. They send an advertisement with a TTL value of zero.
Time Exceeded (11)
If this message is received from an intermediate router, it means that the time to
live field of an IP datagram has expired.
If this message is received from the destination host, it means that the IP
fragment reassembly time to live timer has expired while the host is waiting for a
fragment of the datagram. The ICMP header code field can have the one of the
following values:
Transit TTL exceeded
Reassembly TTL exceeded
See Figure 3-35 for more details.
unused (zero)
IP header - 64 bits of original data of the datagram
Figure 3-35 ICMP: Time Exceeded
Parameter Problem (12)
This message indicates that a problem was encountered during processing of
the IP header parameters. The pointer field indicates the octet in the original IP
datagram where the problem was encountered. The ICMP header code field can
have the one of the following values:
Unspecified error
Required option missing
See Figure 3-36 for more details.
unused (zero)
IP header - 64 bits of original data of the datagram
Figure 3-36 ICMP: Parameter Problem
Chapter 3. Internetworking protocols
Timestamp Request (13) and Timestamp Reply (14)
These two messages are for debugging and performance measurements. They
are not used for clock synchronization.
The sender initializes the identifier and sequence number (which is used if
multiple time stamp requests are sent), sets the originate time stamp, and sends
the datagram to the recipient. The receiving host fills in the receive and transmit
time stamps, changes the type to time stamp reply, and returns it to the original
sender. The datagram has two time stamps if there is a perceptible time
difference between the receipt and transmit times. In practice, most
implementations perform the two (receipt and reply) in one operation. This sets
the two time stamps to the same value. Time stamps are the number of
milliseconds elapsed since midnight UT (GMT).
See Figure 3-37 for details.
sequence number
originate timestamp
receive timestamp
transmit timestamp
Figure 3-37 ICMP: Timestamp Request and Timestamp Reply
Address Mask Request (17) and Address Mask Reply (18)
An address mask request is used by a host to determine the subnet mask used
on an attached network. Most hosts are configured with their subnet mask or
masks. However some, such as diskless workstations, must obtain this
information from a server. A host uses RARP (see 3.5, “Reverse Address
Resolution Protocol (RARP)” on page 124) to obtain its IP address. To obtain a
subnet mask, the host broadcasts an address mask request. Any host in the
network that has been configured to send address mask replies will fill in the
subnet mask, convert the packet to an address mask reply, and return it to the
sender. The ICMP header code field is zero.
See Figure 3-38 on page 117 for more details.
TCP/IP Tutorial and Technical Overview
sequence number
subnet address mask
Figure 3-38 ICMP: Address Mask Request and Reply
3.2.2 ICMP applications
There are two simple and widely used applications based on ICMP: Ping and
Traceroute. Ping uses the ICMP Echo and Echo Reply messages to determine
whether a host is reachable. Traceroute sends IP datagrams with low TTL values
so that they expire en route to a destination. It uses the resulting ICMP Time
Exceeded messages to determine where in the internet the datagrams expired
and pieces together a view of the route to a host. We discuss these applications
in the following sections.
Ping is the simplest of all TCP/IP applications. It sends IP datagrams to a
specified destination host and measures the round trip time to receive a
response. The word ping, which is used as a noun and a verb, is taken from the
sonar operation to locate an underwater object. It is also an abbreviation for
Packet InterNet Groper.
Generally, the first test of reachability for a host is to attempt to ping it. If you can
successfully ping a host, other applications such as Telnet or FTP should be able
to reach that host. However with the advent of security measures on the Internet,
particularly firewalls (see 22.3, “Firewalls” on page 794), which control access to
networks by application protocol or port number, or both, this is no longer
necessarily true. The ICMP protocol can be restricted on the firewall and
therefore the host is unable to be successfully pinged.
The syntax that is used in different implementations of ping varies from platform
to platform. A common format for using the ping command is:
ping host
Where host is the destination, either a symbolic name or an IP address.
Most platforms allow you to specify the following values:
The size of the data portion of the packet.
The number of packets to send.
The number of echo requests to send.
Chapter 3. Internetworking protocols
Record routes
Record the route per count hop.
Time stamp
Time stamp each count hop.
Endless ping
Ping until manually stopped.
Resolve address
Resolve the host address to the host name.
Time to Live (TTL)
The time (in seconds) the datagram is allowed to travel.
Type of Service (TOS) The type of internet service quality.
Loose source route or strict source route of host lists.
The timeout to wait for each reply.
No fragmentation
The fragment flag is not set.
Ping uses the ICMP Echo and Echo Reply messages (refer to “Echo (8) and
Echo Reply (0)” on page 111). Because ICMP is required in every TCP/IP
implementation, hosts do not require a separate server to respond to ping
Ping is useful for verifying an IP installation. The following variations of the
command each require the operation of an different portion of an IP installation:
򐂰 ping loopback: Verifies the operation of the base TCP/IP software.
򐂰 ping my-IP-address: Verifies whether the physical network device can be
򐂰 ping a-remote-IP-address: Verifies whether the network can be accessed.
򐂰 ping a-remote-host-name: Verifies the operation of the name server (or the
flat namespace resolver, depending on the installation).
The Traceroute program is used to determine the route IP datagrams follow
through the network.
Traceroute is based on ICMP and UDP. It sends an IP datagram with a TTL of 1
to the destination host. The first router decrements the TTL to 0, discards the
datagram, and returns an ICMP Time Exceeded message to the source. In this
way, the first router in the path is identified. This process is repeated with
successively larger TTL values to identify the exact series of routers in the path
to the destination host.
Traceroute sends UDP datagrams to the destination host. These datagrams
reference a port number outside the standard range. When an ICMP Port
Unreachable message is received, the source determines the destination host
has been reached.
TCP/IP Tutorial and Technical Overview
3.3 Internet Group Management Protocol (IGMP)
IGMP is a standard protocol with STD number 5. That standard also includes IP
(see 3.1, “Internet Protocol (IP)” on page 68) and ICMP (see 3.2, “Internet
Control Message Protocol (ICMP)” on page 109). Its status is recommended. It is
described in RFC 1112 with updates in RFC 2236.
Similar to ICMP, the Internet Group Management Protocol (IGMP) is also an
integral part of IP. It allows hosts to participate in IP multicasts. IGMP further
provides routers with the capability to check if any hosts on a local subnet are
interested in a particular multicast.
Refer to 6.2, “Internet Group Management Protocol (IGMP)” on page 241 for a
detailed review of IGMP.
3.4 Address Resolution Protocol (ARP)
Address Resolution Protocol (ARP) is a network-specific standard protocol. The
address resolution protocol is responsible for converting the higher-level protocol
addresses (IP addresses) to physical network addresses. It is described in
RFC 826.
3.4.1 ARP overview
On a single physical network, individual hosts are known in the network by their
physical hardware address. Higher-level protocols address destination hosts in
the form of a symbolic address (IP address in this case). When such a protocol
wants to send a datagram to destination IP address w.x.y.z, the device driver
does not understand this address.
Therefore, a module (ARP) is provided that will translate the IP address to the
physical address of the destination host. It uses a lookup table (sometimes
referred to as the ARP cache) to perform this translation.
When the address is not found in the ARP cache, a broadcast is sent out in the
network with a special format called the ARP request. If one of the machines in
the network recognizes its own IP address in the request, it will send an ARP
reply back to the requesting host. The reply will contain the physical hardware
address of the host and source route information (if the packet has crossed
bridges on its path). Both this address and the source route information are
stored in the ARP cache of the requesting host. All subsequent datagrams to this
destination IP address can now be translated to a physical address, which is
used by the device driver to send out the datagram in the network.
Chapter 3. Internetworking protocols
An exception to the rule is the asynchronous transfer mode (ATM) technology,
where ARP cannot be implemented in the physical layer as described previously.
Therefore, every host, upon initialization, must register with an ARP server in
order to be able to resolve IP addresses to hardware addresses (also see 2.10,
“Asynchronous transfer mode (ATM)” on page 47).
ARP was designed to be used on networks that support hardware broadcast.
This means, for example, that ARP will not work on an X.25 network.
3.4.2 ARP detailed concept
ARP is used on IEEE 802 networks as well as on the older DIX Ethernet
networks to map IP addresses to physical hardware addresses (see 2.1,
“Ethernet and IEEE 802 local area networks (LANs)” on page 30). To do this, it is
closely related to the device driver for that network. In fact, the ARP
specifications in RFC 826 only describe its functionality, not its implementation.
The implementation depends to a large extent on the device driver for a network
type and they are usually coded together in the adapter microcode.
ARP packet generation
If an application wants to send data to a certain IP destination address, the IP
routing mechanism first determines the IP address of the next hop of the packet
(it can be the destination host itself, or a router) and the hardware device on
which it should be sent. If it is an IEEE 802.3/4/5 network, the ARP module must
be consulted to map the <protocol type, target protocol address> to a physical
The ARP module tries to find the address in this ARP cache. If it finds the
matching pair, it gives the corresponding 48-bit physical address back to the
caller (the device driver), which then transmits the packet. If it does not find the
pair in its table, it discards the packet (the assumption is that a higher-level
protocol will retransmit) and generates a network broadcast of an ARP request.
See Figure 3-39 on page 121 for more details.
TCP/IP Tutorial and Technical Overview
physical layer header
x bytes
hardware address space
2 bytes
Protocol address space
2 bytes
hardware address byte length
Protocol address
byte length (m)
operation code
2 bytes
hardware address of sender
n bytes
protocol address of sender
m bytes
hardware address of target
n bytes
protocol address of target
m bytes
Figure 3-39 ARP: Request/reply packet
򐂰 Hardware address space: Specifies the type of hardware; examples are
Ethernet or Packet Radio Net.
򐂰 Protocol address space: Specifies the type of protocol, same as the
EtherType field in the IEEE 802 header (IP or ARP).
򐂰 Hardware address length: Specifies the length (in bytes) of the hardware
addresses in this packet. For IEEE 802.3 and IEEE 802.5, this is 6.
򐂰 Protocol address length: Specifies the length (in bytes) of the protocol
addresses in this packet. For IP, this is 4.
򐂰 Operation code: Specifies whether this is an ARP request (1) or reply (2).
򐂰 Source/target hardware address: Contains the physical network hardware
addresses. For IEEE 802.3, these are 48-bit addresses.
򐂰 Source/target protocol address: Contains the protocol addresses. For
TCP/IP, these are the 32-bit IP addresses.
For the ARP request packet, the target hardware address is the only undefined
field in the packet.
Chapter 3. Internetworking protocols
ARP packet reception
When a host receives an ARP packet (either a broadcast request or a
point-to-point reply), the receiving device driver passes the packet to the ARP
module, which treats it as shown in Figure 3-40.
Do I have the specified
hardware type?
Do I speak the specified
Set flag = false.
Update the table with the
sender hardware address.
Set flag=true.
Is the pair <protocol type,
sender protocol address>
already in my table?
Am I the target protocol
Is flag = false?
Add the triplet <protocol
type, sender protocol and
sender hardware> to
Swap source and target
addressesin the ARP
packet. Put my local
addresses in the source
address fields. Send back
ARP packet as an ARP
reply to the requesting
Is the opcode a request?
Figure 3-40 ARP: Packet reception
The requesting host will receive this ARP reply, and will follow the same
algorithm to treat it. As a result of this, the triplet <protocol type, protocol
address, hardware address> for the desired host will be added to its lookup table
(ARP cache). The next time a higher-level protocol wants to send a packet to that
host, the ARP module will find the target hardware address and the packet will be
sent to that host.
Note that because the original ARP request was a broadcast in the network, all
hosts on that network will have updated the sender's hardware address in their
table (only if it was already in the table).
TCP/IP Tutorial and Technical Overview
3.4.3 ARP and subnets
The ARP protocol remains unchanged in the presence of subnets. Remember
that each IP datagram first goes through the IP routing algorithm. This algorithm
selects the hardware device driver that should send out the packet. Only then,
the ARP module associated with that device driver is consulted.
3.4.4 Proxy-ARP or transparent subnetting
Proxy-ARP is described in RFC 1027, which is a subset of the method proposed
in RFC 925. It is another method to construct local subnets, without the need for
a modification to the IP routing algorithm, but with modifications to the routers
that interconnect the subnets.
Proxy-ARP concept
Consider one IP network that is divided into subnets and interconnected by
routers. We use the “old” IP routing algorithm, which means that no host knows
about the existence of multiple physical networks. Consider hosts A and B, which
are on different physical networks within the same IP network, and a router R
between the two subnetworks as illustrated in Figure 3-41.
Figure 3-41 ARP: Hosts interconnected by a router
When host A wants to send an IP datagram to host B, it first has to determine the
physical network address of host B through the use of the ARP protocol.
Because host A cannot differentiate between the physical networks, its IP routing
algorithm thinks that host B is on the local physical network and sends out a
broadcast ARP request. Host B does not receive this broadcast, but router R
does. Router R understands subnets, that is, it runs the subnet version of the IP
routing algorithm and it will be able to see that the destination of the ARP request
(from the target protocol address field) is on another physical network. If router
R's routing tables specify that the next hop to that other network is through a
Chapter 3. Internetworking protocols
different physical device, it will reply to the ARP as though it were host B, saying
that the network address of host B is that of the router R itself.
Host A receives this ARP reply, puts it in its cache, and will send future IP
packets for host B to the router R. The router will forward such packets to the
correct subnet.
The result is transparent subnetting:
򐂰 Normal hosts (such as A and B) do not know about subnetting, so they use
the “old” IP routing algorithm.
򐂰 The routers between subnets have to:
– Use the subnet IP routing algorithm.
– Use a modified ARP module, which can reply on behalf of other hosts.
See Figure 3-42 for more details.
" o ld " I P
r o u tin g
s u b n e t r o u tin g
a n d m o d ifie d A R P
Figure 3-42 ARP: Proxy-ARP router
3.5 Reverse Address Resolution Protocol (RARP)
Reverse Address Resolution Protocol (RARP) is a network-specific standard
protocol. It is described in RFC 903.
Some network hosts, such as diskless workstations, do not know their own IP
address when they are booted. To determine their own IP address, they use a
mechanism similar to ARP, but now the hardware address of the host is the
known parameter and the IP address the queried parameter. It differs more
fundamentally from ARP in the fact that a RARP server must exist in the network
that maintains that a database of mappings from hardware address to protocol
address must be preconfigured.
TCP/IP Tutorial and Technical Overview
3.5.1 RARP concept
The reverse address resolution is performed the same way as the ARP address
resolution. The same packet format (see Figure 3-39 on page 121) is used as for
An exception is the operation code field that now takes the following values:
For the RARP request
For the RARP reply
And of course, the physical header of the frame will now indicate RARP as the
higher-level protocol (8035 hex) instead of ARP (0806 hex) or IP (0800 hex) in
the EtherType field.
Some differences arise from the concept of RARP itself:
򐂰 ARP only assumes that every host knows the mapping between its own
hardware address and protocol address. RARP requires one or more server
hosts in the network to maintain a database of mappings between hardware
addresses and protocol addresses so that they will be able to reply to
requests from client hosts.
򐂰 Due to the size this database can take, part of the server function is usually
implemented outside the adapter's microcode, with optionally a small cache in
the microcode. The microcode part is then only responsible for reception and
transmission of the RARP frames, the RARP mapping itself being taken care
of by server software running as a normal process on the host machine.
򐂰 The nature of this database also requires some software to create and update
the database manually.
򐂰 If there are multiple RARP servers in the network, the RARP requester only
uses the first RARP reply received on its broadcast RARP request and
discards the others.
3.6 Bootstrap Protocol (BOOTP)
The Bootstrap Protocol (BOOTP) enables a client workstation to initialize with a
minimal IP stack and request its IP address, a gateway address, and the address
of a name server from a BOOTP server. If BOOTP is to be used in your network,
the server and client are usually on the same physical LAN segment. BOOTP
can only be used across bridged segments when source-routing bridges are
being used, or across subnets, if you have a router capable of BOOTP
Chapter 3. Internetworking protocols
BOOTP is a draft standard protocol. Its status is recommended. The BOOTP
specifications are in RFC 951, which has been updated by RFC1542 and RFC
There are also updates to BOOTP, some relating to interoperability with DHCP
(see 3.7, “Dynamic Host Configuration Protocol (DHCP)” on page 130),
described in RFC 1542, which updates RFC 951 and RFC 2132. The updates to
BOOTP are draft standards with a status of elective and recommended,
The BOOTP protocol was originally developed as a mechanism to enable
diskless hosts to be remotely booted over a network as workstations, routers,
terminal concentrators, and so on. It allows a minimum IP protocol stack with no
configuration information to obtain enough information to begin the process of
downloading the necessary boot code. BOOTP does not define how the
downloading is done, but this process typically uses TFTP (see also 14.2, “Trivial
File Transfer Protocol (TFTP)” on page 529), as described in RFC 906. Although
still widely used for this purpose by diskless hosts, BOOTP is also commonly
used solely as a mechanism to deliver configuration information to a client that
has not been manually configured.
The BOOTP process involves the following steps:
1. The client determines its own hardware address; this is normally in a ROM on
the hardware.
2. A BOOTP client sends its hardware address in a UDP datagram to the server.
Figure 3-43 on page 127 shows the full contents of this datagram. If the client
knows its IP address or the address of the server, it should use them, but in
general, BOOTP clients have no IP configuration data at all. If the client does
not know its own IP address, it uses If the client does not know the
server's IP address, it uses the limited broadcast address (
The UDP port number is 67.
3. The server receives the datagram and looks up the hardware address of the
client in its configuration file, which contains the client's IP address. The
server fills in the remaining fields in the UDP datagram and returns it to the
client using UDP port 68. One of three methods can be used to do this:
– If the client knows its own IP address (it was included in the BOOTP
request), the server returns the datagram directly to this address. It is
likely that the ARP cache in the server's protocol stack will not know the
hardware address matching the IP address. ARP will be used to
determine it as normal.
– If the client does not know its own IP address (it was in the BOOTP
request), the server must concern itself with its own ARP cache.
TCP/IP Tutorial and Technical Overview
– ARP on the server cannot be used to find the hardware address of the
client because the client does not know its IP address and so cannot reply
to an ARP request. This is called the “chicken and egg” problem. There
are two possible solutions:
If the server has a mechanism for directly updating its own ARP cache
without using ARP itself, it does so and then sends the datagram
If the server cannot update its own ARP cache, it must send a
broadcast reply.
4. When it receives the reply, the BOOTP client will record its own IP address
(allowing it to respond to ARP requests) and begin the bootstrap process.
Figure 3-43 gives an overview of the BOOTP message format.
H/W type
transaction ID
flags field
client IP address
your IP address
server IP address
router IP address
client hardware address
(16 bytes)
server host name
(64 bytes)
boot file name
(128 bytes)
vendor-specific area
(64 bytes)
Figure 3-43 BOOTP message format
Indicates a request or a reply:
H/W type
The type of hardware, for example:
IEEE 802 Networks
Refer to STD 2 – Assigned Internet Numbers for a
complete list.
Chapter 3. Internetworking protocols
Hardware address length in bytes. Ethernet and token
ring both use 6, for example.
The client sets this to 0.
It is incremented by a router that relays the request to
another server and is used to identify loops. RFC 951
suggests that a value of 3 indicates a loop.
Transaction ID
A random number used to match this boot request with
the response it generates.
Set by the client. It is the elapsed time in seconds
since the client started its boot process.
Flags field
The most significant bit of the flags field is used as a
broadcast flag. All other bits must be set to zero; they
are reserved for future use. Normally, BOOTP servers
attempt to deliver BOOTREPLY messages directly to a
client using unicast delivery. The destination address
in the IP header is set to the BOOTP your IP address
and the MAC address is set to the BOOTP client
hardware address. If a host is unable to receive a
unicast IP datagram until it knows its IP address, this
broadcast bit must be set to indicate to the server that
the BOOTREPLY must be sent as an IP and MAC
broadcast. Otherwise, this bit must be set to zero.
Client IP address
Set by the client, either to its known IP address or
Your IP address
Set by the server if the client IP address field was
Server IP address
Set by the server.
Router IP address
This is the address of a BOOTP relay agent, not a
general IP router to be used by the client. It is set by
the forwarding agent when BOOTP forwarding is used
(see 3.6.1, “BOOTP forwarding” on page 129).
Client hardware address
Set by the client and used by the server to identify
which registered client is booting.
Server host name
Optional server host name terminated by X'00'.
Boot file name
The client either leaves this null or specifies a generic
name, such as router indicating the type of boot file to
be used. The server returns the fully qualified file name
of a boot file suitable for the client. The value is
terminated by X'00'.
TCP/IP Tutorial and Technical Overview
Vendor-specific area
Optional vendor-specific area. Clients should always
fill the first four bytes with a “magic cookie.” If a
vendor-specific magic cookie is not used, the client
should use followed by an end tag (255)
and set the remaining bytes to zero. The
vendor-specific area can also contain BOOTP Vendor
extensions. These are options that can be passed to
the client at boot time along with its IP address. For
example, the client can also receive the address of a
default router, the address of a domain name server,
and a subnet mask. BOOTP shares the same options
as DHCP, with the exception of several DHCP-specific
options. See RFC 2132 for full details.
After the BOOTP client has processed the reply, it can proceed with the transfer
of the boot file and execute the full boot process. See RFC 906 for the
specification of how this is done with TFTP. In the case of a diskless host, the full
boot process will normally replace the minimal IP protocol stack, loaded from
ROM, and used by BOOTP and TFTP, with a normal IP protocol stack
transferred as part of the boot file and containing the correct customization for
the client.
3.6.1 BOOTP forwarding
The BOOTP client uses the limited broadcast address for BOOTP requests,
which requires the BOOTP server to be on the same subnet as the client.
BOOTP forwarding is a mechanism for routers to forward BOOTP requests
across subnets. It is a configuration option available on most routers. The router
configured to forward BOOTP requests is known as a BOOTP relay agent.
A router will normally discard any datagrams containing illegal source addresses,
such as, which is used by a BOOTP client. A router will also generally
discard datagrams with the limited broadcast destination address. However, a
BOOTP relay agent will accept such datagrams from BOOTP clients on port 67.
The process carried out by a BOOTP relay agent on receiving a
BOOTPREQUEST is as follows:
1. When the BOOTP relay agent receives a BOOTPREQUEST, it first checks
the hops field to check the number of hops already completed in order to
decide whether to forward the request. The threshold for the allowable
number of hops is normally configurable.
2. If the relay agent decides to relay the request, it checks the contents of the
router IP address field. If this field is zero, it fills this field with the IP address
of the interface on which the BOOTPREQUEST was received. If this field
already has an IP address of another relay agent, it is not touched.
Chapter 3. Internetworking protocols
3. The value of the hops field is incremented.
4. The relay agent then forwards the BOOTPREQUEST to one or more BOOTP
servers. The address of the BOOTP server or servers is preconfigured at the
relay agent. The BOOTPREQUEST is normally forwarded as a unicast frame,
although some implementations use broadcast forwarding.
5. When the BOOTP server receives the BOOTPREQUEST with the non-zero
router IP address field, it sends an IP unicast BOOTREPLY to the BOOTP
relay agent at the address in this field on port 67.
6. When the BOOTP relay agent receives the BOOTREPLY, the H/W type,
length, and client hardware address fields in the message supply sufficient
link layer information to return the reply to the client. The relay agent checks
the broadcast flag. If this flag is set, the agent forwards the BOOTPREPLY to
the client as a broadcast. If the broadcast flag is not set, the relay agent
sends a reply as a unicast to the address specified in your IP address.
When a router is configured as a BOOTP relay agent, the BOOTP forwarding
task is considerably different from the task of switching datagrams between
subnets normally carried out by a router. Forwarding of BOOTP messages can
be considered to be receiving BOOTP messages as a final destination, and then
generating new BOOTP messages to be forwarded to another destination.
3.6.2 BOOTP considerations
The use of BOOTP allows centralized configuration of multiple clients. However,
it requires a static table to be maintained with an IP address preallocated for
every client that is likely to attach to the BOOTP server, even if the client is
seldom active. This means that there is no relief on the number of IP addresses
required. There is a measure of security in an environment using BOOTP,
because a client will only be allocated an IP address by the server if it has a valid
MAC address.
3.7 Dynamic Host Configuration Protocol (DHCP)
DHCP is a draft standard protocol. Its status is elective. The current DHCP
specifications are in RFC 2131 with updates in RFC 3396 and RFC 4361. The
specifications are also in RFC 2132 with updates in RFC3442, RFC3942, and
The Dynamic Host Configuration Protocol (DHCP) provides a framework for
passing configuration information to hosts on a TCP/IP network. DHCP is based
on the BOOTP protocol, adding the capability of automatic allocation of reusable
network addresses and additional configuration options. For information
TCP/IP Tutorial and Technical Overview
regarding BOOTP, refer to 3.6, “Bootstrap Protocol (BOOTP)” on page 125.
DHCP messages use UDP port 67, the BOOTP server's well-known port and
UDP port 68, the BOOTP client's well-known port. DHCP participants can
interoperate with BOOTP participants. See 3.7.8, “BOOTP and DHCP
interoperability” on page 140 for further details.
DHCP consists of two components:
򐂰 A protocol that delivers host-specific configuration parameters from a DHCP
server to a host
򐂰 A mechanism for the allocation of temporary or permanent network
addresses to hosts
IP requires the setting of many parameters within the protocol implementation
software. Because IP can be used on many dissimilar kinds of network
hardware, values for those parameters cannot be guessed at or assumed to
have correct defaults. The use of a distributed address allocation scheme based
on a polling/defense mechanism, for discovery of network addresses already in
use, cannot guarantee unique network addresses because hosts might not
always be able to defend their network addresses.
DHCP supports three mechanisms for IP address allocation:
򐂰 Automatic allocation
DHCP assigns a permanent IP address to the host.
򐂰 Dynamic allocation
DHCP assigns an IP address for a limited period of time. Such a network
address is called a lease. This is the only mechanism that allows automatic
reuse of addresses that are no longer needed by the host to which it was
򐂰 Manual allocation
The host's address is assigned by a network administrator.
Chapter 3. Internetworking protocols
3.7.1 The DHCP message format
The format of a DHCP message is shown in Figure 3-44.
transaction ID
flags field
client IP address
your IP address
server IP address
router IP address
client hardware address
(16 bytes)
server host name
(64 bytes)
boot file name
(128 bytes)
(312 bytes)
Figure 3-44 DHCP message format
Indicates a request or a reply:
The type of hardware, for example:
IEEE 802 Networks
Refer to STD 2 – Assigned Internet Numbers for a
complete list.
Hardware address length in bytes.
TCP/IP Tutorial and Technical Overview
The client sets this to 0. It is incremented by a router
that relays the request to another server and is used to
identify loops. RFC 951 suggests that a value of 3
indicates a loop.
Transaction ID
A random number used to match this boot request with
the response it generates.
Set by the client. It is the elapsed time in seconds
since the client started its boot process.
Flags field
The most significant bit of the flags field is used as a
broadcast flag. All other bits must be set to zero, and
are reserved for future use. Normally, DHCP servers
attempt to deliver DHCP messages directly to a client
using unicast delivery. The destination address in the
IP header is set to the DHCP your IP address and the
MAC address is set to the DHCP client hardware
address. If a host is unable to receive a unicast IP
datagram until it knows its IP address, this broadcast
bit must be set to indicate to the server that the DHCP
reply must be sent as an IP and MAC broadcast.
Otherwise, this bit must be set to zero.
Client IP address
Set by the client. Either its known IP address, or
Your IP address
Set by the server if the client IP address field was
Server IP address
Set by the server.
Router IP address
This is the address of a BOOTP relay agent, not a
general IP router to be used by the client. It is set by
the forwarding agent when BOOTP forwarding is used
(see 3.6.1, “BOOTP forwarding” on page 129).
Client hardware address
Set by the client. DHCP defines a client identifier
option that is used for client identification. If this option
is not used, the client is identified by its MAC address.
Server host name
Optional server host name terminated by X'00'.
Boot file name
The client either leaves this null or specifies a generic
name, such as router, indicating the type of boot file to
be used. In a DHCPDISCOVER request, this is set to
null. The server returns a fully qualified directory path
name in a DHCPOFFER request. The value is
terminated by X'00'.
Chapter 3. Internetworking protocols
The first four bytes of the options field of the DHCP
message contain the magic cookie (
The remainder of the options field consists of tagged
parameters that are called options. See RFC 2132,
with updates in RFC3942, for details.
3.7.2 DHCP message types
DHCP messages fall into one of the following categories:
򐂰 DHCPDISCOVER: Broadcast by a client to find available DHCP servers.
򐂰 DHCPOFFER: Response from a server to a DHCPDISCOVER and offering
IP address and other parameters.
򐂰 DHCPREQUEST: Message from a client to servers that does one of the
– Requests the parameters offered by one of the servers and declines all
other offers.
– Verifies a previously allocated address after a system or network change
(a reboot for example).
– Requests the extension of a lease on a particular address.
򐂰 DHCPACK: Acknowledgement from server to client with parameters,
including IP address.
򐂰 DHCPNACK: Negative acknowledgement from server to client, indicating that
the client's lease has expired or that a requested IP address is incorrect.
򐂰 DHCPDECLINE: Message from client to server indicating that the offered
address is already in use.
򐂰 DHCPRELEASE: Message from client to server cancelling remainder of a
lease and relinquishing network address.
򐂰 DHCPINFORM: Message from a client that already has an IP address
(manually configured, for example), requesting further configuration
parameters from the DHCP server.
3.7.3 Allocating a new network address
This section describes the client/server interaction if the client does not know its
network address. Assume that the DHCP server has a block of network
addresses from which it can satisfy requests for new addresses. Each server
also maintains a database of allocated addresses and leases in permanent local
TCP/IP Tutorial and Technical Overview
The DHCP client/server interaction steps are illustrated in Figure 3-45.
DHCP Clients
Broadcast and Ask for IP Address
DHCP Server
Offering an IP Address
(2) Servers
Receive Offers
Use Previous Configuration
Select Process
Ask Selected IP Address
Ack & Additional Configuration Information
Verify (arp)
Decline an Offer (Very Rare)
Initiate the Entire Process Again
Relinquish Lease
Address Released
Figure 3-45 DHCP client and DHCP server interaction
The following procedure describes the DHCP client/server interaction steps
illustrated in Figure 3-45:
1. The client broadcasts a DHCPDISCOVER message on its local physical
subnet. At this point, the client is in the INIT state. The DHCPDISCOVER
message might include some options such as network address suggestion or
lease duration.
2. Each server responds with a DHCPOFFER message that includes an
available network address (your IP address) and other configuration options.
The servers record the address as offered to the client to prevent the same
address being offered to other clients in the event of further
DHCPDISCOVER messages being received before the first client has
completed its configuration.
Chapter 3. Internetworking protocols
3. The client receives one or more DHCPOFFER messages from one or more
servers. The client chooses one based on the configuration parameters
offered and broadcasts a DHCPREQUEST message that includes the server
identifier option to indicate which message it has selected and the requested
IP address option taken from your IP address in the selected offer.
4. In the event that no offers are received, if the client has knowledge of a
previous network address, the client can reuse that address if its lease is still
valid until the lease expires.
5. The servers receive the DHCPREQUEST broadcast from the client. Those
servers not selected by the DHCPREQUEST message use the message as
notification that the client has declined that server's offer. The server selected
in the DHCPREQUEST message commits the binding for the client to
persistent storage and responds with a DHCPACK message containing the
configuration parameters for the requesting client. The combination of client
hardware and assigned network address constitute a unique identifier for the
client's lease and are used by both the client and server to identify a lease
referred to in any DHCP messages. The your IP address field in the
DHCPACK messages is filled in with the selected network address.
6. The client receives the DHCPACK message with configuration parameters.
The client performs a final check on the parameters, for example, with ARP
for allocated network address, and notes the duration of the lease and the
lease identification cookie specified in the DHCPACK message. At this point,
the client is configured.
7. If the client detects a problem with the parameters in the DHCPACK message
(the address is already in use in the network, for example), the client sends a
DHCPDECLINE message to the server and restarts the configuration
process. The client should wait a minimum of ten seconds before restarting
the configuration process to avoid excessive network traffic in case of
looping. On receipt of a DHCPDECLINE, the server must mark the offered
address as unavailable (and possibly inform the system administrator that
there is a configuration problem).
8. If the client receives a DHCPNAK message, the client restarts the
configuration process.
9. The client may choose to relinquish its lease on a network address by
sending a DHCPRELEASE message to the server. The client identifies the
lease to be released by including its network address and its hardware
TCP/IP Tutorial and Technical Overview
3.7.4 DHCP lease renewal process
This section describes the interaction between DHCP servers and clients that
have already been configured and the process that ensures lease expiration and
The process involves the following steps:
1. When a server sends the DHCPACK to a client with IP address and
configuration parameters, it also registers the start of the lease time for that
address. This lease time is passed to the client as one of the options in the
DHCPACK message, together with two timer values, T1 and T2. The client is
rightfully entitled to use the given address for the duration of the lease time.
On applying the received configuration, the client also starts the timers T1
and T2. At this time, the client is in the BOUND state. Times T1 and T2 are
options configurable by the server, but T1 must be less than T2, and T2 must
be less than the lease time. According to RFC 2132, T1 defaults to (0.5 *
lease time) and T2 defaults to (0.875 * lease time).
2. When timer T1 expires, the client will send a DHCPREQUEST (unicast) to the
server that offered the address, asking to extend the lease for the given
configuration. The client is now in the RENEWING state. The server usually
responds with a DHCPACK message indicating the new lease time, and
timers T1 and T2 are reset at the client accordingly. The server also resets its
record of the lease time. In normal circumstances, an active client continually
renews its lease in this way indefinitely, without the lease ever expiring.
3. If no DHCPACK is received until timer T2 expires, the client enters the
REBINDING state. It now broadcasts a DHCPREQUEST message to extend
its lease. This request can be confirmed by a DHCPACK message from any
DHCP server in the network.
4. If the client does not receive a DHCPACK message after its lease has
expired, it has to stop using its current TCP/IP configuration. The client can
then return to the INIT state, issuing a DHCPDISCOVER broadcast to try and
obtain any valid address.
Chapter 3. Internetworking protocols
Figure 3-46 shows the DHCP process and changing client state during that
Figure 3-46 DHCP client state and DHCP process
3.7.5 Reusing a previously allocated network address
If the client remembers and wants to reuse a previously allocated network
address, the following steps are carried out:
1. The client broadcasts a DHCPREQUEST message on its local subnet. The
DHCPREQUEST message includes the client's network address.
2. A server with knowledge of the client's configuration parameters responds
with a DHCPACK message to the client (provided the lease is still current),
renewing the lease at the same time.
3. If the client's lease has expired, the server with knowledge of the client
responds with DHCPNACK.
TCP/IP Tutorial and Technical Overview
4. The client receives the DHCPACK message with configuration parameters.
The client performs a final check on the parameters and notes the duration of
the lease and the lease identification cookie specified in the DHCPACK
message. At this point, the client is configured and its T1 and T2 timers are
5. If the client detects a problem with the parameters in the DHCPACK
message, the client sends a DHCPDECLINE message to the server and
restarts the configuration process by requesting a new network address. If the
client receives a DHCPNAK message, it cannot reuse its remembered
network address. It must instead request a new address by restarting the
configuration process as described in 3.7.3, “Allocating a new network
address” on page 134.
For further information, refer to the previously mentioned RFCs.
3.7.6 Configuration parameters repository
DHCP provides persistent storage of network parameters for network clients. A
DHCP server stores a key-value entry for each client, the key being some unique
identifier, for example, an IP subnet number and a unique identifier within the
subnet (normally a hardware address), and the value contains the configuration
parameters last allocated to this particular client.
One effect of this is that a DHCP client will tend always to be allocated to the
same IP address by the server, provided the pool of addresses is not
over-subscribed and the previous address has not already been allocated to
another client.
3.7.7 DHCP considerations
DHCP dynamic allocation of IP addresses and configuration parameters relieves
the network administrator of a great deal of manual configuration work. The
ability for a device to be moved from network to network and to automatically
obtain valid configuration parameters for the current network can be of great
benefit to mobile users. Also, because IP addresses are only allocated when
clients are actually active, it is possible, by the use of reasonably short lease
times and the fact that mobile clients do not need to be allocated more than one
address, to reduce the total number of addresses in use in an organization.
However, consider the following points when DHCP is implemented:
򐂰 DHCP is built on UDP, which is inherently insecure. In normal operation, an
unauthorized client can connect to a network and obtain a valid IP address
and configuration. To prevent this, it is possible to preallocate IP addresses to
particular MAC addresses (similar to BOOTP), but this increases the
Chapter 3. Internetworking protocols
administration workload and removes the benefit of recycling of addresses.
Unauthorized DHCP servers can also be set up, sending false and potentially
disruptive information to clients.
򐂰 In a DHCP environment where automatic or dynamic address allocation is
used, it is generally not possible to predetermine the IP address of a client at
any particular point in time. In this case, if static DNS servers are also used,
the DNS servers will not likely contain valid host name to IP address
mappings for the clients. If having client entries in the DNS is important for the
network, you can use DHCP to manually assign IP addresses to those clients
and then administer the client mappings in the DNS accordingly.
3.7.8 BOOTP and DHCP interoperability
The format of DHCP messages is based on the format of BOOTP messages,
which enables BOOTP and DHCP clients to interoperate in certain
circumstances. Every DHCP message contains a DHCP message type (51)
option. Any message without this option is assumed to be from a BOOTP client.
Support for BOOTP clients at a DHCP server must be configured by a system
administrator, if required. The DHCP server responds to BOOTPREQUEST
messages with BOOTPREPLY, rather than DHCPOFFER. Any DHCP server
that is not configured in this way will discard any BOOTPREQUEST frames sent
to it. A DHCP server can offer static addresses, or automatic addresses (from its
pool of unassigned addresses), to a BOOTP client (although not all BOOTP
implementations will understand automatic addresses). If an automatic address
is offered to a BOOTP client, that address must have an infinite lease time,
because the client will not understand the DHCP lease mechanism.
DHCP messages can be forwarded by routers configured as BOOTP relay
3.8 RFCs relevant to this chapter
The following RFCs provide detailed information about the connection protocols
and architectures presented throughout this chapter:
򐂰 RFC 791 – Internet Protocol (September 1981)
򐂰 RFC 792 – Internet Control Message Protocol (September 1981)
򐂰 RFC 826 – Ethernet Address Resolution Protocol: Or converting network
protocol addresses to 48.bit Ethernet address for transmission on Ethernet
hardware (November 1982)
򐂰 RFC 903 – A Reverse Address Resolution Protocol (June 1984)
TCP/IP Tutorial and Technical Overview
򐂰 RFC 906 – Bootstrap loading using TFTP (June 1984)
򐂰 RFC919 – Broadcasting Internet Datagrams (October 1984)
򐂰 RFC922 – Broadcasting Internet datagrams in the presence of subnets
(October 1984)
򐂰 RFC 925 – Multi-LAN address resolution (October 1984)
򐂰 RFC 950 – Internet Standard Subnetting Procedure (August 1985)
򐂰 RFC 951 – Bootstrap Protocol (September 1985)
򐂰 RFC 1027 – Using ARP to implement transparent subnet gateways
(October 1987)
򐂰 RFC 1112 – Host extensions for IP multicasting (August 1989)
򐂰 RFC 1122 – Requirements for Internet Hosts – Communication Layers
(October 1989)
򐂰 RFC 1166 – Internet numbers (July 1990)
򐂰 RFC 1191 – Path MTU discovery (November 1990)
򐂰 RFC 1256 – ICMP Router Discovery Messages (September 1991)
򐂰 RFC 1349 – Type of Service in the Internet Protocol Suite (July 1992)
򐂰 RFC 1393 Traceroute Using an IP Option G (January 1993)
򐂰 RFC 1466 – Guidelines for Management of IP Address Space (May 1993)
򐂰 RFC 1518 – An Architecture for IP Address Allocation with CIDR
(September 1993)
򐂰 RFC 1519 – Classless Inter-Domain Routing (CIDR): an Address Assignment
(September 1993)
򐂰 RFC 1520 – Exchanging Routing Information Across Provider Boundaries in
the CIDR Environment (September 1993)
򐂰 RFC 1542 – Clarifications and Extensions for the Bootstrap Protocol
(October 1993)
򐂰 RFC 1788 – ICMP Domain Name Messages (April 1995)
򐂰 RFC 1812 – Requirements for IP Version 4 Routers (June 1995)
򐂰 RFC 1918 – Address Allocation for Private Internets (February 1996)
򐂰 RFC 2050 – Internet Registry IP Allocation Guidelines (November 1996)
򐂰 RFC 2131 – Dynamic Host Configuration Protocol (March 1997)
򐂰 RFC 2132 – DHCP Options and BOOTP Vendor Extensions (March 1997)
򐂰 RFC 2236 – Internet Group Management Protocol, Version 2
(November 1997)
Chapter 3. Internetworking protocols
򐂰 RFC 2474 – Definition of the Differentiated Services Field (DS Field) in the
IPv4 and IPv6 Headers (December 1998)
򐂰 RFC 2644 – Changing the Default for Directed Broadcasts in Router
(August 1999)
򐂰 RFC 2663 – IP Network Address Translator (NAT) Terminology and
Considerations (August 1999)
򐂰 RFC 3022 – Traditional IP Network Address Translator (Traditional NAT)
(January 2001)
򐂰 RFC 3168 – The Addition of Explicit Congestion Notification (ECN) to IP
(September 2001)
򐂰 RFC 3260 – New Terminology and Clarifications for Diffserv (April 2002)
򐂰 RFC 3330 – Special-Use IPv4 Addresses (September 2002)
򐂰 RFC 3396 – Encoding Long Options in the Dynamic Host Configuration
Protocol (DHCPv4) (November 2002)
򐂰 RFC 3442 – The Classless Static Route Option for Dynamic Host
Configuration Protocol (DHCP) version 4 (December 2002)
򐂰 RFC 3942 – Reclassifying Dynamic Host Configuration Protocol version 4
(DHCPv4) Options (November 2004)
򐂰 RFC 4361 – Node-specific Client Identifiers for Dynamic Host Configuration
Protocol Version Four (DHCPv4) (February 2006)
򐂰 RFC 4379 – Detecting Multi-Protocol Label Switched (MPLS) Data Plane
Failures (February 2006)
TCP/IP Tutorial and Technical Overview
Chapter 4.
Transport layer protocols
This chapter provides an overview of the most important and commonly used
protocols of the TCP/IP transport layer. These include:
򐂰 User Datagram Protocol (UDP)
򐂰 Transmission Control Protocol (TCP)
By building on the functionality provided by the Internet Protocol (IP), the
transport protocols deliver data to applications executing in the internet. This is
done by making use of ports, as described in 4.1, “Ports and sockets” on
page 144. The transport protocols can provide additional functionality such as
congestion control, reliable data delivery, duplicate data suppression, and flow
control as is done by TCP.
© Copyright IBM Corp. 1989-2006. All rights reserved.
4.1 Ports and sockets
This section introduces the concepts of the port and socket, which are needed to
determine which local process at a given host actually communicates with which
process, at which remote host, using which protocol. If this sounds confusing,
consider the following points:
򐂰 An application process is assigned a process identifier number (process ID),
which is likely to be different each time that process is started.
򐂰 Process IDs differ between operating system platforms, thus they are not
򐂰 A server process can have multiple connections to multiple clients at a time,
thus simple connection identifiers are not unique.The concept of ports and
sockets provides a way to uniformly and uniquely identify connections and the
programs and hosts that are engaged in them, irrespective of specific process
The concept of ports and sockets provides a way to uniformly and uniquely
identify connections and the programs and hosts that are engaged in them,
irrespective of specific process IDs.
4.1.1 Ports
Each process that wants to communicate with another process identifies itself to
the TCP/IP protocol suite by one or more ports. A port is a 16-bit number used by
the host-to-host protocol to identify to which higher-level protocol or application
program (process) it must deliver incoming messages. There are two types of
򐂰 Well-known: Well-known ports belong to standard servers, for example,
Telnet uses port 23. Well-known port numbers range between 1 and 1023
(prior to 1992, the range between 256 and 1023 was used for UNIX-specific
servers). Well-known port numbers are typically odd, because early systems
using the port concept required an odd/even pair of ports for duplex
operations. Most servers require only a single port. Exceptions are the
BOOTP server, which uses two: 67 and 68 (see 3.6, “Bootstrap Protocol
(BOOTP)” on page 125) and the FTP server, which uses two: 20 and 21 (see
(14.1, “File Transfer Protocol (FTP)” on page 514).
The well-known ports are controlled and assigned by the Internet Assigned
Number Authority (IANA) and on most systems can only be used by system
processes or by programs executed by privileged users. Well-known ports
allow clients to find servers without configuration information. The well-known
port numbers are defined in STD 2 – Assigned Internet Numbers.
TCP/IP Tutorial and Technical Overview
򐂰 Ephemeral: Some clients do not need well-known port numbers because they
initiate communication with servers, and the port number they are using is
contained in the UDP/TCP datagrams sent to the server. Each client process
is allocated a port number, for as long as it needs, by the host on which it is
running. Ephemeral port numbers have values greater than 1023, normally in
the range of 1024 to 65535.
Ephemeral ports are not controlled by IANA and can be used by ordinary
user-developed programs on most systems.
Confusion, due to two different applications trying to use the same port numbers
on one host, is avoided by writing those applications to request an available port
from TCP/IP. Because this port number is dynamically assigned, it can differ
from one invocation of an application to the next.
UDP, TCP, and ISO TP-4 all use the same port principle. To the best possible
extent, the same port numbers are used for the same services on top of UDP,
TCP, and ISO TP-4.
Note: Normally, a server will use either TCP or UDP, but there are exceptions.
For example, domain name servers (see 12.1, “Domain Name System (DNS)”
on page 426) use both UDP port 53 and TCP port 53.
4.1.2 Sockets
The socket interface is one of several application programming interfaces to the
communication protocols (see 11.2, “Application programming interfaces (APIs)”
on page 410). Designed to be a generic communication programming interface,
socket APIs were first introduced by 4.2 Berkeley Software Distribution (BSD).
Although it has not been standardized, Berkeley socket API has become a de
facto industry standard abstraction for network TCP/IP socket implementation.
Consider the following terminologies:
򐂰 A socket is a special type of file handle, which is used by a process to request
network services from the operating system.
򐂰 A socket address is the triple:
<protocol, local-address, local port>
For example, in the TCP/IP (version 4) suite:
<tcp,, 8080>
򐂰 A conversation is the communication link between two processes.
Chapter 4. Transport layer protocols
򐂰 An association is the 5-tuple that completely specifies the two processes that
comprise a connection:
<protocol, local-address, local-port, foreign-address, foreign-port>
In the TCP/IP (version 4) suite, the following could be a valid association:
<tcp,, 1500, 192.168.44, 22>
򐂰 A half-association is either one of the following, which each specify half of a
<protocol, local-address, local-process>
<protocol, foreign-address, foreign-process>
The half-association is also called a socket or a transport address. That is, a
socket is an endpoint for communication that can be named and addressed in
a network.
Two processes communicate through TCP sockets. The socket model provides
a process with a full-duplex byte stream connection to another process. The
application need not concern itself with the management of this stream; these
facilities are provided by TCP.
TCP uses the same port principle as UDP to provide multiplexing. Like UDP,
TCP uses well-known and ephemeral ports. Each side of a TCP connection has
a socket that can be identified by the triple <TCP, IP address, port number>. If
two processes are communicating over TCP, they have a logical connection that
is uniquely identifiable by the two sockets involved, that is, by the combination
<TCP, local IP address, local port, remote IP address, remote port>. Server
processes are able to manage multiple conversations through a single port.
Refer to 11.2.1, “The socket API” on page 410 for more information about socket
4.2 User Datagram Protocol (UDP)
UDP is a standard protocol with STD number 6. UDP is described by RFC 768 –
User Datagram Protocol. Its status is standard and almost every TCP/IP
implementation intended for small data units transfer or those which can afford to
lose a little amount of data (such as multimedia streaming) will include UDP.
TCP/IP Tutorial and Technical Overview
UDP is basically an application interface to IP. It adds no reliability, flow-control,
or error recovery to IP. It simply serves as a multiplexer/demultiplexer for sending
and receiving datagrams, using ports to direct the datagrams, as shown in
Figure 4-1. For a more detailed discussion of ports, refer to 4.1, “Ports and
sockets” on page 144.
process 1
port a
process 2
port b
process n
port z
UDP - Port De-Multiplexing
Figure 4-1 UDP: Demultiplexing based on ports
UDP provides a mechanism for one application to send a datagram to another.
The UDP layer can be regarded as being extremely thin and is, consequently,
very efficient, but it requires the application to take responsibility for error
recovery and so on.
Applications sending datagrams to a host need to identify a target that is more
specific than the IP address, because datagrams are normally directed to certain
processes and not to the system as a whole. UDP provides this by using ports.
We discuss the port concept in 4.1, “Ports and sockets” on page 144.
4.2.1 UDP datagram format
Each UDP datagram is sent within a single IP datagram. Although, the IP
datagram might be fragmented during transmission, the receiving IP
implementation will reassemble it before presenting it to the UDP layer. All IP
implementations are required to accept datagrams of 576 bytes, which means
that, allowing for maximum-size IP header of 60 bytes, a UDP datagram of 516
bytes is acceptable to all implementations. Many implementations will accept
larger datagrams, but this is not guaranteed.
Chapter 4. Transport layer protocols
The UDP datagram has an 8-byte header, as described in Figure 4-2 on
page 148.
Source Port
Destination Port
Figure 4-2 UDP: Datagram format
Source Port
Indicates the port of the sending process. It is the port to
which replies are addressed.
Destination Port
Specifies the port of the destination process on the
destination host.
The length (in bytes) of this user datagram, including the
An optional 16-bit one's complement of the one's
complement sum of a pseudo-IP header, the UDP
header, and the UDP data. In Figure 4-3, we see a
pseudo-IP header. It contains the source and destination
IP addresses, the protocol, and the UDP length.
Source IP address
Destination IP address
TCP Length
Figure 4-3 UDP: Pseudo-IP header
The pseudo-IP header effectively extends the checksum to include the
original (unfragmented) IP datagram.
TCP/IP Tutorial and Technical Overview
4.2.2 UDP application programming interface
The application interface offered by UDP is described in RFC 768. It provides for:
򐂰 The creation of new receive ports
򐂰 The receive operation that returns the data bytes and an indication of source
port and source IP address
򐂰 The send operation that has, as parameters, the data, source, and destination
ports and addresses
The way this interface is implemented is left to the discretion of each vendor.
Be aware that UDP and IP do not provide guaranteed delivery, flow-control, or
error recovery, so these must be provided by the application.
Standard applications using UDP include:
򐂰 Trivial File Transfer Protocol (see 14.2, “Trivial File Transfer Protocol (TFTP)”
on page 529).
򐂰 Domain Name System name server (see 12.2, “Dynamic Domain Name
System” on page 453).
򐂰 Remote Procedure Call, used by the Network File System (see both 11.2.2,
“Remote Procedure Call (RPC)” on page 415 and 14.4, “Network File System
(NFS)” on page 538).
򐂰 Simple Network Management Protocol (see 17.1, “The Simple Network
Management Protocol (SNMP)” on page 624).
򐂰 Lightweight Directory Access Protocol (see 12.4, “Lightweight Directory
Access Protocol (LDAP)” on page 459).
4.3 Transmission Control Protocol (TCP)
TCP is a standard protocol with STD number 7. TCP is described by RFC 793 –
Transmission Control Protocol. Its status is standard, and in practice, every
TCP/IP implementation that is not used exclusively for routing will include TCP.
TCP provides considerably more facilities for applications than UDP. Specifically,
this includes error recovery, flow control, and reliability. TCP is a
connection-oriented protocol, unlike UDP, which is connectionless. Most of the
user application protocols, such as Telnet and FTP, use TCP. The two processes
communicate with each other over a TCP connection (InterProcess
Chapter 4. Transport layer protocols
Communication, or IPC), as shown in Figure 4-4. In the figure, processes 1 and 2
communicate over a TCP connection carried by IP datagrams. See 4.1, “Ports
and sockets” on page 144 for more details about ports and sockets.
process 2
process 1
port m
port n
TCP connection
IP datagrams
host A
host B
Figure 4-4 TCP: Connection between processes
4.3.1 TCP concept
As noted earlier, the primary purpose of TCP is to provide a reliable logical circuit
or connection service between pairs of processes. It does not assume reliability
from the lower-level protocols (such as IP), so TCP must guarantee this itself.
TCP can be characterized by the following facilities it provides for the
applications using it:
򐂰 Stream data transfer: From the application's viewpoint, TCP transfers a
contiguous stream of bytes through the network. The application does not
have to bother with chopping the data into basic blocks or datagrams. TCP
does this by grouping the bytes into TCP segments, which are passed to the
IP layer for transmission to the destination. Also, TCP itself decides how to
segment the data, and it can forward the data at its own convenience.
Sometimes, an application needs to be sure that all the data passed to TCP
has actually been transmitted to the destination. For that reason, a push
function is defined. It will push all remaining TCP segments still in storage to
TCP/IP Tutorial and Technical Overview
the destination host. The normal close connection function also pushes the
data to the destination.
򐂰 Reliability: TCP assigns a sequence number to each byte transmitted, and
expects a positive acknowledgment (ACK) from the receiving TCP layer. If
the ACK is not received within a timeout interval, the data is retransmitted.
Because the data is transmitted in blocks (TCP segments), only the sequence
number of the first data byte in the segment is sent to the destination host.
The receiving TCP uses the sequence numbers to rearrange the segments
when they arrive out of order, and to eliminate duplicate segments.
򐂰 Flow control: The receiving TCP, when sending an ACK back to the sender,
also indicates to the sender the number of bytes it can receive (beyond the
last received TCP segment) without causing overrun and overflow in its
internal buffers. This is sent in the ACK in the form of the highest sequence
number it can receive without problems. This mechanism is also referred to
as a window-mechanism, and we discuss it in more detail later in this chapter.
򐂰 Multiplexing: Achieved through the use of ports, just as with UDP.
򐂰 Logical connections: The reliability and flow control mechanisms described
here require that TCP initializes and maintains certain status information for
each data stream. The combination of this status, including sockets,
sequence numbers, and window sizes, is called a logical connection. Each
connection is uniquely identified by the pair of sockets used by the sending
and receiving processes.
򐂰 Full duplex: TCP provides for concurrent data streams in both directions.
Chapter 4. Transport layer protocols
The window principle
A simple transport protocol might use the following principle: send a packet and
then wait for an acknowledgment from the receiver before sending the next
packet. If the ACK is not received within a certain amount of time, retransmit the
packet. See Figure 4-5 for more details.
Send packet 1
Receive packet 1 and
reply with an ACK 1
Receive ACK
Send packet 2
Figure 4-5 TCP: The window principle
Although this mechanism ensures reliability, it only uses a part of the available
network bandwidth.
Now, consider a protocol where the sender groups its packets to be transmitted,
as in Figure 4-6, and uses the following rules:
򐂰 The sender can send all packets within the window without receiving an ACK,
but must start a timeout timer for each of them.
򐂰 The receiver must acknowledge each packet received, indicating the
sequence number of the last well-received packet.
򐂰 The sender slides the window on each ACK received.
Figure 4-6 TCP: Message packets
TCP/IP Tutorial and Technical Overview
As shown in Figure 4-7, the sender can transmit packets 1 to 5 without waiting for
any acknowledgment.
Send packet 1
Send packet 2
Send packet 3
Send packet 4
ACK for packet 1 received
Send packet 5
Figure 4-7 TCP: Window principle
As shown in Figure 4-8, at the moment the sender receives ACK 1
(acknowledgment for packet 1), it can slide its window one packet to the right.
Send packet 1
Receive packet 1 and
reply with an ACK 1
Receive ACK
Send packet 2
Figure 4-8 TCP: Message packets
At this point, the sender can also transmit packet 6.
Chapter 4. Transport layer protocols
Imagine some special cases:
򐂰 Packet 2 gets lost: The sender will not receive ACK 2, so its window will
remain in position 1 (as in Figure 4-8 on page 153). In fact, because the
receiver did not receive packet 2, it will acknowledge packets 3, 4, and 5 with
an ACK 1, because packet 1 was the last one received in sequence. At the
sender's side, eventually a timeout will occur for packet 2 and it will be
retransmitted. Note that reception of this packet by the receiver will generate
ACK 5, because it has now successfully received all packets 1 to 5, and the
sender's window will slide four positions upon receiving this ACK 5.
򐂰 Packet 2 did arrive, but the acknowledgment gets lost: The sender does not
receive ACK 2, but will receive ACK 3. ACK 3 is an acknowledgment for all
packets up to 3 (including packet 2) and the sender can now slide its window
to packet 4.
This window mechanism ensures:
򐂰 Reliable transmission.
򐂰 Better use of the network bandwidth (better throughput).
򐂰 Flow-control, because the receiver can delay replying to a packet with an
acknowledgment, knowing its free buffers are available and the window size
of the communication.
The window principle applied to TCP
The previously discussed window principle is used in TCP, but with a few
򐂰 Because TCP provides a byte-stream connection, sequence numbers are
assigned to each byte in the stream. TCP divides this contiguous byte stream
into TCP segments to transmit them. The window principle is used at the byte
level, that is, the segments sent and ACKs received will carry byte-sequence
numbers and the window size is expressed as a number of bytes, rather than
a number of packets.
򐂰 The window size is determined by the receiver when the connection is
established and is variable during the data transfer. Each ACK message will
include the window size that the receiver is ready to deal with at that particular
TCP/IP Tutorial and Technical Overview
The sender's data stream can now be seen as follows in Figure 4-9.
window (size expressed in bytes)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
Figure 4-9 TCP: Window principle applied to TCP
Bytes that are transmitted and have been acknowledged
Bytes that are sent but not yet acknowledged
Bytes that can be sent without waiting for any acknowledgment
Bytes that cannot be sent yet
Remember that TCP will block bytes into segments, and a TCP segment only
carries the sequence number of the first byte in the segment.
Chapter 4. Transport layer protocols
TCP segment format
Figure 4-10 shows the TCP segment format.
Source Port
Destination Port
Sequence Number
Acknowledgment Number
Urgent Pointer
Data Bytes
Figure 4-10 TCP: Segment format
Source Port
The 16-bit source port number, used by the receiver to
Destination Port
The 16-bit destination port number.
Sequence Number
The sequence number of the first data byte in this
segment. If the SYN control bit is set, the sequence
number is the initial sequence number (n) and the first
data byte is n+1.
Acknowledgment Number
If the ACK control bit is set, this field contains the value of
the next sequence number that the receiver is expecting
to receive.
Data Offset
The number of 32-bit words in the TCP header. It
indicates where the data begins.
Six bits reserved for future use; must be zero.
Indicates that the urgent pointer field is significant in this
Indicates that the acknowledgment field is significant in
this segment.
Push function.
Resets the connection.
TCP/IP Tutorial and Technical Overview
Synchronizes the sequence numbers.
No more data from sender.
Used in ACK segments. It specifies the number of data
bytes, beginning with the one indicated in the
acknowledgment number field that the receiver (the
sender of this segment) is willing to accept.
The 16-bit one's complement of the one's complement
sum of all 16-bit words in a pseudo-header, the TCP
header, and the TCP data. While computing the
checksum, the checksum field itself is considered zero.
The pseudo-header is the same as that used by UDP for
calculating the checksum. It is a pseudo-IP-header, only
used for the checksum calculation, with the format shown
in Figure 4-11.
S o u rce IP a d d re s s
D e s tin a tio n IP a d d re s s
Z e ro
P ro to c o l
T C P L e n g th
Figure 4-11 TCP: Pseudo-IP header
Urgent Pointer
Points to the first data octet following the urgent data.
Only significant when the URG control bit is set.
Just as in the case of IP datagram options, options can be
– A single byte containing the option number
– A variable length option in the following format as shown in Figure 4-12
option data...
Figure 4-12 TCP: IP datagram option, variable length option
Chapter 4. Transport layer protocols
There are currently seven options defined, as shown in Table 4-1.
Table 4-1 TCP: IP datagram options
End of option list
No operation
Maximum segment size
Window scale
Time stamps
Maximum segment size option
This option is only used during the establishment of
the connection (SYN control bit set) and is sent from
the side that is to receive data to indicate the
maximum segment length it can handle. If this option
is not used, any segment size is allowed. See
Figure 4-13 for more details.
Figure 4-13 TCP: Maximum segment size
TCP/IP Tutorial and Technical Overview
max. seg size
Window scale option
This option is not mandatory. Both sides must send
the Window scale option in their SYN segments to
enable windows scaling in their direction. The Window
scale expands the definition of the TCP window to 32
bits. It defines the 32-bit window size by using scale
factor in the SYN segment over standard 16-bit
window size. The receiver rebuilds the 32-bit window
size by using the 16-bit window size and scale factor.
This option is determined while handshaking. There is
no way to change it after the connection has been
established. See Figure 4-14 for more details.
Figure 4-14 TCP: Window scale option
SACK-permitted option
This option is set when selective acknowledgment is
used in that TCP connection. See Figure 4-15 for
Figure 4-15 TCP: SACK-permitted option
Chapter 4. Transport layer protocols
SACK option
Selective Acknowledgment (SACK) allows the receiver
to inform the sender about all the segments that are
received successfully. Therefore, the sender will only
send the segments that got lost. If the number of the
segments that have been lost since the last SACK is
too large, the SACK option will be too large. As a
result, the number of blocks that can be reported by
the SACK option is limited to four. To reduce this, the
SACK option should be used for the most recent
received data. See Figure 4-16 for more details.
Left Edge of 1st Block
Right Edge of 1st Block
Left Edge of Nth Block
Right Edge of Nth Block
Figure 4-16 TCP: SACK option
Timestamps option The timestamps option sends a time stamp value that
indicates the current value of the time stamp clock of
the TCP sending the option. The Timestamp Echo
Value can only be used if the ACK bit is set in the TCP
header. See Figure 4-17 for more details.
TS Valve
TS Echo Reply
Figure 4-17 TCP: Timestamps option
All zero bytes are used to fill up the TCP header to a total
length that is a multiple of 32 bits.
TCP/IP Tutorial and Technical Overview
Acknowledgments and retransmissions
TCP sends data in variable length segments. Sequence numbers are based on a
byte count. Acknowledgments specify the sequence number of the next byte that
the receiver expects to receive.
Consider that a segment gets lost or corrupted. In this case, the receiver will
acknowledge all further well-received segments with an acknowledgment
referring to the first byte of the missing packet. The sender will stop transmitting
when it has sent all the bytes in the window. Eventually, a timeout will occur and
the missing segment will be retransmitted.
Figure 4-18 illustrates and example where a window size of 1500 bytes and
segments of 500 bytes are used.
Segment 1 (seq. 1000)
Segment 2 (seq. 1500)
Receives 1000, sends ACK 1500
gets lost
Segment 3 (seq. 2000)
Receives the ACK 1500,
which slides window
Segment 4 (seq. 2500)
window size reached,
waiting for ACK
Receives one of the frames
and replies with ACK 1500
(receiver is still expecting
byte 1500)
Receives the ACK 1500,
which does not slide the
Timeout for Segment 2
Figure 4-18 TCP: Acknowledgment and retransmission process
Chapter 4. Transport layer protocols
A problem now arises, because the sender does know that segment 2 is lost or
corrupted, but does not know anything about segments 3 and 4. The sender
should at least retransmit segment 2, but it could also retransmit segments 3 and
4 (because they are within the current window). It is possible that:
򐂰 Segment 3 has been received, and we do not know about segment 4. It might
be received, but ACK did not reach us yet, or it might be lost.
򐂰 Segment 3 was lost, and we received the ACK 1500 on the reception of
segment 4.
Each TCP implementation is free to react to a timeout as those implementing it
want. It can retransmit only segment 2, but in the second case, we will be waiting
again until segment 3 times out. In this case, we lose all of the throughput
advantages of the window mechanism. Or TCP might immediately resend all of
the segments in the current window.
Whatever the choice, maximal throughput is lost. This is because the ACK does
not contain a second acknowledgment sequence number indicating the actual
frame received.
Variable timeout intervals
Each TCP should implement an algorithm to adapt the timeout values to be used
for the round trip time of the segments. To do this, TCP records the time at which
a segment was sent, and the time at which the ACK is received. A weighted
average is calculated over several of these round trip times, to be used as a
timeout value for the next segment or segments to be sent.
This is an important feature, because delays can vary in IP network, depending
on multiple factors, such as the load of an intermediate low-speed network or the
saturation of an intermediate IP gateway.
Establishing a TCP connection
Before any data can be transferred, a connection has to be established between
the two processes. One of the processes (usually the server) issues a passive
OPEN call, the other an active OPEN call. The passive OPEN call remains
dormant until another process tries to connect to it by an active OPEN.
TCP/IP Tutorial and Technical Overview
As shown in Figure 4-19, in the network, three TCP segments are exchanged.
1) SYN
TCP Layer
2) SYN ACK SEQ:4999 ACK:1000
TCP Layer
3) ACK SEQ:1000 ACK:5000
Figure 4-19 TCP: Connection establishment
This whole process is known as a three-way handshake. Note that the
exchanged TCP segments include the initial sequence numbers from both sides,
to be used on subsequent data transfers.
Closing the connection is done implicitly by sending a TCP segment with the FIN
bit (no more data) set. Because the connection is full-duplex (that is, there are
two independent data streams, one in each direction), the FIN segment only
closes the data transfer in one direction. The other process will now send the
remaining data it still has to transmit and also ends with a TCP segment where
the FIN bit is set. The connection is deleted (status information on both sides)
after the data stream is closed in both directions.
The following is a list of the different states of a TCP connection:
򐂰 LISTEN: Awaiting a connection request from another TCP layer.
򐂰 SYN-SENT: A SYN has been sent, and TCP is awaiting the response SYN.
򐂰 SYN-RECEIVED: A SYN has been received, a SYN has been sent, and TCP
is awaiting an ACK.
򐂰 ESTABLISHED: The three-way handshake has been completed.
򐂰 FIN-WAIT-1: The local application has issued a CLOSE. TCP has sent a FIN,
and is awaiting an ACK or a FIN.
򐂰 FIN-WAIT-2: A FIN has been sent, and an ACK received. TCP is awaiting a
FIN from the remote TCP layer.
򐂰 CLOSE-WAIT: TCP has received a FIN, and has sent an ACK. It is awaiting a
close request from the local application before sending a FIN.
򐂰 CLOSING: A FIN has been sent, a FIN has been received, and an ACK has
been sent. TCP is awaiting an ACK for the FIN that was sent.
򐂰 LAST-ACK: A FIN has been received, and an ACK and a FIN have been sent.
TCP is awaiting an ACK.
Chapter 4. Transport layer protocols
򐂰 TIME-WAIT: FINs have been received and ACK’d, and TCP is waiting two
MSLs to remove the connection from the table.
򐂰 CLOSED: Imaginary, this indicates that a connection has been removed from
the connection table.
4.3.2 TCP application programming interface
The TCP application programming interface is not fully defined. Only some base
functions it should provide are described in RFC 793 – Transmission Control
Protocol. As is the case with most RFCs in the TCP/IP protocol suite, a great
degree of freedom is left to the implementers, thereby allowing for optimal
operating system-dependent implementations, resulting in better efficiency and
greater throughput.
The following function calls are described in the RFC:
򐂰 Open: To establish a connection takes several parameters, such as:
– Active/Passive
– Foreign socket
– Local port number
– Timeout value (optional)
This returns a local connection name, which is used to reference this
particular connection in all other functions.
򐂰 Send: Causes data in a referenced user buffer to be sent over the connection.
Can optionally set the URGENT flag or the PUSH flag.
򐂰 Receive: Copies incoming TCP data to a user buffer.
򐂰 Close: Closes the connection; causes a push of all remaining data and a TCP
segment with FIN flag set.
򐂰 Status: An implementation-dependent call that can return information, such
– Local and foreign socket
– Send and receive window sizes
– Connection state
– Local connection name
򐂰 Abort: Causes all pending Send and Receive operations to be aborted, and a
RESET to be sent to the foreign TCP.
For full details, see RFC 793 – Transmission Control Protocol.
TCP/IP Tutorial and Technical Overview
4.3.3 TCP congestion control algorithms
One big difference between TCP and UDP is the congestion control algorithm.
The TCP congestion algorithm prevents a sender from overrunning the capacity
of the network (for example, slower WAN links). TCP can adapt the sender's rate
to network capacity and attempt to avoid potential congestion situations. In order
to understand the difference between TCP and UDP, understanding basic TCP
congestion control algorithms is very helpful.
Several congestion control enhancements have been added and suggested to
TCP over the years. This is still an active and ongoing research area, but modern
implementations of TCP contain four intertwined algorithms as basic Internet
򐂰 Slow start
򐂰 Congestion avoidance
򐂰 Fast retransmit
򐂰 Fast recovery
Slow start
Old implementations of TCP start a connection with the sender injecting multiple
segments into the network, up to the window size advertised by the receiver.
Although this is OK when the two hosts are on the same LAN, if there are routers
and slower links between the sender and the receiver, problems can arise. Some
intermediate routers cannot handle it, packets get dropped, and retransmission
results and performance is degraded.
The algorithm to avoid this is called slow start. It operates by observing that the
rate at which new packets should be injected into the network is the rate at which
the acknowledgments are returned by the other end. Slow start adds another
window to the sender's TCP: the congestion window, called cwnd. When a new
connection is established with a host on another network, the congestion window
is initialized to one segment (for example, the segment size announced by the
other end, or the default, typically 536 or 512).
Note: Congestion control is defined in RFC 2581. Additionally, RFC 3390
updates RFC 2581 such that TCP implementations can initialize the
congestion window to between two and four segments, with an upper limit of
4 K.
Each time an ACK is received, the congestion window is increased by one
segment. The sender can transmit the lower value of the congestion window or
the advertised window. The congestion window is flow control imposed by the
Chapter 4. Transport layer protocols
sender, while the advertised window is flow control imposed by the receiver. The
former is based on the sender's assessment of perceived network congestion;
the latter is related to the amount of available buffer space at the receiver for this
The sender starts by transmitting one segment and waiting for its ACK. When
that ACK is received, the congestion window is incremented from one to two, and
two segments can be sent. When each of those two segments is acknowledged,
the congestion window is increased to four. This provides an exponential growth,
although it is not exactly exponential, because the receiver might delay its ACKs,
typically sending one ACK for every two segments that it receives.
At some point, the capacity of the IP network (for example, slower WAN links)
can be reached, and an intermediate router will start discarding packets. This
tells the sender that its congestion window has gotten too large. See Figure 4-20
for an overview of slow start in action.
Figure 4-20 TCP: Slow start in action
TCP/IP Tutorial and Technical Overview
Congestion avoidance
The assumption of the algorithm is that packet loss caused by damage is very
small (much less than 1%). Therefore, the loss of a packet signals congestion
somewhere in the network between the source and destination. There are two
indications of packet loss:
򐂰 A timeout occurs.
򐂰 Duplicate ACKs are received.
Congestion avoidance and slow start are independent algorithms with different
objectives. But when congestion occurs, TCP must slow down its transmission
rate of packets into the network and invoke slow start to get things going again.
In practice, they are implemented together.
Congestion avoidance and slow start require that two variables be maintained for
each connection:
򐂰 A congestion window, cwnd
򐂰 A slow start threshold size, ssthresh
The combined algorithm operates as follows:
1. Initialization for a given connection sets cwnd to one segment and ssthresh to
65535 bytes.
2. The TCP output routine never sends more than the lower value of cwnd or the
receiver's advertised window.
3. When congestion occurs (timeout or duplicate ACK), one-half of the current
window size is saved in ssthresh. Additionally, if the congestion is indicated
by a timeout, cwnd is set to one segment.
4. When new data is acknowledged by the other end, increase cwnd, but the
way it increases depends on whether TCP is performing slow start or
congestion avoidance. If cwnd is less than or equal to ssthresh, TCP is in
slow start; otherwise, TCP is performing congestion avoidance.
Slow start continues until TCP is halfway to where it was when congestion
occurred (since it recorded half of the window size that caused the problem in
step 2), and then congestion avoidance takes over. Slow start has cwnd begin at
one segment, and incremented by one segment every time an ACK is received.
As mentioned earlier, this opens the window exponentially: send one segment,
then two, then four, and so on.
Congestion avoidance dictates that cwnd be incremented by
segsize*segsize/cwnd each time an ACK is received, where segsize is the
segment size and cwnd is maintained in bytes. This is a linear growth of cwnd,
compared to slow start's exponential growth. The increase in cwnd should be at
Chapter 4. Transport layer protocols
most one segment each round-trip time (regardless of how many ACKs are
received in that round-trip time), while slow start increments cwnd by the number
of ACKs received in a round-trip time. Many implementations incorrectly add a
small fraction of the segment size (typically the segment size divided by 8) during
congestion avoidance. This is wrong and should not be emulated in future
releases. See Figure 4-21 for an example of TCP slow start and congestion
avoidance in action.
Round Trip Times
Figure 4-21 TCP: Slow start and congestion avoidance behavior in action
Fast retransmit
Fast retransmit avoids having TCP wait for a timeout to resend lost segments.
Modifications to the congestion avoidance algorithm were proposed in 1990.
Before describing the change, realize that TCP can generate an immediate
acknowledgment (a duplicate ACK) when an out-of-order segment is received.
This duplicate ACK should not be delayed. The purpose of this duplicate ACK is
to let the other end know that a segment was received out of order and to tell it
what sequence number is expected.
Because TCP does not know whether a duplicate ACK is caused by a lost
segment or just a reordering of segments, it waits for a small number of duplicate
ACKs to be received. It is assumed that if there is just a reordering of the
segments, there will be only one or two duplicate ACKs before the reordered
segment is processed, which will then generate a new ACK. If three or more
duplicate ACKs are received in a row, it is a strong indication that a segment has
TCP/IP Tutorial and Technical Overview
been lost. TCP then performs a retransmission of what appears to be the missing
segment, without waiting for a retransmission timer to expire. See Figure 4-22 for
an overview of TCP fast retransmit in action.
packet 1
packet 2
packet 3
packet 4
packet 5
packet 6
Retransmit Packet 3
Figure 4-22 TCP: Fast retransmit in action
Fast recovery
After fast retransmit sends what appears to be the missing segment, congestion
avoidance, but not slow start, is performed. This is the fast recovery algorithm. It
is an improvement that allows high throughput under moderate congestion,
especially for large windows.
The reason for not performing slow start in this case is that the receipt of the
duplicate ACKs tells TCP more than just a packet has been lost. Because the
receiver can only generate the duplicate ACK when another segment is received,
that segment has left the network and is in the receiver's buffer. That is, there is
still data flowing between the two ends, and TCP does not want to reduce the
flow abruptly by going into slow start. The fast retransmit and fast recovery
algorithms are usually implemented together as follows:
1. When the third duplicate ACK in a row is received, set ssthresh to one-half
the current congestion window, cwnd, but no less than two segments.
Retransmit the missing segment. Set cwnd to ssthresh plus three times the
segment size. This inflates the congestion window by the number of
segments that have left the network and the other end has cached (3).
Chapter 4. Transport layer protocols
2. Each time another duplicate ACK arrives, increment cwnd by the segment
size. This inflates the congestion window for the additional segment that has
left the network. Transmit a packet, if allowed by the new value of cwnd.
3. When the next ACK arrives that acknowledges new data, set cwnd to
ssthresh (the value set in step 1). This ACK is the acknowledgment of the
retransmission from step 1, one round-trip time after the retransmission.
Additionally, this ACK acknowledges all the intermediate segments sent
between the lost packet and the receipt of the first duplicate ACK. This step is
congestion avoidance, because TCP is down to one-half the rate it was at
when the packet was lost.
4.4 RFCs relevant to this chapter
The following RFCs provide detailed information about the connection protocols
and architectures presented throughout this chapter:
򐂰 RFC 761 – DoD standard Transmission Control Protocol (January 1980)
򐂰 RFC 768 – User Datagram Protocol (August 1980)
򐂰 RFC 793 – Updated by RFC 3168 - The Addition of Explicit Congestion
Notification (ECN) to IP (September 2001)
TCP/IP Tutorial and Technical Overview
Chapter 5.
Routing protocols
This chapter provides an overview of IP routing and discusses the various
routing protocols used.
One of the basic functions provided by the IP protocol is the ability to form
connections between different physical networks. A system that performs this
function is called an IP router. This type of device attaches to two or more
physical networks and forwards datagrams between the networks.
When sending data to a remote destination, a host passes datagrams to a local
router. The router forwards the datagrams toward the final destination. They
travel from one router to another until they reach a router connected to the
destination’s LAN segment. Each router along the end-to-end path selects the
next hop device used to reach the destination. The next hop represents the next
device along the path to reach the destination. It is located on a physical network
connected to this intermediate system. Because this physical network differs
from the one on which the system originally received the datagram, the
intermediate host has forwarded (that is, routed) the IP datagram from one
physical network to another.
© Copyright IBM Corp. 1989-2006. All rights reserved.
Figure 5-1 shows an environment where Host C is positioned to forward packets
between network X and network Y.
Host A
Host C Acting as
IP Routing
Interface X
Interface X
Network X
Interface Y
Interface Y
Network Y
Figure 5-1 IP routing operations
The IP routing table in each device is used to forward packets between network
segments. The basic table contains information about a router’s locally
connected networks. The configuration of the device can be extended to contain
information detailing remote networks. This information provides a more
complete view of the overall environment.
A robust routing protocol provides the ability to dynamically build and manage
the information in the IP routing table. As network topology changes occur, the
routing tables are updated with minimal or no manual intervention. This chapter
details several IP routing protocols and how each protocol manages this
Note: In other sections of this book, the position of each protocol within the
layered model of the OSI protocol stack is shown. The routing function is
included as part of the internetwork layer. However, the primary function of a
routing protocol is to exchange routing information with other routers. In this
respect, routing protocols behave more like an application protocol. Therefore,
this chapter makes no attempt to represent the position of these protocols
within the overall protocol stack.
Note: Early IP routing documentation often referred to an IP router as an IP
TCP/IP Tutorial and Technical Overview
5.1 Autonomous systems
The definition of an autonomous system (AS) is integral to understanding the
function and scope of a routing protocol. An AS is defined as a logical portion of
a larger IP network. An AS normally consists of an internetwork within an
organization. It is administered by a single management authority. As shown in
Figure 5-2, an AS can connect to other autonomous systems managed by the
same organization. Alternatively, it can connect to other public or private
Autonomous System A
Autonomous System C
Single Management Authority
Autonomous System B
Figure 5-2 Autonomous systems
Some routing protocols are used to determine routing paths within an AS. Others
are used to interconnect a set of autonomous systems:
򐂰 Interior Gateway Protocols (IGPs): Interior Gateway Protocols allow routers to
exchange information within an AS. Examples of these protocols are Open
Short Path First (OSPF) and Routing Information Protocol (RIP).
򐂰 Exterior Gateway Protocols (EGPs): Exterior Gateway Protocols allow the
exchange of summary information between autonomous systems. An
example of this type of routing protocol is Border Gateway Protocol (BGP).
Chapter 5. Routing protocols
Figure 5-2 on page 173 depicts the interaction between Interior and Exterior
Gateway Protocols. It shows the Interior Gateway Protocols used to maintain
routing information within each AS. The figure also shows the Exterior Gateway
Protocols maintaining the routing information between autonomous systems.
Within an AS, multiple interior routing processes can be used. When this occurs,
the AS must appear to other autonomous systems as having a single coherent
interior routing plan. The AS must present a consistent view of the internal
5.2 Types of IP routing and IP routing algorithms
Routing algorithms build and maintain the IP routing table on a device. There are
two primary methods used to build the routing table:
򐂰 Static routing: Static routing uses preprogrammed definitions representing
paths through the network.
򐂰 Dynamic routing: Dynamic routing algorithms allow routers to automatically
discover and maintain awareness of the paths through the network. This
automatic discovery can use a number of currently available dynamic routing
protocols. The difference between these protocols is the way they discover
and calculate new routes to destination networks. They can be classified into
four broad categories:
– Distance vector protocols
– Link state protocols
– Path vector protocols
– Hybrid protocols
The remainder of this section describes the operation of each algorithm.
There are several reasons for the multiplicity of protocols:
򐂰 Routing within a network and routing between networks typically have
different requirements for security, stability, and scalability. Different routing
protocols have been developed to address these requirements.
򐂰 New protocols have been developed to address the observed deficiencies in
established protocols.
򐂰 Different-sized networks can use different routing algorithms. Small to
medium-sized networks often use routing protocols that reflect the simplicity
of the environment.
TCP/IP Tutorial and Technical Overview
However, these protocols do not scale to support large, interconnected networks.
More complex routing algorithms are required to support these environments.
5.2.1 Static routing
Static routing is manually performed by the network administrator. The
administrator is responsible for discovering and propagating routes through the
network. These definitions are manually programmed in every routing device in
the environment.
After a device has been configured, it simply forwards packets out the
predetermined ports. There is no communication between routers regarding the
current topology of the network.
In small networks with minimal redundancy, this process is relatively simple to
administer. However, there are several disadvantages to this approach for
maintaining IP routing tables:
򐂰 Static routes require a considerable amount of coordination and maintenance
in non-trivial network environments.
򐂰 Static routes cannot dynamically adapt to the current operational state of the
network. If a destination subnetwork becomes unreachable, the static routes
pointing to that network remain in the routing table. Traffic continues to be
forwarded toward that destination. Unless the network administrator updates
the static routes to reflect the new topology, traffic is unable to use any
alternate paths that may exist.
Normally, static routes are used only in simple network topologies. However,
there are additional circumstances when static routing can be attractive. For
example, static routes can be used:
򐂰 To manually define a default route. This route is used to forward traffic when
the routing table does not contain a more specific route to the destination.
򐂰 To define a route that is not automatically advertised within a network.
򐂰 When utilization or line tariffs make it undesirable to send routing
advertisement traffic through lower-capacity WAN connections.
򐂰 When complex routing policies are required. For example, static routes can
be used to guarantee that traffic destined for a specific host traverses a
designated network path.
򐂰 To provide a more secure network environment. The administrator is aware of
all subnetworks defined in the environment. The administrator specifically
authorizes all communication permitted between these subnetworks.
Chapter 5. Routing protocols
򐂰 To provide more efficient resource utilization. This method of routing table
management requires no network bandwidth to advertise routes between
neighboring devices. It also uses less processor memory and CPU cycles to
calculate network paths.
5.2.2 Distance vector routing
Distance vector algorithms are examples of dynamic routing protocols. These
algorithms allow each device in the network to automatically build and maintain a
local IP routing table.
The principle behind distance vector routing is simple. Each router in the
internetwork maintains the distance or cost from itself to every known destination.
This value represents the overall desirability of the path. Paths associated with a
smaller cost value are more attractive to use than paths associated with a larger
value. The path represented by the smallest cost becomes the preferred path to
reach the destination.
This information is maintained in a distance vector table. The table is periodically
advertised to each neighboring router. Each router processes these
advertisements to determine the best paths through the network.
The main advantage of distance vector algorithms is that they are typically easy
to implement and debug. They are very useful in small networks with limited
redundancy. However, there are several disadvantages with this type of protocol:
򐂰 During an adverse condition, the length of time for every device in the
network to produce an accurate routing table is called the convergence time.
In large, complex internetworks using distance vector algorithms, this time
can be excessive. While the routing tables are converging, networks are
susceptible to inconsistent routing behavior. This can cause routing loops or
other types of unstable packet forwarding.
򐂰 To reduce convergence time, a limit is often placed on the maximum number
of hops contained in a single route. Valid paths exceeding this limit are not
usable in distance vector networks.
򐂰 Distance vector routing tables are periodically transmitted to neighboring
devices. They are sent even if no changes have been made to the contents of
the table. This can cause noticeable periods of increased utilization in
reduced capacity environments.
Enhancements to the basic distance vector algorithm have been developed to
reduce the convergence and instability exposures. We describe these
enhancements in 5.3.5, “Convergence and counting to infinity” on page 185.
RIP is a popular example of a distance vector routing protocol.
TCP/IP Tutorial and Technical Overview
5.2.3 Link state routing
The growth in the size and complexity of networks in recent years has
necessitated the development of more robust routing algorithms. These
algorithms address the shortcoming observed in distance vector protocols.
These algorithms use the principle of a link state to determine network topology.
A link state is the description of an interface on a router (for example, IP address,
subnet mask, type of network) and its relationship to neighboring routers. The
collection of these link states forms a link state database.
The process used by link state algorithms to determine network topology is
1. Each router identifies all other routing devices on the directly connected
2. Each router advertises a list of all directly connected network links and the
associated cost of each link. This is performed through the exchange of link
state advertisements (LSAs) with other routers in the network.
3. Using these advertisements, each router creates a database detailing the
current network topology. The topology database in each router is identical.
4. Each router uses the information in the topology database to compute the
most desirable routes to each destination network. This information is used to
update the IP routing table.
Shortest-Path First (SPF) algorithm
The SPF algorithm is used to process the information in the topology database. It
provides a tree-representation of the network. The device running the SPF
algorithm is the root of the tree. The output of the algorithm is the list of
shortest-paths to each destination network. Figure 5-3 on page 178 provides an
example of the shortest-path algorithm executed on router A.
Chapter 5. Routing protocols
B -2
C -1
A -2
D -4
A -1
D -1
E -3
C -1
B -4
E -3
C -3
D -3
Link S tate
D ata ba se
Figure 5-3 Shortest-Path First (SPF) example
Because each router is processing the same set of LSAs, each router creates an
identical link state database. However, because each device occupies a different
place in the network topology, the application of the SPF algorithm produces a
different tree for each router.
The OSPF protocol is a popular example of a link state routing protocol.
5.2.4 Path vector routing
Path vector routing is discussed in RFC 1322; the following paragraphs are
based on the RFC.
The path vector routing algorithm is somewhat similar to the distance vector
algorithm in the sense that each border router advertises the destinations it can
reach to its neighboring router. However, instead of advertising networks in
terms of a destination and the distance to that destination, networks are
advertised as destination addresses and path descriptions to reach those
TCP/IP Tutorial and Technical Overview
A route is defined as a pairing between a destination and the attributes of the
path to that destination, thus the name, path vector routing, where the routers
receive a vector that contains paths to a set of destinations.
The path, expressed in terms of the domains (or confederations) traversed so
far, is carried in a special path attribute that records the sequence of routing
domains through which the reachability information has passed. The path
represented by the smallest number of domains becomes the preferred path to
reach the destination.
The main advantage of a path vector protocol is its flexibility. There are several
other advantages regarding using a path vector protocol:
򐂰 The computational complexity is smaller than that of the link state protocol.
The path vector computation consists of evaluating a newly arrived route and
comparing it with the existing one, while conventional link state computation
requires execution of an SPF algorithm.
򐂰 Path vector routing does not require all routing domains to have
homogeneous policies for route selection; route selection policies used by
one routing domain are not necessarily known to other routing domains. The
support for heterogeneous route selection policies has serious implications
for the computational complexity. The path vector protocol allows each
domain to make its route selection autonomously, based only on local
policies. However, path vector routing can accommodate heterogeneous
route selection with little additional cost.
򐂰 Only the domains whose routes are affected by the changes have to
򐂰 Suppression of routing loops is implemented through the path attribute, in
contrast to link state and distance vector, which use a globally-defined
monotonically thereby increasing metric for route selection. Therefore,
different confederation definitions are accommodated because looping is
avoided by the use of full path information.
򐂰 Route computation precedes routing information dissemination. Therefore,
only routing information associated with the routes selected by a domain is
distributed to adjacent domains.
򐂰 Path vector routing has the ability to selectively hide information.
However, there are disadvantages to this approach, including:
򐂰 Topology changes only result in the recomputation of routes affected by these
changes, which is more efficient than complete recomputation. However,
because of the inclusion of full path information with each distance vector, the
effect of a topology change can propagate farther than in traditional distance
vector algorithms.
Chapter 5. Routing protocols
򐂰 Unless the network topology is fully meshed or is able to appear so, routing
loops can become an issue.
BGP is a popular example of a path vector routing protocol.
5.2.5 Hybrid routing
The last category of routing protocols is hybrid protocols. These protocols
attempt to combine the positive attributes of both distance vector and link state
protocols. Like distance vector, hybrid protocols use metrics to assign a
preference to a route. However, the metrics are more accurate than conventional
distance vector protocols. Like link state algorithms, routing updates in hybrid
protocols are event driven rather than periodic. Networks using hybrid protocols
tend to converge more quickly than networks using distance vector protocols.
Finally, these protocols potentially reduce the costs of link state updates and
distance vector advertisements.
Although open hybrid protocols exist, this category is almost exclusively
associated with the proprietary EIGRP algorithm. EIGRP was developed by
Cisco Systems, Inc.
5.3 Routing Information Protocol (RIP)
RIP is an example of an interior gateway protocol designed for use within small
autonomous systems. RIP is based on the Xerox XNS routing protocol. Early
implementations of RIP were readily accepted because the code was
incorporated in the Berkeley Software Distribution (BSD) UNIX-based operating
system. RIP is a distance vector protocol.
In mid-1988, the IETF issued RFC 1058 with updates in RFC2453, which
describes the standard operations of a RIP system. However, the RFC was
issued after many RIP implementations had been completed. For this reason,
some RIP systems do not support the entire set of enhancements to the basic
distance vector algorithm (for example, poison reverse and triggered updates).
5.3.1 RIP packet types
The RIP protocol specifies two packet types. These packets can be sent by any
device running the RIP protocol:
򐂰 Request packets: A request packet queries neighboring RIP devices to obtain
their distance vector table. The request indicates if the neighbor should return
either a specific subset or the entire contents of the table.
TCP/IP Tutorial and Technical Overview
򐂰 Response packets: A response packet is sent by a device to advertise the
information maintained in its local distance vector table. The table is sent
during the following situations:
– The table is automatically sent every 30 seconds.
– The table is sent as a response to a request packet generated by another
RIP node.
– If triggered updates are supported, the table is sent when there is a
change to the local distance vector table. We discuss triggered updates in
“Triggered updates” on page 188.
When a response packet is received by a device, the information contained in
the update is compared against the local distance vector table. If the update
contains a lower cost route to a destination, the table is updated to reflect the
new path.
5.3.2 RIP packet format
RIP uses a specific packet format to share information about the distances to
known network destinations. RIP packets are transmitted using UDP datagrams.
RIP sends and receives datagrams using UDP port 520.
RIP datagrams have a maximum size of 512 octets. Updates larger than this size
must be advertised in multiple datagrams. In LAN environments, RIP datagrams
are sent using the MAC all-stations broadcast address and an IP network
broadcast address. In point-to-point or non-broadcast environments, datagrams
are specifically addressed to the destination device.
Chapter 5. Routing protocols
The RIP packet format is shown in Figure 5-4.
Number of Octets
Version = 1
AFI: X'0002'
IP Address
Address Family
Identifier for IP
Routing Entry: May
be repeated
Figure 5-4 RIP packet format
A 512 byte packet size allows a maximum of 25 routing entries to be included in
a single RIP advertisement.
5.3.3 RIP modes of operation
RIP hosts have two modes of operation:
򐂰 Active mode: Devices operating in active mode advertise their distance vector
table and also receive routing updates from neighboring RIP hosts. Routing
devices are typically configured to operate in active mode.
򐂰 Passive (or silent) mode: Devices operating in this mode simply receive
routing updates from neighboring RIP devices. They do not advertise their
distance vector table. End stations are typically configured to operate in
passive mode.
5.3.4 Calculating distance vectors
The distance vector table describes each destination network. The entries in this
table contain the following information:
򐂰 The destination network (vector) described by this entry in the table.
TCP/IP Tutorial and Technical Overview
򐂰 The associated cost (distance) of the most attractive path to reach this
destination. This provides the ability to differentiate between multiple paths to
a destination. In this context, the terms distance and cost can be misleading.
They have no direct relationship to physical distance or monetary cost.
򐂰 The IP address of the next-hop device used to reach the destination network.
Each time a routing table advertisement is received by a device, it is processed
to determine if any destination can be reached by a lower cost path. This is done
using the RIP distance vector algorithm. The algorithm can be summarized as:
򐂰 At router initialization, each device contains a distance vector table listing
each directly attached networks and configured cost. Typically, each network
is assigned a cost of 1. This represents a single hop through the network. The
total number of hops in a route is equal to the total cost of the route. However,
cost can be changed to reflect other measurements such as utilization,
speed, or reliability.
򐂰 Each router periodically (typically every 30 seconds) transmits its distance
vector table to each of its neighbors. The router can also transmit the table
when a topology change occurs. Each router uses this information to update
its local distance vector table:
– The total cost to each destination is calculated by adding the cost reported
in a neighbor's distance vector table to the cost of the link to that neighbor.
The path with the least cost is stored in the distance vector table.
– All updates automatically supersede the previous information in the
distance vector table. This allows RIP to maintain the integrity of the
routes in the routing table.
򐂰 The IP routing table is updated to reflect the least-cost path to each
Chapter 5. Routing protocols
Figure 5-5 illustrates the distance vector tables for three routers within a simple
Router R4
Distance Vector
Router R3
Distance Vector
Router R2
Distance Vector
Figure 5-5 A sample distance vector routing table
TCP/IP Tutorial and Technical Overview
5.3.5 Convergence and counting to infinity
Given sufficient time, this algorithm will correctly calculate the distance vector
table on each device. However, during this convergence time, erroneous routes
may propagate through the network. Figure 5-6 shows this problem.
(n) = Network Cost
Figure 5-6 Counting to infinity sample network
This network contains four interconnected routers. Each link has a cost of 1,
except for the link connecting router C and router D; this link has a cost of 10.
The costs have been defined so that forwarding packets on the link connecting
router C and router D is undesirable. After the network has converged, each
device has routing information describing all networks.
For example, to reach the target network, the routers have the following
Router D to the target network: Directly connected network. Metric is 1.
Router B to the target network: Next hop is router D. Metric is 2.
Router C to the target network: Next hop is router B. Metric is 3.
Router A to the target network: Next hop is router B. Metric is 3.
Consider an adverse condition where the link connecting router B and router D
fails. After the network has reconverged, all routes use the link connecting router
C and router D to reach the target network. However, this reconvergence time
Chapter 5. Routing protocols
can be considerable. Figure 5-7 illustrates how the routes to the target network
are updated throughout the reconvergence period. For simplicity, this figure
assumes all routers send updates at the same time.
-- Direct
-- C
Figure 5-7 Network convergence sequence
Reconvergence begins when router B notices that the route to router D is
unavailable. Router B is able to immediately remove the failed route because the
link has timed out. However, a considerable amount of time passes before the
other routers remove their references to the failed route. This is described in the
sequence of updates shown in Figure 5-7:
1. Prior to the adverse condition occurring, router A and router C have a route to
the target network through router B.
2. The adverse condition occurs when the link connecting router D and router B
fails. Router B recognizes that its preferred path to the target network is now
3. Router A and router C continue to send updates reflecting the route through
router B. This route is actually invalid because the link connecting router D
and router B has failed.
4. Router B receives the updates from router A and router C. Router B believes
it should now route traffic to the target network through either router A or
router C. In reality, this is not a valid route, because the routes in router A and
router C are vestiges of the previous route through router B.
5. Using the routing advertisement sent by router B, router A and router C are
able to determine that the route through router B has failed. However, router
A and router C now believe the preferred route exists through the partner.
Network convergence continues as router A and router C engage in an extended
period of mutual deception. Each device claims to be able to reach the target
network through the partner device. The path to reach the target network now
contains a routing loop.
The manner in which the costs in the distance vector table increment gives rise
to the term counting to infinity. The costs continues to increment, theoretically to
TCP/IP Tutorial and Technical Overview
infinity. To minimize this exposure, whenever a network is unavailable, the
incrementing of metrics through routing updates must be halted as soon as it is
practical to do so. In a RIP environment, costs continue to increment until they
reach a maximum value of 16. This limit is defined in RFC 1058.
A side effect of the metric limit is that it also limits the number of hops a packet
can traverse from source network to destination network. In a RIP environment,
any path exceeding 15 hops is considered invalid. The routing algorithm will
discard these paths.
There are two enhancements to the basic distance vector algorithm that can
minimize the counting to infinity problem:
򐂰 Split horizon with poison reverse
򐂰 Triggered updates
These enhancements do not impact the maximum metric limit.
Split horizon
The excessive convergence time caused by counting to infinity can be reduced
with the use of split horizon. This rule dictates that routing information is
prevented from exiting the router on an interface through which the information
was received.
The basic split horizon rule is not supported in RFC 1058. Instead, the standard
specifies the enhanced split horizon with poison reverse algorithm. The basic
rule is presented here for background and completeness. The enhanced
algorithm is reviewed in the next section.
The incorporation of split horizon modifies the sequence of routing updates
shown in Figure 5-7 on page 186. The new sequence is shown in Figure 5-8. The
tables show that convergence occurs considerably faster using the split horizon
1 Direct
3 A
3 C
Note: Faster Routing Table Convergence
Figure 5-8 Network convergence with split horizon
Chapter 5. Routing protocols
The limitation to this rule is that each node must wait for the route to the
unreachable destination to time out before the route is removed from the
distance vector table. In RIP environments, this timeout is at least three minutes
after the initial outage. During that time, the device continues to provide
erroneous information to other nodes about the unreachable destination. This
propagates routing loops and other routing anomalies.
Split horizon with poison reverse
Poison reverse is an enhancement to the standard split horizon implementation.
It is supported in RFC 1058. With poison reverse, all known networks are
advertised in each routing update. However, those networks learned through a
specific interface are advertised as unreachable in the routing announcements
sent out to that interface.
This drastically improves convergence time in complex, highly-redundant
environments. With poison reverse, when a routing update indicates that a
network is unreachable, routes are immediately removed from the routing table.
This breaks erroneous, looping routes before they can propagate through the
network. This approach differs from the basic split horizon rule where routes are
eliminated through timeouts.
Poison reverse has no benefit in networks with no redundancy (single path
One disadvantage to poison reverse is that it might significantly increase the size
of routing annoucements exchanged between neighbors. This is because all
routes in the distance vector table are included in each announcement. Although
this is generally not an issue on local area networks, it can cause periods of
increased utilization on lower-capacity WAN connections.
Triggered updates
Like split horizon with poison reverse, algorithms implementing triggered updates
are designed to reduce network convergence time. With triggered updates,
whenever a router changes the cost of a route, it immediately sends the modified
distance vector table to neighboring devices. This mechanism ensures that
topology change notifications are propagated quickly, rather than at the normal
periodic interval.
Triggered updates are supported in RFC 1058.
TCP/IP Tutorial and Technical Overview
5.3.6 RIP limitations
There are a number of limitations observed in RIP environments:
򐂰 Path cost limits: The resolution to the counting to infinity problem enforces a
maximum cost for a network path. This places an upper limit on the maximum
network diameter. Networks requiring paths greater than 15 hops must use
an alternate routing protocol.
򐂰 Network-intensive table updates: Periodic broadcasting of the distance vector
table can result in increased utilization of network resources. This can be a
concern in reduced-capacity segments.
򐂰 Relatively slow convergence: RIP, like other distance vector protocols, is
relatively slow to converge. The algorithms rely on timers to initiate routing
table advertisements.
򐂰 No support for variable length subnet masking: Route advertisements in a
RIP environment do not include subnet masking information. This makes it
impossible for RIP networks to deploy variable length subnet masks.
5.4 Routing Information Protocol Version 2 (RIP-2)
The IETF recognizes two versions of RIP:
򐂰 RIP Version 1 (RIP-1): This protocol is described in RFC 1058.
򐂰 RIP Version 2 (RIP-2): RIP-2 is also a distance vector protocol designed for
use within an AS. It was developed to address the limitations observed in
RIP-1. RIP-2 is described in RFC 2453. The standard (STD 56) was
published in late 1994.
In practice, the term RIP refers to RIP-1. Whenever you encounter the term RIP
in TCP/IP literature, it is safe to assume that the reference is to RIP Version 1
unless otherwise stated. This same convention is used in this document.
However, when the two versions are being compared, the term RIP-1 is used to
avoid confusion.
RIP-2 is similar to RIP-1. It was developed to extend RIP-1 functionality in small
networks. RIP-2 provides these additional benefits not available in RIP-1:
򐂰 Support for CIDR and VLSM: RIP-2 supports supernetting (that is, CIDR) and
variable-length subnet masking. This support was the major reason the new
standard was developed. This enhancement positions the standard to
accommodate a degree of addressing complexity not supported in RIP-1.
Chapter 5. Routing protocols
򐂰 Support for multicasting: RIP-2 supports the use of multicasting rather than
simple broadcasting of routing annoucements. This reduces the processing
load on hosts not listening for RIP-2 messages. To ensure interoperability
with RIP-1 environments, this option is configured on each network interface.
򐂰 Support for authentication: RIP-2 supports authentication of any node
transmitting route advertisements. This prevents fraudulent sources from
corrupting the routing table.
򐂰 Support for RIP-1: RIP-2 is fully interoperable with RIP-1. This provides
backward-compatibility between the two standards.
As noted in the RIP-1 section, one notable shortcoming in the RIP-1 standard is
the implementation of the metric field. RIP-1 specifies the metric as a value
between 0 and 16. To ensure compatibility with RIP-1 networks, RIP-2 preserves
this definition. In both standards, networks paths with a hop-count greater than
15 are interpreted as unreachable.
5.4.1 RIP-2 packet format
The original RIP-1 specification was designed to support future enhancements.
The RIP-2 standard was able to capitalize on this feature. RIP-2 developers
noted that a RIP-1 packet already contains a version field and that 50% of the
octets are unused.
TCP/IP Tutorial and Technical Overview
Figure 5-9 illustrates the contents of a RIP-2 packet. The packet is shown with
authentication information. The first entry in the update contains either a routing
entry or an authentication entry. If the first entry is an authentication entry, 24
additional routing entries can be included in the message. If there is no
authentication information, 25 routing entries can be provided.
Number of Octets
Authentication Type
Authentication Data
Route Tag
IP Address
Subnet Mask
Next Hop
0= No Authentication
2= Password Data
Password if Type 2 Selected
Routing Entry: May not be
Figure 5-9 RIP-2 packet format
The use of the command field, IP address field, and metric field in a RIP-2
message is identical to the use in a RIP-1 message. Otherwise, the changes
implemented in a RIP-2 packets include:
The value contained in this field must be two. This
instructs RIP-1 routers to ignore any information
contained in the previously unused fields.
AFI (Address Family) A value of x’0002’ indicates the address contained in the
network address field is an IP address. An value of
x'FFFF' indicates an authentication entry.
Chapter 5. Routing protocols
Authentication Type This field defines the remaining 16 bytes of the
authentication entry. A value of 0 indicates no
authentication. A value of two indicates the authentication
data field contains password data.
Authentication Data This field contains a 16-byte password.
Route Tag
This field is intended to differentiate between internal and
external routes. Internal routes are learned through RIP-2
within the same network or AS.
Subnet Mask
This field contains the subnet mask of the referenced
Next Hop
This field contains a recommendation about the next hop
the router should use when sending datagrams to the
referenced network.
5.4.2 RIP-2 limitations
RIP-2 was developed to address many of the limitations observed in RIP-1.
However, the path cost limits and slow convergence inherent in RIP-1 networks
are also concerns in RIP-2 environments.
In addition to these concerns, there are limitations to the RIP-2 authentication
process. The RIP-2 standard does not encrypt the authentication password. It is
transmitted in clear text. This makes the network vulnerable to attack by anyone
with direct physical access to the environment.
5.5 RIPng for IPv6
RIPng was developed to allow routers within an IPv6-based network to exchange
information used to compute routes. It is documented in RFC 2080. We provide
additional information regarding IPv6 in 9.1, “IPv6 introduction” on page 328.
Like the other protocols in the RIP family, RIPng is a distance vector protocol
designed for use within a small autonomous system. RIPng uses the same
algorithms, timers, and logic used in RIP-2.
RIPng has many of the same limitations inherent in other distance vector
protocols. Path cost restrictions and convergence time remain a concern in
RIPng networks.
TCP/IP Tutorial and Technical Overview
5.5.1 Differences between RIPng and RIP-2
There are two important distinctions between RIP-2 and RIPng:
򐂰 Support for authentication: The RIP-2 standard includes support for
authenticating a node transmitting routing information. RIPng does not
include any native authentication support. Rather, RIPng uses the security
features inherent in IPv6. In addition to authentication, these security features
provide the ability to encrypt each RIPng packet. This can control the set of
devices that receive the routing information.One consequence of using IPv6
security features is that the AFI field within the RIPng packet is eliminated.
There is no longer a need to distinguish between authentication entries and
routing entries within an advertisement.
򐂰 Support for IPv6 addressing formats: The fields contained in RIPng packets
were updated to support the longer IPv6 address format.
5.5.2 RIPng packet format
RIPng packets are transmitted using UDP datagrams. RIPng sends and receives
datagrams using UDP port number 521.
The format of a RIPng packet is similar to the RIP-2 format. Specifically, both
packets contain a 4 octet command header followed by a set of 20 octet route
entries. The RIPng packet format is shown in Figure 5-10.
Number of Octets
Route Table Entry
May be repeated
Figure 5-10 RIPng packet format
Chapter 5. Routing protocols
The use of the command field and the version field is identical to the use in a
RIP-2 packet. However, the fields containing routing information have been
updated to accommodate the 16 octet IPv6 address. These fields are used
differently than the corresponding fields in a RIP-1 or RIP-2 packet. The format of
the RTE is shown in Figure 5-11.
Number of Octets
IPv6 Prefix
Route Tag
Prefix Length
Figure 5-11 Route table entry (RTE)
In RIPng, the combination of the IP prefix and the prefix length identifies the
route to be advertised. The metric remains encoded in a 1 octet field. This length
is sufficient because RIPng uses a maximum hop-count of 16.
Another difference between RIPng and RIP-2 is the process used to determine
the next hop. In RIP-2, each route table entry contains a next hop field. In RIPng,
including this information in each RTE would have doubled the size of the
advertisement. Therefore, in RIPng, the next hop is included in a special type of
TCP/IP Tutorial and Technical Overview
RTE. The specified next hop applies to each subsequent routing table entry in
the advertisement. The format of an RTE used to specify the next hop is shown
in Figure 5-12.
Number of Octets
IPv6 Next Hop Address
Metric 0x'FF'
Used to distinguish a
next hop entry
Figure 5-12 Next Hop route table entry (RTE)
The next hop RTE is identified by a value of 0x’FF’ in the metric field. This
reserved value is outside the valid range of metrics.
The use of RTEs and next hop RTEs is shown in Figure 5-13.
Number of Octets
Routing entry #1
Routing entry #2
Routing entry #3
Next hop RTE A
Routing entry #4
Routing entry #5
Next hop RTE B
Routing entry #6
Figure 5-13 Using the RIPng RTE
In this example, the first three routing entries do not have a corresponding next
hop RTE. The address prefixes specified by these entries will be routed through
the advertising router. The prefixes included in routing entries 4 and 5 will route
through the next hop address specified in the next hop RTE A. The prefix
included in routing entry 6 will route through the next hop address specified in the
next hop RTE B.
Chapter 5. Routing protocols
5.6 Open Shortest Path First (OSPF)
The Open Shortest Path First (OSPF) protocol is another example of an interior
gateway protocol. It was developed as a non-proprietary routing alternative to
address the limitations of RIP. Initial development started in 1988 and was
finalized in 1991. Subsequent updates to the protocol continue to be published.
The current version of the standard is documented in RFC 2328.
OSPF provides a number of features not found in distance vector protocols.
Support for these features has made OSPF a widely-deployed routing protocol in
large networking environments. In fact, RFC 1812 – Requirements for IPv4
Routers, lists OSPF as the only required dynamic routing protocol. The following
features contribute to the continued acceptance of the OSPF standard:
򐂰 Equal cost load balancing: The simultaneous use of multiple paths can
provide more efficient utilization of network resources.
򐂰 Logical partitioning of the network: This reduces the propagation of outage
information during adverse conditions. It also provides the ability to aggregate
routing announcements that limit the advertisement of unnecessary subnet
򐂰 Support for authentication: OSPF supports the authentication of any node
transmitting route advertisements. This prevents fraudulent sources from
corrupting the routing tables.
򐂰 Faster convergence time: OSPF provides instantaneous propagation of
routing changes. This expedites the convergence time required to update
network topologies.
򐂰 Support for CIDR and VLSM: This allows the network administrator to
efficiently allocate IP address resources.
OSPF is a link state protocol. As with other link state protocols, each OSPF
router executes the SPF algorithm (“Shortest-Path First (SPF) algorithm” on
page 177) to process the information stored in the link state database. The
algorithm produces a shortest-path tree detailing the preferred routes to each
destination network.
5.6.1 OSPF terminology
OSPF uses specific terminology to describe the operation of the protocol.
OSPF areas
OSPF networks are divided into a collection of areas. An area consists of a
logical grouping of networks and routers. The area can coincide with geographic
or administrative boundaries. Each area is assigned a 32-bit area ID.
TCP/IP Tutorial and Technical Overview
Subdividing the network provides the following benefits:
򐂰 Within an area, every router maintains an identical topology database
describing the routing devices and links within the area. These routers have
no knowledge of topologies outside the area. They are only aware of routes to
these external destinations. This reduces the size of the topology database
maintained by each router.
򐂰 Areas limit the potentially explosive growth in the number of link state
updates. Most LSAs are distributed only within an area.
򐂰 Areas reduce the CPU processing required to maintain the topology
database. The SPF algorithm is limited to managing changes within the area.
Backbone area and area 0
All OSPF networks contain at least one area. This area is known as area 0 or the
backbone area. Additional areas can be created based on network topology or
other design requirements.
In networks containing multiple areas, the backbone physically connects to all
other areas. OSPF expects all areas to announce routing information directly into
the backbone. The backbone then announces this information into other areas.
Figure 5-14 on page 198 depicts a network with a backbone area and four
additional areas.
Chapter 5. Routing protocols
Intra-area, area border, and AS boundary routers
There are three classifications of routers in an OSPF network. Figure 5-14
illustrates the interaction of these devices.
AS External Links
Area 1
AS 10
Area 0
Area 2
Area 4
Area 3
AS External Links
ASBR - AS Border Router
ABR - Area Border Router
IA - Intra-Area Router
Figure 5-14 OSPF router types
Intra-area routers
This class of router is logically located entirely
within an OSPF area. Intra-area routers
maintain a topology database for their local
Area border routers (ABR)
This class of router is logically connected to two
or more areas. One area must be the backbone
area. An ABR is used to interconnect areas.
They maintain a separate topology database for
each attached area. ABRs also execute
separate instances of the SPF algorithm for
each area.
TCP/IP Tutorial and Technical Overview
AS boundary routers (ASBR) This class of router is located at the periphery of
an OSPF internetwork. It functions as a gateway
exchanging reachability between the OSPF
network and other routing environments.
ASBRs are responsible for announcing AS external link advertisements through
the AS. We provide more information about external link advertisements in 5.6.4,
“OSPF route redistribution” on page 208.
Each router is assigned a 32-bit router ID (RID). The RID uniquely identifies the
device. One popular implementation assigns the RID from the lowest-numbered
IP address configured on the router.
Physical network types
OSPF categorizes network segments into three types. The frequency and types
of communication occurring between OSPF devices connected to these
networks is impacted by the network type:
򐂰 Point-to-point: Point-to-point networks directly link two routers.
򐂰 Multi-access: Multi-access networks support the attachment of more than two
They are further subdivided into two types:
– Broadcast networks have the capability of simultaneously directing a
packet to all attached routers. This capability uses an address that is
recognized by all devices. Ethernet and token-ring LANs are examples of
OSPF broadcast multi-access networks.
– Non-broadcast networks do not have broadcasting capabilities. Each
packet must be specifically addressed to every router in the network. X.25
and frame relay networks are examples of OSPF non-broadcast
multi-access networks.
򐂰 Point-to-multipoint: Point-to-multipoint networks are a special case of
multi-access, non-broadcast networks. In a point-to-multipoint network, a
device is not required to have a direct connection to every other device. This
is known as a partially meshed environment.
Neighbor routers and adjacencies
Routers that share a common network segment establish a neighbor relationship
on the segment. Routers must agree on the following information to become
򐂰 Area ID: The routers must belong to the same OSPF area.
򐂰 Authentication: If authentication is defined, the routers must specify the same
Chapter 5. Routing protocols
򐂰 Hello and dead intervals: The routers must specify the same timer intervals
used in the Hello protocol. We describe this protocol further in “OSPF packet
types” on page 203.
򐂰 Stub area flag: The routers must agree that the area is configured as a stub
area. We describe stub areas further in 5.6.5, “OSPF stub areas” on
page 210.
After two routers have become neighbors, an adjacency relationship can be
formed between the devices. Neighboring routers are considered adjacent when
they have synchronized their topology databases. This occurs through the
exchange of link state information.
Designated and backup designated router
The exchange of link state information between neighbors can create significant
quantities of network traffic. To reduce the total bandwidth required to
synchronize databases and advertise link state information, a router does not
necessarily develop adjacencies with every neighboring device:
򐂰 Multi-access networks: Adjacencies are formed between an individual router
and the (backup) designated router.
򐂰 Point-to-point networks: An adjacency is formed between both devices.
Each multi-access network elects a designated router (DR) and backup
designated router (BDR). The DR performs two key functions on the network
򐂰 It forms adjacencies with all routers on the multi-access network. This causes
the DR to become the focal point for forwarding LSAs.
򐂰 It generates network link advertisements listing each router connected to the
multi-access network. For additional information regarding network link
advertisements, see “Link state advertisements and flooding” on page 201.
The BDR forms the same adjacencies as the designated router. It assumes DR
functionality when the DR fails.
Each router is assigned an 8-bit priority, indicating its ability to be selected as the
DR or BDR. A router priority of zero indicates that the router is not eligible to be
selected. The priority is configured on each interface in the router.
TCP/IP Tutorial and Technical Overview
Figure 5-15 illustrates the relationship between neighbors. No adjacencies are
formed between routers that are not selected to be the DR or BDR.
Figure 5-15 Relationship between adjacencies and neighbors
Link state database
The link state database is also called the topology database. It contains the set of
link state advertisements describing the OSPF network and any external
connections. Each router within the area maintains an identical copy of the link
state database.
Note: RFC 2328 uses the term link state database in preference to topology
database. The former term has the advantage in that it describes the contents
of the database. The latter term is more descriptive of the purpose of the
database. This book has previously used the term topology database for this
reason. However for the remainder of the OSPF section, we refer to it as the
link state database.
Link state advertisements and flooding
The contents of an LSA describe an individual network component (that is,
router, segment, or external destination). LSAs are exchanged between adjacent
OSPF routers. This is done to synchronize the link state database on each
When a router generates or modifies an LSA, it must communicate this change
throughout the network. The router starts this process by forwarding the LSA to
each adjacent device. Upon receipt of the LSA, these neighbors store the
information in their link state database and communicate the LSA to their
Chapter 5. Routing protocols
neighbors. This store and forward activity continues until all devices receive the
update. This process is called reliable flooding. Two steps are taken to ensure
that this flooding effectively transmits changes without overloading the network
with excessive quantities of LSA traffic:
򐂰 Each router stores the LSA for a period of time before propagating the
information to its neighbors. If, during that time, a new copy of the LSA
arrives, the router replaces the stored version. However, if the new copy is
outdated, it is discarded.
򐂰 To ensure reliability, each link state advertisement must be acknowledged.
Multiple acknowledgements can be grouped together into a single
acknowledgement packet. If an acknowledgement is not received, the original
link state update packet is retransmitted.
Link state advertisements contain five types of information. Together these
advertisements provide the necessary information needed to describe the entire
OSPF network and any external environments:
򐂰 Router LSAs: This type of advertisement describes the state of the router's
interfaces (links) within the area. They are generated by every OSPF router.
The advertisements are flooded throughout the area.
򐂰 Network LSAs: This type of advertisement lists the routers connected to a
multi-access network. They are generated by the DR on a multi-access
segment. The advertisements are flooded throughout the area.
򐂰 Summary LSAs (Type-3 and Type-4): This type of advertisement is generated
by an ABR. There are two types of summary link advertisements:
– Type-3 summary LSAs describe routes to destinations in other areas
within the OSPF network (inter-area destinations).
– Type-4 summary LSAs describe routes to ASBRs. Summary LSAs are
used to exchange reachability information between areas. Normally,
information is announced into the backbone area. The backbone then
injects this information into other areas.
򐂰 AS external LSAs: This type of advertisement describes routes to
destinations external to the OSPF network. They are generated by an ASBR.
The advertisements are flooded throughout all areas in the OSPF network.
TCP/IP Tutorial and Technical Overview
Figure 5-16 illustrates the different types of link state advertisements.
Router Links
Network Links
- Advertised by router
- Describes state/cost of
router's links
- Advertised by designated router
- Describes all routers attached to
External Links
Summary Links
Area X
Area X
Area 0
- Advertised by router
- Describes state/cost of
router's links
Area 0
- Advertised by router
- Describes state/cost of
router's links
Figure 5-16 OSPF link state advertisements
OSPF packet types
OSPF packets are transmitted in IP datagrams. They are not encapsulated
within TCP or UDP packets. The IP header uses protocol identifier 89. OSPF
packets are sent with an IP ToS of 0 and an IP precedence of internetwork
control. This is used to obtain preferential processing for the packets. We discuss
ToS and IP precedence further in “Integrated Services” on page 288.
Wherever possible, OSPF uses multicast facilities to communicate with
neighboring devices. In broadcast and point-to-point environments, packets are
sent to the reserved multicast address RFC 2328 refers to this as the
AllSPFRouters address. In non-broadcast environments, packets are addressed
to the neighbor’s specific IP address.
Chapter 5. Routing protocols
All OSPF packets share the common header shown in Figure 5-17. The header
provides general information including area identifier, RID, checksum, and
authentication information.
Number of Octets
Packet Type
Packet Length
Router ID
Area ID
Authentication Type
Authentication Data
Version = 2
1= Hello
2=Database Description
3=Link State Request
4=Link State Update
5=Link State Acknowledgement
0=No Authentication
1=Simple Password
Password if Type 1 Selected
Figure 5-17 OSPF common header
The type field identifies the OSPF packet as one of five possible types:
This packet type discovers and maintains neighbor
Database description
This packet type describes the set of LSAs contained
in the router's link state database.
Link state request
This packet type requests a more current instance of
an LSA from a neighbor.
Link state update
This packet type provides a more current instance of
an LSA to a neighbor.
Link state acknowledgement
This packet type acknowledges receipt of a newly
received LSA.
We describe the use of these packets in the next section.
TCP/IP Tutorial and Technical Overview
5.6.2 Neighbor communication
OSPF is responsible for determining the optimum set of paths through a network.
To accomplish this, each router exchanges LSAs with other routers in the
network. The OSPF protocol defines a number of activities to accomplish this
information exchange:
򐂰 Discovering neighbors
򐂰 Electing a designated router
򐂰 Establishing adjacencies and synchronizing databases
The five OSPF packet types are used to support these information exchanges.
Discovering neighbors: The OSPF Hello protocol
The Hello protocol discovers and maintains relationships with neighboring
routers. Hello packets are periodically sent out to each router interface. The
packet contains the RID of other routers whose hello packets have already been
received over the interface.
When a device sees its own RID in the hello packet generated by another router,
these devices establish a neighbor relationship.
The hello packet also contains the router priority, DR identifier, and BDR
identifier. These parameters are used to elect the DR on multi-access networks.
Electing a designated router
All multi-access networks must have a DR. A BDR can also be selected. The
backup ensures there is no extended loss of routing capability if the DR fails.
The DR and BDR are selected using information contained in hello packets. The
device with the highest OSPF router priority on a segment becomes the DR for
that segment. The same process is repeated to select the BDR. In case of a tie,
the router with the highest RID is selected. A router declared the DR is ineligible
to become the BDR.
After elected, the DR and BDR proceed to establish adjacencies with all routers
on the multi-access segment.
Establishing adjacencies and synchronizing databases
Neighboring routers are considered adjacent when they have synchronized their
link state databases. A router does not develop an adjacency with every
neighboring device. On multi-access networks, adjacencies are formed only with
the DR and BDR. This is a two step process.
Chapter 5. Routing protocols
Step 1: Database exchange process
The first phase of database synchronization is the database exchange process.
This occurs immediately after two neighbors attempt to establish an adjacency.
The process consists of an exchange of database description packets. The
packets contain a list of the LSAs stored in the local database.
During the database exchange process, the routers form a master/subordinate
relationship. The master is the first to transmit. Each packet is identified by a
sequence number. Using this sequence number, the subordinate acknowledges
each database description packet from the master. The subordinate also
includes its own set of link state headers in the acknowledgements.
Step 2: Database loading
During the database exchange process, each router notes the link state headers
for which the neighbor has a more current instance (all advertisements are time
stamped). After the process is complete, each router requests the more current
information from the neighbor. This request is made with a link state request
When a router receives a link state request, it must reply with a set of link state
update packets providing the requested LSA. Each transmitted LSA is
acknowledged by the receiver. This process is similar to the reliable flooding
procedure used to transmit topology changes throughout the network.
Every LSA contains an age field indicating the time in seconds since the origin of
the advertisement. The age continues to increase after the LSA is installed in the
topology database. It also increases during each hop of the flooding process.
When the maximum age is reached, the LSA is no longer used to determining
routing information and is discarded from the link state database. This age is also
used to distinguish between two otherwise identical copies of an advertisement.
5.6.3 OSPF neighbor state machine
The OSPF specification defines a set of neighbor states and the events that can
cause a neighbor to transition from one state to another. A state machine is used
to describe these transitions:
򐂰 Down: This is the initial state. It indicates that no recent information has been
received from any device on the segment.
򐂰 Attempt: This state is used on non-broadcast networks. It indicates that a
neighbor appears to be inactive. Attempts continue to reestablish contact.
TCP/IP Tutorial and Technical Overview
򐂰 Init: Communication with the neighbor has started, but bidirectional
communication has not been established. Specifically, a hello packet was
received from the neighbor, but the local router was not listed in the
neighbor's hello packet.
򐂰 2-way: Bidirectional communication between the two routers has been
established. Adjacencies can be formed. Neighbors are eligible to be elected
as designated routers.
򐂰 ExStart: The neighbors are starting to form an adjacency.
򐂰 Exchange: The two neighbors are exchanging their topology databases.
򐂰 Loading: The two neighbors are synchronizing their topology databases.
򐂰 Full: The two neighbors are fully adjacent and their databases are
Network events cause a neighbor’s OSPF state to change. For example, when a
router receives a hello packet from a neighboring device, the OSPF neighbor
state changes from Down to Init. When bidirectional communication has been
established, the neighbor state changes from Init to 2-Way. RFC 2328 contains a
complete description of the events causing a state change.
OSPF virtual links and transit areas
Virtual links are used when a network does not support the standard OSPF
network topology. This topology defines a backbone area that directly connects
to each additional OSPF area. The virtual link addresses two conditions:
򐂰 It can logically connect the backbone area when it is not contiguous.
򐂰 It can connect an area to the backbone when a direct connection does not
A virtual link is established between two ABRs sharing a common non-backbone
area. The link is treated as a point-to-point link. The common area is known as a
transit area. Figure 5-18 on page 208 illustrates the interaction between virtual
links and transit areas when used to connect an area to the backbone.
Chapter 5. Routing protocols
Area 0
Area 2
Transit Area
Virtual link
Area 1
Figure 5-18 OSPF virtual link and transit areas
This diagram shows that area 1 does not have a direct connection to the
backbone. Area 2 can be used as a transit area to provide this connection. A
virtual link is established between the two ABRs located in area 2. Establishing
this virtual link logically extends the backbone area to connect to area 1.
A virtual link is used only to transmit routing information. It does not carry regular
traffic between the remote area and the backbone. This traffic, in addition to the
virtual link traffic, is routed using the standard intra-area routing within the transit
5.6.4 OSPF route redistribution
Route redistribution is the process of introducing external routes into an OSPF
network. These routes can be either static routes or routes learned through
another routing protocol. They are advertised into the OSPF network by an
ASBR. These routes become OSPF external routes. The ASBR advertises these
routes by flooding OSPF AS external LSAs throughout the entire OSPF network.
TCP/IP Tutorial and Technical Overview
The routes describe an end-to-end path consisting of two portions:
򐂰 External portion: This is the portion of the path external to the OSPF network.
When these routes are distributed into OSPF, the ASBR assigns an initial
cost. This cost represents the external cost associated with traversing the
external portion of the path.
򐂰 Internal portion: This is the portion of the path internal to the OSPF network.
Costs for this portion of the network are calculated using standard OSPF
OSPF differentiates between two types of external routes. They differ in the way
the cost of the route is calculated. The ASBR is configured to redistribute the
route as:
򐂰 External type 1: The total cost of the route is the sum of the external cost and
any internal OSPF costs.
򐂰 External type 2: The total cost of the route is always the external cost. This
ignores any internal OSPF costs required to reach the ASBR.
Figure 5-19 illustrates an example of the types of OSPF external routes.
R1 Routing Table
E1: Cost 60
E2: Cost 50
(15) redistributed
with external cost 50
R2 Routing Table
E1: Cost 65
E2: Cost 50
Figure 5-19 OSPF route redistribution
Chapter 5. Routing protocols
In this example, the ASBR is redistributing the route into the OSPF
network. This subnet is located within the RIP network. The route is announced
into OSPF with an external cost of 50. This represents the cost for the portion of
the path traversing the RIP network:
򐂰 If the ASBR redistributed the route as an E1 route, R1 will contain an external
route to this subnet with a cost of 60 (50 + 10). R2 will have an external route
with a cost of 65 (50 + 15).
򐂰 If the ASBR redistributed the route as an E2 route, both R1 and R2 will
contain an external route to this subnet with a cost of 50. Any costs
associated with traversing segments within the OSPF network are not
included in the total cost to reach the destination.
5.6.5 OSPF stub areas
OSPF allows certain areas to be defined as a stub area. A stub area is created
when the ABR connecting to a stub area excludes AS external LSAs from being
flooded into the area. This is done to reduce the size of the link state database
maintained within the stub area routers. Because there are no specific routes to
external networks, routing to these destinations is based on a default route
generated by the ABR. The link state databases maintained within the stub area
contain only the default route and the routes from within the OSPF environment
(for example, intra-area and inter-area routes).
Because a stub area does not allow external LSAs, a stub area cannot contain
an ASBR. No external routes can be generated from within the stub area.
Stub areas can be deployed when there is a single exit point connecting the area
to the backbone. An area with multiple exit points can also be a stub area.
However, there is no guarantee that packets exiting the area will follow an
optimal path. This is due to the fact that each ABR generates a default route.
There is no ability to associate traffic with a specific default routes.
All routers within the area must be configured as stub routers. This configuration
is verified through the exchange of hello packets.
Not-so-stubby areas
An extension to the stub area concept is the not-so-stubby area (NSSA). This
alternative is documented in RFC 3101. An NSSA is similar to a stub area in that
the ABR servicing the NSSA does not flood any external routes into the NSSA.
The only routes flooded into the NSSA are the default route and any other routes
from within the OSPF environment (for example, intra-area and inter-area).
However, unlike a stub area, an ASBR can be located within an NSSA. This
ASBR can generate external routes. Therefore, the link state databases
TCP/IP Tutorial and Technical Overview
maintained within the NSSA contain the default route, routes from within the
OSPF environment (for example, intra-area and inter-area routes), and the
external routes generated by the ASBR within the area.
The ABR servicing the NSSA floods the external routes from within the NSSA
throughout the rest of the OSPF network.
5.6.6 OSPF route summarization
Route summarization is the process of consolidating multiple contiguous routing
entries into a single advertisement. This reduces the size of the link state
database and the IP routing table. In an OSPF network, summarization is
performed at a border router. There are two types of summarization:
򐂰 Inter-area route summarization: Inter-area summarization is performed by the
ABR for an area. It is used to summarize route advertisements originating
within the area. The summarized route is announcement into the backbone.
The backbone receives the aggregated route and announces the summary
into other areas.
򐂰 External route summarization: This type of summarization applies specifically
to external routes injected into OSPF. This is performed by the ASBR
distributing the routes into the OSPF network. Figure 5-20 illustrates an
example of OSPF route summarization.
OSPF Area 2
Area 0
External Summary
Inter-area Summary
Area 1
Figure 5-20 OSPF route summarization
Chapter 5. Routing protocols
In this figure, the ASBR is advertising a single summary route for the 64
subnetworks located in the RIP environment. This single summary route is
flooded throughout the entire OSPF network. In addition, the ABR is generating a
single summary route for the 64 subnetworks located in area 1. This summary
route is flooded through area 0 and area 2. Depending of the configuration of the
ASBR, the inter-area summary route can also be redistributed into the RIP
5.7 Enhanced Interior Gateway Routing Protocol
The Enhanced Interior Gateway Routing Protocol (EIGRP) is categorized as a
hybrid routing protocol. Similar to a distance vector algorithm, EIGRP uses
metrics to determine network paths. However, like a link state protocol, topology
updates in an EIGRP environment are event driven.
EIGRP, as the name implies, is an interior gateway protocol designed for use
within an AS. In properly designed networks, EIGRP has the potential for
improved scalability and faster convergence over standard distance vector
algorithms. EIGRP is also better positioned to support complex, highly redundant
EIGRP is a proprietary protocol developed by Cisco Systems, Inc. At the time of
this writing, it is not an IETF standard protocol.
5.7.1 Features of EIGRP
EIGRP has several capabilities. Some of these capabilities are also available in
distance vector or link state algorithms.
򐂰 EIGRP maintains a list of alternate routes that can be used if a preferred path
fails. When the path fails, the new route is immediately installed in the IP
routing table. No route recomputation is performed.
򐂰 EIGRP allows partial routing updates. When EIGRP discovers a neighboring
router, each device exchanges their entire routing table. After the initial
information exchange, only routing table changes are propagated. There is no
periodic rebroadcasting of the entire routing table.
򐂰 EIGRP uses a low amount of bandwidth. During normal network operations,
only hello packets are transmitted through a stable network.
򐂰 EIGRP supports supernetting (CIDR) and variable length subnet masks
(VLSM). This enables the network administrator to efficiently allocate IP
address resources.
TCP/IP Tutorial and Technical Overview
򐂰 EIGRP supports the ability to summarize routing annoucements. This limits
the advertisement of unnecessary subnet information.
򐂰 EIGRP can provide network layer routing for multiple protocols such as
AppleTalk, IPX, and IP networks.
򐂰 EIGRP supports the simultaneous use of multiple unequal cost paths to a
destination. Each route is installed in the IP routing table. EIGRP also
intelligently load balances traffic over the multiple paths.
򐂰 EIGRP uses a topology table to install routes into the IP routing table. The
topology table lists all destination networks currently advertised by
neighboring routers. The table contains all the information needed to build a
set of distances and vectors to each destination.
򐂰 EIGRP maintains a table to track the state of each adjacent neighbor. This is
called a neighbor table.
򐂰 EIGRP can guarantee the ordered delivery of packets to a neighbor.
However, not all types of packets must be reliably transmitted. For example,
in a network that supports multicasting, there is no need to send individual,
acknowledged hello packets to each neighbor. To provide efficient operation,
reliability is provided only when needed. This improves convergence time in
networks containing varying speed connections.
Neighbor discovery and recovery
EIGRP can dynamically learn about other routers on directly attached networks.
This is similar to the Hello protocol used for neighbor discovery in an OSPF
Devices in an EIGRP network exchange hello packets to verify each neighbor is
operational. Like OSPF, the frequency used to exchange packets is based on the
network type. Packets are exchanged at a five second interval on high bandwidth
links (for example, LAN segments). Otherwise, hello packets on lower bandwidth
connections are exchanged every 60 seconds.
Also like OSPF, EIGRP uses a hold timer to remove inactive neighbors. This
timer indicates the amount of time that a device will continue to consider a
neighbor active without receiving a hello packet from the neighbor.
EIGRP routing algorithm
EIGRP does not rely on periodic updates to converge on the topology. Instead, it
builds a topology table containing each of its neighbor’s advertisements. Unlike a
distance vector protocol, this data is not discarded.
Chapter 5. Routing protocols
EIGRP processes the information in the topology table to determine the best
paths to each destination network. EIGRP implements an algorithm known as
Diffusing Update ALgorithm (DUAL).
Route recomputation
For a specific destination, the successor is the neighbor router currently used for
packet forwarding. This device has the least-cost path to the destination and is
guaranteed not to be participating in a routing loop. A feasible successor
assumes forwarding responsibility when the current successor router fails. The
set of feasible successors represent the devices that can become a successor
without requiring a route recomputation or introducing routing loops.
A route recomputation occurs when there is no known feasible successor to the
destination. The successor is the neighbor router currently used for packet
forwarding. The process starts with a router sending a multicast query packet to
determine if any neighbor is aware of a feasible successor to the destination. A
neighbor replies if it has an feasible successor.
If the neighbor does not have a feasible successor, the neighbor can return a
query indicating it also is performing a route recomputation. When the link to a
neighbor fails, all routes that used that neighbor as the only feasible successor
require a route recomputation.
5.7.2 EIGRP packet types
EIGRP uses five types of packets to establish neighbor relationships and
advertise routing information:
򐂰 Hello/acknowledgement: These packets are used for neighbor discovery.
They are multicast advertised on each network segment. Unicast responses
to the hello packet are returned. A hello packet without any data is considered
an acknowledgement.
򐂰 Updates: These packets are used to convey reachability information for each
destination. When a new neighbor is discovered, unicast update packets are
exchanged to allow each neighbor to build their topology table. Other types of
advertisements (for example, metric changes) use multicast packets. Update
packets are always transmitted reliably.
򐂰 Queries and replies: These packets are exchanged when a destination enters
an active state. A multicast query packet is sent to determine if any neighbor
contains a feasible successor to the destination. Unicast reply packets are
sent to indicate that the neighbor does not need to go into an active state
because a feasible successor has been identified. Query and reply packets
are transmitted reliably.
TCP/IP Tutorial and Technical Overview
򐂰 Request: These packets are used to obtain specific information from a
neighbor. These packets are used in route server applications.
5.8 Exterior Gateway Protocol (EGP)
EGP is an exterior gateway protocol of historical merit. It was one of the first
protocols developed for communication between autonomous systems. It is
described in RFC 904.
EGP assumes the network contains a single backbone and a single path exists
between any two autonomous systems. Due to this limitation, the current use of
EGP is minimal. In practice, EGP has been replaced by BGP.
EGP is based on periodic polling using a hello/I-hear-you message exchange.
These are used to monitor neighbor reachability and solicit update responses.
The gateway connecting to an AS is permitted to advertise only those destination
networks reachable within the local AS. It does not advertise reachability
information about its EGP neighbors outside the AS.
5.9 Border Gateway Protocol (BGP)
The Border Gateway Protocol (BGP) is an exterior gateway protocol. It was
originally developed to provide a loop-free method of exchanging routing
information between autonomous systems. BGP has since evolved to support
aggregation and summarization of routing information.
BGP is an IETF draft standard protocol described in RFC 4271. The version
described in this RFC is BGP Version 4. Following standard convention, this
document uses the term BGP when referencing BGP Version 4.
Chapter 5. Routing protocols
5.9.1 BGP concepts and terminology
BGP uses specific terminology to describe the operation of the protocol.
Figure 5-21 illustrates this terminology.
BGP Speaker
BGP Speaker
BGP Speaker
BGP Speaker
BGP Speaker
Figure 5-21 Components of a BGP network
BGP uses the following terms:
򐂰 BGP speaker: A router configured to support BGP.
򐂰 BGP neighbors (peers): A pair of BGP speakers that exchange routing
information. There are two types of BGP neighbors:
– Internal (IBGP) neighbor: A pair of BGP speakers within the same AS.
– External (EBGP) neighbor: A pair of BGP neighbors, each in a different
AS. These neighbors typically share a directly connected network.
򐂰 BGP session: A TCP session connecting two BGP neighbors. The session is
used to exchange routing information. The neighbors monitor the state of the
session by sending keepalive messages.1
TCP/IP Tutorial and Technical Overview
򐂰 Traffic type: BGP defines two types of traffic:
– Local: Traffic local to an AS either originates or terminates within the AS.
Either the source or the destination IP address resides in the AS.
– Transit: Any traffic that is not local traffic is transit traffic. One of the goals
of BGP is to minimize the amount of transit traffic.
򐂰 AS type: BGP defines three types of autonomous systems:
– Stub: A stub AS has a single connection to one other AS. A stub AS
carries only local traffic.
– Multihomed: A multihomed AS has connections to two or more
autonomous systems. However, a multihomed AS has been configured so
that it does not forward transit traffic.
– Transit: A transit AS has connections to two or more autonomous systems
and carries both local and transit traffic. The AS can impose policy
restrictions on the types of transit traffic that will be forwarded.
Depending on the configuration of the BGP devices within AS 2 in Figure 5-21
on page 216, this autonomous system can be either a multihomed AS or a
transit AS.
򐂰 AS number: A 16-bit number uniquely identifying an AS.
򐂰 AS path: A list of AS numbers describing a route through the network. A BGP
neighbor communicates paths to its peers.
򐂰 Routing policy: A set of rules constraining the flow of data packets through the
network. Routing policies are not defined in the BGP protocol. Rather, they
are used to configure a BGP device. For example, a BGP device can be
configured so that:
– A multihomed AS can refuse to act as a transit AS. This is accomplished
by advertising only those networks contained within the AS.
– A multihomed AS can perform transit AS routing for a restricted set of
adjacent autonomous systems. It does this by tailoring the routing
advertisements sent to EBGP peers.
– An AS can optimize traffic to use a specific AS path for certain categories
of traffic.
򐂰 Network layer reachability information (NLRI): NLRI is used by BGP to
advertise routes. It consists of a set of networks represented by the tuple
<length,prefix>. For example, the tuple <14,> represents the
CIDR route
This keepalive message is implemented in the application layer. It is independent of the keepalive
message available in many TCP implementations.
Chapter 5. Routing protocols
򐂰 Routes and paths: A route associates a destination with a collection of
attributes describing the path to the destination. The destination is specified in
NRLI format. The path is reported as a collection of path attributes. This
information is advertised in UPDATE messages. For additional information
describing the UPDATE message, see 5.9.3, “Protocol description” on
page 220.
5.9.2 IBGP and EBGP communication
BGP does not replace the IGP operating within an AS. Instead, it cooperates with
the IGP to establish communication between autonomous systems. BGP within
an AS is used to advertise the local IGP routes. These routes are advertised to
BGP peers in other autonomous systems. Figure 5-22 on page 219 illustrates
the communication that occurs between BGP peers. This example shows four
autonomous systems. AS 2, AS 3, and AS 4 each have an EBGP connection to
AS 1. A full mesh of IBGP sessions exists between BGP devices within AS 1.
Network is located within AS 3. Using BGP, the existence of this
network is advertised to the rest of the environment:
򐂰 R4 in AS 3 uses its EBGP connection to announce the network to AS 1.
򐂰 R1 in AS 1 uses its IBGP connections to announce the network to R2 and R3.
򐂰 R2 in AS 1 uses its EBGP session to announce the network into AS 2. R3 in
AS 1 uses its EBGP session 5 to announce the network into AS 4.
TCP/IP Tutorial and Technical Overview
AS 1
AS 2
AS 3
AS 4
Figure 5-22 EBGP and IBGP communication
Several additional operational issues are shown in Figure 5-22:
򐂰 Role of BGP and the IGP: The diagram shows that while BGP alone carries
information between autonomous systems, both BGP and the IGP are used
to carry information through an AS.
򐂰 Establishing the TCP session between peers: Before establishing a BGP
session, a device verifies that routing information is available to reach the
– EBGP peers: EBGP peers typically share a directly connected network.
The routing information needed to exchange BGP packets between these
peers is trivial.
– IBGP peers: IBGP peers can be located anywhere within the AS. They do
not need to be directly connected. BGP relies on the IGP to locate a peer.
Packet forwarding between IBGP peers uses IGP-learned routes.
Chapter 5. Routing protocols
򐂰 Full mesh of BGP sessions within an AS: IBGP speakers assume a full mesh
of BGP sessions have been established between peers in the same AS. In
Figure 5-22 on page 219, all three BGP peers in AS 1 are interconnected with
BGP sessions.
When a BGP speaker receives a route update from an IBGP peer, the receiving
speaker uses EBGP to propagate the update to external peers. Because the
receiving speaker assumes a full mesh of IBGP sessions have been established,
it does not propagate the update to other IBGP peers.
For example, assume that there was no IBGP session between R1 and R3 in
Figure_82. R1 receives the update about from AS 3. R1 forwards the
update to its BGP peers, namely R2. R2 receives the IBGP update and forwards
it to its EBGP peers, namely R6. No update is sent to R3. If R3 needs to receive
this information, R1 and R3 must be configured to be BGP peers.
5.9.3 Protocol description
BGP establishes a reliable TCP connection between peers. Sessions are
established using TCP port 179. BGP assumes the transport connection will
manage fragmentation, retransmission, acknowledgement, and sequencing.
When two speakers initially form a BGP session, they exchange their entire
routing table. This routing information contains the complete AS path used to
reach each destination. The information avoids the routing loops and
counting-to-infinity behavior observed in RIP networks. After the entire table has
been exchanged, changes to the table are communicated as incremental
BGP packet types
All BGP packets contain a standard header. The header specifies the BGP
packet type. The valid BGP packet types include:
򐂰 OPEN2: This message type establishes a BGP session between two peer
򐂰 UPDATE: This message type transfers routing information between GP
򐂰 NOTIFICATION: This message is sent when an error condition is detected.
򐂰 KEEPALIVE: This message determines if peers are reachable.
RFC 1771 uses uppercase to name BGP messages. The same convention is used in this section.
TCP/IP Tutorial and Technical Overview
Figure 5-23 shows the flow of these message types between two autonomous
Keep Alive
Figure 5-23 BGP message flows between BGP speakers
Opening and confirming a BGP connection
After a TCP session has been established between two peer nodes, each router
sends an OPEN message to the neighbor. The OPEN message includes:
򐂰 The originating router's AS number and BGP router identifier.
򐂰 A suggested value for the hold timer. We discuss the function of this timer in
the next section.
򐂰 Optional parameters. This information is used to authenticate a peer.
An OPEN message contains support for authenticating the identity of a BGP
peer. However, the BGP standard does not specify a specific authorization
mechanism. This allows BGP peers to select any supported authorization
An OPEN message is acknowledged by a KEEPALIVE message. After peer
routers have established a BGP connection, they can exchange additional
Maintaining the BGP connection
BGP does not use any transport-based keepalive to determine if peers are
reachable. Instead, BGP messages are periodically exchanged between peers.
If no messages are received from the peer for the duration specified by the hold
timer, the originating router assumes that an error has occurred. When this
happens, an error notification is sent to the peer and the connection is closed.
Chapter 5. Routing protocols
RFC 4271 recommends a 90 second hold timer and a 30 second keepalive
Sending reachability information
Reachability information is exchanged between peers in UPDATE messages.
BGP does not require a periodic refresh of the entire BGP routing table.
Therefore, each BGP speaker must retain a copy of the current BGP routing
table used by each peer. This information is maintained for the duration of the
connection. After neighbors have performed the initial exchange of complete
routing information, only incremental updates to that information are exchanged.
An UPDATE message is used to advertise feasible routes or withdraw infeasible
routes. The message can simultaneously advertise a feasible route and withdraw
multiple infeasible routes from service. Figure 5-24 depicts the format of an
UPDATE message:
򐂰 Network layer reachability information (NLRI).
򐂰 Path attributes (we discuss path attributes in “Path attributes” on page 223.
򐂰 Withdrawn routes.
Number of Octets
Common Header
Type = 2
Unfeasible Routes Length
Withdrawn Routes
Total Path Attribute
Path Attributes
Network Layer
Reachability Information
Figure 5-24 BGP UPDATE message
Several path attributes can be used to describe a route.
Withdrawn routes
The unfeasible routes length field indicates the total length of the withdrawn
routes field.
TCP/IP Tutorial and Technical Overview
The withdrawn routes field provides a list of IP addresses prefixes that are not
feasible or are no longer in service. These addresses need to be withdrawn from
the BGP routing table. The withdrawn routes are represented in the same
tuple-format as the NLRI.
Notification of error conditions
A BGP device can observe error conditions impacting the connection to a peer.
NOTIFICATION messages are sent to the neighbor when these conditions are
detected. After the message is sent, the BGP transport connection is closed.
This means that all resources for the BGP connection are deallocated. The
routing table entries associated with the remote peer are marked as invalid.
Finally, other peers are notified that these routes are invalid.
Notification messages include an error code and an error subcode.The error
codes provided by BGP include:
򐂰 Message header error
򐂰 OPEN message error
򐂰 UPDATE message error
򐂰 Hold timer expired
򐂰 Finite state machine error
򐂰 Cease
The error subcode further qualifies the specific error. Each error code can have
multiple subcodes associated with it.
5.9.4 Path selection
BGP is a path vector protocol. In path vector routing, the path is expressed in
terms of the domains (or confederations) traversed so far. The best path is
obtained by comparing the number of domains of each feasible route. However,
inter-AS routing complicates this process. There are no universally agreed-upon
metrics that can be used to evaluate external paths. Each AS has its own set of
criteria for path evaluation.
Path attributes
Path attributes are used to describe and evaluate a route. Peers exchange path
attributes along with other routing information. When a device advertises a route,
it can add or modify the path attributes before advertising the route to a peer. The
combination of attributes are used to select the best path.
Chapter 5. Routing protocols
Each path attribute is placed into one of four separate categories:
򐂰 Well-known mandatory: The attribute must be recognized by all BGP
implementations. It must be sent in every UPDATE message.
򐂰 Well-known discretionary: The attribute must be recognized by all BGP
implementations. However, it is not required to be sent in every UPDATE
򐂰 Optional transitive: It is not required that every BGP implementation
recognize this type of attribute. A path with an unrecognized optional
transitive attribute is accepted and simply forwarded to other BGP peers.
򐂰 Optional non-transitive: It is not required that every BGP implementation
recognize this type of attribute. These attributes can be ignored and not
passed along to other BGP peers.
BGP defines seven attribute types to define an advertised route:
򐂰 ORIGIN: This attribute defines the origin of the path information. Valid
selections are IGP (interior to the AS), EGP, or INCOMPLETE. This is a
well-known mandatory attribute.
򐂰 AS_PATH: This attribute defines the set of autonomous systems that must be
traversed to reach the advertised network. Each BGP device prepends its AS
number onto the AS path sequence before sending the routing information to
an EBGP peer. Using the sample network depicted in Figure 5-22 on
page 219, R4 advertises network with an AS_PATH of 3. When the
update traverses AS 1, R2 prepends its own AS number to it. When the
routing update reaches R6, the AS_PATH attribute for network is <1
3>. This is a well-known mandatory attribute.
򐂰 NEXT_HOP: This attribute defines the IP address of the next hop used to
reach the destination. This is a well-known mandatory attribute.
For routing updates received over EBGP connections, the next hop is
typically the IP address of the EBGP neighbor in the remote AS. BGP
specifies that this next hop is passed without modification to each IBGP
neighbor. As a result, each IBGP neighbor must have a route to reach the
neighbor in the remote AS. Figure 5-25 on page 225 illustrates this
TCP/IP Tutorial and Technical Overview
AS 1
AS 3
Figure 5-25 NEXT_HOP attribute
In this example, when a routing update for network is sent from AS
3, R1 receives the update with the NEXT_HOP attribute set to
When this update is forwarded to R3, the next hop address remains R3 must have appropriate routing information to reach this
address. Otherwise, R3 will drop packets destined for AS 3 if the next hop is
򐂰 MULTI_EXIT_DISC (multi-exit discriminator, MED): This attribute is used to
discriminate among multiple exit points to a neighboring AS. If this information
is received from an EBGP peer, it is propagated to each IBGP peer. This
attribute is not propagated to peers in other autonomous systems. If all other
attributes are equal, the exit point with the lowest MED value is preferred.
This is an optional non-transitive attribute. MED is discussed further in RFC
Chapter 5. Routing protocols
򐂰 LOCAL_PREF (local preference): This attribute is used by a BGP speaker to
inform other speakers within the AS of the originating speaker's degree of
preference for the advertised route. Unlike MED, this attribute is used only
within an AS. The value of the local preference is not distributed outside an
AS. If all other attributes are equal, the route with the higher degree of
preference is preferred. This is a well-known discretionary attribute.
򐂰 ATOMIC_AGGREGATE: This attribute is used when a BGP peer receives
advertisements for the same destination identified in multiple, non-matching
routes (that is, overlapping routes). One route describes a smaller set of
destinations (a more specific prefix), other routes describe a larger set of
destinations (a less specific prefix). This attribute is used by the BGP speaker
to inform peers that it has selected the less specific route without selecting
the more specific route. This is a well-known discretionary attribute. A route
with this attribute included may actually traverse autonomous systems not
listed in the AS_PATH.
򐂰 AGGREGATOR: This attribute indicates the last AS number that formed the
aggregate route, followed by the IP address of the BGP speaker that formed
the aggregate route. For further information about route aggregation, refer to
5.9.6, “BGP aggregation” on page 228. This is an optional transitive attribute.
Decision process
The process to select the best path uses the path attributes describing each
route. The attributes are analyzed and a degree of preference is assigned.
Because there can be multiple paths to a given destination, the route selection
process determines the degree of preference for each feasible route. The path
with the highest degree of preference is selected as the best path. This is the
path advertised to each BGP neighbor. Route aggregation can also be
performed during this process. Where there are multiple paths to a destination,
BGP tracks each individual path. This allows faster convergence to the alternate
path when the primary path fails.
5.9.5 BGP synchronization
Figure 5-26 on page 227 shows an example of an AS providing transit service. In
this example, AS 1 is used to transport traffic between AS 3 and AS 4. Within
AS 1, R2 is not configured for BGP. However, R2 is used for communication
between R1 and R3. Traffic between these two BGP nodes physically traverses
through R2.
Using the routing update flow described earlier, the network is
advertised using the EBGP connection between R4 and R1. R1 passes the
network advertisement to R3 using its existing IBGP connection. Because R2 is
not configured for BGP, it is unaware of any networks in AS 3. A problem occurs
TCP/IP Tutorial and Technical Overview
if R3 needs to communicate with a device in AS 3. R3 passes the traffic to R2.
However, because R2 does not have any routes to AS 3 networks, the traffic is
If R3 advertises the network to AS 4, the problem continues. If AS 4
needs to communicate with a device in AS 3, the packets are forwarded from R5
to R3. R3 forwards the packets to R2 where they are discarded.
AS 1
AS 3
AS 4
Figure 5-26 BGP synchronization
This situation is addressed by the synchronization rule of BGP. The rule states
that a transit AS will not advertise a route before all routers within the AS have
learned about the route. In this example, R3 will not advertise the existence of
the networks in AS 3 until R2 has built a proper routing table.
There are three methods to implement the synchronization rule:
򐂰 Enable BGP on all devices within the transit AS. In this solution, R2 has an
IBGP session with both R1 and R3. R2 learns of the network at the
same time it is advertised to R3. At that time, R3 announces the routes to its
peer in AS 4.
Chapter 5. Routing protocols
򐂰 Redistribute the routes into the IGP used within the transit area. In this
solution, R1 redistributes the network into the IGP within AS 1. R3
learns of the network through two routing protocols: BGP and the IGP. After
R3 learns of the network through the IGP, it is certain that other routers within
the AS have also learned of the routes. At that time, R3 announces the routes
to its peer in AS 4.
򐂰 Encapsulate the transit traffic across the AS. In this solution, transit traffic is
encapsulated within IP datagrams addressed to the exit gateway. Because
this does not require the IGP to carry exterior routing information, no
synchronization is required between BGP and the IGP. R3 can immediately
announce the routes to its peer in AS 4.
5.9.6 BGP aggregation
The major improvement introduced in BGP Version 4 was support for CIDR and
route aggregation. These features allow BGP peers to consolidate multiple
contiguous routing entries into a single advertisement. It significantly enhances
the scalability of BGP into large internetworking environments. Figure 5-27 on
page 229 illustrates these functions.
TCP/IP Tutorial and Technical Overview
AS 1 through
R2 through
AS 3
AS 2
Figure 5-27 BGP route aggregation
This diagrams depicts three autonomous systems interconnected by BGP. In this
example, networks through are located within AS 3.
To reduce the size of routing announcements, R4 aggregates these individual
networks into a single route entry prior to advertising into AS 1. The single entry represents a valid CIDR supernet even though it is an illegal
Class C network.
BGP aggregate routes contain additional information within the AS_PATH path
attribute. When aggregate entries are generated from a set of more specific
routes, the AS_PATH attributes of the more specific routes are combined. For
example, in Figure 5-27, the aggregate route is announced from
AS 1 into AS 2. This aggregate represents the set of more specific routes
deployed within AS 1 and AS 3. When this aggregate route is sent to AS 2, the
AS_PATH attribute consists of <1 3>. This is done to prevent routing information
loops. A loop can occur if AS 1 generated an aggregate with an AS_PATH
attribute of <1>. If AS 2 had a direct connection to AS 3, the route with the
less-specific AS_PATH advertised from AS 1 can generate a loop. This is
Chapter 5. Routing protocols
because AS 2 does not know this aggregate contains networks located within
AS 3.
5.9.7 BGP confederations
BGP requires that all speakers within a single AS have a fully meshed set of
IBGP connections. This can be a scaling problem in networks containing a large
number of IBGP peers. The use of BGP confederations addresses this problem.
A BGP confederation creates a set of autonomous systems that represent a
single AS to peers external to the confederation. This removes the full mesh
requirement and reduces management complexity.
Figure 5-28 illustrates the operation of a BGP confederation. In this sample
network, AS 1 contains eight BGP speakers. A standard BGP network would
require 28 IBGP sessions to fully mesh the speakers.
AS 101
AS 100
AS 102
AS 1
AS 3
AS 2
Figure 5-28 BGP confederations
TCP/IP Tutorial and Technical Overview
A confederation divides the AS into a set of domains. In this example, AS 1
contains three domains. Devices within a domain have a fully meshed set of
IBGP connections. Each domain also has an EBGP connection to other domains
within the confederation. In the example network, R1, R2, and R3 have fully
meshed IBGP sessions. R1 has an EBGP session within the confederation to
R4. R3 has an EBGP session outside the confederation to R9.
Each router in the confederation is assigned a confederation ID. A member of the
confederation uses this ID in all communications with devices outside the
confederation. In this example, each router is assigned a confederation ID of
AS 1.
All communications from AS 1 to AS 2 or AS 3 appear to have originated from
the confederation ID of AS 1. Even though communication between domains
within a confederation occurs with EBGP, the domains exchange routing updates
as though they were connected by IBGP. Specifically, the information contained
in the NEXT_HOP, MULTI_EXIT_DESC, and LOCAL_PREF attributes is
preserved between domains. The confederation appears to be a single AS to
other autonomous systems.
BGP confederations are described in RFC 3065. At the time of this writing, this is
a proposed standard. Regardless, BGP confederations have been widely
deployed throughout the Internet. Numerous vendors support this feature.
5.9.8 BGP route reflectors
Route reflectors are another solution to address the requirement for a full mesh
of IBGP sessions between peers in an AS. As noted previously, when a BGP
speaker receives an update from an IBGP peer, the receiving speaker
propagates the update only to EBGP peers. The receiving speaker does not
forward the update to other IBGP peers.
Route reflectors relax this restriction. BGP speakers are permitted to advertise
IBGP learned routes to certain IBGP peers. Figure 5-29 on page 232 depicts an
environment using route reflectors. R1 is configured as a route reflector for R2
and R3. R2 and R3 are route reflector clients of R1. No IBGP session is defined
between R2 and R3. When R3 receives an EBGP update from AS 3, it is passed
to R1 using IBGP. Because R1 is configured as a reflector, R1 forwards the
IBGP update to R2.
Chapter 5. Routing protocols
Figure 5-29 also illustrates the interaction between route reflectors and
conventional BGP speakers within an AS.
AS 100
AS 1
AS 3
AS 2
Figure 5-29 BGP route reflector
In Figure 5-29, R1, R2, and R3 are in the route reflector domain. R6, R7, and R8
are conventional BGP speakers containing a full mesh of IBGP peer connections.
In addition, each of these speakers is peered with the route reflector. This
configuration permits full IBGP communication within AS 1.
Although not shown in Figure 5-29, an AS can contain more than one route
reflector. When this occurs, each reflector treats other reflectors as a
conventional IBGP peer.
Route reflectors are described in RFC 4456. At the time of this writing, this is a
proposed standard.
TCP/IP Tutorial and Technical Overview
5.10 Routing protocol selection
The choice of a routing protocol is a major decision for the network administrator.
It has a major impact on overall network performance. The selection depends on
network complexity, size, and administrative policies.
The protocol chosen for one type of network might not be appropriate for other
types of networks. Each unique environment must be evaluated against a
number of fundamental design requirements:
򐂰 Scalability to large environments: The potential growth of the network dictates
the importance of this requirement. If support is needed for large,
highly-redundant networks, consider link state or hybrid algorithms. Distance
vector algorithms do not scale into these environments.
򐂰 Stability during outages: Distance vector algorithms might introduce network
instability during outage periods. The counting to infinity problems (5.3.5,
“Convergence and counting to infinity” on page 185) can cause routing loops
or other non-optimal routing paths. Link state or hybrid algorithms reduce the
potential for these problems.
򐂰 Speed of convergence: Triggered updates provide the ability to immediately
initiate convergence when a failure is detected. All three types of protocols
support this feature. One contributing factor to convergence is the time
required to detect a failure. In OSPF and EIGRP networks, a series of hello
packets must be missed before convergence begins. In RIP environments,
subsequent route advertisements must be missed before convergence in
initiated. These detection times increase the time required to restore
򐂰 Metrics: Metrics provide the ability to groom appropriate routing paths through
the network. Link state algorithms consider bandwidth when calculating
routes. EIGRP improves this to include network delay in the route calculation.
򐂰 Support for VLSM: The availability of IP address ranges dictates the
importance of this requirement. In environments with an constrained supply of
addresses, the network administrator must develop an addressing scheme
that intelligently overlays the network. VLSM is a major component of this
plan. The use of private addresses ranges can also address this concern.
򐂰 Vendor interoperability: The types of devices deployed in a network indicate
the importance of this requirement. If the network contains equipment from a
number of vendors, use standard routing protocols. The IETF has dictated the
operating policies for the distance vector and link state algorithms described
in this document. Implementing these algorithms avoids any interoperability
problems encountered with nonstandard protocols.
Chapter 5. Routing protocols
򐂰 Ease of implementation: Distance vector protocols are the simplest routing
protocol to configure and maintain. Because of this, these protocols have the
largest implementation base. Limited training is required to perform problem
resolution in these environments.
In small, non-changing environments, static routes are also simple to
implement. These definitions change only when sites are added or removed
from the network. The administrator must assess the importance of each of
these requirements when determining the appropriate routing protocol for an
5.11 Additional functions performed by the router
The main functions performed by a router relate to managing the IP routing table
and forwarding data. However, the router should be able to provide information
alerting other devices to potential network problems.
This information is provided by the ICMP protocol described in 3.2, “Internet
Control Message Protocol (ICMP)” on page 109. The information includes:
򐂰 ICMP Destination Unreachable: The destination address specified in the IP
packet references an unknown IP network.
򐂰 ICMP Redirect: Redirect forwarding of traffic to a more suitable router along
the path to the destination.
򐂰 ICMP Source Quench: Congestion problems (for example, too many
incoming datagrams for the available buffer space) have been encountered in
a device along the path to the destination.
򐂰 ICMP Time Exceeded: The Time-to-Live field of an IP datagram has reached
zero. The packet is not able to be delivered to the final destination.
In addition, each IP router should support the following base ICMP operations
and messages:
򐂰 Parameter problem: This message is returned to the packet’s source if a
problem with the IP header is found. The message indicates the type and
location of the problem. The router discards the errored packet.
򐂰 Address mask request/reply: A router must implement support for receiving
ICMP Address Mask Request messages and responding with ICMP Address
Mask Reply messages.
򐂰 Timestamp: The router must return a Timestamp Reply to every Timestamp
message that is received. It should be designed for minimum variability in
delay. To synchronize the clock on the router, the UDP Time Server Protocol
or the Network Time Protocol (NTP) can be used.
TCP/IP Tutorial and Technical Overview
򐂰 Echo request/reply: A router must implement an ICMP Echo server function
that receives requests sent to the router and sends corresponding replies.
The router can ignore ICMP Echo requests addressed to IP broadcast or IP
multicast addresses.
5.12 Routing processes in UNIX-based systems
This chapter focuses on protocols available in standard IP routers. However,
several of these protocols are also available in UNIX-based systems.
These protocols are often implemented using one of two processes:
򐂰 Routed (pronounced route-D): This is a basic routing process for interior
routing. It is supplied with the majority of TCP/IP implementations. It
implements the RIP protocol.
򐂰 Gated (pronounced gate-D): This is a more sophisticated process allowing for
both interior and exterior routing. It can implement a number of protocols
including OSPF, RIP-2, and BGP-4.
5.13 RFCs relevant to this chapter
The following RFCs provide detailed information about the connection protocols
and architectures presented throughout this chapter:
򐂰 RFC 904 – Exterior Gateway Protocol formal specification (April 1984)
򐂰 RFC 1058 – Routing Information Protocol (June 1988)
򐂰 RFC 1322 – A Unified Approach to Inter-Domain Routing (May 1992)
򐂰 RFC 1812 – Requirements for IP Version 4 Routers. (June 1995)
򐂰 RFC 2080 – RIPng for IPv6 (January 1997)
򐂰 RFC 2328 – OSPF Version 2 (April 1998)
򐂰 RFC 2453 – RIP Version 2 (November 1998)
򐂰 RFC 3065 – Autonomous System Confederations for BGP (February 2001)
򐂰 RFC 3101 – The OSPF Not-So-Stubby Area (NSSA) Option (January 2003)
򐂰 RFC 4271 – A Border Gateway Protocol 4 (BGP-4) (January 2006)
򐂰 RFC 4451 – BGP MULTI_EXIT_DISC (MED) Considerations (March 2006)
򐂰 RFC 4456 – BGP Route Reflection: An Alternative to Full Mesh Internal BGP
(IBGP) (April 2006)
Chapter 5. Routing protocols
TCP/IP Tutorial and Technical Overview
Chapter 6.
IP multicast
In early IP networks, a packet could be sent to either a single device (unicast) or
to all devices (broadcast). A single transmission destined for a group of devices
was not possible. However, during the past few years, a new set of applications
has emerged. These applications use multicast transmissions to enable efficient
communication between groups of devices. Data is transmitted to a single
multicast IP address and received by any device that needs to obtain the
This chapter describes the interoperation between IP multicasting, Internet
Group Management Protocol (IGMP), and multicast routing protocols.
© Copyright IBM Corp. 1989-2006. All rights reserved.
6.1 Multicast addressing
Multicast devices use Class D IP addresses to communicate. These addresses
are contained in the range encompassing through
For each multicast address, there exists a set of zero or more hosts that listen for
packets transmitted to the address. This set of devices is called a host group. A
host that sends packets to a specific group does not need to be a member of the
group. The host might not even know the current members in the group. There
are two types of host groups:
򐂰 Permanent: Applications that are part of this type of group have an IP address
permanently assigned by the IANA. Membership in this type of host group is
not permanent; a host can join or leave the group as required. A permanent
group continues to exist even if it has no members. The list of IP addresses
assigned to permanent host groups is included in RFC 3232. These reserved
addresses include:
– Reserved base address
– All systems on this subnet
– All routers on this subnet
– All RIP2 routers
Other address examples include those reserved for OSPF (refer to 5.6, “Open
Shortest Path First (OSPF)” on page 196). They include:
– All OSPF routers
– OSPF designated routers
Additionally, IGMPv3 (defined in RFC 3376), reserves the following address:
– All IGMPv3-capable multicast routers
An application can use DNS to obtain the IP address assigned to a
permanent host group (refer to 12.1, “Domain Name System (DNS)” on
page 426) using the domain It can determine the permanent group
from an address by using a pointer query (refer to 12.1.6, “Mapping IP
addresses to domain names: Pointer queries” on page 430) in the domain
򐂰 Transient: Any group that is not permanent is transient. The group is available
for dynamic assignment as needed. Transient groups cease to exist when the
number of members drops to zero.
6.1.1 Multicasting on a single physical network
This process is straightforward. The sending process specifies a destination IP
multicast address. The device driver converts this IP address to the
TCP/IP Tutorial and Technical Overview
corresponding Ethernet address and sends the packet to the destination. The
destination process informs its network device drivers that it wants to receive
datagrams destined for a given multicast address. The device driver enables
reception of packets for that address.
In contrast to standard IP unicast traffic forwarding, the mapping between the IP
multicast destination address and the data-link address is not done with ARP
(see 3.4, “Address Resolution Protocol (ARP)” on page 119). Instead, a static
mapping has been defined. In an Ethernet network, multicasting is supported if
the high-order octet of the data-link address is 0x'01'. The IANA has reserved the
range 0x’01005E000000' through 0x'01005E7FFFFF' for multicast addresses.
This range provides 23 usable bits. The 32-bit multicast IP address is mapped to
an Ethernet address by placing the low-order 23 bits of the Class D address into
the low-order 23 bits of the IANA reserved address block. Figure 6-1 shows the
mapping of a multicast IP address to the corresponding Ethernet address.
Figure 6-1 Mapping of Class D IP addresses to Ethernet addresses
Because the high-order five bits of the IP multicast group are ignored, 32
different multicast groups are mapped to the same Ethernet address. Because of
this non-unique mapping, filtering by the device driver is required. This is done by
checking the destination address in the IP header before passing the packet to
the IP layer. This ensures the receiving process does not receive spurious
datagrams. There are two additional reasons why filtering might be needed:
򐂰 Some LAN adapters are limited to a finite number of concurrent multicast
addresses. When this limit is exceeded, they receive all multicast packets.
򐂰 The filters in some LAN adapters use a hash table value rather than the entire
multicast address. If two addresses with the same hash value are used at the
same time, the filter might pass excess packets.
Despite this requirement for software filtering, multicast transmissions still cause
less inefficiencies for hosts not participating in a specific session. In particular,
hosts that are not participating in a host group are not listening for the multicast
Chapter 6. IP multicast
address. In this situation, multicast packets are filtered by lower-layer network
interface hardware.
6.1.2 Multicasting between network segments
Multicast traffic is not limited to a single physical network. However, there are
inherent dangers when multicasting between networks. If the environment
contains multiple routers, specific precautions must be taken to ensure multicast
packets do not continuously loop through the network. It is simple to create a
multicast routing loop. To address this, multicast routing protocols have been
developed to deliver packets while simultaneously avoiding routing loops and
excess transmissions.
There are two requirements to multicast data across multiple networks:
򐂰 Determining multicast participants: A mechanism for determining if a
multicast datagram needs to be forwarded on a specific network. This
mechanism is defined in RFC 3376, Internet Group Management Protocol
(IGMP), Version 3.
򐂰 Determining multicast scope: A mechanism for determining the scope of a
transmission. Unlike unicast addresses, multicast addresses can extend
through the entire Internet.
The TTL field in a multicast datagram can be used to determine the scope of
a transmission. Like other datagrams, each multicast datagram has a Time
To Live (TTL) field. The value contained in this field is decremented at each
hop. When a host or multicast router receives a datagram, packet processing
depends on both the TTL value and the destination IP address:
– TTL = 0: A multicast datagram received with a TTL value of zero is
restricted to the source host.
– TTL = 1: A multicast datagram with a TTL value of one reaches all hosts
on the subnet that are members of the group. Multicast routers decrement
the value to zero. However unlike unicast datagrams, no ICMP Time
Exceeded error message is returned to the source host. Datagram
expiration is a standard occurrence in multicast environments.
– TTL = 2 (or more): A multicast datagram with this TTL value reaches all
hosts on the subnet that are members of the group. The action performed
by multicast routers depends on the specific group address:
240 - This range of addresses is intended for
single-hop multicast applications. Multicast routers will not forward
datagrams with destination addresses in this range.
TCP/IP Tutorial and Technical Overview
Even though multicast routers will not forward datagrams within this
address range, a host must still report membership in a group within
this range. The report is used to inform other hosts on the subnet that
the reporting host is a member of the group.
Other: Datagrams with any other valid Class D destination address are
forwarded as normal by the multicast router. The TTL value is
decremented by one at each hop.
This allows a host to implement an expanding ring search to locate the
nearest server listening to a specific multicast address. The host sends
out a datagram with a TTL value of 1 (same subnet) and waits for a
reply. If no reply is received, the host resends the datagram with a TTL
value of 2. If no reply is received, the host continues to systematically
increment the TTL value until the nearest server is found.
6.2 Internet Group Management Protocol (IGMP)
The Internet Group Management Protocol is used by hosts to join or leave a
multicast host group. Group membership information is exchanged between a
specific host and the nearest multicast router.
IGMP is best regarded as an extension to ICMP. It occupies the same position in
the IP protocol stack.
IGMP functions are integrated directly into IPv6, because all IPv6 hosts are
required to support multicasting (refer to 9.3.2, “Multicast Listener Discovery
(MLD)” on page 365). In IPv4, multicasting and IGMP support is optional.
6.2.1 IGMP messages
IGMP messages are encapsulated in IP datagrams. To indicate an IGMP packet,
the IP header contains a protocol number of 2. For IGMP version 2 (defined by
RFC 2236), the IP data field contains the 8-octet IGMP message shown in
Figure 6-2.
Figure 6-2 IGMP message format
Chapter 6. IP multicast
The fields in the IGMP message contain the following information:
򐂰 Type: This field specifies the type of IGMP packet:
– 0x’11’: Specifies a membership query packet. This is sent by a multicast
router. There are two subtypes of membership query messages:
General Query: This is used to learn which groups have members on
an attached network.
Group-Specific Query: This is used to learn if a particular group has
any members on an attached network.
– 0x’12’: Specifies an IGMPv1 membership report packet. This is sent by a
multicast host to signal participation in a specific multicast host group.
– 0x’16’: Specifies an IGMPv2 membership report packet.
– 0x’17’: Specifies a leave group packet. This is sent by a multicast host.
򐂰 Max resp time: This field is used in membership query messages. It specifies
the maximum allowed time a host can wait before sending a corresponding
report. Varying this setting allows routers to tune the leave latency. This
references the time between the last host leaving a group and the time the
routing protocol is notified that there are no more members.
򐂰 Checksum: This field contains a 16-bit checksum.
򐂰 Class D Address: This field contains a valid multicast group address. It is
used in a report packet.
IGMPv3 messages
The IGMPv2 message format has been extended in IGMP version 3, defined in
RFC 3376 (which obsoletes RFC 2236). Version 3 allows receivers to subscribe
to or exclude a specific set of sources within a multicast group. To accommodate
this, The 0x’11’ type membership query packet has been altered, and a new
IGMP packet type of 0x’22’ has been added. However, all IGMPv3
implementations must still support packet types 0x’12’, 0x’16’, and 0x’17’.
TCP/IP Tutorial and Technical Overview
Figure 6-3 illustrates the expanded version 3 membership query packet.
Type - 0x11
Max. Resp. Code
Group Address
Number of Sources (N)
Source Address [1]
Source Address [2]
Source Address [N]
Figure 6-3 The IGMPv3 membership query message
The fields from this message are as follows:
򐂰 Type: This remains unchanged and, in the case of the membership query
message, has a value of 0x’11’.
򐂰 Max resp code: This field is has been changed from the IGMPv2 convention,
and distinguishes between a maximum response code and a maximum
response time. The time can be determined from the code as follows:
– If the maximum response code is less than 128, the value of maximum
response code is also the maximum response time.
– If the maximum response code is greater than 128, the maximum
response code represents a floating-point value denoted as follows:
Byte 0 = 1
Bytes 1-3 = exp
Bytes 4-7 = mant
Maximum response time = (mant OR 0x’10) << (exp + 3)
Note: Maximum response time, for both IGMPv2 and IGMPv3, is
measured in tenths of a second.
Chapter 6. IP multicast
This can be better described as creating a 5-bit string starting with 1
followed by the 4 bits of mant, and then shifting this string exp+3 to the left.
The resulting value is the maximum response time. For example, assume
that the value of maximum response code is decimal 178. The bit string
representation of this is 10110010. From this, the fields of the maximum
response code are:
Byte 0 = 1
exp = 011
mant = 0010
The subsequent calculations are (note that the exp calculation uses a
binary representation of decimal 3):
(mant OR 0x10) = (0010 OR 10000) = 10010
10010 << exp + 11 = 10010 << 110 = 10010000000
Binary 10010000000 = Decimal 1152
Therefore, when the maximum response code is decimal 178, the
maximum response time is 1152 tenths of a second.
򐂰 Checksum: This field contains a 16-bit checksum, and remains unchanged
from its version 2 counterpart.
򐂰 Group Address: This field contains the Class D address, and remains
unchanged from its version 2 counterpart.
򐂰 Resv: This field is reserved, and is set to zero on transmission and ignored on
򐂰 S Flag: When set to 1, this field indicates that any receiving multicast routers
should suppress the normal timer updates normally performed upon hearing
a query.
򐂰 QRV: This field is the Querier’s Robustness Variable. The QRV was added in
IGMPv3, and is used in tuning timer values for expected packet loss. The
higher the value of the QRV, the more tolerant the environment is for lost
packets within a domain. However, increasing the QRV also increases the
latency required in detecting a problem. Routers adopt the value in this field
from the most recently received query as their own robustness variable. If this
value exceeds the limit of 7, it is reset to 0.
򐂰 QQIC: This field is the Querier’s Query Interval Code. This value, specified in
seconds, specifies the query interval used by the originator of this query. The
calculations to convert this code into the actual interval time is the same used
for the maximum response code.
TCP/IP Tutorial and Technical Overview
򐂰 Number of Sources (N): This field indicates how many source address are
contained within the message. The maximum value for this field is determined
by the MTU allowed in the network.
򐂰 Source Addresses: This set of fields are a vector of N IP unicast addresses,
where the value N corresponds to the Number or Sources (N) field.
Additionally, IGMPv3 adds a new type of 0x’22’, which is the IGMPv3
Membership Report. Figure 6-4 illustrates the format for this message.
Type = 0x22
Number of Group Records (M)
Group Record [1]
Group Record [2]
Group Record [M]
Figure 6-4 The IGMPv3 membership report message
Chapter 6. IP multicast
In Figure 6-4 on page 245, the Type, Checksum, and Number of Group Records
fields are equivalent to their IGMPv3 message query counterparts. However, in
the membership report message, each Group Record is broken up, as illustrated
in Figure 6-5.
Record Type
Number of Sources (N)
Multicast Address
Source Address [1]
Source Address [2]
Source Address [N]
Auxiliary Data
Figure 6-5 IGMPv3 group record
The fields within the group record are described as follows:
򐂰 Record Type: This field indicates whether the group record type is a
current-state, filter-mode-change, or source-list-change record. The values
are as follows:
– Current-state records are sent by a system in response to a query
received on an interface, and report the current reception of that interface.
There are two group record values that denote a current-state record:
MODE_IS_INCLUDE: This indicates that the interface has a filter mode
of INCLUDE for the specified multicast addresses.
MODE_IS_EXCLUDE: This indicates that the interface has a filter
mode of EXCLUDE for the specified multicast addresses.
TCP/IP Tutorial and Technical Overview
– Filter-mode-change records are sent by a system whenever an interface’s
state changes for a particular multicast address. There are two group
record values which denote a filter-mode-change record:
CHANGE_TO_INCLUDE_MODE: This indicates that the interface has
changed to the INCLUDE filter mode for the specified multicast
CHANGE_TO_EXCLUDE_MODE: This indicates that the interface has
changed to the EXCLUDE filter mode for the specified multicast
– Source-list-change records are sent by a system whenever an interface
wants to alter the list of source addresses without altering its state. There
are two group record values that denote a source-list-change record:
ALLOW_NEW_SOURCES: This indicates that the interface has
changed such that it wants to receive messages from additional
sources. If the filter is an INCLUDE filter, the specified multicast
addresses will be added. If it is an EXCLUDE filter, the specified
multicast addresses will be removed.
BLOCK_OLD_SOURCES: This indicates that the interface has
changed such that it no longer wants to receives messages from
additional sources. If the filter is an INCLUDE filter, the specified
multicast addresses will be removed. If it is an EXCLUDE filter, the
specified multicast addresses will be added.
We discuss these group record types in greater detail in “IGMPv3 specific host
operations” on page 248.
6.2.2 IGMP operation
Both hosts and multicast routers participate in IGMP functions.
Host operations
To receive multicast datagrams, a host must join a host group. When a host is
multihomed, it can join groups on one or more of its attached interfaces. If a host
joins the same group on multiple interfaces, the multicast messages received by
the host can be different. For example, is the group for all hosts on this
subnet. Messages in this group received through one subnet will always be
different from those on another subnet.
Multiple processes on a single host can listen for messages from the same
group. When this occurs, the host joins the group once. The host internally tracks
each process interested in the group.
Chapter 6. IP multicast
To join a group, the host sends an IGMP membership report packet through an
attached interface. The report is addressed to the desired multicast group. A host
does not need to join the all hosts group ( Membership in this group is
IGMPv3 specific host operations
In IGMPv3, hosts specify a list of multicast addresses from which they want to
receive messages, or a list of multicast addresses from which they do not want to
receive messages. Hosts can then later alter these lists to add or remove
multicast addresses. This can be achieved using the filter-mode-change and
source-list-change records.
Note: If no interface state exists, it is created using the filter-mode-change
The use of these records is demonstrated using Table 6-1. In this example, the
current state indicates what subsets of multicast addresses (A and B) are
currently included or excluded. The desired state indicates the subsets desired to
be included or excluded. The Records needed to achieve desired state show the
records which can must be sent to achieve this change. Note that the group type
records are abbreviated as follows:
Table 6-1 IGMPv3 list changes using group record types
Current state
Desired state
Records needed to achieve desired state
Include ()
Include (A)
To_in (A)
Include (A)
Include (B)
Allow (B-A), Block (A-B)
Include (B)
Exclude (A)
To_ex (A)
Exclude (A)
Exclude (B)
Allow (A-B), Block (B-A)
Exclude (B)
Include (A)
To_in (A)
These steps are summarized as follows:
1. No source address lists currently exists. The subset of lists A is added by
issuing a CHANGE_TO_INCLUDE_MODE specifying the A subset.
TCP/IP Tutorial and Technical Overview
2. Subset A is currently included. However, only subset B is desired. To do this,
first an ALLOW_NEW_SOURCES message is issued to add all of subset B
except for those addresses already included in A. This is followed by a
BLOCK_OLD_SOURCES message to exclude all of subset A except for
those addresses which also belong to B.
3. Now only subset B is included. However, we want to change the filter to
EXCLUDE, and to specify only subset A. This is done with one
CHANGE_TO_EXCLUDE_MODE specifying subset A.
4. After step 3, only subset A is excluded. Now, we want to exclude only
subset B. First issue an ALLOW_NEW_SOURCES message to remove all of
subset A from the excluded list except those addresses also in subset B.
Then add all of the addresses in subset B to the exclude list except for those
also in subset A.
5. Now only subset B is excluded. We want to change the filter to INCLUDE and
add A to the list. Use the CHANGE_TO_INCLUDE_MODE specifying
subset A.
Multicast router operations
When a host attempts to join a group, multicast routers on the subnet receive the
membership report packet and create an entry in their local group database. This
database tracks the group membership of the router’s directly attached networks.
Each entry in the database is of the format [group, attached network]. This
indicates that the attached network has at least one IP host belonging to the
group. Multicast routers listen to all multicast addresses to detect these reports.
The information in the local group database is used to forward multicast
datagrams. When the router receives a datagram, it is forwarded out each
interface containing hosts belonging to the group.
To verify group membership, multicast routers regularly send an IGMP query
message to the all hosts’ multicast address. Each host that still wants to be a
member of a group sends a reply. RFC 3376 specifies this verification should by
default occur every 125 seconds. To avoid bursts of traffic on the subnet, replies
to query messages are sent using a random delay. Because routers do not track
the number of hosts in each group, any host that hears another device claim
membership cancels any pending membership replies. If no hosts claim
membership within the specified interval, the multicast router assumes no hosts
on that network are members of the group.
IGMP snooping switches
One potential problem when implementing IGMP is the flooding of a network
segment with multicast packets, even though there might not be any nodes on
that segment that have any interest in receiving the packets. Although the
Chapter 6. IP multicast
amount of processing involved in receiving these packets presents no significant
cost, the flooding of these packets does have the potential to waste bandwidth.
This problem is documented in RFC 4541, which provides recommendations and
suggested rules to allow network switches to “snoop” on IGMP traffic. By doing
so, switches can analyze the data contained within the IGMP header and
determine if the traffic needs to be forwarded to every segment to which the
switch is connected. By doing this, switches can reduce the amount of
unnecessary IGMP traffic flooding uninterested networks, thereby reserving the
segment’s bandwidth for the traffic with which the hosts on those networks are
6.3 Multicast delivery tree
IGMP only specifies the communication occurring between receiving hosts and
their local multicast router. Routing of packets between multicast routers is
managed by a separate routing protocol. We describe these protocols in 6.4,
“Multicast forwarding algorithms” on page 252. Figure 6-6 on page 251 shows
that multicast routing protocols and IGMP operate in different sections of the
multicast delivery tree.
TCP/IP Tutorial and Technical Overview
Figure 6-6 Multicast delivery tree
Figure 6-6 shows the tree formed between a multicast sender and the set of
receivers. If there are no hosts connected to a multicast router that have joined
this specific group, no multicast packets for the group should be delivered to the
branch connecting these hosts. The branch are pruned from the delivery tree.
This action reduces the size of the tree to the minimum number of branches
needed to reach every group member. New sections of the tree can be
dynamically added as new members join the group. This grafts new sections to
the delivery tree.
Chapter 6. IP multicast
6.4 Multicast forwarding algorithms
Multicast algorithms are used to establish paths through the network. These
paths allow multicast traffic to effectively reach all group members. Each
algorithm should address the following set of requirements:
򐂰 The algorithm must route data only to group members.
򐂰 The algorithm must optimize the path from source to destinations.
򐂰 The algorithm must maintain loop-free routes.
򐂰 The algorithm must provide scalable signaling functions used to create and
maintain group membership.
򐂰 The algorithm must not concentrate traffic on a subset of links.
Several algorithms have been developed for use in multicast routing protocols.
These algorithms have varying levels of success addressing these design
requirements. We review two algorithms in the following sections.
6.4.1 Reverse path forwarding algorithm
The reverse path forwarding (RPF) algorithm uses a multicast delivery tree to
forward datagrams from the source to each member in the multicast group. As
shown in Figure 6-7, packets are replicated only at necessary branches in the
delivery tree.
Figure 6-7 Reverse path forwarding (RPF)
TCP/IP Tutorial and Technical Overview
To track the membership of individual groups, trees are calculated and updated
The algorithm maintains a reverse path table used to reach each source. This
table maps every known source network to the preferred interface used to reach
the source. When forwarding data, if the datagram arrives through the interface
used to transmit datagrams back to the source, the datagram is forwarded
through every appropriate downstream interface. Otherwise, the datagram
arrived through a sub-optimal path and is discarded. Using this process,
duplicate packets caused by network loops are filtered.
The use of RPF provides two benefits:
򐂰 RPF guarantees the fastest delivery for multicast data. In this configuration,
traffic follows the shortest path from the source to each destination.
򐂰 A different tree is computed for each source node. Packet delivery is
distributed over multiple network links. This results in more efficient use of
network resources.
6.4.2 Center-based tree algorithm
The center-based tree (CBT) algorithm describes another method to determine
optimum paths between members of a multicast group. The algorithm describes
the following steps:
1. A center point in the network is chosen. This fixed point represents the center
of the multicast group.
2. Each recipient sends a join request directed towards the center point. This is
accomplished using an IGMP membership report for that group.
3. The request is processed by all intermediate devices located between the
multicast recipient and the center point. If the router receiving the request is
already a member of the tree, it marks one more interface as belonging to the
group. If this is the first join request, the router forwards the request one step
further toward the source.
This procedure builds a delivery tree for each multicast group. The tree is
identical for all sources. Each router maintains a single tree for the entire group.
This contrasts with the process used in the RPF algorithm. The RPF algorithm
builds a tree for each sender in a multicast group.
Because there is no requirement for the source to be a member of the group,
multicast packets from a source are forwarded toward the center point until they
reach a router belonging to the tree. At this stage, the packets are forwarded
using the multicast processing of the center-based tree.
Chapter 6. IP multicast
The disadvantage to the center-based tree algorithm is that it might build a
suboptimal path for some sources and receivers.
6.4.3 Multicast routing protocols
A number of multicast routing protocols have been developed using these
򐂰 Distance Vector Multicast Routing Protocol (DVMRP)
򐂰 Multicast OSPF (MOSPF)
򐂰 Protocol Independent Multicast (PIM)
The remainder of this chapter describes these protocols.
6.5 Distance Vector Multicast Routing Protocol
DVMRP is an established multicast routing protocol originally defined in RFC
1075. The standard was first implemented as the mrouted process available on
many UNIX systems. It has since been enhanced to support RPF. DVMRP is an
interior gateway protocol. It is used to build per-source per-group multicast
delivery trees within an autonomous system (AS).
DVMRP does not route unicast datagrams. Any router that processes both
multicast and unicast datagrams must be configured with two separate routing
processes. Because separate processes are used, multicast and unicast traffic
might not follow the same path through the network.
6.5.1 Protocol overview
DVMRP is described as a broadcast and prune multicast routing protocol:
򐂰 DVMRP builds per-source broadcast trees based on routing exchanges.
򐂰 DVMRP dynamically prunes the per-source broadcast tree to create a
multicast delivery tree. DVMRP uses the RPF algorithm to determine the set
of downstream interfaces used to forward multicast traffic.
Neighbor discovery
DVMRP routers dynamically discover each neighbor by periodically sending
neighbor probe messages on each local interface. These messages are sent to
the all-DVMRP-routers multicast address ( Each message contains a
list of neighbor DVMRP routers for which neighbor probe messages have been
TCP/IP Tutorial and Technical Overview
received. This allows a DVMRP router to verify it has been seen by each
After a router has received a probe message that contains its address in the
neighbor list, the pair of routers establish a two-way neighbor adjacency.
Routing table creation
DVMRP computes the set of reverse paths used in the RPF algorithm. To ensure
that all DVMRP routers have a consistent view of the path connecting to a
source, a routing table is exchanged between each neighbor router. DVMRP
implements its own unicast routing protocol. This routing protocol is similar to
The algorithm is based on hop counts. DVMRP requires a metric to be
configured on every interface. Each router advertises the network number, mask,
and metric of each interface. The router also advertises routes received from
neighbor routers. Like other distance vector protocols, when a route is received,
the interface metric is added to the advertised metric. This adjusted metric is
used to determine the best upstream path to the source.
DVMRP has one important difference from RIP. RIP manages routing and
datagram forwarding to a particular unicast destination. DVMRP manages the
return path to the source of a particular multicast datagram.
Dependent downstream routers
In addition to providing a consistent view of paths to source networks,
exchanging routing information provides an additional benefit. DVMRP uses this
mechanism to notify upstream routers that a specific downstream router requires
them to forward multicast traffic.
DVMRP accomplishes this by using the poison reverse technique (refer to “Split
horizon with poison reverse” on page 188). If a downstream router selects an
upstream router as the next hop to a particular source, routing updates from the
downstream router specify a metric of infinity for the source network. When the
upstream router receives the advertisement, it adds the downstream router to a
list of dependent downstream routers for this source. This technique provides the
information needed to prune the multicast delivery tree.
Designated forwarder
When two or more multicast routers are connected to a multi-access network,
duplicate packets can be forwarded to the network. DVMRP prevents this
possibility by electing a designated forwarder for each source.
Chapter 6. IP multicast
When the routers exchange their routing table, each learns the peer’s metric to
reach the source network. The router with the lowest metric is responsible for
forwarding data to the shared network. If multiple routers have the same metric,
the router with the lowest IP address becomes the designated forwarder for the
6.5.2 Building and maintaining multicast delivery trees
As previously mentioned, the RPF algorithm is used to forward multicast
datagrams. If a datagram was received on the interface representing the best
path to the source, the router forwards the datagram out a set of downstream
interfaces. This set contain each downstream interface included in the multicast
delivery tree.
Building the multicast delivery tree
A multicast router forwards datagrams to two types of devices: downstream
dependent routers and hosts that are members of a particular multicast group. If
a multicast router has no dependent downstream neighbors through a specific
interface, the network is a leaf network. The delivery tree is built using routing
information detailing these different types of destinations.
Adding leaf networks
If the downstream interface connects to a leaf network, packets are forwarded
only if there are hosts that are members of the specific multicast group. The
router obtains this information from the IGMP local group database. If the group
address is listed in the database, and the router is the designated forwarder for
the source, the interface is included in the multicast delivery tree. If there are no
group members, the interface is excluded.
Adding non-leaf networks
Initially, all non-leaf networks are included in the multicast delivery tree. This
allows each downstream router to participate in traffic forwarding for each group.
Pruning the multicast delivery tree
Routers connected to leaf networks remove an interface when there are no
longer any active members participating in the specific multicast group. When
this occurs, multicast packets are no longer forwarded through the interface.
If a router is able to remove all of its downstream interfaces for a specific group, it
notifies its upstream neighbor that it no longer needs traffic from that particular
source and group pair. This notification is accomplished by sending a prune
message to the upstream neighbor. If the upstream neighbor receives prune
TCP/IP Tutorial and Technical Overview
messages from each of the dependent downstream routers on an interface, the
upstream router can remove this interface from the multicast delivery tree.
If the upstream router is able to prune all of its interfaces from the tree, it sends a
prune message to its upstream router. This continues until all unnecessary
branches have been removed from the delivery tree.
Maintaining prune information
In order to remove outdated prune information, each prune message contains a
prune lifetime timer. This indicates the length of time that the prune will remain in
effect. If the interface is still pruned when the timer expires, the interface is
reconnected to the multicast delivery tree. If this causes unwanted multicast
datagrams to be delivered to a downstream device, the prune mechanism is
Grafting pruned networks
Because IP multicast supports dynamic group membership, hosts can join a
multicast group at any time. When this occurs, DVMRP routers use graft
messages to reattach the network to the multicast delivery tree. A graft message
is sent as a result of receiving a IGMP membership report for a group that has
previously been pruned. Separate graft messages are sent to the appropriate
upstream neighbor for each source network that has been pruned.
Receipt of a graft message is acknowledged with a graft ACK message. This
enables the sender to differentiate between a lost graft packet and an inactive
device. If an acknowledgment is not received within the graft timeout period, the
request is retransmitted. The purpose of the graft ACK message is to
acknowledge the receipt of a graft message. It does not imply any action has
been taken as a result of the request. Therefore, all graft request messages are
acknowledged even if they do not cause any action to be taken by the receiving
Chapter 6. IP multicast
6.5.3 DVMRP tunnels
Some IP routers might not be configured to support native multicast routing.
DVMRP provides the ability to tunnel IP multicast datagrams through networks
containing non-multicast routers. The datagrams are encapsulated in unicast IP
packets and forwarded through the network. This behavior is shown in
Figure 6-8.
Figure 6-8 DVMRP tunnels
When the packet is received at the remote end of the tunnel, it is decapsulated
and forwarded through the subnetwork using standard DVMRP multicast
6.6 Multicast OSPF (MOSPF)
MOSPF is a multicast extension to OSPF Version 2 (refer to 5.6, “Open Shortest
Path First (OSPF)” on page 196), defined in RFC 1584. Unlike DVMRP, MOSPF
is not a separate multicast routing protocol. It is used in networks that already
use OSPF for unicast IP routing. The multicast extensions leverage the existing
OSPF topology database to create a source-rooted shortest path delivery tree.
MOSPF forwards multicast datagrams using both source and destination
address. This contrasts the standard OSPF algorithm, which relies solely on
destination address.
TCP/IP Tutorial and Technical Overview
6.6.1 Protocol overview
We present an overview of MOSPF in the following sections.
Group-membership LSA
The location of every group member must be communicated to the rest of the
environment. This ensures that multicast datagrams are forwarded to each
member. OSPF adds a new type of link state advertisement (the
group-membership-LSA) to track the location of each group member. These
LSAs are stored in the OSPF link state database. This database describes the
topology of the AS.
Designated routers
On each network segment, one MOSPF router is selected to be the designated
router (DR). This router is responsible for generating periodic IGMP host
membership queries. It is also responsible for listening to the IGMP membership
reports. Routers ignore any report received on a network where they are not the
DR. This ensures that each network segment appears in the local group
database of at most one router. It also prevents datagrams from being duplicated
as they are delivered to local group members.
Every router floods a group-membership-LSA for each multicast group having at
least one entry in the router’s local group database. This LSA is flooded
throughout the OSPF area.
Shortest-path delivery trees
The path used to forward a multicast datagram is calculated by building a
shortest-path delivery tree rooted at the datagram's source (refer to
“Shortest-Path First (SPF) algorithm” on page 177). This tree is built from
information contained in the link state database. Any branch in the shortest-path
delivery tree that does not have a corresponding group-membership-LSA is
pruned. These branches do not contain any multicast members for the specific
Initially, shortest-path delivery trees are built when the first datagram is received.
The results are cached for use by subsequent datagrams having the same
source and destination. The tree is recomputed when a link state change occurs
or when the cache information times out.
In an MOSPF network, all routers calculate an identical shortest-path delivery
tree for a specific multicast datagram. There is a single path between the
datagram source and any specific destination. This means that unlike OSPF's
treatment of unicast traffic, MOSPF has no provision for equal-cost multipath.
Chapter 6. IP multicast
6.6.2 MOSPF and multiple OSPF areas
OSPF allows an AS to be split into areas. Although this has several traffic
management benefits, it limits the topology information maintained in each
router. A router is only aware of the network topology within the local area. When
building shortest-path trees in these environments, the information contained in
the link state database is not sufficient to describe the complete path between
each source and destination. This can lead to non-optimal path selection.
Within an OSPF area, the area border router (ABR) forwards routing information
and data traffic between areas. The corresponding functions in an MOSPF
environment are performed by an inter-area multicast forwarder. This device
forwards group membership information and multicast datagrams between
areas. An OSPF ABR can also function as an MOSPF inter-area multicast
Because group-membership-LSAs are only flooded within an area, a process to
convey membership information between areas is required. To accomplish this,
each inter-area multicast forwarder summarizes the attached areas’ group
membership requirements and forwards this information into the OSPF
backbone. This announcement consists of a group-membership-LSA listing each
group containing members in the non-backbone area. The advertisement
performs the same function as the summary LSAs generated in a standard
OSPF area.
However, unlike route summarization in a standard OSPF network,
summarization for multicast group membership in MOSPF is asymmetric.
Membership information for the non-backbone area is summarized into the
backbone. However, this information is not re-advertised into other
non-backbone areas.
To forward multicast data traffic between areas, a wildcard multicast receiver is
used. This is a router to which all multicast traffic, regardless of destination, is
forwarded. In non-backbone areas, all inter-area multicast forwarders are
wildcard multicast receivers. This ensures that all multicast traffic originating in a
non-backbone area is forwarded to a inter-area multicast forwarder. This router
sends the multicast datagrams to the backbone area. Because the backbone has
complete knowledge of all group membership information, the datagrams are
then forwarded to the appropriate group members in other areas.
6.6.3 MOSPF and multiple autonomous systems
An analogous situation to inter-area multicast routing exists when at least one
multicast device resides in another AS. In both cases, the shortest path tree
describing the complete path from source to destination cannot be built.
TCP/IP Tutorial and Technical Overview
In this environment, an ASBR in the MOSPF domain is configured as an inter-AS
multicast forwarder. This router is also configured with an inter-AS multicast
routing protocol. Although the MOSPF standard does not dictate the operations
of the inter-AS protocol, it does assume the protocol forwards datagrams using
RPF principles. Specifically, MOSPF assumes that a multicast datagram whose
source is outside the domain will enter the domain at a point that is advertising
(into OSPF) the best route to the source. MOSPF uses this information to
calculate the path of the datagram through the domain.
MOSPF designates an inter-AS multicast forwarder as a wildcard multicast
receiver. As with inter-area communications, this ensures that the receiver
remains on all pruned shortest-path delivery trees. They receive all multicast
datagrams, regardless of destination. Because this device has complete
knowledge of all group membership outside the AS, datagrams can be forwarded
to group members in other autonomous systems.
6.6.4 MOSPF interoperability
Routers configured to support an MOSPF network can be intermixed with
non-multicast OSPF routers. Both types of routers interoperate when forwarding
unicast data traffic. However, forwarding IP multicast traffic is limited to the
MOSPF domain. Unlike DVMRP, MOSPF does not provide the ability to tunnel
multicast traffic through non-multicast routers.
6.7 Protocol Independent Multicast (PIM)
The complexity associated with MOSPF lead to the development and
deployment of PIM. PIM is another multicast routing protocol. Unlike MOSPF,
PIM is independent of any underlying unicast routing protocol. It interoperates
with all existing unicast routing protocols.
PIM defines two modes or operation:
򐂰 Dense mode (PIM-DM), specified in RFC 3973
򐂰 Sparse mode (PIM-SM), specified in RFC 2362
Dense mode and sparse mode refer to the density of group members within an
area. In a random sampling, a group is considered dense if the probability of
finding at least one group member within the sample is high. This holds even if
the sample size is reasonably small. A group is considered sparse if the
probability of finding group members within the sample is low.
PIM provides the ability to switch between spare mode and dense mode. It also
permits both modes to be used within the same group.
Chapter 6. IP multicast
6.7.1 PIM dense mode
The PIM-DM protocol implements the RPF process described in 6.4.1, “Reverse
path forwarding algorithm” on page 252. Specifically, when a PIM-DM device
receives a packet, it validates the incoming interface with the existing unicast
routing table. If the incoming interface reflects the best path back to the source,
the router floods the multicast packet. The packet is sent out to every interface
that has not been pruned from the multicast delivery tree.
Unlike DVMRP, PIM-DM does not attempt to compute multicast specific routes.
Rather, it assumes that the routes in the unicast routing table are symmetric.
Similar to operations in a DVMRP environment, a PIM-DM device initially
assumes all downstream interfaces need to receive multicast traffic. The router
floods datagrams to all areas of the network. If some areas do not have receivers
for the specific multicast group, PIM-DM reactively prunes these branches from
the delivery tree. This reactive pruning is done because PIM-DM does not obtain
downstream receiver information from the unicast routing table. Figure 6-9
contains an example of a PIM-DM pruning.
Figure 6-9 PIM-DM flood and prune operation
PIM-DM is relatively simple to implement. The only assumption is that a router is
able to retain a list of prune requests.
PIM-DM benefits
Given the flood and prune methodology used in PIM-DM, use this protocol in
environments where the majority of hosts within a domain need to receive the
multicast data. In these environments, the majority of networks will not be pruned
TCP/IP Tutorial and Technical Overview
from the delivery tree. The inefficiencies associated with flooding is minimal. This
configuration is also appropriate when:
򐂰 Senders and receivers are in close proximity to each other.
򐂰 There are few senders and many receivers.
򐂰 The volume of multicast traffic is high.
򐂰 The stream of multicast traffic is constant.
Unlike DVMRP, PIM-DM does not support tunnels to transmit multicast traffic
through non-multicast capable networks. Therefore, the network administrator
must ensure that each device connected to the end-to-end path is
6.7.2 PIM sparse mode
The PIM-SM protocol uses a variant of the center-based tree algorithm. In a
PIM-SM network, a rendezvous point (RP) is analogous to the center point
described in the algorithm. Specifically, an RP is the location in the network
where multicast senders connect to multicast receivers. Receivers join a tree
rooted at the RP. Senders register their existence with the RP. Initially, traffic
from the sender flows through the RP to reach each receiver.
The benefit of PIM-SM is that unlike DVMRP and PIM-DM networks, multicast
data is blocked from a network segment unless a downstream device specifically
asks to receive the data. This has the potential to significantly reduce the amount
of traffic traversing the network. It also implies that no pruning information is
maintained for locations with no receivers. This information is maintained only in
devices connected to the multicast delivery tree. Because of these benefits,
PIM-SM is currently the most popular multicast routing protocol used in the
Chapter 6. IP multicast
Building the PIM-SM multicast delivery tree
The basic PIM-SM interaction with the RP is:
1. A multicast router sends periodic join messages to a group-specific RP. Each
router along the path toward the RP builds and sends join requests to the RP.
This builds a group-specific multicast delivery tree rooted at the RP. Like
other multicast protocols, the tree is actually a reverse path tree, because join
requests follow a reverse path from the receiver to the RP. Figure 6-10 shows
this function.
Figure 6-10 Creating the RP-rooted delivery tree
2. The multicast router connecting to the source initially encapsulates each
multicast packet in a register message. These messages are sent to the RP.
The RP decapsulates these unicast messages and forwards the data packets
to the set of downstream receivers. Figure 6-11 shows this function.
Figure 6-11 Registering a source
TCP/IP Tutorial and Technical Overview
3. The RP-based delivery tree can reflect suboptimal routes to some receivers.
To optimize these connections, the router can create a source-based
multicast delivery tree. Figure 6-12 shows this function.
Figure 6-12 Establishing a source-based delivery tree
4. After the router receives multicast packets through both the source-based
delivery tree and the RP-based delivery tree, PIM prune messages are sent
toward the RP to prune this branch of the tree. When complete, multicast data
from the source flows only through the source-based delivery tree.
Figure 6-13 shows this function.
Figure 6-13 Eliminating the RP-based delivery tree
The switch to the source-based tree occurs after a significant number of packets
have been received from the source. To implement this policy, the router
monitors the quantity of packets received through the RP-based delivery tree.
When this data rate exceeds a configured threshold, the device initiates the
Chapter 6. IP multicast
RP selection
An RP is selected as part of standard PIM-SM operations. An RP is mapped to
each specific multicast group. To perform this mapping, a router configured to
support PIM-SM distributes a list of candidate RPs to all other routers in the
environment. When a mapping needs to be performed, each router hashes the
multicast group address into an IP address that represents the RP.
PIM-SM benefits
PIM-SM is optimized for environments containing a large number of multicast
data streams. Each stream should flow to a relatively small number of the LAN
segments. For these groups, the flooding and pruning associated with PIM-DM
and DVMRP is an inefficient use of network bandwidth. PIM-SM is also
appropriate when:
򐂰 There are few receivers in a group.
򐂰 Senders and receivers are separated by WAN links.
򐂰 The stream of multicast traffic is intermittent.
Like PIM-DM, PIM-SM assumes that the route obtained through the unicast
routing protocol supports multicast routing. Therefore, the network administrator
must ensure that each device connected to the end-to-end path is
6.8 Interconnecting multicast domains
Early multicast development focused on a flat network topology. This differed
from the hierarchical topology deployed in the Internet. Like the Internet,
multicast developers soon realized the need to perform inter-domain routing.
New classes of protocol have been proposed to address this deficiency. There
are currently two approaches to interconnecting multicast domains.
6.8.1 Multicast Source Discovery Protocol (MSDP)
MSDP, defined in RFC 3618, is a protocol to logically connect multiple PIM-SM
domains. It is used to find active multicast sources in other domains. RPs in
separate autonomous systems communicate through MSDP to exchange
information about these multicast sources. Each domain continues to use its own
RP. There is no direct dependence on other RPs for intra-domain
TCP/IP Tutorial and Technical Overview
Typically, each AS will contain one MSDP-speaking RP. Other autonomous
systems create MSDP peer sessions with this RP. These sessions are used to
exchange the lists of source speakers in each specific multicast group.
MSDP is not directly involved in multicast data delivery. Its purpose is to discover
sources in other domains. If a device in one domain wants to receive data from a
source in another domain, the data is delivered using the standard operations
within PIM-SM.
MSDP peers will usually be separated by multiple hops. All devices used to
support this remote communication must be multicast capable.
MSDP relies heavily on the Multiprotocol Extensions to BGP (MBGP) for
interdomain communication. For an MBGP overview, see “Multiprotocol
extensions for BGP-4” on page 268.
MSDP operations
To establish multicast communications between a source in one domain and
receivers in other domains, MSDP uses the following steps:
1. Following standard PIM-SM operations, the DR for the source sends a PIM
register message to the RP. That data packet is decapsulated by the RP and
forwarded down the shared tree to receivers in the same domain.
2. The packet is also re-encapsulated in a source-active (SA) message. This
message is sent to all MSDP peers. The SA message identifies the source
and group. This operation occurs when the source becomes active.
3. When an RP for a domain receives an SA message from a MSDP peer, it
determines if it has any members interested in the group described by the SA
message. If there is an interested party, the RP triggers a join toward the
source. After the path has been established and the RP is forwarding data,
the receiver can switch to the shortest-path tree directly to the source. This is
done with standard PIM-SM conventions. Each MSDP peer receives and
forwards SA messages. Each peer examines the MBGP routing table to
determine the next-hop peer towards the originating RP. The router forwards
the SA message to all MSDP peers other than the RPF peer.
4. If the router receives the SA message from a peer that is not the next-hop
peer, the message is discarded. This floods the announcements through the
entire network. It is similar to the RPF processing described in 6.4.1, “Reverse
path forwarding algorithm” on page 252.
Chapter 6. IP multicast
See Figure 6-14 for an overview of MSDP operations.
Figure 6-14 MSDP operations
MSDP limitations
MSDP is currently an experimental RFC, but is deployed in numerous network
environments. Because of the periodic flood and prune messages associated
with MSDP, this protocol does not scale to the address the potential needs of the
Internet. It is expected that MSDP will be replaced with the Border Gateway
Multicast Protocol (BGMP). We review this protocol 6.8.2, “Border Gateway
Multicast Protocol” on page 269.
Multiprotocol extensions for BGP-4
When interconnecting multicast domains, it is possible that unicast routing might
select a path containing devices that do not support multicast traffic. This results
in multicast join messages not reaching the intended destination.
To solve this problem, MSDP uses the multiprotocol extensions for BGP-4
(MBGP) defined in RFC 2858. This is a set of extensions allowing BGP to
maintain separate routing tables for different protocols. Therefore, MBGP can
create routes for both unicast and multicast traffic. The multicast routes can
traverse around the portions of the environment that do not support multicast. It
permits links to be dedicated to multicast traffic. Alternatively, it can limit the
resources used to support each type of traffic.
The information associated with the multicast routes is used by PIM to build
distribution trees. The standard services for filtering and preference setting are
available with MBGP.
TCP/IP Tutorial and Technical Overview
6.8.2 Border Gateway Multicast Protocol
The Border Gateway Multicast Protocol (BGMP), defined in RFC 3913, is a
multicast routing protocol that builds shared domain trees. Like PIM-SM, BGMP
chooses a global root for the delivery tree. However, in BGMP, the root is a
domain, not a single router. This allows connectivity to the domain to be
maintained whenever any path is available to the domain.
Similar to the cooperation between an Exterior Gateway Protocol (EGP) and an
Interior Gateway Protocol (IGP) in a unicast environment, BGMP is used as the
inter-domain multicast protocol. Any multicast IGP can be used internally.
BGMP operates between border routers in each domain instead of using an RP.
Join messages are used to construct trees between domains. Border routers
learn from the multicast IGP whenever a host is interested in participating in an
interdomain multicast group. When this occurs, the border router sends a join
message to the root domain. Peer devices forward this request towards the root.
This forms the shared tree used for multicast delivery.
The BGMP specification requires multicast addresses to be allocated to
particular domains. The specification suggests the use of the Multicast
Address-Set Claim (MASC) protocol, defined in RFC 2909, to achieve this result.
MASC is a separate protocol allowing domains to claim temporary responsibility
for a range of addresses. The BGMP specification does not mandate the use of
6.9 The multicast backbone
The Internet multicast backbone (MBONE) was established in March 1992. It
was initially deployed to provide hands-on experience with multicast protocols.
The first uses provided audio multicasting of IETF meetings. At that time, 20 sites
were connected to the backbone. Two years later, simultaneous audio and video
transmissions were distributed to more than 500 participants located in 15
countries. Since then, the MBONE has been used to broadcast NASA Space
Shuttle missions, rock concerts, and numerous technical conferences.
Commercial and private use of the MBONE continues to increase.
The multicast backbone started as a virtual overlay network using much of the
physical Internet infrastructure. At that time, multicast routing was not supported
in standard routing devices. The first MBONE points-of-presence were UNIX
systems configured with the mrouted routing process. Today, the MBONE is still
operational, but multicast connectivity is natively included in many Internet
routers. Additionally, because multicast ability is a standard feature of IPv6,
Chapter 6. IP multicast
MBONE will most likely become obsolete as IPv6 becomes more widely
6.9.1 MBONE routing
Multicast traffic does not flow to every Internet location. Until that occurs,
MBONE will consist of a set of multicast network islands. These islands are
interconnected through virtual tunnels. The tunnels bridge through areas that do
not support multicast traffic.
A router that needs to send multicast packets to another multicast island
encapsulates the packets in unicast packets. These encapsulated packets are
transmitted through the standard Internet routers. The destination address
contained in the unicast packets is the endpoint of the tunnel. The router at the
remote end of the tunnel removes the encapsulation header and forwards the
multicast packets to the receiving devices.
Figure 6-15 shows an overview of an MBONE tunnel’s structure.
Figure 6-15 MBONE tunnel
MBONE tunnels have associated metric and threshold parameters. The metric
parameter is used as a cost in the multicast routing algorithm. The routing
algorithm uses this value to select the best path through the network. Figure 6-16
on page 271 depicts an environment containing four multicast sites
interconnected through MBONE tunnels. The tunnels have been assigned
different metric values to skew traffic forwarding through the network.
TCP/IP Tutorial and Technical Overview
Figure 6-16 MBONE tunnel metric
A multicast packet sent from router 1 to router 2 should not use the tunnel
directly connecting router 1 and router 2. The cost of the alternate path using
router 3 and router 4 is 5 (1 + 2 + 2). This is more attractive that the direct path
between router 1 and router 2; this path has a cost of 8.
The threshold parameter limits the distribution of multicast packets. It specifies a
minimum TTL for a multicast packet forwarded into an established tunnel. The
TTL is decremented by 1 at each multicast router.
In the future, most Internet routers will provide direct support for IP multicast.
This will eliminate the need for multicast tunnels. The current MBONE
implementation is only a temporary solution. It will become obsolete when
multicasting is fully supported in every Internet router.
6.9.2 Multicast applications
The first multicast applications provided audio conferencing functions. These
applications have increased in usability and functionality. Recently, development
Chapter 6. IP multicast
of multicast systems has accelerated. New and improved applications are being
delivered to support:
򐂰 Multimedia conferencing: These tools have been used on the MBONE for
several years. They support many-to-many audio-only or audio-video
communication. When used in conjunction with whiteboard applications,
these conferences enhance collaboration while requiring minimal bandwidth.
򐂰 Data distribution: These tools provide the ability to simultaneously deliver
data to large numbers of receivers. For example, a central site can efficiently
push updated data files to each district office.
򐂰 Gaming and simulation: These applications have been readily available.
However, the integration of multicast services allow the applications to scale
to a large number of users. Multicast groups can represent different sections
of the game or simulation. As users move from one section to the next, they
exit and join different multicast groups.
򐂰 Real-time data multicast: These applications distribute real-time data to large
numbers of users. For example, stock ticker information can be provided to
sets of workstations. The use of multicast groups can tailor the information
received by a specific device.
Many of these applications use UDP instead of the usual TCP transport support.
With TCP, reliability and flow control mechanisms have not been optimized for
real-time broadcasting of multimedia data. Frequently, the potential to lose a
small percentage of packets is preferred to the transmission delays introduced
with TCP.
In addition to UDP, most applications use the Real-Time Transport Protocol
(refer to 21.3.4, “Real-Time Transport Protocol (RTP)” on page 756). This
protocol provides mechanisms to continuously transmit multimedia data streams
through the Internet without incurring additional delays.
6.10 RFCs relevant to this chapter
The following RFCs provide detailed information about the multicasting protocols
and architectures presented throughout this chapter:
򐂰 RFC 1075 – Distance Vector Multicast Routing Protocol (November 1988)
򐂰 RFC 1112 – Host extensions for IP multicasting (August 1989)
򐂰 RFC 1584 – Multicast Extensions to OSPF (March 1994)
򐂰 RFC 2236 – Internet Group Management Protocol, Version 2
(November 1997)
TCP/IP Tutorial and Technical Overview
򐂰 RFC 2362 – Protocol Independent Multicast-Sparse Mode (PIM-SM):
Protocol Specification (June 1998)
򐂰 RFC 2858 – Multiprotocol Extensions for BGP-4 (February 1998)
򐂰 RFC 2909 – The Multicast Address-Set Claim (MASC) Protocol
(September 2000)
򐂰 RFC 3232 – Assigned Numbers: RFC 1700 is Replaced by an On-line
Database (January 2002)
򐂰 RFC 3376 – Internet Group Management Protocol, Version 3 (October 2002)
򐂰 RFC 3618 – Multicast Source Discovery Protocol (MSDP) (October 2003)
򐂰 RFC 3913 – Border Gateway Multicast Protocol (BGMP): Protocol
Specification (September 2004)
򐂰 RFC 3973 – Protocol Independent Multicast - Dense Mode (PIM-DM):
Protocol Specification (January 2005)
򐂰 RFC 4541 – Considerations for Internet Group Management Protocol (IGMP)
and Multicast Listener Discover (MLD) Snooping Switches (May 2006)
Chapter 6. IP multicast
TCP/IP Tutorial and Technical Overview
Chapter 7.
Mobile IP
The increasingly mobile nature of the workforce presents problems for the
configuration and operation of mobile network devices. It is possible to allocate
multiple sets of configuration parameters to a device, but this obviously means
an increased workload for the administrator and the user. Perhaps more
importantly, however, is that this type of configuration is wasteful with respect to
the number of IP addresses allocated.
In DHCP and DDNS environments, DHCP provides a device with a valid IP
address for the point at which it is attached to the network. DDNS provides a
method of locating that device by its host name, no matter where that device
happens to be attached to a network and what IP address it has been allocated.
An alternative approach to the problem of dealing with mobile devices is provided
in RFC 3344 – IP Mobility Support. IP Mobility Support, commonly referred to as
Mobile IP, is a proposed standard, with a status of elective.
© Copyright IBM Corp. 1989-2006. All rights reserved.
7.1 Mobile IP overview
Mobile IP enables a device to maintain the same IP address (its home address)
wherever it attaches to the network. (Obviously, a device with an IP address
plugged into the wrong subnet will normally be unreachable.) However, the
mobile device also has a care-of address, which connects to the subnet where it
is currently located. The care-of address is managed by a home agent, which is a
device on the home subnet of the mobile device. Any packet addressed to the IP
address of the mobile device is intercepted by the home agent and then
forwarded to the care-of address through a tunnel. After it arrives at the end of
the tunnel, the datagram is delivered to the mobile device. The mobile node
generally uses its home address as the source address of all datagrams that it
Mobile IP can help resolve address shortage problems and reduce administrative
workload, because each device that needs to attach to the network at multiple
locations only requires a single IP address.
The following terminology is used in a mobile IP network configuration:
Home address
The static IP address allocated to a mobile node. It does
not change, no matter where the node attaches to the
Home network
A subnet with a network prefix matching the home
address of the mobile node. Datagrams intended for the
home address of the mobile node will always be routed to
this network.
The path followed by an encapsulated datagram.
Visited network
A network to which the mobile node is connected (other
than the node's home network).
Home agent
A router on the home network of the mobile node that
maintains current location information for the node and
tunnels datagrams for delivery to the node when it is away
from home.
Foreign agent
A router on a visited network that registers the presence
of a mobile node and detunnels and forwards datagrams
to the node that have been tunneled by the mobile node's
home agent.
TCP/IP Tutorial and Technical Overview
7.1.1 Mobile IP operation
Mobility agents (home agents and foreign agents) advertise their presence in the
network by means of agent advertisement messages, which are ICMP router
advertisement messages with extensions (see Figure 7-3 on page 280). A
mobile node can also explicitly request one of these messages with an agent
solicitation message. When a mobile node connects to the network and receives
one of these messages, it is able to determine whether it is on its home network
or a foreign network. If the mobile node detects that it is on its home network, it
will operate normally, without the use of mobility services. In addition, if it has just
returned to the home network, having previously been working elsewhere, it will
deregister itself with the home agent. This is done through the exchange of a
registration request and registration reply.
If, however, the mobile node detects, from an agent advertisement, that it has
moved to a foreign network, it obtains a care-of address for the foreign network.
This address can be obtained from the foreign agent (a foreign agent care-of
address, which is the address of the foreign agent itself), or it can be obtained by
some other mechanism, such as DHCP (in which case, it is known as a
co-located care-of address). The use of co-located care-of addresses has the
advantage that the mobile node does not need a foreign agent to be present at
every network that it visits, but it does require that a pool of IP addresses be
made available for visiting mobile nodes by the DHCP server.
Note that communication between a mobile node and a foreign agent takes place
at the link layer level. It cannot use the normal IP routing mechanism, because
the mobile node's IP address does not belong to the subnet in which it is
currently located.
After the mobile node has received its care-of address, it needs to register itself
with its home agent. This can be done through the foreign agent, which forwards
the request to the home agent, or directly with the home agent (see Figure 7-4 on
page 281).
After the home agent has registered the care-of address for the mobile node in
its new position, any datagram intended for the home address of the mobile node
is intercepted by the home agent and tunneled to the care-of address. The tunnel
endpoint can be at a foreign agent (if the mobile node has a foreign agent care-of
address), or at the mobile node itself (if it has a co-located care-of address). Here
the original datagram is removed from the tunnel and delivered to the mobile
The mobile node will generally respond to the received datagram using standard
IP routing mechanisms.
Chapter 7. Mobile IP
Figure 7-1 shows a mobile IP operation.
Mobile Node B
Host A
Foreign Agent
(1) Host A sends datagram to B
( routed to the 9.180.128
(2) Home agent intercepts datragram
and tunnels to B's care-of address.
(3) Foreign agent detunnels datagram
and forwards to mobile node.
(4) Mobile Node B replies to A using
standard routing.
Figure 7-1 Mobile IP operation
7.1.2 Mobility agent advertisement extensions
The mobility agent advertisement consists of an ICMP router Advertisement with
one or more of the following extensions, as shown in Figure 7-2.
Figure 7-2 Mobility agent advertisement extension
TCP/IP Tutorial and Technical Overview
(6+[4*N]), where N is the number of care-of addresses
Sequence number
The number of advertisements sent by this agent since it
was initialized.
Registration lifetime The longest lifetime, in seconds, that this agent will accept
a Registration Request. A value of 0xffff indicates infinity.
This field bears no relationship with the lifetime field in the
router advertisement itself.
Registration required. Mobile node must register with this
agent rather than use a co-located care-of address.
Busy. Foreign agent cannot accept additional
Home agent. This agent offers service as a home agent
on this link.
Foreign agent. This agent offers service as a foreign
agent on this link.
Minimal encapsulation. This agent receives tunneled
datagrams that use minimal encapsulation.
GRE encapsulation. This agent receives tunneled
datagrams that use GRE encapsulation.
Van Jacobson Header Compression. This agent supports
use of Van Jacobson Header Compression over the link
with any registered mobile node.
This area is ignored.
Care-of Address(es) The care-of address or addresses advertised by this
agent. At least one must be included if the F bit is set.
Note that a foreign agent might be too busy to service additional mobile nodes at
certain times. However, it must continue to send agent advertisements (with the
B bit set) so that mobile nodes that are already registered will know that the
agent has not failed or that they are still in range of the foreign agent.
Chapter 7. Mobile IP
The prefix lengths extension can follow the mobility agent advertisement
extension. It is used to indicate the number of bits that need to be applied to each
router address (in the ICMP router advertisement portion of the message) when
network prefixes are being used for move detection. See Figure 7-3 for more
Figure 7-3 Prefix-lengths extensions
The number of router address entries in the router
advertisement portion of the agent advertisement.
Prefix length(s)
The number of leading bits that make up the network
prefix for each of the router addresses in the router
advertisement portion of the agent advertisement. Each
prefix length is a separate byte, in the order that the router
addresses are listed.
7.2 Mobile IP registration process
RFC 3344 defines two different procedures for mobile IP registration: The mobile
node can register through a foreign agent, which relays the registration to the
mobile node's home agent, or it can register directly with its home agent. The
following rules are used to determine which of these registration processes is
򐂰 If the mobile node has obtained its care-of address from a foreign agent, it
must register through that foreign agent.
򐂰 If the mobile node is using a co-located care-of address, but has received an
agent advertisement from a foreign agent on this subnet (which has the R bit
(registration required) set in that advertisement), it registers through the
agent. This mechanism allows for accounting to take place on foreign
subnets, even if DHCP and co-located care-of address is the preferred
method of address allocation.
TCP/IP Tutorial and Technical Overview
򐂰 If the mobile node is using a co-located care-of address but has not received
such an advertisement, it must register directly with its home agent.
򐂰 If the mobile node returns to its home network, it must deregister directly with
its home agent.
The registration process involves the exchange of registration request and
registration reply messages, which are UDP datagrams. The registration request
is sent to port 434. The request consists of a UDP header, followed by the fields
shown in Figure 7-4.
Figure 7-4 Mobile IP: Registration request
Simultaneous bindings. If this bit is set, the home agent
keeps any previous bindings for this node as well as
adding the new binding. The home agent will then forward
any datagrams for the node to multiple care-of addresses.
This capability is particularly intended for wireless mobile
Broadcast datagrams. If this bit is set, the home agent
tunnels any broadcast datagrams on the home network to
the mobile node.
Chapter 7. Mobile IP
Decapsulation by mobile node. The mobile node is using
a co-located care-of address and will, itself, decapsulate
the datagrams sent to it.
Minimal encapsulation should be used for datagrams
tunneled to the mobile node.
GRE encapsulation should be used for datagrams
tunneled to the mobile node.
Van Jacobson compression should be used over the link
between agent and mobile node.
Reserved bits. Sent as zero.
The number of seconds remaining before the registration
will be considered expired. A value of zero indicates a
request for deregistration. 0xffff indicates infinity.
Home address
The home IP address of the mobile node.
Home agent
The IP address of the mobile node's home agent.
Care-of address
The IP address for the end of the tunnel.
A 64-bit identification number constructed by the mobile
node and used for matching registration requests with
A number of extensions are defined, all relating to
authentication of the registration process. See RFC 3344
for full details.
TCP/IP Tutorial and Technical Overview
The mobility agent responds to a registration request with a registration reply and
with a destination port copied from the source port of the registration request.
Figure 7-5 shows the registration reply format.
Figure 7-5 Mobile IP: Registration reply
Indicates the result of the registration request:
Registration accepted.
Registration accepted, but simultaneous
bindings unsupported.
Registration denied by foreign agent.
Registration denied by home agent.
The number of seconds remaining before the registration
is considered expired. (Code field must be 0 or 1.)
Home address
Home IP address of the mobile node.
Home agent
IP address of the mobile node's home agent.
A 64-bit identification number used for matching
registration requests with replies.
A number of extensions are defined, all relating to
authentication of the registration process.
For full details of these messages, refer to RFC 3344.
Chapter 7. Mobile IP
7.2.1 Tunneling
The home agent examines the destination IP address of all datagrams arriving
on the home network. If the address matches any of the mobile nodes currently
registered as being away from home, the home agent tunnels (using IP in IP
encapsulation) the datagram to the care-of address for that mobile node. It is
likely that the home agent will also be a router on the home network. In this case,
it is likely that it will receive datagrams addressed for a mobile node that is not
currently registered as being away from home. In this case, the home agent
assumes that the mobile node is at home, and forwards the datagram to the
home network.
When a foreign agent receives a datagram sent to its advertised care-of address,
it compares the inner destination address with its list of registered visitors. If it
finds a match, the foreign agent forwards the decapsulated datagram to the
appropriate mobile node. If there is no match, the datagram is discarded. (The
foreign agent must not forward such a datagram to the original IP header;
otherwise, a routing loop occurs.)
If the mobile node is using a co-located care-of address, the end of the tunnel
lies at the mobile node itself. The mobile node is responsible for decapsulating
the datagrams received from the home agent.
7.2.2 Broadcast datagrams
If the home agent receives a broadcast datagram, it must not forward it to mobile
nodes unless the mobile node specifically requested forwarding of broadcasts in
its registration request. In this case, it forwards the datagram in one of the
following manners:
򐂰 If the mobile node has a co-located care-of address, the home agent simply
encapsulates the datagram and tunnels it directly to the care-of address.
򐂰 If the mobile node has a foreign agent care-of address, the home agent first
encapsulates the broadcast in a unicast datagram addressed to the home
address of the node. It then encapsulates and tunnels this datagram to the
care-of address. In this way, the foreign agent, when it decapsulates the
datagram, knows to which of its registered mobile nodes it needs to forward
the broadcast.
7.2.3 Move detection
Mobile IP is designed not just for mobile users who regularly move from one site
to another and attach their mobile computers to different subnets each time, but
also for truly dynamic mobile users (for example, users of a wireless connection
TCP/IP Tutorial and Technical Overview
from an aircraft). Two mechanisms are defined that allow the mobile node to
detect when it has moved from one subnet to another. When the mobile node
detects that it has moved, it must re-register with a care-of address on the new
foreign network. The two methods of move detection are as follows:
򐂰 Foreign agents are consistently advertising their presence in the network by
means of agent advertisements. When the mobile node receives an agent
advertisement from its foreign agent, it starts a timer based on the lifetime
field in the advertisement. If the mobile node has not received another
advertisement from the same foreign agent by the time the lifetime has
expired, the mobile node assumes that it has lost contact with that agent. If, in
the meantime, it has received an advertisement from another foreign agent, it
immediately attempts registration with the new agent. If it has not received
any further agent advertisements, it uses agent solicitation to try and locate a
new foreign agent with which to register.
򐂰 The mobile node checks whether any newly received agent advertisements
are on the same subnet as its current care-of address. If the network prefix is
different, the mobile node assumes that it has moved. On expiration of its
current care-of address, the mobile node registers with the foreign agent that
sent the new agent advertisement.
7.2.4 Returning home
When the mobile node receives an agent advertisement from its own home
agent, it knows that it has returned to its home network. Before deregistering with
the home agent, the mobile node must configure its routing table for operation on
the home subnet.
7.2.5 ARP considerations
Mobile IP requires two extensions to ARP to cope with the movement of mobile
nodes. These are:
Proxy ARP
An ARP reply sent by one node on behalf of another that
is either unable or unwilling to answer an ARP request on
its own behalf
Gratuitous ARP
An ARP packet sent as a local broadcast packet by one
node that causes all receiving nodes to update an entry in
their ARP cache
When a mobile node is registered as being on a foreign network, its home agent
will use proxy ARP in response to any ARP request seeking the mobile node's
MAC address. The home agent responds to the request giving its own MAC
Chapter 7. Mobile IP
When a mobile node moves from its home network and registers itself with a
foreign network, the home agent does a gratuitous ARP broadcast to update the
ARP caches of all local nodes in the network. The MAC address used is again
the MAC address of the home agent.
When a mobile node returns to its home network, having been previously
registered at a foreign network, gratuitous ARP is again used to update ARP
caches of all local nodes, this time with the real MAC address of the mobile node.
7.2.6 Mobile IP security considerations
The mobile computing environment has many potential vulnerabilities with
regard to security, particularly if wireless links are in use, which are particularly
exposed to eavesdropping. The tunnel between a home agent and the care-of
address of a mobile node can also be susceptible to interception unless a strong
authentication mechanism is implemented as part of the registration process.
RFC 3344 specifies implementation of keyed MD5 for the authentication protocol
and advocates the use of additional mechanisms (such as encryption) for
environments where total privacy is required.
7.3 RFCs relevant to this chapter
The following RFC provides detailed information about the connection protocols
and architectures presented throughout this chapter:
RFC 3344 – IP Mobility Support (August 2002)
TCP/IP Tutorial and Technical Overview
Chapter 8.
Quality of service
With the increased use of the IP based networks, including the Internet, there
has been a large focus on providing necessary network resources to certain
applications. That is, it has become better understood that some applications are
more “important” than others, thereby demanding preferential treatment
throughout an internetwork. Additionally, applications have different demands,
such as real-time requirements of low latency and high bandwidth.
This chapter discusses the topic of traffic prioritization, or quality of service
(QoS). It explains why QoS can be desirable in an intranet, as well as in the
Internet, and presents the two main approaches to implementing QoS in TCP/IP
򐂰 Integrated Services
򐂰 Differentiated Services
© Copyright IBM Corp. 1989-2006. All rights reserved.
8.1 Why QoS?
In the Internet and intranets of today, bandwidth is an important subject. More
and more people are using the Internet for private and business purposes. The
amount of data that is being transmitted through the Internet is increasing
exponentially. Multimedia applications, such as IP telephony and
videoconferencing systems, need a lot more bandwidth than the applications that
were used in the early years of the Internet. While traditional Internet
applications, such as WWW, FTP, or Telnet, cannot tolerate packet loss but are
less sensitive to variable delays, most real-time applications show just the
opposite behavior, meaning they can compensate for a reasonable amount of
packet loss but are usually very critical toward high variable delays.
This means that without any bandwidth control, the quality of these real-time
streams depends on the bandwidth that is currently available. Low or unstable
bandwidth causes bad quality in real-time transmissions by leading to, for
example, dropouts and hangs. Even the quality of a transmission using the
Real-Time Protocol (RTP) depends on the utilization of the underlying IP delivery
Therefore, certain concepts are necessary to guarantee a specific quality of
service (QoS) for real-time applications on the Internet. A QoS can be described
as a set of parameters that describe the quality (for example, bandwidth, buffer
usage, priority, and CPU usage) of a specific stream of data. The basic IP
protocol stack provides only one QoS, which is called best-effort. Packets are
transmitted from point to point without any guarantee for a special bandwidth or
minimum time delay. With the best-effort traffic model, Internet requests are
handled with the first come, first serve strategy. This means that all requests
have the same priority and are handled one after the other. There is no possibility
to make bandwidth reservations for specific connections or to raise the priority for
special requests. Therefore, new strategies were developed to provide
predictable services for the Internet.
Today, there are two main rudiments to bring QoS to the Internet and IP-based
internetworks: Integrated Services and Differentiated Services.
Integrated Services
Integrated Services bring enhancements to the IP network model to support
real-time transmissions and guaranteed bandwidth for specific flows. In this
case, we define a flow as a distinguishable stream of related datagrams from a
unique sender to a unique receiver that results from a single user activity and
requires the same QoS.
TCP/IP Tutorial and Technical Overview
For example, a flow might consist of one video stream between a given host pair.
To establish the video connection in both directions, two flows are necessary.
Each application that initiates data flows can specify which QoSs are required for
this flow. If the videoconferencing tool needs a minimum bandwidth of 128 kbps
and a minimum packet delay of 100 ms to assure a continuous video display,
such a QoS can be reserved for this connection.
Differentiated Services
Differentiated Services mechanisms do not use per-flow signaling, and as a
result, do not consume per-flow state within the routing infrastructure. Different
service levels can be allocated to different groups of users, which means that all
traffic is distributed into groups or classes with different QoS parameters. This
reduces the amount of maintenance required in comparison to Integrated
8.2 Integrated Services
The Integrated Services (IS) model was defined by an IETF working group to be
the keystone of the planned IS Internet. This Internet architecture model includes
the currently used best-effort service and the new real-time service that provides
functions to reserve bandwidth on the Internet and internetworks. IS was
developed to optimize network and resource utilization for new applications, such
as real-time multimedia, which requires QoS guarantees. Because of routing
delays and congestion losses, real-time applications do not work very well on the
current best-effort Internet. Video conferencing, video broadcast, and audio
conferencing software need guaranteed bandwidth to provide video and audio of
acceptable quality. Integrated Services makes it possible to divide the Internet
traffic into the standard best-effort traffic for traditional uses and application data
flows with guaranteed QoS.
To support the Integrated Services model, an Internet router must be able to
provide an appropriate QoS for each flow, in accordance with the service model.
The router function that provides different qualities of service is called traffic
control. It consists of the following components:
Packet scheduler
The packet scheduler manages the forwarding of different
packet streams in hosts and routers, based on their
service class, using queue management and various
scheduling algorithms. The packet scheduler must ensure
that the packet delivery corresponds to the QoS
parameter for each flow. A scheduler can also police or
shape the traffic to conform to a certain level of service.
The packet scheduler must be implemented at the point
Chapter 8. Quality of service
where packets are queued. This is typically the output
driver level of an operating system and corresponds to the
link layer protocol.
Packet classifier
The packet classifier identifies packets of an IP flow in
hosts and routers that will receive a certain level of
service. To realize effective traffic control, each incoming
packet is mapped by the classifier into a specific class. All
packets that are classified in the same class get the same
treatment from the packet scheduler. The choice of a
class is based on the source and destination IP address
and port number in the existing packet header or an
additional classification number, which must be added to
each packet. A class can correspond to a broad category
of flows.
For example, all video flows from a video conference with
several participants can belong to one service class. But it
is also possible that only one flow belongs to a specific
service class.
Admission control
The admission control contains the decision algorithm
that a router uses to determine if there are enough routing
resources to accept the requested QoS for a new flow. If
there are not enough free routing resources, accepting a
new flow would impact earlier guarantees and the new
flow must be rejected. If the new flow is accepted, the
reservation instance in the router assigns the packet
classifier and the packet scheduler to reserve the
requested QoS for this flow. Admission control is invoked
at each router along a reservation path to make a local
accept/reject decision at the time a host requests a
real-time service. The admission control algorithm must
be consistent with the service model.
Admission control is sometimes confused with policy
control, which is a packet-by-packet function, processed
by the packet scheduler. It ensures that a host does not
violate its promised traffic characteristics. Nevertheless,
to ensure that QoS guarantees are honored, the
admission control will be concerned with enforcing
administrative policies on resource reservations. Some
policies will be used to check the user authentication for a
requested reservation. Unauthorized reservation requests
can be rejected. As a result, admission control can play
an important role in accounting costs for Internet
resources in the future.
TCP/IP Tutorial and Technical Overview
Figure 8-1 shows the operation of the Integrated Service model within a host and
a router.
Figure 8-1 The Integrated Service model
Integrated Services use the Resource Reservation Protocol (RSVP) for the
signalling of the reservation messages. The IS instances communicate through
RSVP to create and maintain flow-specific states in the endpoint hosts and in
routers along the path of a flow. See 8.2.4, “The Resource Reservation Protocol
(RSVP)” on page 296 for a detailed description of the RSVP protocol. As shown
in Figure 8-1, the application that wants to send data packets in a reserved flow
communicates with the reservation instance RSVP. The RSVP protocol tries to
set up a flow reservation with the requested QoS, which will be accepted if the
application fulfilled the policy restrictions and the routers can handle the
requested QoS. RSVP advises the packet classifier and packet scheduler in
each node to process the packets for this flow adequately. If the application now
delivers the data packets to the classifier in the first node, which has mapped this
flow into a specific service class complying the requested QoS, the flow is
recognized with the sender IP address and is transmitted to the packet
scheduler. The packet scheduler forwards the packets, dependent on their
service class, to the next router or, finally, to the receiving host. Because RSVP
is a simplex protocol, QoS reservations are only made in one direction, from the
sending node to the receiving node. If the application in our example wants to
cancel the reservation for the data flow, it sends a message to the reservation
instance, which frees the reserved QoS resources in all routers along the path.
The resources can then be used for other flows. The IS specifications are
defined in RFC 1633.
Chapter 8. Quality of service
8.2.1 Service classes
The Integrated Services model uses different classes of service that are defined
by the Integrated Services IETF working group. Depending on the application,
those service classes provide tighter or looser bounds on QoS controls. The
current IS model includes the Guaranteed Service, which is defined in RFC 2212,
and the Controlled Load Service, which is defined in RFC 2211. To understand
these service classes, some terms need to be explained. Because the IS model
provides per-flow reservations, each flow is assigned a flow descriptor. The flow
descriptor defines the traffic and QoS characteristics for a specific flow of data
packets. In the IS specifications, the flow descriptor consists of a filter
specification (filterspec) and a flow specification (flowspec), as illustrated in
Figure 8-2.
Flow Descriptor
Figure 8-2 Flow descriptor
The filterspec identifies the packets that belong to a specific flow with the sender
IP address and source port. The information from the filterspec is used in the
packet classifier. The flowspec contains a set of parameters that are called the
invocation information. The invocation information divides into two groups:
򐂰 Traffic Specification (Tspec)
򐂰 Service Request Specification (Rspec)
The Tspec describes the traffic characteristics of the requested service. In the IS
model, this Tspec is represented with a token bucket filter. This principle defines
a data-flow control mechanism that adds characters (tokens) in periodical time
intervals into a buffer (bucket) and allows a data packet to leave the sender only if
TCP/IP Tutorial and Technical Overview
there are at least as many tokens in the bucket as the packet length of the data
packet. This strategy allows precise control of the time interval between two data
packets in the network. The token bucket system is specified by two parameters:
the token rate r, which represents the rate at which tokens are placed into the
bucket, and the bucket capacity b. Both r and b must be positive. Figure 8-3
illustrates the token bucket model.
Figure 8-3 Token bucket filter
The parameter r specifies the long-term data rate and is measured in bytes of IP
datagrams per second. The value of this parameter can range from 1 byte per
second to 40 terabytes per second. The parameter b specifies the burst data rate
allowed by the system and is measured in bytes. The value of this parameter can
range from 1 byte to 250 gigabytes. The range of values allowed for these
parameters is intentionally large in order to be prepared for future network
technologies. The network elements are not expected to support the full range of
the values. Traffic that passes the token bucket filter must obey the rule that over
all time periods T (seconds), the amount of data sent does not exceed rT+b,
where r and b are the token bucket parameters.
Two other token bucket parameters are also part of the Tspec. The minimum
policed unit m and the maximum packet size M. The parameter m specifies the
minimum IP datagram size in bytes. Smaller packets are counted against the
token bucket filter as being of size m. The parameter M specifies the maximum
packet size in bytes that conforms to the Tspec. Network elements must reject a
service request if the requested maximum packet size is larger than the MTU
size of the link. Summarizing, the token bucket filter is a policing function that
isolates the packets that conform to the traffic specifications from the ones that
do not conform.
The Service Request Specification (Rspec) specifies the quality of service the
application wants to request for a specific flow. This information depends on the
type of service and the needs of the QoS requesting application. It can consist of
Chapter 8. Quality of service
a specific bandwidth, a maximum packet delay, or a maximum packet loss rate.
In the IS implementation, the information from Tspec and Rspec is used in the
packet scheduler.
8.2.2 Controlled Load Service
The Controlled Load Service is intended to support the class of applications that
are highly sensitive to overloaded conditions in the Internet, such as real-time
applications. These applications work well on underloaded networks, but
degrade quickly under overloaded conditions. If an application uses the
Controlled Load Service, the performance of a specific data flow does not
degrade if the network load increases.
The Controlled Load Service offers only one service level, which is intentionally
minimal. There are no optional features or capabilities in the specification. The
service offers only a single function. It approximates best-effort service over
lightly loaded networks. This means that applications that make QoS
reservations using Controlled Load Services are provided with service closely
equivalent to the service provided to uncontrolled (best-effort) traffic under lightly
loaded conditions. In this context, lightly loaded conditions means that a very
high percentage of transmitted packets will be successfully delivered to the
destination, and the transit delay for a very high percentage of the delivered
packets will not greatly exceed the minimum transit delay.
Each router in a network that accepts requests for Controlled Load Services
must ensure that adequate bandwidth and packet processing resources are
available to handle QoS reservation requests. This can be realized with active
admission control. Before a router accepts a new QoS reservation, represented
by the Tspec, it must consider all important resources, such as link bandwidth,
router or switch port buffer space, and computational capacity of the packet
The Controlled Load Service class does not accept or make use of specific target
values for control parameters, such as bandwidth, delay, or loss. Applications
that use Controlled Load Services must guard against small amounts of packet
loss and packet delays.
QoS reservations using Controlled Load Services need to provide a Tspec that
consists of the token bucket parameters r and b, as well as the minimum policed
unit m and the maximum packet size M. An Rspec is not necessary, because
Controlled Load Services does not provide functions to reserve a fixed
bandwidth or guarantee minimum packet delays. Controlled Load Service
provides QoS control only for traffic that conforms to the Tspec that was provided
at setup time. This means that the service guarantees only apply for packets that
TCP/IP Tutorial and Technical Overview
respect the token bucket rule that over all time periods T, the amount of data sent
cannot exceed rT+b.
Controlled Load Service is designed for applications that can tolerate a
reasonable amount of packet loss and delay, such as audio and
videoconferencing software.
8.2.3 Guaranteed Service
The Guaranteed Service model provides functions that assure that datagrams
will arrive within a guaranteed delivery time. This means that every packet of a
flow that conforms to the traffic specifications will arrive at least at the maximum
delay time that is specified in the flow descriptor. Guaranteed Service is used for
applications that need a guarantee that a datagram will arrive at the receiver not
later than a certain time after it was transmitted by its source.
For example, real-time multimedia applications, such as video and audio
broadcasting systems that use streaming technologies, cannot use datagrams
that arrive after their proper play-back time. Applications that have hard real-time
requirements, such as real-time distribution of financial data (share prices), will
also require guaranteed service. Guaranteed Service does not minimize jitter
(the difference between the minimal and maximal datagram delays), but it
controls the maximum queuing delay.
The Guaranteed Service model represents the extreme end of delay control for
networks. Other service models providing delay control have much weaker delay
restrictions. Therefore, Guaranteed Service is only useful if it is provided by
every router along the reservation path.
Guaranteed Service gives applications considerable control over their delay. It is
important to understand that the delay in an IP network has two parts: a fixed
transmission delay and a variable queuing delay. The fixed delay depends on the
chosen path, which is determined not by guaranteed service but by the setup
mechanism. All data packets in an IP network have a minimum delay that is
limited by the speed of light and the turnaround time of the data packets in all
routers on the routing path. The queuing delay is determined by Guaranteed
Service and it is controlled by two parameters: the token bucket (in particular, the
bucket size b) and the bandwidth R that is requested for the reservation. These
parameters are used to construct the fluid model for the end-to-end behavior of a
flow that uses Guaranteed Service.
The fluid model specifies the service that would be provided by a dedicated link
between sender and receiver that provides the bandwidth R. In the fluid model,
the flow's service is completely independent from the service for other flows. The
definition of Guaranteed Service relies on the result that the fluid delay of a flow
Chapter 8. Quality of service
obeying a token bucket (r,b) and being served by a line with bandwidth R is
bounded by b/R as long as R is not less than r. Guaranteed Service
approximates this behavior with the service rate R, where R is now a share of
bandwidth through the routing path and not the bandwidth of a dedicated line.
In the Guaranteed Service model, Tspec and Rspec are used to set up a flow
reservation. The Tspec is represented by the token bucket parameters. The
Rspec contains the parameter R, which specifies the bandwidth for the flow
reservation. The Guaranteed Service model is defined in RFC 2212.
8.2.4 The Resource Reservation Protocol (RSVP)
The Integrated Services model uses the Resource Reservation Protocol (RSVP)
to set up and control QoS reservations. RSVP is defined in RFC 2205 and has
the status of a proposed standard. Because RSVP is an Internet control protocol
and not a routing protocol, it requires an existing routing protocol to operate. The
RSVP protocol runs on top of IP and UDP and must be implemented in all
routers on the reservation path. The key concepts of RSVP are flows and
An RSVP reservation applies for a specific flow of data packets on a specific path
through the routers. As described in 8.1, “Why QoS?” on page 288, a flow is
defined as a distinguishable stream of related datagrams from a unique sender
to a unique receiver. If the receiver is a multicast address, a flow can reach
multiple receivers. RSVP provides the same service for unicast and multicast
flows. Each flow is identified from RSVP by its destination IP address and
destination port. All flows have dedicated a flow descriptor, which contains the
QoS that a specific flow requires. The RSVP protocol does not understand the
contents of the flow descriptor. It is carried as an opaque object by RSVP and is
delivered to the router's traffic control functions (packet classifier and scheduler)
for processing.
Because RSVP is a simplex protocol, reservations are only done in one direction.
For duplex connections, such as video and audio conferences, where each
sender is also a receiver, it is necessary to set up two RSVP sessions for each
The RSVP protocol is receiver-initiated. Using RSVP signalling messages, the
sender provides a specific QoS to the receiver, which sends an RSVP
reservation message back, with the QoS that should be reserved for the flow,
from the sender to the receiver. This behavior considers the different QoS
requirements for heterogeneous receivers in large multicast groups. The sender
does not need to know the characteristics of all possible receivers to structure
the reservations.
TCP/IP Tutorial and Technical Overview
To establish a reservation with RSVP, the receivers send reservation requests to
the senders, depending on their system capabilities. For example, a fast
workstation and a slow PC want to receive a high-quality MPEG video stream
with 30 frames per second, which has a data rate of 1.5 Mbps. The workstation
has enough CPU performance to decode the video stream, but the PC can only
decode 10 frames per second. If the video server sends the messages to the two
receivers that it can provide the 1.5 Mbps video stream, the workstation can
return a reservation request for the full 1.5 Mbps. But the PC does not need the
full bandwidth for its flow because it cannot decode all frames. So the PC may
send a reservation request for a flow with 10 frames per second and 500 kbps.
RSVP operation
A basic part of a resource reservation is the path. The path is the way of a packet
flow through the different routers from the sender to the receiver. All packets that
belong to a specific flow will use the same path. The path gets determined if a
sender generates messages that travel in the same direction as the flow. Each
sender host periodically sends a path message for each data flow it originates.
The path message contains traffic information that describes the QoS for a
specific flow. Because RSVP does not handle routing by itself, it uses the
information from the routing tables in each router to forward the RSVP
Chapter 8. Quality of service
If the path message reaches the first RSVP router, the router stores the IP
address from the last hop field in the message, which is the address of the
sender. Then the router inserts its own IP address into the last hop field, sends
the path message to the next router, and the process repeats until the message
has reached the receiver. At the end of this process, each router will know the
address from the previous router and the path can be accessed backwards.
Figure 8-4 shows the process of the path definition.
Router 2
Address of
Router 1
Router 1
Router 3
Address of
Address of
Router 2
Router 5
Address of
Router 3
Address of
Router 5
Router 4
Figure 8-4 RSVP path definition process
Routers that have received a path message are prepared to process resource
reservations for a flow. All packets that belongs to this flow will take the same
way through the routers (the way that was defined with the path messages).
The status in a system after sending the path messages is as follows: All
receivers know that a sender can provide a special QoS for a flow and all routers
know about the possible resource reservation for this flow.
Now if a receiver wants to reserve QoS for this flow, it sends a reservation (Resv)
message. The reservation message contains the QoS requested from this
receiver for a specific flow and is represented by the filterspec and flowspec that
form the flow descriptor. The receiver sends the Resv message to the last router
in the path with the address it received from the path message. Because every
RSVP-capable device knows the address of the previous device on the path,
reservation messages travel the path in reverse direction toward the sender and
TCP/IP Tutorial and Technical Overview
establish the resource reservation in every router. Figure 8-5 shows the flow of
the reservation messages trough the routers.
Router 2
Address of
Router 1
Router 1
Router 3
Address of
Address of
Router 2
Router 5
Address of
Router 3
Address of
Router 5
Router 4
Figure 8-5 RSVP Resv messages flow
At each node, a reservation request initiates two actions:
1. QoS reservation on this link
The RSVP process passes the request to the admission control and policy
control instance on the node. The admission control checks if the router has
the necessary resources to establish the new QoS reservation, and the policy
control checks if the application has the authorization to make QoS requests.
If one of these tests fails, the reservation is rejected and the RSVP process
returns a ResvErr error message to the appropriate receiver. If both checks
succeed, the node uses the filterspec information in the Resv message to set
the packet classifier and the flowspec information to set the packet scheduler.
After this, the packet classifier will recognize the packets that belong to this
flow, and the packet scheduler will obtain the desired QoS defined by the
Chapter 8. Quality of service
Figure 8-6 shows the reservation process in an RSVP router.
RESV Message
RESV Message
Flow Descriptor
packet link
Determine route
and QoS class
Figure 8-6 RSVP reservation process
2. Forwarding of the reservation request
After a successful admission and policy check, a reservation request is
propagated upstream toward the sender. In a multicast environment, a
receiver can get data from multiple senders. The set of sender hosts to which
a given reservation request is propagated is called the scope of that request.
The reservation request that is forwarded by a node after a successful
reservation can differ from the request that was received from the previous
hop downstream. One possible reason for this is that the traffic control
mechanism may modify the flowspec hop-by-hop. Another more important
reason is that in a multicast environment, reservations from different
downstream branches, but for the same sender, are merged together as they
travel across the upstream path. This merging is necessary to conserve
resources in the routers.
A successful reservation request propagates upstream along the multicast
tree until it reaches a point where an existing reservation is equal or greater
than that being requested. At this point, the arriving request is merged with
the reservation in place and does not need to be forwarded further.
TCP/IP Tutorial and Technical Overview
Figure 8-7 shows the reservation merging for a multicast flow.
Reservation Merge
Reserved Resources
Packet Flow
Path Message
Reservation Message
Figure 8-7 RSVP reservation merging for multicast flow
If the reservation request reaches the sender, the QoS reservation was
established in every router on the path and the application can start to send
packets downstream to the receivers. The packet classifier and the packet
scheduler in each router make sure that the packets are forwarded according
to the requested QoS.
This type of reservation is only reasonable if all routers on the path support
RSVP. If only one router does not support resource reservation, the service
cannot be guaranteed for the whole path because of the “best effort”
characteristics of normal routers. A router on the path that does not support
RSVP is a bottleneck for the flow.
A receiver that originates a reservation request can also request a
confirmation message that indicates that the request was installed in the
network. The receiver includes a confirmation request in the Resv message
and gets a ResvConf message if the reservation was established
RSVP resource reservations maintain soft-state in routers and hosts, which
means that a reservation is canceled if RSVP does not send refresh
messages along the path for an existing reservation. This allows route
changes without resulting in protocol inefficiencies. Path messages must also
be present because the path state fields in the routers will be reset after a
timeout period.
Chapter 8. Quality of service
Path and reservation states can also be deleted with RSVP teardown
messages. There are two types of teardown messages:
– PathTear messages
PathTear messages travel downstream from the point of initiation to all
receivers, deleting the path state as well as all dependent reservation
states in each RSVP-capable device.
– ResvTear messages
ResvTear messages travel upstream from the point of initiation to all
senders, deleting reservation states in all routers and hosts.
A teardown request can be initiated by senders, receivers, or routers that
notice a state timeout. Because of the soft-state principle of RSVP
reservations, it is not really necessary to explicitly tear down an old
reservation. Nevertheless, we recommend that all end hosts send a teardown
request if a consisting reservation is no longer needed.
RSVP reservation styles
Users of multicast multimedia applications often receive flows from different
senders. In the reservation process described in “RSVP operation” on page 297,
a receiver must initiate a separate reservation request for each flow it wants to
receive. But RSVP provides a more flexible way to reserve QoS for flows from
different senders. A reservation request includes a set of options that are called
the reservation style. One of these options deals with the treatment of
reservations for different senders within the same session. The receiver can
establish a distinct reservation for each sender or make a single shared
reservation for all packets from the senders in one session.
Another option defines how the senders for a reservation request are selected. It
is possible to specify an explicit list or a wildcard that selects the senders
belonging to one session. In an explicit sender-selection reservation, a filterspec
must identify exactly one sender. In a wildcard sender-selection, the filterspec is
not needed. Figure 8-8 shows the reservation styles that are defined with this
reservation option.
(FF) Style
(SE) Style
(Not Defined)
(WF) Style
Figure 8-8 RSVP reservation styles
TCP/IP Tutorial and Technical Overview
Wildcard-Filter (WF) The Wildcard-Filter style uses the options shared
reservation and wildcard sender selection. This
reservation style establishes a single reservation for all
senders in a session. Reservations from different senders
are merged together along the path so that only the
biggest reservation request reaches the senders.
A wildcard reservation is forwarded upstream to all
sender hosts. If new senders appear in the session, for
example, new members enter a videoconferencing, the
reservation is extended to these new senders.
Fixed-Filter (FF)
The Fixed-Filter style uses the option's distinct
reservations and explicit sender selection. This means
that a distinct reservation is created for data packets from
a particular sender. Packets from different senders that
are in the same session do not share reservations.
Shared-Explicit (SE) The Shared-Explicit style uses the option's shared
reservation and explicit sender selection. This means that
a single reservation covers flows from a specified subset
of senders. Therefore, a sender list must be included into
the reservation request from the receiver.
Reservations established in shared style (WF and SE) are mostly used for
multicast applications. For this type of application, it is unlikely that several data
sources transmit data simultaneously, so it is not necessary to reserve QoS for
each sender.
For example, in an audio conference that consists of five participants, every
station sends a data stream with 64 kbps. With a Fixed-Filter style reservation, all
members of the conference must establish four separate 64 kbps reservations
for the flows from the other senders. But in an audio conference, usually only one
or two people speak at the same time. Therefore, it is sufficient to reserve a
bandwidth of 128 kbps for all senders, because most audio conferencing
software uses silence suppression, which means that if a person does not speak,
no packets are sent. This can be realized if every receiver makes one shared
reservation of 128 kbps for all senders.
Using the Shared-Explicit style, all receivers must explicitly identify all other
senders in the conference. With the Wildcard-Filter style, the reservation counts
for every sender that matches the reservation specifications. If, for example, the
audio conferencing tool sends the data packets to a special TCP/IP port, the
receivers can make a Wildcard-Filter reservation for all packets with this
destination port.
Chapter 8. Quality of service
RSVP message format
An RSVP message basically consists of a common header, followed by a body
consisting of a variable number of objects. The number and the content of these
objects depend on the message type. The message objects contain the
information that is necessary to realize resource reservations, for example, the
flow descriptor or the reservation style. In most cases, the order of the objects in
an RSVP message makes no logical difference. RFC 2205 recommends that an
RSVP implementation should use the object order defined in the RFC, but
accepts the objects in any permissible order. Figure 8-9 shows the common
header of a RSVP message.
Figure 8-9 RSVP common header
4-bit RSVP protocol number. The current version is 1.
4-bit field that is reserved for flags. No flags are defined
Message type
8-bit field that specifies the message type:
RSVP vhecksum
16-bit field. The checksum can be used by receivers of an
RSVP message to detect errors in the transmission of this
8-bit field, which contains the IP TTL value with which the
message was sent.
RSVP length
16-bit field that contains the total length of the RSVP
message including the common header and all objects
that follow. The length is counted in bytes.
TCP/IP Tutorial and Technical Overview
The RSVP objects that follow the common header consist of a 32-bit header and
one or more 32-bit words. Figure 8-10 shows the RSVP object header.
Length (Bytes)
Class - Number
(Object Contents)
Figure 8-10 RSVP object header
16-bit field that contains the object length in bytes. This
must be a multiple of 4. The minimum length is 4 bytes.
Identifies the object class. The following classes are
The NULL object has a Class-Number of zero. The
length of this object must be at least 4, but can be any
multiple of 4. The NULL object can appear anywhere
in the object sequence of an RSVP message. The
content is ignored by the receiver.
The Session object contains the IP destination
address, the IP protocol ID, and the destination port to
define a specific session for the other objects that
follow. The Session object is required in every RSVP
The RSVP_HOP object contains the IP address of the
node that sent this message and a logical outgoing
interface handle. For downstream messages (for
example, path messages), the RSVP_HOP object
represents a PHOP (previous hop) object, and for
upstream messages (for example, Resv messages), it
represents an NHOP (next hop) object.
The Time_Values object contains the refresh period
for path and reservation messages. If these messages
are not refreshed within the specified time period, the
path or reservation state is canceled.
Chapter 8. Quality of service
The Style object defines the reservation style and
some style-specific information that is not in Flowspec
or Filterspec. The Style object is required in every
Resv message.
This object specifies the required QoS in reservation
The Filterspec object defines which data packets
receive the QoS specified in the Flowspec.
This object contains the sender IP address and
additional demultiplexing information, which is used to
identify a sender. The Sender_Template is required in
every Path message.
This object defines the traffic characteristics of a data
flow from a sender. The Sender_Tspec is required in
all path messages.
The Adspec object is used to provide advertising
information to the traffic control modules in the RSVP
nodes along the path.
This object specifies an error in a PathErr or ResvErr,
or a confirmation in a ResvConf message.
This object contains information that allows a policy
module to decide whether an associated reservation is
administratively permitted or not. It can be used in
Path, Resv, PathErr, or ResvErr messages.
The Integrity object contains cryptographic data to
authenticate the originating node and to verify the
contents of an RSVP message.
The Scope object contains an explicit list of sender
hosts to which the information in the message are
sent. The object can appear in a Resv, ResvErr, or
ResvTear message.
This object contains the IP address of a receiver that
requests confirmation for its reservation. It can be
used in a Resv or ResvConf message.
The C-Type specifies the object type within the class
number. Different object types are used for IPv4 and
Object contents
The object content depends on the object type and has a
maximum length of 65528 bytes.
TCP/IP Tutorial and Technical Overview
All RSVP messages are built of a variable number of objects. The recommended
object order for the most important RSVP messages, the path, and the Resv
message, are shown in Figure 8-11, which gives an overview of the format of the
RSVP path message. Objects that can appear in a path message, but that are
not required, are in parentheses.
Common Header
Figure 8-11 RSVP path message format
If the Integrity object is used in the path message, it must immediately follow the
common header. The order of the other objects can differ in different RSVP
implementations, but the order shown in Figure 8-11 is recommended by the
Chapter 8. Quality of service
The RSVP Resv messages looks similar to the path message. Figure 8-12
shows the objects used for reservation messages.
Common Header
Flow Descriptor List
Figure 8-12 RSVP Resv message format
As in the path message, the Integrity object must follow the common header if it
is used. Another restriction applies for the Style object and the following flow
descriptor list. They must occur at the end of the message. The order of the other
objects follows the recommendation from the RFC.
For a detailed description of the RSVP message structure and the handling of
the different reservation styles in reservation messages, consult RFC 2205.
8.2.5 Integrated Services outlook
Integrated Services is designed to provide end-to-end quality of service (QoS) to
applications over heterogeneous networks. This means that Integrated Services
has to be supported by several different network types and devices. It also
means that elements within the network, such as routers, need information to
provide the requested service for an end-to-end QoS flow. This information setup
in routers is done by the Resource Reservation Protocol (RSVP). RSVP is a
signaling protocol that can carry Integrated Services information.
TCP/IP Tutorial and Technical Overview
Although RSVP can be used to request resources from the network, Integrated
Services defines the needed service types, quantifying resource requirements
and determining the availability of the requested resource.
There are some factors that have prevented the deployment of RSVP and, thus,
Integrated Services in the Internet. These include:
򐂰 Only a small number of hosts currently generate RSVP signalling. Although
the number is expected to grow in the near future, many applications cannot
generate RSVP signalling.
򐂰 Integrated Services is based on flow-state and flow-processing. If
flow-processing rises dramatically, it might become a scalabiltiy concern for
large networks.
򐂰 The necessary policy control mechanisms, such as access control
authentication and accounting, have only recently become available.
The requirements of the market will determine if Integrated Services with RSVP
will inspire service providers to use these protocols. But this requires that
network devices (for example, routers) need the required software support.
Another aspect also needs to be considered. Support of Integrated Services
running over Differentiated Services networks is a possibility. This solution
򐂰 End-to-end QoS for applications, such as IP telephony and video on demand.
򐂰 Intserv enables hosts to request per-flow and quantify able resources along
the end-to-end path, including feedback about admission to the resources.
򐂰 Diffserv eliminates the need for per-flow state and per-flow processing, and
therefore enables the scalability across large networks.
8.3 Differentiated Services
The Differentiated Services (DS) concept is currently under development at the
IETF DS working group. The DS specifications are defined in some IETF Internet
drafts. There is no RFC available yet. This section gives an overview of the
basics and ideas to provide service differentiation in the Internet. Because the
concept is still under development, some of the specifications mentioned in this
book might be changed in the final definition of differentiated services.
The goal of DS development is to provide differentiated classes of service for
Internet traffic to support various types of applications and meet specific
business requirements. DS offers predictable performance (delay, throughput,
packet loss, and so on) for a given load at a given time. The difference between
Chapter 8. Quality of service
Integrated Services, described in 8.2.5, “Integrated Services outlook” on
page 308 and Differentiated Services is that DS provides scalable service
discrimination in the Internet without the need for per-flow state and signaling at
every hop. It is not necessary to perform a unique QoS reservation for each flow.
With DS, the Internet traffic is split into different classes with different QoS
A central component of DS is the service level agreement (SLA). An SLA is a
service contract between a client and a service provider that specifies the details
of the traffic classifying and the corresponding forwarding service a client should
receive. A client can be a user organization or another DS domain. The service
provider must assure that the traffic of a client, with whom it has an SLA, gets the
contracted QoS. Therefore, the service provider's network administration must
set up the appropriate service policies and measure the network performance to
guarantee the agreed traffic performance.
To distinguish the data packets from different clients in DS-capable network
devices, the IP packets are modified in a specific field. A small bit-pattern, called
the DS field, in each IP packet is used to mark the packets that receive a
particular forwarding treatment at each network node. The DS field uses the
space of the former TOS octet in the IPv4 IP header and the traffic class octet in
the IPv6 header. All network traffic inside of a domain receives a service that
depends on the traffic class that is specified in the DS field.
To provide SLA conform services, the following mechanisms must be combined
in a network:
򐂰 Setting bits in the DS field (TOS octet) at network edges and administrative
򐂰 Using those bits to determine how packets are treated by the routers inside
the network.
򐂰 Conditioning the marked packets at network boundaries in accordance with
the QoS requirements of each service.
The currently defined DS architecture only provides service differentiation in one
direction and is therefore asymmetric. Development of a complementary
symmetric architecture is a topic of current research. The following section
describes the DS architecture in more detail.
8.3.1 Differentiated Services architecture
Unlike Integrated Services, QoS guarantees made with Differentiated Services
are static and stay long-term in routers. This means that applications using DS
do not need to set up QoS reservations for specific data packets. All traffic that
TCP/IP Tutorial and Technical Overview
passes DS-capable networks can receive a specific QoS. The data packets must
be marked with the DS field that is interpreted by the routers in the network.
Per-hop behavior (PHB)
The DS field uses six bits to determine the Differentiated Services Code Point
(DSCP). This code point will be used by each node in the net to select the PHB.
A two-bit currently unused (CU) field is reserved. The value of the CU bits are
ignored by Differentiated Services-compliant nodes when PHP is used for
received packets. Figure 8-13 shows the structure of the defined DS field.
Figure 8-13 DS field
Each DS-capable network device must have information about how packets with
different DS field should be handled. In the DS specifications, this information is
called the per-hop behavior (PHB). It is a description of the forwarding treatment
a packet receives at a given network node. The DSCP value in the DS field is
used to select the PHB a packet experiences at each node. To provide
predictable services, per-hop behaviors need to be available in all routers in a
Differentiated Services-capable network. The PHB can be described as a set of
parameters inside of a router that can be used to control how packets are
scheduled onto an output interface. This can be a number of separate queues
with priorities that can be set, parameters for queue lengths, or drop algorithms
and drop preference weights for packets.
DS requires routers that support queue scheduling and management to prioritize
outbound packets and control the queue depth to minimize congestion in the
network. The traditional FIFO queuing in common Internet routers provides no
service differentiation and can lead to network performance problems. The
packet treatment inside of a router depends on the router's capabilities and its
particular configuration, and it is selected by the DS field in the IP packet. For
example, if a IP packet reaches a router with eight different queues that all have
Chapter 8. Quality of service
different priorities, the DS field can be used to select which queue is liable for the
routing of this packet. The scale reaches from zero, for lowest priority, to seven
for highest priority. See Figure 8-14 as an example.
Queue 7 (Highest Priority)
Queue 6
Queue 5
Incoming Traffic
Queue 4
Outgoing Traffic
Queue 3
Queue 2
Queue 1
Queue 0 (Lowest Priority)
Figure 8-14 DS routing example
Another example is a router that has a single queue with multiple drop priorities
for data packets. It uses the DS field to select the drop preference for the packets
in the queue. A value of zero means “it is most likely to drop this packet,” and
seven means “it is least likely to drop this packet.” Another possible constellation
is four queues with two levels of drop preference in each.
To make sure that the per-hop behaviors in each router are functionally
equivalent, certain common PHBs must be defined in future DS specifications to
avoid having the same DS field value causing different forwarding behaviors in
different routers of one DS domain. This means that in future DS specifications,
some unique PHB values must be defined that represent specific service
classes. All routers in one DS domain must know which service a packet with a
specific PHB should receive. The DiffServ Working Group will propose PHBs that
should be used to provide differentiated services. Some of these proposed PHBs
will be standardized; others might have widespread use, and still others might
remain experimental.
TCP/IP Tutorial and Technical Overview
PHBs will be defined in groups. A PHB group is a a set of one or more PHBs that
can only be specified and implemented simultaneously because of queue
servicing or queue management policies that apply to all PHBs in one group. A
default PHB must be available in all DS-compliant nodes. It represents the
standard best-effort forwarding behavior available in existing routers. When no
other agreements are in place, it is assumed that packets belong to this service
level. The IETF working group recommends that you use the DSCP value
000000 in the DS field to define the default PHB.
Another PHB that is proposed for standardization is the Expedited Forwarding
(EF) PHB. It is a high priority behavior that is typically used for network control
traffic, such as routing updates. The value 101100 in the DSCP field of the DS
field is recommended for the EF PHB.
8.3.2 Organization of the DSCP
There are some IANA considerations concerning the DSCP. The codepoint
space for the DSCP distinguishes between 64 codepoint values. The proposal is
to divide the space into tree pools. Pool1 can be used for standard actions. The
other pools can be used for experimental local usage, where one of the two pools
is provided for experimental local use in the near future.
The proposal is to divide the space into three pools (see Table 8-1).
Table 8-1 DSCP pools
Codepoint space
Standard action
Experimental/local use
Future experimental /local use
Differentiated Services domains
The setup of QoS guarantees is not made for specific end-to-end connections
but for well-defined Differentiated Services domains. The IETF working group
defines a Differentiated Services domain as a contiguous portion of the Internet
over which a consistent set of Differentiated Services policies are administered in
a coordinated fashion. It can represent different administrative domains or
autonomous systems, different trust regions, and different network technologies,
such as cell or frame-based techniques, hosts, and routers. A DS domain
consists of boundary components that are used to connect different DS domains
to each other and interior components that are only used inside of the domains.
Chapter 8. Quality of service
Figure 8-15 shows the use of boundary and interior components for two DS
B = Boundary Component
I = Interior Component
Figure 8-15 DS domain
A DS domain normally consists of one or more networks under the same
administration. This can be, for example, a corporate intranet or an Internet
service provider (ISP). The administrator of the DS domain is responsible for
ensuring that adequate resources are provisioned and reserved to support the
SLAs offered by the domain. Network administrators must use appropriate
measurement techniques to monitor if the network resources in a DS domain are
sufficient to satisfy all authorized QoS requests.
DS boundary nodes
All data packets that travel from one DS domain to another must pass a
boundary node, which can be a router, a host, or a firewall. A DS boundary node
that handles traffic leaving a DS domain is called an egress node and a boundary
node that handles traffic entering a DS domain is called an ingress node.
Normally, DS boundary nodes act both as an ingress node and an egress node,
depending on the traffic direction. The ingress node must make sure that the
packets entering a domain receive the same QoS as in the domain the packets
traveled before. A DS egress node performs conditioning functions on traffic that
is forwarded to a directly connected peering domain. The traffic conditioning is
done inside of a boundary node by a traffic conditioner. It classifies, marks, and
TCP/IP Tutorial and Technical Overview
possibly conditions packets that enter or leave the DS domain. A traffic
conditioner consists of the following components:
A classifier selects packets based on their packet header
and forwards the packets that match the classifier rules
for further processing. The DS model specifies two types
of packet classifiers:
• Multi-field (MF) classifiers, which can classify on the
DS field as well as on any other IP header field, for
example, the IP address and the port number, like
an RSVP classifier
• Behavior Aggregate (BA) classifiers, which only
classify on the bits in the DS field
Traffic meters measure if the forwarding of the packets
that are selected by the classifier corresponds to the
traffic profile that describes the QoS for the SLA between
client and service provider. A meter passes state
information to other conditioning functions to trigger a
particular action for each packet, which either does or
does not comply with the requested QoS requirements.
DS markers set the DS field of the incoming IP packets to
a particular bit pattern. The PHB is set in the first 6 bits of
the DS field so that the marked packets are forwarded
inside of the DS domain according to the SLA between
service provider and client.
Packet shapers and droppers cause conformance to
some configured traffic properties, for example, a token
bucket filter, as described in 8.2.1, “Service classes” on
page 292. They use different methods to bring the stream
into compliance with a traffic profile. Shapers delay some
or all of the packets. A shaper usually has a finite-size
buffer, and packets can be discarded if there is not
sufficient buffer space to hold the delayed packets.
Droppers discard some or all of the packets. This process
is know as policing the stream. A dropper can be
implemented as a special case of a shaper by setting the
shaper buffer size to zero packets.
Chapter 8. Quality of service
The traffic conditioner is mainly used in DS boundary components, but it can also
be implemented in an interior component. Figure 8-16 shows the cooperation of
the traffic conditioner components.
Figure 8-16 DS traffic conditioner
The traffic conditioner in a boundary component makes sure that packets that
transit the domain are correctly marked to select a PHB from one of the PHB
groups supported within the domain. This is necessary because different DS
domains can have different groups of PHBs, which means that the same entry in
the DS field can be interpreted variably in different domains.
For example, in the first domain a packet traverses, all routers have four queues
with different queue priorities (0-3). Packets with a PHB value of three are routed
with the highest priority. But in the next domain the packet travels through, all
routers have eight different queues and all packets with the PHB value of seven
are routed with the highest priority. The packet that was forwarded in the first
domain with high priority has only medium priority in the second domain. This
might violate the SLA contract between the client and the service provider.
Therefore, the traffic conditioner in the boundary router that connects the two
domains must assure that the PHB value is remarked from three to seven if the
packet travels from the first to the second domain.
TCP/IP Tutorial and Technical Overview
Figure 8-17 shows an example of remarking data packets that travel trough two
different domains.
Interior Router
Boundary Router
3 3
Interior Router
Figure 8-17 Remarking data packets
If a data packet travels through multiple domains, the DS field can be remarked
at every boundary component to guarantee the QoS that was contracted in the
SLA. The SLA contains the details of the Traffic Conditioning Agreement (TCA)
that specifies classifier rules and temporal properties of a traffic stream. The TCA
contains information about how the metering, marking, discarding, and shaping
of packets must be done in the traffic conditioner to fulfill the SLA. The TCA
information must be available in all boundary components of a DS network to
guarantee that packets passing through different DS domains receive the same
service in each domain.
DS interior components
The interior components of a DS domain select the forwarding behavior for
packets based on their DS field. The interior component is usually a router that
contains a traffic prioritization algorithm. Because the value of the DS field
normally does not change inside of a DS domain, all interior routers must use the
same traffic forwarding policies to comply with the QoS agreement. Data packets
with different PHB values in the DS field receive different QoSs according to the
QoS definitions for this PHB. Because all interior routers in a domain use the
same policy functions for incoming traffic, the traffic conditioning inside of an
interior node is done only by a packet classifier. It selects packets based on their
PHB value or other IP header fields and forwards the packets to the queue
management and scheduling instance of the node.
Chapter 8. Quality of service
Figure 8-18 shows the traffic conditioning in an interior node.
Interior Router
DS Byte Classifier
Queue Management Scheduler
Figure 8-18 DS interior component
Traffic classifying and prioritized routing is done in every interior component of a
DS domain. After a data packet has crossed a domain, it reaches the boundary
router of the next domain and possibly gets remarked to cross this domain with
the requested QoS.
Source domains
The IETF DS working group defines a source domain as the domain that
contains one or more nodes that originate the traffic that receives a particular
service. Traffic sources and intermediate nodes within a source domain can
perform traffic classification and conditioning functions. The traffic that is sent
from a source domain can be marked by the traffic sources directly or by
intermediate nodes before leaving the source domain.
In this context, it is important to understand that the first PHB marking of the data
packets is not done by the sending application itself. Applications do not notice
the availability of Differentiated Services in a network. Therefore, applications
using DS networks must not be rewritten to support DS. This is an important
difference from Integrated Services, where most applications support the RSVP
protocol directly when some code changes are necessary.
The first PHB marking of packets that are sent from an application is done in the
source host or the first router the packet passes. The packets are identified with
their IP address and source port. For example, a client has an SLA with a service
provider that guarantees a higher priority for the packets sent by an audio
application. The audio application sends the data packets through a specific port
and can be recognized in multi-field classifiers. This classifier type recognizes
the IP address and port number of a packet and can distinguish the packets from
different applications. If the host contains a traffic conditioner with an MF
classifier, the IP packet can be marked with the appropriate PHB value and
consequently receives the QoSs that are requested by the client. If the host does
not contain a traffic conditioner, the initial marking of the packets is done by the
TCP/IP Tutorial and Technical Overview
first router in the source domain that supports traffic conditioning. Figure 8-19
shows the initial marking of a packet inside of a host and a router.
TCP Data
TCP Data
Initial marking in the host
Initial marking in the router
Figure 8-19 Initial marking of data packets
In our example, the DS network has the policy that the packets from the audio
application should have higher priority than other packets. The sender host can
mark the DS field of all outgoing packets with a DS codepoint that indicates
higher priority. Alternatively, the first-hop router directly connected to the
sender's host can classify the traffic and mark the packets with the correct DS
codepoint. The source DS domain is responsible for ensuring that the
aggregated traffic toward its provider DS domain conforms to the SLA between
client and service provider. The boundary node of the source domain should also
monitor that the provided service conforms to the requested service and can
police, shape, or re-mark packets as necessary.
Integrated Services (Intserv) over Diffserv networks
The basic idea is to use both architectures to provide an end-to-end, quantitative
QoS, which will also allow scalability. This will be achieved by applying the
Chapter 8. Quality of service
Intserv model end-to-end across a network containing one or more Diffserv
Intserv views the Diffserv regions as virtual links connecting Intserv-capable
routers or hosts running Intserv. Within the Diffserv regions, the routers are
implemented with specific PHB definitions to provide aggregate traffic control.
The total amount of traffic that is admitted into the Diffserv region may be limited
by a determined policy at the edges of the Diffserv network. The Intserv traffic
has to be adapted to the limits of the Diffserv region.
There are two possible approaches for connecting Intserv networks with Diffserv
򐂰 Resources within the Diffserv network or region include RSVP-aware devices
that participate in RSVP signalling.
򐂰 Resources within the Diffserv region include no RSVP signalling.
In our sample (Figure 8-20), we describe the second configuration, because it is
assumed to become the most current approach to use.
Figure 8-20 Integrated Services and RSVP using Differentiated Services
This configuration consists of two RSVP-capable intranets that are connected to
Diffserv regions. Within the intranets, there are hosts using RSVP to
communicate the quantitative QoS requirements with a partner host in another
intranet running QoS-aware applications.
The intranets contain edge routers R1 and R4, which are adjacent to the Diffserv
region interface. They are connected with the border routers R2 and R3, which
are located within the Diffserv region.
The RSVP signalling process is initiated by the service requesting application on
the host (for example, hostA). Traffic control on the host can mark the DSCP in
TCP/IP Tutorial and Technical Overview
the transmitted packets and shape transmitted traffic to the requirements of the
Intserv services in use.
The RSVP end-to-end signaling messages are exchanged between the hosts in
the intranets. Thus, the RSVP/Intserv resource reservation is accomplished
outside the Diffserv region.
The edge routers act as admission control agents of the Diffserv network. They
process the signaling messages from the hosts in both intranets and apply
admission control. Admission control is based on resources available within the
Diffserv region. It is also defined in a company's policy for the intranets.
Because the border routers in our sample (R2 and R3) are not aware of RSVP,
they act as pure Diffserv routers. These routers control and submit packets
based on the specified DSCP and the agreement for the host's aggregate traffic
The Diffserv network or region supports aggregate traffic control and is assumed
to be incapable of MF classification. Therefore, any RSVP messages will pass
transparently with negligible performance impact through a Diffserv network
The next aspect to be considered is the mapping of the Intserv service type and
the set of quantitative parameters, known as flowspec. Because Diffserv
networks use PHB or a set of PHBs, a mapping of the Intserv flowspec has to be
defined accordingly. The mapping value is a bit combination in the DSCP.
However, this mapping has to viewed under bandwidth management
considerations for the Diffserv network.
The DSCP value has been made known to all routers in the Diffserv network.
The question arises of how the DSCP will be propagated to these routers.
There are two choices:
򐂰 DSCPs can be marked at the entrance of the Diffserv region (at the boundary
routers). In this case, they can be also re-marked at the exit of the Diffserv
region (at the other boundary router).
򐂰 DSCP marking can occur in a host or in a router of the intranet. In this case,
the appropriate mapping needs to be communicated to the marking device.
This can be provided by RSVP.
The following sequence shows how an application obtains end-to-end QoS
1. HostA, attached to an intranet, requests a service from hostB, attached to
another intranet. Both intranets are connected by a Diffserv network (see
Figure 8-20 on page 320).
Chapter 8. Quality of service
2. HostA generates a RSVP PATH message, which describes the traffic offered
by the sending application.
3. The PATH message is sent over the intranet to router R1. Standard
RSVP/Intserv processing is done by the network devices within the intranet.
4. The PATH state is defined in the router R1, and is forwarded to the Router R2
in the Diffserv network.
5. The PATH message is ignored in the Diffserv network at R2 and R3. It is sent
to R4 in the intranet and to hostB.
6. When the PATH message is received by hostB, a RSVP RESV message is
built, indicating the offered traffic of a specific Intserv service type.
7. The RESV message is sent to the Diffserv network.
8. The Diffserv network transparently transmits the message to router R1.
9. In R1, the RESV message triggers admission control processing. This means
requested resources in the initial RSVP/Intserv request are compared to the
resources available in the Diffserv network at the corresponding Diffserv
service level. The corresponding service level is determined by the Intserv to
Diffserv mapping function. The availability of resources is determined by the
capacity defined in the SLA.
10.If R1 approves the request, the RESV message is admitted and is allowed to
be sent to the sender, hostA. R1 updates its tables with reduced capacity
available at the admitted service level on this particular transmit interface.
11.If the RESV message is not rejected by any RSVP node in the intranet, it will
be received at hostA. The QoS process interprets the receipt of the message
as an indication that the specified message flow has been admitted for the
specified Intserv service type. It also learns the DSCP marking, which will be
used for subsequent packets to be sent for this flow.
8.3.3 Configuration and administration of DS with LDAP
In a Differentiated Services network, the service level information must be
provided to all network elements to ensure correct administrative control of
bandwidth, delay, or dropping preferences for a given customer flow. All DS
boundary components must have the same policy information for the defined
service levels. This makes sure that the packets marked with the DS field receive
the same service in all DS domains. If only one domain in the DS network has
different policy information, it is possible that the data packets passing this
domain will not receive the service that was contracted in the SLA between client
and service provider.
Network administrators can define different service levels for different clients and
manually provide this information to all boundary components. This policy
TCP/IP Tutorial and Technical Overview
information remains statically in the network components until the next manual
But in dynamic network environments, it is necessary to enable flexible
definitions of class-based packet handling behaviors and class-based policy
control. Administrative policies can change in a running environment, so it is
necessary to store the policies in a directory-based repository. The policy
information from the directory can be distributed across multiple physical servers,
but the administration is done for a single entity by the network administrator.
The directory information must be propagated on all network elements, such as
hosts, proxies, and routers, that use the policy information for traffic conditioning
in the DS network. In today's heterogeneous environments, it is likely that
network devices and administrative tools are developed by different vendors.
Therefore, it is necessary to use a standardized format to store the administrative
policies in the directory server function and a standardized mechanism to provide
the directory information to the DS boundary components, which act as directory
clients. These functions are provided by the Lightweight Directory Access
Protocol (LDAP), which is a commonly used industry standard for directory
accessing. LDAP is a widely deployed and simple protocol for directory access.
Policy rules for different service levels are stored in directories as LDAP
schemas and can be downloaded to devices that implement the policies, such as
hosts, routers, policy servers, or proxies.
Chapter 8. Quality of service
Figure 8-21 shows the cooperation of the DS network elements with the LDAP
LDAP Server
Boundary Components
Boundary Components
LDAP Client
LDAP Client
Interior Components
LDAP= Lightweight Directory Access Protocol
Figure 8-21 Administration of DS components with LDAP
Using Differentiated Services with IPSec
The IPSec protocol (described in 22.4.3, “Encapsulating Security Payload (ESP)”
on page 817) does not use the DS field in an IP header for its cryptographic
calculations. Therefore, modification of the DS field by a network node has no
effect on IPSec's end-to-end security, because it cannot cause any IPSec
integrity check to fail. This makes it possible to use IPSec-secured packets in DS
IPSec's tunnel mode provides security for the encapsulated IP header's DS field.
A tunnel mode IPSec packet contains an outer header that is supplied by the
tunnel start point and an encapsulated inner header that is supplied by the host
that originally sent the packet.
TCP/IP Tutorial and Technical Overview
Processing of the DS field in the presence of IPSec tunnels then works as
follows :
1. The node where the IPSec tunnel begins encapsulates the incoming IP
packets with an outer IP header and sets the DS field of the outer header
accordingly to the SLA in the local DS domain.
2. The secured packet travels through the DS network, and intermediate nodes
modify the DS field in the outer IP header, as appropriate.
3. If a packet reaches the end of an IPSec tunnel, the outer IP header is stripped
off by the tunnel end node and the packet is forwarded using the information
contained in the inner (original) IP header.
4. If the DS domain of the original datagram is different from the DS domain
where the IPSec tunnel ends, the tunnel end node must modify the DS field of
the inner header to match the SLA in its domain. The tunnel end node then
effectively acts as a DS ingress node.
5. As the packet travels in the DS network on the other side of the IPSec tunnel,
intermediate nodes use the original IP header to modify the DS field.
8.4 RFCs relevant to this chapter
The following RFCs provide detailed information about the connection protocols
and architectures presented throughout this chapter:
򐂰 RFC 1349 – Type of Service in the Internet Protocol Suite (July 1992)
򐂰 RFC 1633 – Integrated Services in the Internet Architecture: An Overview
(June 1994)
򐂰 RFC 4495 – A Resource Reservation Protocol (RSVP) Extension for the
Reduction of Bandwidth of a Reservation Flow (May 2006)
򐂰 RFC 2206 – RSVP Management Information Base Using SMIv2
(September 1997)
򐂰 RFC 2207 – RSVP Extensions for IPSEC Data Flows (September 1997)
򐂰 RFC 2208 – Resource Reservation Protocol (RSVP) – Version 1 Applicability
Statement (September 1997)
򐂰 RFC 2209 – Resource Reservation Protocol (RSVP) – Version 1 Message
Processing Rules (September 1997)
򐂰 RFC 2210 – The Use of RSVP with IETF Integrated Services
(September 1997)
򐂰 RFC 2211 – Specification of the Controlled Load Network Element Service
(September 1997)
Chapter 8. Quality of service
򐂰 RFC 2212 – Specification of Guaranteed Quality of Service
(September 1997)
򐂰 RFC 2474 – Definition of the Differentiated Services Field (DS Field) in the
IPv4 and IPv6 Headers (December 1998)
TCP/IP Tutorial and Technical Overview
Chapter 9.
IP version 6
This chapter discusses the concepts and architecture of IP version 6. This
chapter includes the following topics:
򐂰 An overview of IPv6 new features
򐂰 An examination of the IPv6 packet format
򐂰 An explanation of additional IPv6 functions
򐂰 A description of associated protocols
򐂰 A review of IPv6 mobility applications
򐂰 A projection of new opportunities opened up by IPv6
򐂰 An explanation of the needs for next generation IPv6
򐂰 An exploration into IPv4-to-IPv6 transition paths
© Copyright IBM Corp. 1989-2006. All rights reserved.
9.1 IPv6 introduction
Internet Protocol version 6 (IPv6) is the replacement for version 4 (IPv4). This
section discusses the expanded address space capabilities of IPv6 and includes
a brief feature overview.
9.1.1 IP growth
The Internet is growing extremely rapidly. The latest Internet Domain Survey1,
conducted in January 2006, counted 395 million hosts (374,991,609 hosts to be
Figure 9-1 The total number of IP hosts growth
The IPv4 addressing scheme, with a 32-bit address field, provides for over 4
billion possible addresses, so it might seem more than adequate to the task of
addressing all of the hosts on the Internet, since there appears to be room to
accommodate 40 times as many Internet hosts. Unfortunately, this is not the
case for a number of reasons, including the following:
򐂰 An IP address is divided into a network portion and a local portion which are
administered separately. Although the address space within a network may
be very sparsely filled, allocating a portion of the address space (range of IP
addresses) to a particular administrative domain makes all addresses within
that range unavailable for allocation elsewhere.
Source: Internet Software Consortium (
TCP/IP Tutorial and Technical Overview
򐂰 The address space for networks is structured into Class A, B, and C networks
of differing sizes, and the space within each needs to be considered
򐂰 The IP addressing model requires that unique network numbers be assigned
to all IP networks, whether or not they are actually connected to the Internet.
򐂰 It is anticipated that growth of TCP/IP usage into new areas outside the
traditional connected PC will shortly result in a rapid explosion of demand for
IP addresses. For example, widespread use of TCP/IP for interconnecting
hand-held devices, electronic point-of-sale terminals or for Web-enabled
television receivers (all devices that are now available) will enormously
increase the number of IP hosts.
These factors mean that the address space is much more constrained than our
simple analysis would indicate. This problem is called IP Address Exhaustion.
Methods of relieving this problem are already being employed, but eventually,
the present IP address space will be exhausted. The Internet Engineering Task
Force (IETF) set up a working group on Address Lifetime Expectations (ALE)
with the express purpose of providing estimates of when exhaustion of the IP will
become an intractable problem. Their final estimates (reported in the ALE
working group minutes for December 1994) were that the IP address space
would be exhausted at some point between 2005 and 2011. Since then, their
position may have changed somewhat, in that the use of Classless Inter Domain
Routing (CIDR) and the increased use of Dynamic Host Configuration Protocol
(DHCP) may have relieved pressure on the address space.
But on the other hand, current growth rates are probably exceeding that
expectation. Apart from address exhaustion, other restrictions in IPv4 also call
for the definition of a new IP protocol:
1. Even with the use of CIDR, routing tables, primarily in the IP backbone
routers, are growing too large to be manageable.
2. Traffic priority, or class of service, is vaguely defined, scarcely used, and not
at all enforced in IPv4, but highly desirable for modern real-time applications.
3. The number of mobile data applications and devices grow quickly, IPv4 has
difficulty in managing forwarding addresses and in realizing visitor-location
network authentication.
4. There is no direct security support in IPv4. Various open and proprietary
security solutions cause interoperability concerns. As the internet becomes
the fabric of every life in the new cyber space, security enhancement to the
infrastructure should be placed in the basic IP protocol.
In view of these issues, the IETF established an IPng (IP next generation)
working group to make recommendations for the IP Next Generation Protocol.
Chapter 9. IP version 6
Eventually, the specification for Internet Protocol, Version 6 (IPv6) was produced
in RFC2460 as the latest version.
9.1.2 IPv6 feature overview
IPv6 offers the following significant features:
򐂰 A dramatically larger address space, which is said to be sufficient for at least
the next 30 years
򐂰 Globally unique and hierarchical addressing, based on prefixes rather than
address classes, to keep routing tables small and backbone routing efficient
򐂰 A mechanism for the auto-configuration of network interfaces
򐂰 Support for encapsulation of itself and other protocols
򐂰 Class of service that distinguishes types of data
򐂰 Improved multicast routing support (in preference to broadcasting)
򐂰 Built-in authentication and encryption
򐂰 Transition methods to migrate from IPv4
򐂰 Compatibility methods to coexist and communicate with IPv4
Note: IPv6 uses the term packet rather than datagram. The meaning is the
same, although the formats are different.
IPv6 uses the term node for any system running IPv6, that is, a host or a
router. An IPv6 host is a node that does not forward IPv6 packets that are not
explicitly addressed to it. A router is a node that forwards IP packets not
addressed to it.
9.2 The IPv6 header format
The format of the IPv6 packet header has been simplified from its counterpart in
IPv4. The length of the IPv6 header increases to 40 bytes (from 20 bytes) and
contains two 16-byte addresses (source and destination), preceded by 8 bytes of
control information, as shown in Figure 9-2 on page 331. The IPv4 header has
two 4-byte addresses preceded by 12 bytes of control information and possibly
followed by option data. The reduction of the control information and the
elimination of options in the header for most IP packets are intended to optimize
the processing time per packet in a router. The infrequently used fields that have
been removed from the header are moved to optional extension headers when
they are required.
TCP/IP Tutorial and Technical Overview
v e rs
tra ffic c la s s
flo w la b e l
p a ylo a d le n g th
nxt hdr
h o p lim it
s o u rc e a d d re s s
d e s tin a tio n a d d re s s
d a ta ....
Figure 9-2 IP header format
4-bit Internet Protocol version number: 6.
Traffic class
8-bit traffic class value. See 9.2.3, “Traffic class” on
page 345.
Flow label
20-bit field. See 9.2.4, “Flow labels” on page 346.
Payload length
The length of the packet in bytes (excluding this header)
encoded as a 16-bit unsigned integer. If length is greater
than 64 KB, this field is 0 and an option header (Jumbo
Payload) gives the true length.
Next header
Indicates the type of header immediately following the
basic IP header. It can indicate an IP option header or an
upper layer protocol. The protocol numbers used are the
same as those used in IPv4. The next header field is also
used to indicate the presence of extension headers, which
provide the mechanism for appending optional information
to the IPv6 packet. The following values appear in IPv6
packets, in addition to those mentioned for IPv4:
41 Header
Interdomain Routing Protocol
Chapter 9. IP version 6
Resource Reservation Protocol
IPv6 ICMP Packet
The following values are all extension headers:
Hop-by-Hop Options Header
IPv6 Routing Header
IPv6 Fragment Header
Encapsulating Security Payload
IPv6 Authentication Header
No Next Header
Destination Options Header
We discuss different types of extension headers in 9.2.1,
“Extension headers” on page 333.
Hop limit
This is similar to the IPv4 TTL field but it is now measured
in hops and not seconds. It was changed for two reasons:
IP normally forwards datagrams faster than one hop
per second and the TTL field is always decremented
on each hop, so, in practice, it is measured in hops
and not seconds.
Many IP implementations do not expire outstanding
datagrams on the basis of elapsed time.
The packet is discarded after the hop limit is decremented
to zero.
Source address
A 128-bit address. We discuss IPv6 addresses in 9.2.2,
“IPv6 addressing” on page 339.
Destination address A 128-bit address. We discuss IPv6 addresses in 9.2.2,
“IPv6 addressing” on page 339.
A comparison of the IPv4 and IPv6 header formats shows that a number of IPv4
header fields have no direct equivalents in the IPv6 header:
򐂰 Type of service
Type of service issues in IPv6 are handled using the flow concept, described
in 9.2.4, “Flow labels” on page 346.
TCP/IP Tutorial and Technical Overview
򐂰 Identification, fragmentation flags, and fragment offset
Fragmented packets have an extension header rather than fragmentation
information in the IPv6 header. This reduces the size of the basic IPv6
header, because higher-level protocols, particularly TCP, tend to avoid
fragmentation of datagrams (this reduces the IPv6 header processing costs
for the normal case). As noted later, IPv6 does not fragment packets en route
to their destinations, only at the source.
򐂰 Header checksum
Because transport protocols implement checksums, and because IPv6
includes an optional authentication header that can also be used to ensure
integrity, IPv6 does not provide checksum monitoring of IP packets.
Both TCP and UDP include a pseudo IP header in the checksums they use,
so in these cases, the IP header in IPv4 is being checked twice.
TCP and UDP, and any other protocols using the same checksum
mechanisms running over IPv6, will continue to use a pseudo IP header
although, obviously, the format of the pseudo IPv6 header will be different
from the pseudo IPv4 header. ICMP, IGMP, and any other protocols that do
not use a pseudo IP header over IPv4 use a pseudo IPv6 header in their
򐂰 Options
All optional values associated with IPv6 packets are contained in extension
headers, ensuring that the basic IP header is always the same size.
9.2.1 Extension headers
Every IPv6 packet starts with the basic header. In most cases, this will be the
only header necessary to deliver the packet. Sometimes, however, it is
necessary for additional information to be conveyed along with the packet to the
destination or to intermediate systems on route (information that would
previously have been carried in the Options field in an IPv4 datagram). Extension
headers are used for this purpose.
Extension headers are placed immediately after the IPv6 basic packet header
and are counted as part of the payload length. Each extension header (with the
exception of 59) has its own 8-bit Next Header field as the first byte of the header
that identifies the type of the following header. This structure allows IPv6 to chain
multiple extension headers together. Figure 9-3 on page 334 shows an example
packet with multiple extension headers.
Chapter 9. IP version 6
traffic class
payload length
flow label
nxt hdr: 0
hop limit
source address
destination address
nxt hdr: 43
hdr length
hop-by-hop options
nxt hdr: 44
hdr length
routing information
nxt hdr: 51
fragment offset
fragment identification
nxt hdr: 6
hdr length
authentication data
TCP header and data
Figure 9-3 IPv6 packet containing multiple extension headers
The length of each header varies, depending on type, but is always a multiple of
8 bytes. There are a limited number of IPv6 extension headers, any one of which
can be present only once in the IPv6 packet (with the exception of the
Destination Options Header, 60, which can appear more than once). IPv6 nodes
that originate packets are required to place extension headers in a specific order
(numeric order, with the exception of 60), although IPv6 nodes that receive
packets are not required to verify that this is the case. The order is important for
efficient processing at intermediate routers. Routers will generally only be
interested in the hop-by-hop options and the routing header. After the router has
read this far, it does not need to read further in the packet and can immediately
TCP/IP Tutorial and Technical Overview
forward. When the Next Header field contains a value other than one for an
extension header, this indicates the end of the IPv6 headers and the start of the
higher-level protocol data.
IPv6 allows for encapsulation of IPv6 within IPv6 (“tunneling”). This is done with
a Next Header value of 41 (IPv6). The encapsulated IPv6 packet can have its
own extension headers. Because the size of a packet is calculated by the
originating node to match the path MTU, IPv6 routers should not add extension
headers to a packet, but instead encapsulate the received packet within an IPv6
packet of their own making (which can be fragmented if necessary).
With the exception of the hop-by-hop header (which must immediately follow the
IP header if present), extension headers are not processed by any router on the
packet's path except the final one.
Hop-by-hop header
A hop-by-hop header contains options that must be examined by every node the
packet traverses, as well as the destination node. It must immediately follow the
IPv6 header (if present) and is identified by the special value 0 in the Next
Header field of the IPv6 basic header. (This value is not actually a protocol
number but a special case to identify this unique type of extension header).
Hop-by-hop headers contain variable length options of the format shown in
Figure 9-4 (commonly known as the Type-Length-Value (TLV) format).
Figure 9-4 IPv6 Type-Length-Value (TLV) option format
Chapter 9. IP version 6
The type of the option. The option types all have a
common format (Figure 9-5).
t y p
x x
l e
t h
v a
l u
z z z z z
Figure 9-5 IPv6 Type-Length-Value (TLV) option type format
A 2-bit number, indicating how an IPv6 node that does
not recognize the option should treat it:
Skip the option and continue.
Discard the packet quietly.
Discard the packet and inform the sender
with an ICMP Unrecognized Type message.
Discard the packet and inform the sender
with an ICMP Unrecognized Type message
unless the destination address is a multicast
If set, this bit indicates that the value of the option
might change en route. If this bit is set, the entire
Option Data field is excluded from any integrity
calculations performed on the packet.
The remaining bits define the option:
Jumbo Payload Length
The length of the option value field in bytes.
The value of the option. This is dependent on the type.
TCP/IP Tutorial and Technical Overview
Hop-by-hop header option types
You might have noticed that each extension header is an integer multiple of 8
bytes long in order to retain 8-byte alignment for subsequent headers. This is
done not purely for “neatness” but because processing is much more efficient if
multibyte values are positioned on natural boundaries in memory (and today's
processors have natural word sizes of 32 or 64 bits).
In the same way, individual options are also aligned so that multibyte values are
positioned on their natural boundaries. In many cases, this will result in the
option headers being longer than otherwise necessary, but still allow nodes to
process packets more quickly. To allow this alignment, two padding options are
used in hop-by-hop headers:
A X'00' byte used for padding a single byte. For longer padding
sequences, use the PadN option.
An option in the TLV format (described earlier). The length byte gives
the number of bytes of padding after the minimum two that are
The third option type in a hop-by-hop header is the Jumbo Payload Length. This
option is used to indicate a packet with a payload size in excess of 65,535 bytes
(which is the maximum size that can be specified by the 16-bit Payload Length
field in the IPv6 basic header). When this option is used, the Payload Length in
the basic header must be set to zero. This option carries the total packet size,
less the 40 byte basic header. See Figure 9-6 for details.
type= C2
Jumbo Payload Length
Figure 9-6 Jumbo Payload Length option
Routing header
The path that a packet takes through the network is normally determined by the
network itself. Sometimes, however, the source wants more control over the
route taken by the packet. It might want, for example, for certain data to take a
slower but more secure route than would normally be taken. The routing header
(see Figure 9-7 on page 338) allows a path through the network to be
predefined. The routing header is identified by the value 43 in the preceding Next
Header field. It has its Next Header field as the first byte and a single byte routing
type as the second byte. The only type defined initially is type 0, strict/loose
Chapter 9. IP version 6
source routing, which operates in a similar way to source routing in IPv4 (see “IP
datagram routing options” on page 105).
next hdr
hdr length
addrs left
Figure 9-7 IPv6 routing header
Next hdr
The type of header after this one.
Hdr length
Length of this routing header, not including the first 8
Type of routing header. Currently, this can only have the
value 0, meaning strict/loose source routing.
Segments left
Number of route segments remaining, that is, number of
explicitly listed intermediate nodes still to be visited before
reaching the final destination.
Address 1..n
A series of 16-byte IPv6 addresses that make up the
source route.
The first hop on the required path of the packet is indicated by the destination
address in the basic header of the packet. When the packet arrives at this
address, the router swaps the next address from the router extension header
with the destination address in the basic header. The router also decrements the
segments left field by one, and then forwards the packet.
TCP/IP Tutorial and Technical Overview
Fragment header
The source node determines the maximum transmission unit or MTU for a path
before sending a packet. If the packet to be sent is larger than the MTU, the
packet is divided into pieces, each of which is a multiple of 8 bytes and carries a
fragment header. We provide details about the fragmentation header in “IPv6
packet fragmentation” on page 351.
Authentication header
The authentication header is used to ensure that a received packet has not been
altered in transit and that it really came from the claimed sender. The
authentication header is identified by the value 51 in the preceding Next Header
field. For the format of the authentication header and further details about
authentication, refer to 9.2.5, “IPv6 security” on page 347.
Encapsulating Security Payload
The Encapsulated Security Payload (ESP) is a special extension header, in that
it can appear anywhere in a packet between the basic header and the upper
layer protocol. All data following the ESP header is encrypted. For further details,
see 9.2.5, “IPv6 security” on page 347.
Destination options header
This has the same format as the hop-by-hop header, but it is only examined by
the destination node or nodes. Normally, the destination options are only
intended for the final destination only and the destination options header will be
immediately before the upper-layer header. However, destination options can
also be intended for intermediate nodes, in which case, they must precede a
routing header. A single packet can, therefore, include two destination options
headers. Currently, only the Pad1 and PadN types of options are specified for
this header (see “Hop-by-hop header” on page 335). The value for the preceding
Next Header field is 60.
9.2.2 IPv6 addressing
The IPv6 address model is specified in RFC 4291 – IP Version 6 Addressing
Architecture. IPv6 uses a 128-bit address instead of the 32-bit address of IPv4.
That theoretically allows for as many as
340,282,366,920,938,463,463,374,607,431,768,211,456 addresses. Even when
used with the same efficiency as today's IPv4 address space, that still allows for
50,000 addresses per square meter of land on Earth.
The IPv6 address provides flexibility and scalability:
򐂰 It allows multilevel subnetting and allocation from a global backbone to an
individual subnet within an organization.
Chapter 9. IP version 6
򐂰 It improves multicast scalability and efficiency through scope constraints.
򐂰 It adds a new address for server node clusters, where one server can
respond to a request to a group of nodes.
򐂰 The large IPv6 address space is organized into a hierarchical structure to
reduce the size of backbone routing tables.
IPv6 addresses are represented in the form of eight hexadecimal numbers
divided by colons, for example:
To shorten the notation of addresses, leading zeroes in any of the groups can be
omitted, for example:
Finally, a group of all zeroes, or consecutive groups of all zeroes, can be
substituted by a double colon, for example:
Note: The double colon shortcut can be used only once in the notation of
an IPv6 address. If there are more groups of all zeroes that are not
consecutive, only one can be substituted by the double colon; the others
have to be noted as 0.
The IPv6 address space is organized using format prefixes, similar to telephone
country and area codes, that logically divide it in the form of a tree so that a route
from one network to another can easily be found. The following prefixes have
been assigned so far (Table 9-1).
Table 9-1 IPv6: Format prefix allocation
Prefix (bin)
Start of
range (hex)
Mask length
Fraction of
0000 0000
0:: /8
Reserved for
0000 001
200:: /7
Reserved for
0000 010
400:: /7
global unicast
2000:: /3
TCP/IP Tutorial and Technical Overview
Prefix (bin)
Start of
range (hex)
Mask length
Fraction of
1111 1110 10
FE80:: /10
1111 1110 11
FEC0:: /10
1111 1111
FF00:: /8
In the following sections, we describe the types of addresses that IPv6 defines.
Unicast address
A unicast address is an identifier assigned to a single interface. Packets sent to
that address will only be delivered to that interface. Special purpose unicast
addresses are defined as follows:
򐂰 Loopback address (::1): This address is assigned to a virtual interface over
which a host can send packets only to itself. It is equivalent to the IPv4
loopback address
򐂰 Unspecified address (::):This address is used as a source address by hosts
while performing autoconfiguration. It is equivalent to the IPv4 unspecified
򐂰 IPv4-compatible address (::<IPv4_address>): Addresses of this kind are used
when IPv6 traffic needs to be tunneled across existing IPv4 networks. The
endpoint of such tunnels can be either hosts (automatic tunneling) or routers
(configured tunneling). IPv4-compatible addresses are formed by placing 96
bits of zero in front of a valid 32-bit IPv4 address. For example, the address (hex becomes ::0102:0304.
򐂰 IPv4-mapped address (::FFFF:<IPv4_address>): Addresses of this kind are
used when an IPv6 host needs to communicate with an IPv4 host. This
requires a dual stack host or router for header translations. For example, if an
IPv6 node wants to send data to host with an IPv4 address of, it uses
a destination address of ::FFFF:0102:0304.
򐂰 Link-local address: Addresses of this kind can be used only on the physical
network that to which a host's interface is attached.
򐂰 Site-local address: Addresses of this kind cannot be routed into the Internet.
They are the equivalent of IPv4 networks for private use (,,
Chapter 9. IP version 6
Global unicast address format
IPv6 unicast addresses are aggregatable with prefixes of arbitrary bit-length,
similar to IPv4 addresses under Classless Inter-Domain Routing.
The latest global unicast address format, as specified in RFC 3587 – IPv6
Address architecture and RFC 4291 – IPv6 Global Unicast Address Format, is
expected to become the predominant format used for IPv6 nodes connected to
the Internet.
Note: This note is intended for readers who worked on the previous unicast
format. For new readers, you can skip this special note.
The historical IPv6 unicast address used a two-level allocation scheme which
has been replaced by a coordinated allocation policy defined by the Regional
Internet Registries (RIRs). There are two reasons for this major change:
򐂰 Part of the motivation for obsoleting the old TLA/NLA structure is technical;
for instance, there is concern that TLA/NLA is not the technically best
approach at this stage of the deployment of IPv6.
򐂰 Another part of the reason for new allocation of IPv6 addresses is related
to policy and to the stewardship of the IP address space and routing table
size, which the RIRs have been managing for IPv4.
The Subnet Local Aggregator (SLA) field in the original UNicast Address
Structure remains in function, but with a different name called “subnet ID,”
which we describe later in the unicast address format.
Figure 9-8 shows the general format for IPv6 global unicast addresses.
n bits
Global Routing Prefix
m bits
Subnet ID
> <
128-n-m bits
Interface ID
Figure 9-8 Global unicast address format
Global Routing Prefix
A value assigned to a site for a cluster of subnets/links.
The global routing prefix is designed to be structured
hierarchically by the RIRs and ISPs.
TCP/IP Tutorial and Technical Overview
Subnet ID
An identifier of a subnet within the site.The subnet field is
designed to be structured hierarchically by site
Interface ID
Interface identifiers in IPv6 unicast addresses are used to
identify interfaces on a link. They are required to be
unique within a subnet prefix. Do not assign the same
interface identifier to different nodes on a link. They can
also be unique over a broader scope. In some cases, an
interface's identifier will be derived directly from that
interface's link layer address. The same interface
identifier can be used on multiple interfaces on a single
node as long as they are attached to different subnets.
All unicast addresses, except those that start with binary value 000, have
interface IDs that are 64 bits long and constructed in Modified EUI-64 format.
Multicast address
A multicast address is an identifier assigned to a set of interfaces on multiple
hosts. Packets sent to that address will be delivered to all interfaces
corresponding to that address. (See 3.3, “Internet Group Management Protocol
(IGMP)” on page 119 for more information about IP multicasting.) There are no
broadcast addresses in IPv6, their function being superseded by multicast
addresses. Figure 9-9 shows the format of an IPv6 multicast address.
Group ID
Flags Scope
Figure 9-9 IPv6 multicast address format
Format Prefix: 1111 1111.
Set of four flag bits. Only the low order bit currently has
any meaning, as follows:
Permanent address assigned by a numbering
Transient address. Addresses of this kind can
be established by applications as required.
When the application ends, the address will be
released by the application and can be reused.
Chapter 9. IP version 6
4-bit value indicating the scope of the multicast. Possible
values are:
Group ID
Confined to interfaces on the local node
Confined to nodes on the local link (link-local)
Confined to the local site
Confined to the organization
Global scope
Identifies the multicast group.
For example, if the NTP servers group is assigned a permanent multicast
address, with a group ID of &hex.101, then:
򐂰 FF02::101 means all NTP servers on the same link as the sender
򐂰 FF05::101 means all NTP servers on the same site as the sender
Certain special purpose multicast addresses are predefined as follows:
All interfaces node-local. Defines all interfaces on the host
All nodes link-local. Defines all systems on the local
All routers node-local. Defines all routers local to the host
All routers link-local. Defines all routers on the same link
as the host.
All routers site-local. Defines all routers on the same site
as the host.
Mobile agents link-local.
All DHCP agents link-local.
All DHCP servers site-local.
For a more complete listing of reserved multicast addresses, see the IANA
documentation– IPv6 Multicast Addresses (Assignments.). That document also
defines a special multicast address known as the solicited node address, which
has the format FF02::1:FFxx:xxxx, where xx xxxx is taken from the last 24-bits of
a nodes unicast address. For example, the node with the IPv6 address of
4025::01:800:100F:7B5B belongs to the multicast group FF02::1:FF 0F:7B5B.
The solicited node address is used by ICMP for neighbor discovery and to detect
TCP/IP Tutorial and Technical Overview
duplicate addresses. See 9.3, “Internet Control Message Protocol Version 6
(ICMPv6)” on page 352 for further details.
Anycast address
An anycast address is a special type of unicast address that is assigned to
interfaces on multiple hosts. Packets sent to such an address will be delivered to
the nearest interface with that address. Routers determine the nearest interface
based upon their definition of distance, for example, hops in case of RIP or link
state in case of OSPF.
Anycast addresses use the same format as unicast addresses and are
indistinguishable from them. However, a node that has been assigned an
anycast address must be configured to be aware of this fact.
RFC 4291 currently specifies the following restrictions on anycast addresses:
򐂰 An anycast address must not be used as the source address of a packet.
򐂰 Any anycast address can only be assigned to a router.
A special anycast address, the subnet-router address, is predefined. This
address consists of the subnet prefix for a particular subnet followed by trailing
zeroes. This address can be used when a node needs to contact a router on a
particular subnet and it does not matter which router is reached (for example,
when a mobile node needs to communicate with one of the mobile agents on its
“home” subnet).
9.2.3 Traffic class
The 8-bit traffic class field allows applications to specify a certain priority for the
traffic they generate, thus introducing the concept of class of service. This
enables the prioritization of packets, as in Differentiated Services. For a
comparison of how priority traffic can be handled in an IPv4 network, see 8.1,
“Why QoS?” on page 288.
The structure of the traffic class field is illustrated in Figure 9-10 on page 346,
Differentiated Services Code Point (6 bits)
It provides various code sets to mark the per-hop behavior for a
packet belonging to a service class.
Explicit Congestion Notification (2 bits)
It allows routers to set congestion indications instead of simply
dropping the packets. This avoids delays in retransmissions, while
allowing active queuing management.
Chapter 9. IP version 6
5 6
Figure 9-10 Traffic class field
9.2.4 Flow labels
IPv6 introduces the concept of a flow, which is a series of related packets from a
source to a destination that requires a particular type of handling by the
intervening routers, for example, real-time service. The nature of that handling
can either be conveyed by options attached to the datagrams (that is, by using
the IPv6 hop-by-hop options header) or by a separate protocol (such as
Resource Reservation Protocol (RSVP). Refer to “RSVP operation” on page 297.
All packets belonging to the same flow must be sent with the same source
address, destination address, and flow label. The handling requirement for a
particular flow label is known as the state information; this is cached at the
router. When packets with a known flow label arrive at the router, the router can
efficiently decide how to route and forward the packets without having to
examine the rest of the header for each packet.
The maximum lifetime of any flow-handling state established along a flow's path
must be specified as part of the description of the state-establishment
mechanism, for example, the resource reservation protocol or the flow-setup
hop-by-hop option. A source must not reuse a flow label for a new flow within the
maximum lifetime of any flow-handling state that might have been established for
the prior use of that flow label.
There can be multiple active flows between a source and a destination, as well
as traffic that is not associated with any flow. Each flow is distinctly labelled by
the 24-bit flow label field in the IPv6 packet. A flow is uniquely identified by the
combination of a source address and a non-zero flow label. Packets that do not
belong to a flow carry a flow label of zero.
A flow label is assigned to a flow by the flow's source node. New flow labels must
be chosen (pseudo-)randomly and uniformly from the range 1 to FFFFF hex. The
purpose of the random allocation is to make any set of bits within the Flow Label
field suitable for use as a hash key by routers for looking up the state associated
with the flow.
See RFC 3697 for further details about the use of the flow label.
TCP/IP Tutorial and Technical Overview
9.2.5 IPv6 security
There are two optional headers defined for security purposes:
򐂰 Authentication Header (AH)
򐂰 Encapsulated Security Payload (ESP)
AH and ESP in IPv6 support authentication, data integrity, and optionally
confidentiality. AH conveys the authentication information in an IP package,
while ESP carries the encrypted data of the IP package.
Either or both can be implemented alone or combined in order to achieve
different levels of user security requirements. Note that they can also be
combined with other optional header to provision security features. For example,
a routing header can be used to list the intermediate secure nodes for a packet to
visit on the way, thus allowing the packet to travel only through secure routers.
IPv6 requires support for IPSec as a mandatory standard. This mandate
provides a standards-based solution for network security needs and promotes
Authentication header
The authentication header is used to ensure that a received packet has not been
altered in transit and that it really came from the claimed sender (Figure 9-11).
The authentication header is identified by the value 51 in the preceding Next
Header field. The format of the authentication header and further details are
specified in REF 4302.
Security Parameters Index (SPI)
Sequence Number (SN) Field
Integrity Check Value-ICV
Figure 9-11 IPV6 security authentication header
Security Parameters Index (SPI)
The SPI is an arbitrary 32-bit value that is used by a
Chapter 9. IP version 6
receiver to identify the Security Association (SA) to which
an incoming packet is bound.
For a unicast SA, the SPI can be used by itself to specify
an SA, or it can be used in conjunction with the IPSec
protocol type (in this case AH).The SPI field is mandatory.
Traffic to unicast SAs described earlier must be supported
by all AH implementations.
If an IPSec implementation supports multicast, it must
support multicast SAs using a special de-multiplexing
Sequence Number
This unsigned 32-bit field contains a counter value that
increases by one for each packet sent, that is, a per-SA
packet sequence number.
For a unicast SA or a single-sender multicast SA, the
sender must increment this field for every transmitted
packet. Sharing an SA among multiple senders is
permitted, though generally not recommended.
The field is mandatory and must always be present even
if the receiver does not elect to enable the anti-replay
service for a specific SA. Processing of the Sequence
Number field is at the discretion of the receiver, but all AH
implementations must be capable of performing the
processing, Thus, the sender must always transmit this
field, but the receiver need not act upon it.
The sender's counter and the receiver's counter are
initialized to 0 when an SA is established. The first packet
sent using a given SA will have a sequence number of 1;
if anti-replay is enabled (the default), the transmitted
sequence number must never be allowed to cycle.
Therefore, the sender's counter and the receiver's counter
must be reset (by establishing a new SA and thus a new
key) prior to the transmission of the 232 packet on an SA.
Extended (64-bit) Sequence Number (ESN)
To support high-speed IPSec implementations, a new
option for sequence numbers should be offered, as an
extension to the current, 32-bit sequence number field.
Use of an Extended Sequence Number (ESN) must be
negotiated by an SA management protocol. The ESN
feature is applicable to multicast as well as unicast SAs.
Integrity Check Value (ICV)
TCP/IP Tutorial and Technical Overview
This is a variable-length field that contains the Integrity
Check Value (ICV) for this packet. The field must be an
integral multiple of 32 bits (IPv4 or IPv6) in length. All
implementations must support such padding and must
insert only enough padding to satisfy the IPv4/IPv6
alignment requirements.
Encapsulating Security Payload
The Encapsulated Security Payload (ESP) is defined in RFC 4303. All data
following the ESP header is encrypted. Figure 9-12 illustrates the ESP structure
with the additional field explained after the figure.
The packet begins with the Security Parameters Index (SPI) and Sequence
Number (SN). Following these fields is the Payload Data, which has a
substructure that depends on the choice of encryption algorithm and mode and
on the use of TFC padding. Following the Payload Data are Padding and Pad
Length fields and the Next Header field. The optional Integrity Check Value (ICV)
field completes the packet.
Security Parameters Index (SPI)
Sequence Number Field
Next Header
Integrity Check Value-ICV
Figure 9-12 IPv6 ESP
Payload Data
Payload Data is a variable-length field containing data
(from the original IP packet). It is a mandatory field and is
an integral number of bytes in length.
Chapter 9. IP version 6
If the algorithm used to encrypt the payload requires cryptographic
synchronization data, for example, an Initialization Vector (IV), this data is carried
in the Payload field.
Any encryption algorithm that requires an explicit, per-packet synchronization
data must indicate the length, any structure for such data, and the location of this
If such synchronization data is implicit, the algorithm for deriving the data must
be part of the algorithm definition.
Note that the beginning of the next layer protocol header must be aligned relative
to the beginning of the ESP header. For IPv6, the alignment is a multiple of 8
9.2.6 Packet sizes
All IPv6 nodes are expected to dynamically determine the maximum
transmission unit (MTU) supported by all links along a path (as described in RFC
1191 – Path MTU Discovery) and source nodes will only send packets that do
not exceed the path MTU. IPv6 routers will, therefore, not have to fragment
packets in the middle of multihop routes and allow much more efficient use of
paths that traverse diverse physical transmission media. IPv6 requires that every
link supports an MTU of 1280 bytes or greater.
TCP/IP Tutorial and Technical Overview
IPv6 packet fragmentation
The source node determines the maximum transmission unit or MTU for a path
before sending a packet. If the packet to be sent is larger than the MTU, the
packet is divided into pieces, each of which is a multiple of 8 bytes and carries a
fragment header. The fragment header is identified by the value 44 in the
preceding Next Header field and has the following format (Figure 9-13).
next hdr
hdr length
Figure 9-13 IPv6 fragment header
Nxt hdr
The type of next header after this one.
8-bit reserved field; initialized to zero for transmission and
ignored on reception.
Fragment offset
A 13-bit unsigned integer giving the offset, in 8-byte units,
of the following data relative to the start of the original
data before it was fragmented.
2-bit reserved field; initialized to zero for transmission and
ignored on reception.
More flag. If set, it indicates that this is not the last
Fragment identification
This is an unambiguous identifier used to identify
fragments of the same datagram. This is very similar to
the IPv4 Identifier field, but it is twice as wide.
Chapter 9. IP version 6
9.3 Internet Control Message Protocol Version 6
IP concerns itself with moving data from one node to another. However, in order
for IP to perform this task successfully, there are many other functions that need
to be carried out: error reporting, route discovery, and diagnostics, to name a
few. All these tasks are carried out by the Internet Control Message Protocol
(see 3.2, “Internet Control Message Protocol (ICMP)” on page 109). In addition,
ICMPv6 carries out the tasks of conveying multicast group membership
information, a function that was previously performed by the IGMP protocol in
IPv4 (see 3.3, “Internet Group Management Protocol (IGMP)” on page 119) and
address resolution, previously performed by ARP (see 3.4, “Address Resolution
Protocol (ARP)” on page 119).
ICMPv6 messages and their use are specified in RFC 4443 – Internet Control
Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6)
Specification and RFC 2461 – Neighbor Discovery for IP Version 6 (IPv6). Both
RFCs are draft standards with a status of elective.
Every ICMPv6 message is preceded by an IPv6 header (and possibly some IP
extension headers). The ICMPv6 header is identified by a Next Header value of
58 in the immediately preceding header.
ICMPv6 messages all have a similar format, shown in Figure 9-14.
C hecksum
B o d y o f IC M P M e s s a g e
Figure 9-14 ICMPv6 general message format
There are two classes of ICMPv6 messages. Error
messages have a Type from 0 to 127. Informational
messages have a Type from 128 to 255.
Destination Unreachable
Packet Too Big
TCP/IP Tutorial and Technical Overview
Time (Hop Count) Exceeded
Parameter Problem
Echo Request
Echo Reply
Group Membership Query
Group Membership Report
Group Membership Reduction
Router Solicitation
Router Advertisement
Neighbor Solicitation
Neighbor Advertisement
Redirect Message
Varies according to message type.
Used to detect data corruption in the ICMPv6 message
and parts of the IPv6 header.
Body of message
Varies according to message type.
For full details of ICMPv6 messages for all types, refer to RFC 4443.
9.3.1 Neighbor discovery
Neighbor discovery is an ICMPv6 function that enables a node to identify other
hosts and routers on its links. The node needs to know of at least one router so
that it knows where to forward packets if a target node is not on its local link.
Neighbor discovery also allows a router to redirect a node to use a more
appropriate router if the node has initially made an incorrect choice.
Chapter 9. IP version 6
Address resolution
Figure 9-15 shows a simple Ethernet LAN segment with four IPv6 workstations.
Figure 9-15 IPv6 address resolution example
Workstation A needs to send data to workstation B. It knows the IPv6 address of
workstation B, but it does not know how to send a packet, because it does not
know its MAC address. To find this information, it sends a neighbor solicitation
message, of the format shown in Figure 9-16.
Traffic Class
Payload = 32
Flow Label
Next = 58
Hops = 255
Source Address - FE80::0800:5A12:3456
Destination Address - FF02::1:5A12:3458
Type = 135
Code = 0
Reserved = 0
Target Address - FE80::0800:5A12:3458
Opt Code=1 Opt Len=1
Source Link Layer Address = 08005A123456
Figure 9-16 Neighbor solicitation message format
TCP/IP Tutorial and Technical Overview
IP Header
Notice the following important fields in the IP header of this packet:
58 (for the following ICMP message header).
Any solicitation packet that does not have hops set to 255
is discarded. This ensures that the solicitation has not
crossed a router.
Destination address This address is the solicited node address for the target
workstation (a special type of multicast; see page 344).
Every workstation must respond to its own solicited node
address but other workstations will simply ignore it. This is
an improvement over ARP in IPv4, which uses broadcast
frames that have to be processed by every node on the
In the ICMP message itself, notice:
135 (Neighbor Solicitation).
Target address
This is the known IP address of the target workstation.
Source link layer address
This is useful to the target workstation and saves it from
having to initiate a neighbor discovery process of its own
when it sends a packet back to the source workstation.
Chapter 9. IP version 6
The response to the neighbor solicitation message is a neighbor advertisement,
which has the following format (Figure 9-17).
Traffic Class
Payload = 32
Flow Label
Next = 58
Hops = 255
Source Address - FE80::0800:5A12:3458
IP Header
Destination Address - FE80::0800:5A12:3456
Type = 136
Code = 0
Reserved = 0
Target Address - FE80::0800:5A12:3458
Opt Code=2 Opt Len=1
Target Link Layer Address = 08005A123458
Figure 9-17 Neighbor advertisement message
The neighbor advertisement is addressed directly back to workstation A. The
ICMP message option contains the target IP address together with the target's
link layer (MAC) address. Note also the following flags in the advertisement
Router flag. This bit is set on if the sender of the advertisement is a
Solicited flag. This bit is set on if the advertisement is in response to a
Override flag. When this bit is set on, the receiving node must update an
existing cached link layer entry in its neighbor cache.
After workstation A receives this packet, it commits the information to memory in
its neighbor cache, and then forwards the data packet that it originally wanted to
send to workstation C.
Neighbor advertisement messages can also be sent by a node to force updates
to neighbor caches if it becomes aware that its link layer address has changed.
TCP/IP Tutorial and Technical Overview
Router and prefix discovery
Figure 9-15 on page 354 shows a very simple network example. In a larger
network, particularly one connected to the Internet, the neighbor discovery
process is used to find nodes on the same link in exactly the same way.
However, it is more than likely that a node will need to communicate not just with
other nodes on the same link but with nodes on other network segments that
might be anywhere in the world. In this case, there are two important pieces of
information that a node needs to know:
򐂰 The address of a router that the node can use to reach the rest of the world
򐂰 The prefix (or prefixes) that define the range of IP addresses on the same link
as the node that can be reached without going through a router
Routers use ICMP to convey this information to hosts, by means of router
advertisements. The format of the router advertisement message is shown in
Figure 9-18 on page 358. The message generally has one or more attached
options; this example shows all three possible options.
Chapter 9. IP version 6
Flow Label
Traffic Class
Next = 58
Payload = 64
Hops = 255
Source Address
IP Header
Destination Address - FF02::1
Type = 134 Code = 0
Hop Limit
M O Rsvd
Router Lifetime
Reachable Time
Retransmission Timer
Opt Type=1 Opt Len=1
Option 1
Source Link Address
Opt Type=5 Opt Len=1
Option 2
Opt Type=3 Opt Len=4
Prefix Len L A Rsvd
Valid Lifetime
Preferred Lifetime
Option 3
Figure 9-18 Router advertisement message format
Notice the following important fields in the IP header of this packet:
58 (for the following ICMP message header).
Any advertisement packet that does not have hops set
to 255 is discarded. This ensures that the packet has
not crossed a router.
TCP/IP Tutorial and Technical Overview
Destination address
This address is the special multicast address defining
all systems on the local link.
In the ICMP message itself:
134 (router advertisement).
Hop limit
The default value that a node should place in the Hop
Count field of its outgoing IP packets.
1-bit Managed Address Configuration Flag (see
“Stateless address autoconfiguration” on page 363).
1-bit Other Stateful Configuration Flag (see “Stateless
address autoconfiguration” on page 363).
Router lifetime
How long the node should consider this router to be
available. If this time period is exceeded and the node
has not received another router advertisement
message, the node should consider this router to be
Reachable time
This sets a parameter for all nodes on the local link. It
is the time in milliseconds that the node should
assume a neighbor is still reachable after having
received a response to a neighbor solicitation.
Retransmission timer
This sets the time, in milliseconds, that nodes should
allow between retransmitting neighbor solicitation
messages if no initial response is received.
The three possible options in a router advertisement message are:
Option 1 (source link address)
Allows a receiving node to respond directly to the
router without having to do a neighbor solicitation.
Option 5 (MTU)
Specifies the maximum transmission unit size for the
link. For some media, such as Ethernet, this value is
fixed, so this option is not necessary.
Option 3 (Prefix)
Defines the address prefix for the link. Nodes use this
information to determine when they do, and do not,
need to use a router. Prefix options used for this
purpose have the L (link) bit set on. Prefix options are
also used as part of address configuration, in which
case the A bit is set on. See “Stateless address
autoconfiguration” on page 363 for further details.
A router constantly sends unsolicited advertisements at a frequency defined in
the router configuration. A node might, however, want to obtain information about
the nearest router without having to wait for the next scheduled advertisement
Chapter 9. IP version 6
(for example, a new workstation that has just attached to the network). In this
case, the node can send a router solicitation message. The format of the router
solicitation message is shown in Figure 9-19.
Traffic Class
Payload = 16
Flow Label
Next = 58
Hops = 255
Source Address
Destination Address - FF02::2
Type = 133 Code = 0
Reserved = 0
Target Address - FE80::0800:5A12:3458
Opt Type=1 Opt Len=1
Source Link Address
Figure 9-19 Router solicitation message format
Notice the following important fields in the IP header of this packet:
58 (for the following ICMP message header).
Any advertisement packet that does not have hops set to
255 is discarded. This ensures that the packet has not
crossed a router.
Destination address This address is the special multicast address defining all
routers on the local link.
In the ICMP message itself:
133 (Router Solicitation).
Option 1 (source link address)
Allows the receiving router to respond directly to the node
without having to do a neighbor solicitation.
TCP/IP Tutorial and Technical Overview
Each router that receives the solicitation message responds with a router
advertisement sent directly to the node that sent the solicitation (not to the all
systems link-local multicast address).
The router advertisement mechanism ensures that a node will always be aware
of one or more routers through which it is able to connect to devices outside of its
local links. However, in a situation where a node is aware of more than one
router, it is likely that the default router selected when sending data will not
always be the most suitable router to select for every packet. In this case,
ICMPv6 allows for redirection to a more efficient path for a particular destination.
Router B
Router A
Figure 9-20 Redirection example
Consider the simple example shown in Figure 9-20. Node X is aware of routers A
and B, having received router advertisement messages from both. Node X wants
to send data to node Y. By comparing node Y's IP address against the local link
prefix, node X knows that node Y is not on the local link and that it must therefore
use a router. Node X selects router A from its list of default routers and forwards
the packet. Obviously, this is not the most efficient path to node Y. As soon as
Chapter 9. IP version 6
router A has forwarded the packet to node Y (through router B), router A sends a
redirect message to node X. The format of the redirect message (complete with
IP header) is shown in Figure 9-21.
Flow Label
Traffic Class
Payload Length
Next = 58
Hops = 255
Source Address (Router A)
Destination Address (Node X)
Type = 137 Code = 0
Reserved = 0
Target Address (Router B)
Destination Address (Node Y)
Opt Type=1 Opt Len=1
Source Link Address (Router B)
Opt Type=4 Opt Length
Reserved = 0
Reserved = 0
IP Header and Data
Figure 9-21 Redirect message format
The fields to note in the message are:
137 (Redirect).
Target address
This is address of the router that should be used when
trying to reach node Y.
Destination address Node Y's IP address.
TCP/IP Tutorial and Technical Overview
Option 2 (target link layer address)
Gives link address of router B so that node X can reach it
without a neighbor solicitation.
Option 4 (redirected header)
Includes the original packet sent by node X, full IP
header, and as much of the data that will fit so that the
total size of the redirect message does not exceed 576
Neighbor unreachability detection
An additional responsibility of the neighbor discovery function of ICMPv6 is
neighbor unreachability detection (NUD).
A node actively tracks the reachability state of the neighbors to which it is
sending packets. It can do this in two ways: either by monitoring the upper layer
protocols to see if a connection is making forward progress (for example, TCP
acknowledgments are being received), or issuing specific neighbor solicitations
to check that the path to a target host is still available. When a path to a neighbor
appears to be failing, appropriate action is taken to try and recover the link. This
includes restarting the address resolution process or deleting a neighbor cache
entry so that a new router can be tried in order to find a working path to the
NUD is used for all paths between nodes, including host-to-host, host-to-router,
and router-to-host. NUD can also be used for router-to-router communication if
the routing protocol being used does not already include a similar mechanism.
For further information about neighbor unreachability detection, refer to RFC
Stateless address autoconfiguration
Although the 128-bit address field of IPv6 solves a number of problems inherent
in IPv4, the size of the address itself represents a potential problem to the
TCP/IP administrator. Because of this, IPv6 has been designed with the
capability to automatically assign an address to an interface at initialization time,
with the intention that a network can become operational with minimal to no
action on the part of the TCP/IP administrator. IPv6 nodes generally use
autoconfiguration to obtain their IPv6 address. This can be achieved using
DHCP (see 9.5, “DHCP in IPv6” on page 371), which is known as stateful
autoconfiguration, or by stateless autoconfiguration, which is a new feature of
IPv6 and relies on ICMPv6.
Chapter 9. IP version 6
The stateless autoconfiguration process is defined in RFC 2462 – IPv6 Stateless
Address Autoconfiguration. It consists of the following steps:
1. During system startup, the node begins the autoconfiguration by obtaining an
interface token from the interface hardware, for example, a 48-bit MAC
address on token-ring or Ethernet networks.
2. The node creates a tentative link-local unicast address by combining the
well-known link-local prefix (FE80::/10) with the interface token.
3. The node attempts to verify that this tentative address is unique by issuing a
neighbor solicitation message with the tentative address as the target. If the
address is already in use, the node will receive a neighbor advertisement in
response, in which case the autoconfiguration process stops. (Manual
configuration of the node is then required.)
4. If no response is received, the node assigns the link-level address to its
interface. The host then sends one or more router solicitations to the
all-routers multicast group. If there are any routers present, they will respond
with a router advertisement. If no router advertisement is received, the node
attempts to use DHCP to obtain an address and configuration information. If
no DHCP server responds, the node continues using the link-level address
and can communicate with other nodes on the same link only.
5. If a router advertisement is received in response to the router solicitation, this
message contains several pieces of information that tells the node how to
proceed with the autoconfiguration process (see Figure 9-18 on page 358):
– M flag: Managed address configuration.
If this bit is set, the node used DHCP to obtain its IP address.
– O flag: Other stateful configuration.
If this bit is set, the node uses DHCP to obtain other configuration
– Prefix option: If the router advertisement has a prefix option with the A bit
(autonomous address configuration flag) set on, the prefix is used for
stateless address autoconfiguration.
6. If stateless address configuration is used, the prefix is taken from the router
advertisement and added to the interface token to form the global unicast IP
address, which is assigned to the network interface.
7. The working node continues to receive periodic router advertisements. If the
information in the advertisement changes, the node must take appropriate
Note that it is possible to use both stateless and stateful configuration
simultaneously. It is quite likely that stateless configuration will be used to obtain
the IP address, but DHCP will then be used to obtain further configuration
TCP/IP Tutorial and Technical Overview
information. However, plug-and-play configuration is possible in both small and
large networks without the requirement for DHCP servers.
The stateless address configuration process, together with the fact that more
than one address can be allocated to the same interface, also allows for the
graceful renumbering of all the nodes on a site (for example, if a switch to a new
network provider necessitates new addressing) without disruption to the network.
For further details, refer to RFC 2462.
9.3.2 Multicast Listener Discovery (MLD)
The process used by a router to discover the members of a particular multicast
group is known as Multicast Listener Discovery (MLD). MLD is a subset of
ICMPv6 and provides the equivalent function of IGMP for IPv4 (see 3.3, “Internet
Group Management Protocol (IGMP)” on page 119). This information is then
provided by the router to whichever multicast routing protocol is being used so
that multicast packets are correctly delivered to all links where there are nodes
listening for the appropriate multicast address.
MLD is specified in RFC 2710 – Multicast Listener Discovery (MLD) for IPv6.
MLD uses ICMPv6 messages of the format shown in Figure 9-22.
Vers. Traffic Class
Payload Length
Flow Label
Next = 58
Hops = 1
(Link Local)Source Address
Destination Address
Code = 0
Max. Response Delay
IP Multicast Address
Figure 9-22 MLD message format
Chapter 9. IP version 6
Note the following fields in the IPv6 header of the message:
58 (for the following ICMPv6 message header).
Always set to 1.
Source address
A link-local source address is used.
In the MLD message itself, notice:
There are three types of MLD messages:
Multicast Listener Query
There are two types of queries:
– General query: Used to find which multicast
addresses are being listened for on a link.
– Multicast-address-specific query: Used to find if
any nodes are listening for a specific multicast
address on a link.
Multicast listener report
Used by a node to report that it is listening to a
multicast address.
Multicast listener done
Used by a node to report that it is ceasing to listen to a
multicast address.
Set to 0 by sender and ignored by receivers.
Max response delay This sets the maximum allowed delay before a
responding report must be sent. This parameter is only
valid in query messages. Increasing this parameter can
prevent sudden bursts of high traffic if there a lot of
responders on a network.
Multicast address
In a query message, this field is set to zero for a general
query, or set to the specific IPv6 multicast address for a
multicast-address-specific query.
In a response or done message, this field contains the
multicast address being listened for.
A router uses MLD to learn which multicast addresses are being listened for on
each of its attached links. The router only needs to know that nodes listening for
a particular address are present on a link; it does not need to know the unicast
address of those listening nodes, or how many listening nodes are present.
TCP/IP Tutorial and Technical Overview
A router periodically sends a General Query on each of its links to the all nodes
link-local address (FF02::1). When a node listening for any multicast addresses
receives this query, it sets a delay timer (which can be anything between 0 and
maximum response delay) for each multicast address for which it is listening. As
each timer expires, the node sends a multicast listener report message
containing the appropriate multicast address. If a node receives another node's
report for a multicast address while it has a timer still running for that address, it
stops its timer and does not send a report for that address. This prevents
duplicate reports being sent and, together with the timer mechanism, prevents
excess, or bursty traffic being generated.
The router manages a list of, and sets a timer for, each multicast address it is
aware of on each of its links. If one of these timers expires without a report being
received for that address, the router assumes that no nodes are still listening for
that address, and the address is removed from the list. Whenever a report is
received, the router resets the timer for that particular address.
When a node has finished listening to a multicast address, if it was the last node
on a link to send a report to the router (that is, its timer delay was not interrupted
by the receipt of another node's report), it sends a multicast listener done
message to the router. If the node was interrupted by another node before its
timer expired, it assumes that other nodes are still listening to the multicast
address on the link and therefore does not send a done message.
When a router receives a done message, it sends a multicast-address-specific
message on the link. If no report is received in response to this message, the
router assumes that there are no nodes still listening to this multicast address
and removes the address from its list.
9.4 DNS in IPv6
With the introduction of 128-bit addresses, IPv6 makes it even more difficult for
the network user to be able to identify another network user by means of the IP
address of his or her network device. The use of the Domain Name System
(DNS) therefore becomes even more of a necessity.
A number of extensions to DNS are specified to support the storage and retrieval
of IPv6 addresses. These are defined in RFC 3596 – DNS Extensions to Support
IP Version 6, which is a proposed standard with elective status. However, there
is also work in progress on usability enhancements to this RFC, described in an
Internet draft of the same name.
Chapter 9. IP version 6
The following extensions are specified:
򐂰 A new resource record type, AAAA, which maps the domain name to the IPv6
򐂰 A new domain, which is used to support address-to-domain name lookups
򐂰 A change to the definition of existing queries so that they will perform correct
processing on both A and AAAA record types
9.4.1 Format of IPv6 resource records
RFC 3596 defines the format of the AAAA record as similar to an A resource
record, but with the 128-bit IPv6 address encoded in the data section and a Type
value of 28 (decimal).
A special domain, IP6.INT, is defined for inverse (address-to-host name) lookups
(similar to the domain used in IPv4). As in IPv4, the address must
be entered in reverse order, but hexadecimal digits are used rather than decimal
For example, for the IPv6 address:
The inverse domain name entry is:
So, if the previous address relates to the node, we might expect to
see the following entries in the name server zone data:
99999 IN AAAA 2222:0:1:2:3:4:5678:9ABC
cba98765400030002000100000002222.IP6.INT. IN
Proposed changes to resource records
The IPv6 addressing system has been designed to allow for multiple addresses
on a single interface and to facilitate address renumbering (for example, when a
company changes one of its service providers). RFC 3596 – DNS Extensions to
Support IP Version 6 proposes changes to the format of the AAAA resource
record to simplify network renumbering.
All characters making up the reversed IPv6 address in this PTR entry should be separated by a
period(.). These have been omitted in this example for clarity.
TCP/IP Tutorial and Technical Overview
The proposed format of the data section of the AAAA record is shown in
Figure 9-23.
IPv6 address
domain name
Figure 9-23 AAAA resource record: Proposed data format
IPv6 address
128-bit address (contains only the lower bits of the
Prefix Length (0-128)
Domain name
The domain name of the prefix
To see how this format works, consider the example shown in Figure 9-24.
new conne
Figure 9-24 Prefix numbering example
Chapter 9. IP version 6
Site X is multihomed to two providers, PROV1 and PROV2. PROV1 gets its
transit services from top-level provider TOP1. PROV2 gets its service from
TOP2. TOP1 has the top-level aggregate (TLA ID + format prefix) of 2111. TOP2
has the TLA of 2222.
TOP1 has assigned the next-level aggregate (NLA) of 00AB to PROV1. PROV2
has been assigned the NLA of 00BC by TOP2.
PROV1 has assigned the subscriber identifier 00A1 to site X. PROV2 has
assigned the subscriber identifier 00B1 to site X.
Node ND1, at site X, which has the interface token of 10005A123456, is
therefore configured with the following two IP addresses:
Site X is represented by the domain name Each provider has their own
domain,,,, and In each of these
domains, an IP6 subdomain is created that is used to hold prefixes. The node
ND1 can now be represented by the following entries in the DNS:
ND1.TEST.COM AAAA ::1000:5A12:3456 80
IP6.TOP1.COM AAAA 2111::
IP6.TOP2.COM AAAA 2222::
This format simplifies the job of the DNS administrator considerably and makes
renumbering changes much easier to implement. Say, for example, site X
decides to stop using links from providers PROV1 and PROV2 and invests in a
connection direct from the top-level service provider TOP1 (who allocates the
next-level aggregate 00CD to site X). The only change necessary in the DNS is
for the two IP6.TEST.COM entries to be replaced with a single entry, as follows:
TCP/IP Tutorial and Technical Overview
9.5 DHCP in IPv6
Although IPv6 introduces stateless address autoconfiguration, DHCP retains its
importance as the stateful alternative for those sites that want to have more
control over their addressing scheme. Used together with stateless
autoconfiguration, DHCP provides a means of passing additional configuration
options to nodes after they have obtained their addresses. (See 3.7, “Dynamic
Host Configuration Protocol (DHCP)” on page 130 for a detailed description of
The RFC 3315 defines DHCP in IPv6, and RFC 3736 defines stateless DHCP for
DHCPv6 has some significant differences from DHCPv4, because it takes
advantage of some of the inherent enhancements of the IPv6 protocol. Some of
the principal differences include:
򐂰 As soon as a client boots, it already has a link-local IP address, which it can
use to communicate with a DHCP server or a relay agent.
򐂰 The client uses multicast addresses to contact the server, rather than
򐂰 IPv6 allows the use of multiple IP addresses per interface and DHCPv6 can
provide more than one address when requested.
򐂰 Some DHCP options are now unnecessary. Default routers, for example, are
now obtained by a client using IPv6 neighbor discovery.
򐂰 DHCP messages (including address allocations) appear in IPv6 message
extensions, rather than in the IP header as in IPv4.
򐂰 There is no requirement for BOOTP compatibility.
򐂰 There is a new reconfigure message, which is used by the server to send
configuration changes to clients (for example, the reduction in an address
lifetime). Clients must continue to listen for reconfigure messages after they
have received their initial configuration.
9.5.1 DHCPv6 messages
The following DHCPv6 messages are currently defined:
DHCP Solicit
This is an IP multicast message. The DHCP client
forwards the message to FF02::1:2, the well-known
multicast address for all DHCP agents (relays and
servers). If received by a relay, the relay forwards the
message to FF05::1:3, the well-known multicast address
for all DHCP servers.
Chapter 9. IP version 6
DHCP Advertise
This is a unicast message sent in response to a DHCP
Solicit. A DHCP server will respond directly to the
soliciting client if on the same link, or through the relay
agent if the DHCP Solicit was forwarded by a relay. The
advertise message can contain one or more extensions
(DHCP options).
DHCP Request
After the client has located the DHCP server, the DHCP
request (unicast message) is sent to request an address,
configuration parameters, or both. The request must be
forwarded by a relay if the server is not on the same link
as the client. The request can contain extensions (options
specified by the client) that can be a subset of all the
options available on the server.
DHCP Reply
An IP unicast message sent in response to a DHCP
request (can be sent directly to the client or through a
relay). Extensions contain the address, parameters, or
both committed to the client.
DHCP Release
An IP unicast sent by the client to the server, informing
the server of resources that are being released.
DHCP Reconfigure
An IP unicast or multicast message, sent by the server to
one or more clients, to inform them that there is new
configuration information available. The client must
respond to this message with a DHCP request to request
these new changes from the server.
For further details about DHCPv6, refer to RFC3315 and RFC 3736.
9.6 IPv6 mobility support
There are some unique requirements in mobile network applications. For
example, while a mobile station or mobile node is always logically identified by its
home address, it can physically move around in the IPv6 Internet. For a mobile
node to remain reachable while moving, each mobile node has to get a
temporary address when it is newly attached to a visiting location network.
While situated away from its home, a mobile node is associated with a care-of
address, which provides information about the mobile node's current location.
IPv6 packets addressed to a mobile node's home address are transparently
routed to its care-of address. IPv6 mobile network cache the binding of a mobile
node's home address with its care-of address, and then send any packets
destined for the mobile node directly to it at this care-of address.
TCP/IP Tutorial and Technical Overview
At any traveling location, there are always multiple service providers for
competition in the wireless market, and multiple network prefixes are available.
Mobile IPv6 provides binding support for the address to an attached visiting
network. The native IPv6 routing header also supports route selection for a
packet to go through desired networks. The capability allows network service
provider selection. And as a result, it enforces a security and service policy by
going through only authorized gateway service nodes.
There are certain enhancements in IPv6 that are particularly well suited to the
mobile environment, including:
򐂰 A mobile node uses a temporary address while away from home location. It
can use the IPv6 Destination Optional header to store its home address. An
intended destination can access the field to get the mobile node’s home
address for substitution when processing the packet.
򐂰 A mobile station can list the all routing header for the packets to follow a
particular paths in order to connect to a selective service provider network.
򐂰 Also, most packets sent to a mobile node while it is away from its home
location can be tunneled by using IPv6 routing (extension) headers, rather
than a complete encapsulation, as used in Mobile IPv4, which reduces the
processing cost of delivering packets to mobile nodes.
򐂰 Unlike Mobile IPv4, there is no requirement for routers to act as “foreign
agents” on behalf of the mobile node, because neighbor discovery and
address autoconfiguration allow the node to operate away from home without
any special support from a local router.
򐂰 The dynamic home agent address discovery mechanism in Mobile IPv6
returns a single reply to the mobile node. The directed broadcast approach
used in IPv4 returns separate replies from each home agent.
To better use the native IPv6 capabilities in next generation (3G) wireless
network and service, the IPv6 working group and 3rd Generation Partnership
Project (or 3GPP) working group has conducted joint discussions. As a result of
adopting native IPv6 features (for example, IPv6 address prefix allocation and so
on), they ensure that handsets are compatible with mobile computers in sharing
drivers and related software.
Chapter 9. IP version 6
On top of the native IPv6 support to mobility, standard extensions are added to
ensure that any nodes, whether mobile or stationary, can communicate efficiently
with a mobile node. Additional Mobile IPv6 features include:
򐂰 Mobile IPv6 allows a mobile node to move from one link to another without
changing the mobile node's “home address.” Packets can be routed to the
mobile node using this address regardless of the mobile node's current point
of attachment to the Internet. The mobile node can also continue to
communicate with other nodes (stationary or mobile) after moving to a new
link. The movement of a mobile node away from its home link is thus
transparent to transport and higher-layer protocols and applications.
򐂰 The Mobile IPv6 protocol is just as suitable for mobility across homogeneous
media as for mobility across heterogeneous media. For example, Mobile IPv6
facilitates node movement from one Ethernet segment to another as well as
node movement from an Ethernet segment to a wireless LAN cell, with the
mobile node's IP address remaining unchanged in spite of such movement.
򐂰 You can think of the Mobile IPv6 protocol as solving the network layer mobility
management problem. Some mobility management applications, for example,
handover among wireless transceivers, each of which covers only a very
small geographic area, have been solved using link layer techniques. As
another example, in many current wireless LAN products, link layer mobility
mechanisms allow a “handover” of a mobile node from one cell to another,
re-establishing link layer connectivity to the node in each new location.
Note: In mobility terminology, a handover deals with moving from a cellular
to another cellular. But the concept can be generalized into a
wireless-wireline integration environment. For example:
򐂰 Layer-2 handover provides a process by which the mobile node
changes from one link layer connection to another in a change of a
wireless or wireline access point.
򐂰 Subsequent to an L2 handover, a mobile node detects a change in an
on-link subnet prefix that requires a change in the primary care-of
򐂰 Mobile IPv6 route optimization avoids congestion of the home network by
getting a mobile node and a corespondent node to communicate directly.
Route optimization can operate securely even without prearranged Security
򐂰 Support for route optimization is a fundamental part of the protocol, rather
than as a nonstandard set of extensions. It is expected that route optimization
can be deployed on a global scale between all mobile nodes and
correspondent nodes.
TCP/IP Tutorial and Technical Overview
򐂰 The IPv6 Neighbor Unreachability Detection assures symmetric reachability
between the mobile node and its default router in the current location. Most
packets sent to a mobile node while away from home in Mobile IPv6 are sent
using an IPv6 routing header rather than IP encapsulation, increasing
efficiencies when compared to Mobile IPv4.
򐂰 Mobile IPv6 is decoupled from any particular link layer, because it uses IPv6
Neighbor Discovery instead of Address Resolution Protocol (ARP). This also
improves the robustness of the protocol.
򐂰 Mobile IPv6 defines a new IPv6 protocol, using the Mobility header to carry
the following messages:
– Home Test Init
– Home Test
– Care-of Test Init
– Care-of Test
These four messages perform the return routability procedure from the
mobile node to a correspondent node.
– Binding Update and Acknowledgement
A Binding Update is used by a mobile node to notify a node or the mobile
node's home agent of its current binding. The Binding Update sent to the
mobile node's home agent to register its primary care-of address is
marked as a “home registration.”
– Binding Refresh Request
A Binding Refresh Request is used by a correspondent node to request
that a mobile node reestablish its binding with the correspondent node.
The association of the home address of a mobile node with a “care-of”
address for that mobile node remains for the life of that association.
򐂰 Mobile IPv6 also introduces four new ICMP message types, two for use in the
dynamic home agent address discovery mechanism, and two for
renumbering and mobile configuration mechanisms:
– The following two new ICMP message types are used for home agent
address discovery: Home Agent Address Discovery Request and Home
Agent Address Discovery Reply.
– The next two message types are used for network renumbering and
address configuration on the mobile node: Mobile Prefix Solicitation and
Mobile Prefix Advertisement.
In summary, IPv6 provides native support for mobile applications. Additional
extensions have also been added to Mobile IPv6 protocols. IETF has been
cooperating with other standard organizations such as 3GPP in Mobile IPv6.
Chapter 9. IP version 6
For further information about mobility support in IPv6, refer to RFC 3775.
9.7 IPv6 new opportunities
IPv6 opens up new opportunities in infrastructure and services as well as in
research opportunities.
9.7.1 New infrastructure
As new internet appliances are added into the IP world, the Internet becomes a
new infrastructure in multiple dimensions:
򐂰 IPv6 can serve as the next generation wireless core network infrastructure.
As described in 9.6, “IPv6 mobility support” on page 372, various capabilities
in security, addressing, tunneling and so on have enabled mobility
򐂰 Additional sensor devices can be connected into the IPv6 backbone with an
individual IP address. Those collective sensor networks will become part of
the fabric in IPv6 network infrastructure.
򐂰 “Smart” networks with sufficient bandwidth and quality of service make the
Internet available for phone calls and multimedia applications. We expect that
next generation IPv6 network will replace traditional telephone network to
become the dominant telecommunication infrastructure.
򐂰 As virtualization is widely deployed in both computing data centers and
network services, the IPv6 functions become mandatory in security, in flow
label processing, and so on. Next generation data centers and network
services will evolve around the IPv6 platforms.
򐂰 IPv6 can create a new virtual private network (VPN) infrastructure, with
inherently built-in tunneling capabilities. It also decouples security boundaries
from the organization perimeter in the security policy. We expect that network
virtualization is possible with IPv6 VPN on demand provisions and
򐂰 Inside a computer, the traditional I/O bus architecture might be replaced by a
pure IP packet exchanged structure. This scheme might further improve the
network computing infrastructure by separating the computing and storage
components physically.
TCP/IP Tutorial and Technical Overview
9.7.2 New services
The basic features and new functions in IPv6 provide stimulation to new services
creation and deployment. Here are some high-level examples. We encourage
you to refer to Part 3, “Advanced concepts and new technologies” on page 721
for more details.
򐂰 Presence Service (refer to Chapter 19, “Presence over IP” on page 707) can
be developed on top of Location Based Service (LBS). For example, in pure
LBS, movie theaters can post attractive title advertisements to a patron’s
mobile device when entering the movie zone. In PS, users can setup
additional preferences and other policy attributes. As a result, the underlying
network services can be aware of user preference and privacy requirements.
So, rather than pushing the advertisement to all patrons in the movie zone,
those advertisements have to be filtered and tailored accordingly to
“do-not-disturb” or “category-specific” preferences.
򐂰 Anonymous Request Service (ARS) can be developed by exploiting the new
IPv6 address allocation functions. For example, a location address can use a
random but unique link ID to send packets in reporting ethical or policy
violations within an enterprise or in government services.
򐂰 Voice and Video over IP (which we call V2oIP in IPv6) will replace traditional
phone service and provide video services over IPv6. For details about VoIP,
refer to Chapter 20, “Voice over Internet Protocol” on page 723. For details
about IPTV, refer to Chapter 21, “Internet Protocol Television” on page 745.
򐂰 Always On Services (AOS) allows V2oIPv6 to be ready for service with ease
of use. Communication sessions can be kept alive and active using IPv6
mobility functions as well as the IPv6 QoS capability. The “always on”
availability is independent of location, movement, or infrastructure.
򐂰 On-demand Routing Services (ORS) eliminates routing table updates for
unused routes, balancing slow-path and fast-path processing especially in
V2oIPv6 environment.
򐂰 IPv6 Management Service (IMS) provides address automatic inventory,
service provisioning, and service assurance services.
򐂰 IPv6 Operation Service (IOS) supplies on demand configuration, logging,
diagnosis, and control services.
򐂰 IPv6 Testing Service (ITS) provides capabilities in functional conformance
and performance testing for implementations of IETF IPv6 standards or
RFCs. Interoperability testing is also a key ITS service.
Chapter 9. IP version 6
9.7.3 New research and development platforms
In addition to new opportunities for users and network service vendors, there are
IPv6 research opportunities for educational and research and development
institutions as well. For example:
򐂰 Historically, one of the IETF IP next generation (IPng) project was the
development of the 6Bone, which is an Internet-wide virtual network, layered
on top of the physical IPv4 Internet. The 6Bone consists of many islands
supporting IPv6 packets, linked by tunnels across the existing IPv4 backbone.
The 6Bone was widely used for testing of IPv6 protocols and products.
By June 6th, 2006 the 6Bone was phased out per agreements with the IETF
IPv6 community.
For more information, see:
򐂰 The 6NET project demonstrated that growth of the Internet can be met using
new IPv6 technology. 6NET built a native IPv6-based network connecting 16
European countries. The network allows IPv6 service testing and
interoperability with enterprise applications.
For more information, see:
򐂰 Internet2 built an experimental IPv6 infrastructure. The Internet2 consortium
(not a network) established IPv6 working group to perform research and
education in the following areas:
– Infrastructure engineering, operations, and deployment
– Education for campus network engineers
– Exploring the motivation for use of IPv6
For more information, see:
򐂰 Another regional IPv6 example is the MOONv6 project. Moonv6 is just one of
the world's largest native IPv6 networks in existence.
For more information, see:
New open research problems in IPv6 include:
򐂰 IPv6 and next generation network architecture design: While IPv6 and
associated protocols have solved problems of message specification and
control management, the architecture of the next generation IPv6 network
itself is still under experiment.
TCP/IP Tutorial and Technical Overview
򐂰 Network infrastructure and service management: Peer-to-peer (P2P) network
applications are available to flood the Internet. However, there is a lack of
network and service management and control capability. While we should
maintain the access and openness of the Internet, the business and
commercial reality in the IP space require fundamental rethinking about
network and service management infrastructure support.
򐂰 Security: In addition to the native security functions supplied in IPv6
protocols, IPv6 network security architecture needs to define how to extend
security across upper layers of IP networks:
– An integrated security infrastructure combines application security policies
to underlying network security capabilities.
– An integrated security infrastructure also combines content protection into
a distribution and transport security layer.
򐂰 Real-time control capability: IPv6 quality of service features provide real-time
support of voice and multimedia applications. Additional research topics
include signaling and integration with IP multimedia subsystems.
򐂰 IPv6 network virtualization: Automatic configuration inventory and
provisioning capabilities have to be studied in order to allocate networking
resources and transport on demand.
9.8 Internet transition: Migrating from IPv4 to IPv6
If the Internet is to realize the benefits of IPv6, a period of transition will be
necessary when new IPv6 hosts and routers are deployed alongside existing
IPv4 systems. RFC 2893 – Transition Mechanisms for IPv6 Hosts and Routers
and RFC2185 – Routing Aspects of IPv6 Transition define a number of
mechanisms to be employed to ensure both compatibility between old and new
systems and a gradual transition that does not impact the functionality of the
Internet. These techniques are sometimes collectively termed Simple Internet
Transition (SIT). The transition employs the following techniques:
򐂰 Dual-stack IP implementations for hosts and routers that must interoperate
between IPv4 and IPv6.
򐂰 Imbedding of IPv4 addresses in IPv6 addresses. IPv6 hosts will be assigned
addresses that are interoperable with IPv4, and IPv4 host addresses will be
mapped to IPv6.
򐂰 IPv6-over-IPv4 tunneling mechanisms for carrying IPv6 packets across IPv4
router networks.
Chapter 9. IP version 6
򐂰 IPv4/IPv6 header translation.This technique is intended for use when
implementation of IPv6 is well advanced and only a few IPv4-only systems
9.8.1 Dual IP stack implementation: The IPv6/IPv4 node
The simplest way to ensure that a new IPv6 node maintains compatibility with
existing IPv4 systems is to provide a dual IP stack implementation. An IPv6/IPv4
node can send and receive either IPv6 packets or IPv4 datagrams, depending on
the type of system with which it is communicating. The node will have both a
128-bit IPv6 address and a 32-bit IPv4 address, which do not necessarily need to
be related. Figure 9-25 shows a dual stack IPv6/IPv4 system communicating with
both IPv6 and IPv4 systems on the same link.
IPv4 Host
IPv6/IPv4 Host
IPv6 Host
Figure 9-25 IPv6/IPv4 dual stack system
The IPv6/IPv4 node can use stateless or stateful autoconfiguration to obtain its
IPv6 address. It can also use any method to obtain its IPv4 address, such as
DHCP, BOOTP, or manual configuration. However, if the node is to perform
automatic tunneling, the IPv6 address must be an IPv4-compatible address, with
the low order 32-bits of the address serving as the IPv4 address. (See 9.2.2,
“IPv6 addressing” on page 339.)
Conceptually, the dual stack model envisages a doubling-up of the protocols in
the internetwork layer only. However, related changes are obviously needed in
all transport-layer protocols in order to operate when using either stack.
Application changes are also needed if the application is to exploit IPv6
capabilities, such as the increased address space of IPv6.
When an IPv6/IPv4 node wants to communicate with another system, it needs to
know the capabilities of that system and which type of packet it should send. The
TCP/IP Tutorial and Technical Overview
DNS plays a key role here. As described in Table 12-2 on page 438, a new
resource record type, AAAA, is defined for mapping host names to IPv6
addresses. The results of a name server lookup determine how a node will
attempt to communicate with that system. The records found in the DNS for a
node depend on which protocols it is running:
򐂰 IPv4-only nodes only have A records containing IPv4 addresses in the DNS.
򐂰 IPv6/IPv4 nodes that can interoperate with IPv4-only nodes have AAAA
records containing IPv4-compatible IPv6 addresses and A records containing
the equivalent IPv4 addresses.
򐂰 IPv6-only nodes that cannot interoperate with IPv4-only nodes have only
AAAA records containing IPv6 addresses.
Because IPv6/IPv4 nodes make decisions about which protocols to use based
on the information returned by the DNS, the incorporation of AAAA records in the
DNS is a prerequisite to interoperability between IPv6 and IPv4 systems. Note
that name servers do not necessarily need to use an IPv6-capable protocol
stack, but they must support the additional record type.
9.8.2 Tunneling
When IPv6 or IPv6/IPv4 systems are separated from other similar systems with
which they want to communicate by older IPv4 networks, IPv6 packets must be
tunneled through the IPv4 network.
IPv6 packets are tunnelled over IPv4 very simply: The IPv6 packet is
encapsulated in an IPv4 datagram, or in other words, a complete IPv4 header is
added to the IPv6 packet. The presence of the IPv6 packet within the IPv4
datagram is indicated by a protocol value of 41 in the IPv4 header.
There are two kinds of tunneling of IPv6 packets over IPv4 networks: automatic
and configured.
Automatic tunneling
Automatic tunneling relies on IPv4-compatible addresses. The decision of when
to tunnel is made by an IPv6/IPv4 host that has a packet to send across an
IPv4-routed network area, and it follows the following rules:
򐂰 If the destination is an IPv4 or an IPv4-mapped address, send the packet
using IPv4 because the recipient is not IPv6-capable. Otherwise, if the
destination is on the same subnet, send it using IPv6, because the recipient is
Chapter 9. IP version 6
򐂰 If the destination is not on the same subnet but there is at least one default
router on the subnet that is IPv6-capable, or there is a route configured to an
IPv6 router for that destination, send it to that router using IPv6. Otherwise, if
the address is an IPv4-compatible address, send the packet using automatic
IPv6-over-IPv4 tunneling. Otherwise, the destination is a node with an
IPv6-only address that is connected through an IPv4-routed area, which is not
also IPv6-routed. Therefore, the destination is unreachable.
Note: The IP address must be IPv4-compatible for tunneling to be used.
Automatic tunneling cannot be used to reach IPv6-only addresses,
because they cannot be addressed using IPv4. Packets from IPv6/IPv4
nodes to IPv4-mapped addresses are not tunnelled to because they refer
to IPv4-only nodes.
These rules emphasize the use of an IPv6 router in preference to a tunnel for
three reasons:
򐂰 There is less inefficiency, because there is no encapsulating IPv4 header.
򐂰 IPv6-only features are available.
򐂰 The IPv6 routing topology will be used when it is deployed in preference to
the pre-existing IPv4 topology.
A node does not need to know whether it is attached to an IPv6-routed or an
IPv4-routed area; it will always use an IPv6 router if one is configured on its
subnet and will use tunneling if one is not (in which case it can infer that it is
attached to an IPv4-routed area).
Automatic tunneling can be either host-to-host, or it can be router-to-host. A
source host will send an IPv6 packet to an IPv6 router if possible, but that router
might not be able to do the same, and will have to perform automatic tunneling to
the destination host itself. Because of the preference for the use of IPv6 routers
rather than tunneling, the tunnel will always be as “short” as possible. However,
the tunnel will always extend all of the way to the destination host. Because IPv6
uses the same hop-by-hop routing paradigm, a host cannot determine if the
packet will eventually emerge into an IPv6-complete area before it reaches the
destination host. In order to use a tunnel that does not extend all of the way to
the recipient, configured tunneling must be used.
The mechanism used for automatic tunneling is very simple:
1. The encapsulating IPv4 datagram uses the low-order 32 bits of the IPv6
source and destination addresses to create the equivalent IPv4 addresses
and sets the protocol number to 41 (IPv6).
TCP/IP Tutorial and Technical Overview
2. The receiving node's network interface layer identifies the incoming packets
(or packets if the IPv4 datagram was fragmented) as belonging to IPv4 and
passes them upward to the IPv4 part of the dual IPv6/IPv4 internetwork layer.
3. The IPv4 layer then receives the datagram in the normal way, reassembling
fragments if necessary, notes the protocol number of 41, removes the IPv4
header, and passes the original IPv6 packet “sideways” to the IPv6 part of the
internetwork layer.
4. The IPv6 code then processes the original packet as normal. Because the
destination IPv6 address in the packet is the IPv6 address of the node (an
IPv4-compatible address matching the IPv4 address used in the
encapsulating IPv4 datagram), the packet is at its final destination. IPv6 then
processes any extension headers as normal and then passes the packet's
remaining payload to the next protocol listed in the last IPv6 header.
Figure 9-26 on page 384 shows two IPv6/IPv4 nodes separated by an IPv4
network. Both workstations have IPv4-compatible IPv6 addresses. Workstation A
sends a packet to workstation B, as follows:
1. Workstation A has received router solicitation messages from an
IPv6-capable router (X) on its local link. It forwards the packet to this router.
2. Router X adds an IPv4 header to the packet, using the IPv4 source and
destination addresses derived from the IPv4-compatible addresses. The
packet is then forwarded across the IPv4 network, all the way to workstation
B. This is router-to-host automatic tunneling.
3. The IPv4 datagram is received by the IPv4 stack of workstation B. Because
the Protocol field shows that the next header is 41 (IPv6), the IPv4 header is
stripped from the datagram and the remaining IPv6 packet is then handled by
the IPv6 stack.
Chapter 9. IP version 6
IPv6/IPv4 Host
IPv6/IPv4 Router
6 4
payload length
IPv6/IPv4 Host
src: WorkstationA(IPv4)
dst: WorkstationB(IPv4)
6 4
payload length
src: WorkstationA(IPv4)
dst: WorkstationB(IPv4)
6 4
payload length
src: Workstation A
src: WorkstationA
src: WorkstationA
dst: WorkstationB
dst: WorkstationB
dst: WorkstationB
Figure 9-26 Router-to-host automatic tunneling
Figure 9-27 on page 385 shows the host-to-host tunneling scenario. Here
workstation B responds as follows:
1. Workstation B has no IPv6-capable router on its local link. It therefore adds
an IPv4 header to its own IPv6 frame and forwards the resulting IPv4
datagram directly to the IPv4 address of workstation A through the IPv4
network. This is host-to-host automatic tunneling.
2. The IPv4 datagram is received by the IPv4 stack of workstation A. Because
the Protocol field shows that the next header is 41 (IPv6), the IPv4 header is
stripped from the datagram and the remaining IPv6 packet is then handled by
the IPv6 stack.
TCP/IP Tutorial and Technical Overview
IPv6/IPv4 Host
IPv6/IPv4 Router
IPv6/IPv4 Router
IPv6/IPv4 Host
src: Workstation B (IPv4)
dst: Workstation A (IPv4)
6 4
flow label
payload length next hops
src: Workstation B (IPv4)
dst: Workstation A (IPv4)
6 4
flow label
payload length next hops
src: Workstation B (IPv4)
dst: Workstation A (IPv4)
6 4
flow label
payload length next hops
src: Workstation B
src: Workstation B
src: Workstation B
dst: Workstation A
dst: Workstation A
dst: Workstation A
Figure 9-27 Host-to-host automatic tunneling
Configured tunneling
Configured tunneling is used for host-router or router-router tunneling of
IPv6-over-IPv4. The sending host or the forwarding router is configured so that
the route, as well as having a next hop, also has a tunnel end address (which is
always an IPv4-compatible address). The process of encapsulation is the same
as for automatic tunneling, except that the IPv4 destination address is not
derived from the low-order 32 bits of the IPv6 destination address, but from the
low-order 32 bits of the tunnel end. The IPv6 destination and source addresses
do not need to be IPv4-compatible addresses in this case.
When the router at the end of the tunnel receives the IPv4 datagram, it
processes it in exactly the same way as a node at the end of an automatic tunnel.
When the original IPv6 packet is passed to the IPv6 layer in the router, it
recognizes that it is not the destination, and the router forwards the packet on to
the final destination as it would for any other IPv6 packet.
Chapter 9. IP version 6
It is, of course, possible that after emerging from the tunnel, the IPv6 packet is
tunnelled again by another router.
Figure 9-28 on page 387 shows two IPv6-only nodes separated by an IPv4
network. A router-to-router tunnel is configured between the two IPv6/IPv4
routers X and Y.
1. Workstation A constructs an IPv6 packet to send to workstation B. It forwards
the packet to the IPv6 router advertising on its local link (X).
2. Router X receives the packet, but has no direct IPv6 connection to the
destination subnet. However, a tunnel has been configured for this subnet.
The router therefore adds an IPv4 header to the packet, with a destination
address of the tunnel-end (router Y) and forwards the datagram over the IPv4
3. The IPv4 stack of router Y receives the frame. Seeing the Protocol field value
of 41, it removes the IPv4 header, and passes the remaining IPv6 packet to
its IPv6 stack. The IPv6 stack reads the destination IPv6 address, and
forwards the packet.
4. Workstation B receives the IP6 packet.
TCP/IP Tutorial and Technical Overview
IPv6 Host
IPv6/IPv4 Router
IPv6/IPv4 Router
IPv6/IPv4 Host
6 4
flow label
payload length next hops
src: Router X (IPv4)
dst: Router Y (IPv4)
6 4
flow label
payload length next hops
6 4
flow label
payload length next hops
src: W orkstation A
(not IPv4-compatible)
src: W orkstation A
(not IPv4-compatible)
src: W orkstation A
(not IPv4-compatible)
dst: W orkstation B
(not IPv4-compatible)
dst: W orkstation B
(not IPv4-compatible)
dst: W orkstation B
(not IPv4-compatible)
Figure 9-28 Router-to-router configured tunnel
Header translation
Installing IPv6/IPv4 nodes allows for backward compatibility with existing IPv4
systems. However, when migration of networks to IPv6 reaches an advanced
stage, it is likely that new systems being installed will be IPv6 only. Therefore,
there will be a requirement for IPv6-only systems to communicate with the
remaining IPv4-only systems. Header translation is required for IPv6-only nodes
to interoperate with IPv4-only nodes. Header translation is performed by
IPv6/IPv4 routers on the boundaries between IPv6 routed areas and IPv4 routed
The translating router strips the header completely from IPv6 packets and
replaces it with an equivalent IPv4 header (or the reverse). In addition to
correctly mapping between the fields in the two headers, the router must convert
source and destination addresses from IPv4-mapped addresses to real IPv4
addresses (by taking the low-order 32 bits of the IP address). In the reverse
Chapter 9. IP version 6
direction, the router adds the ::FFFF /96 prefix to the IPv4 address to form the
IPv4-mapped address. If either the source or the destination IPv6 address is
IPv6-only, the header cannot be translated.
Note that for a site with even just one IPv4 host, every IPv6 node with which it
needs to communicate must have an IPv4-mapped address.
9.8.3 Interoperability summary
Whether two nodes can interoperate depends on their capabilities and their
An IPv4 node can communicate with:
򐂰 Any IPv4 node on the local link
򐂰 Any IPv4 node through an IPv4 router
򐂰 Any IPv6 node with IPv4-mapped address through a header translator
An IPv6 node (IPv6-only address) can communicate with:
򐂰 Any IPv6 node on the local link
򐂰 Any IPv6 node through an IPv6 router on the local link (might require
tunneling through the IPv4 network from the router)
An IPv6 node (IPv4-mapped address) can communicate with:
򐂰 Any IPv6 node on the local link
򐂰 Any IPv6 node through an IPv6 router on the local link (might require
tunneling through the IPv4 network from the router)
򐂰 Any IPv4 node through a header translator
An IPv6/IPv4 node (IPv4-compatible address) can communicate with:
򐂰 Any IPv4 node on the local link
򐂰 Any IPv4 node through an IPv4 router on the local link
򐂰 Any IPv6 node on the local link
򐂰 Any IPv6 node through an IPv6 router on the local link (might require
tunneling through the IPv4 network from the router)
򐂰 Any IPv6/IPv4 node (IPv4-compatible address) through a host-to-host tunnel
TCP/IP Tutorial and Technical Overview
9.9 RFCs relevant to this chapter
The following RFCs contain detailed information about IPv6:
򐂰 RFC 3041 – Privacy Extensions for Stateless Address Autoconfiguration in
IPv6 (January 2001)
򐂰 RFC 3056 – Connection of IPv6 Domains via IPv4 Clouds (February 2001)
򐂰 RFC 3307 – Allocation Guidelines for IPv6 Multicast Addresses
(August 2002)
򐂰 RFC 3315 – Dynamic Host Configuration Protocol for IPv6 (DHCPv6)
(July 2003)
򐂰 RFC 3484 – Default Address Selection for Internet Protocol version 6 (IPv6)
(February 2003)
򐂰 RFC 3596 – DNS Extensions to Support IP Version 6 (October 2003)
(Obsoletes RFC3152, RFC1886)
򐂰 RFC 3633 – IPv6 Prefix Options for Dynamic Host Configuration Protocol
(DHCP) version 6 (December 2003)
򐂰 RFC 3646 – DNS Configuration options for Dynamic Host Configuration
Protocol for IPv6 (DHCPv6) (December 2003)
򐂰 RFC 3697 – IPv6 Flow Label Specification (March 2004)
򐂰 RFC 3736 – Stateless Dynamic Host Configuration Protocol (DHCP) Service
for IPv6 (April 2004)
򐂰 RFC 3775 – Mobility Support in IPv6 (June 2004)
򐂰 RFC 3776 – Using IPSec to Protect Mobile IPv6 Signaling Between Mobile
Nodes and Home Agents (June 2004)
򐂰 RFC 3956 – Embedding the Rendezvous Point (RP) Address in an IPv6
Multicast Address (November 2004)
򐂰 RFC 4007 – IPv6 Scoped Address Architecture (March 2005)
򐂰 RFC 4038 – Application Aspects of IPv6 Transition (March 2005)
򐂰 RFC 4057 – IPv6 Enterprise Network Scenarios (June 2005)
򐂰 RFC 4241 – A Model of IPv6/IPv4 Dual Stack Internet Access Service
(December 2005)
򐂰 RFC 4443 – Internet Control Message Protocol (ICMPv6) for the Internet
Protocol Version 6 (IPv6) Specification (March 2006)
򐂰 RFC 4302 – IP Authentication Header (December 2005)
򐂰 RFC 4303 – IP Encapsulating Security Payload (ESP) (for v6 and v4)
(December 2005)
Chapter 9. IP version 6
򐂰 RFC 2675 – IPv6 Jumbograms, August 1999)
򐂰 RFC 2460 – Internet Protocol, Version 6 (IPv6) (December 1998)
򐂰 RFC 4291 – IP Version 6 Addressing Architecture (February 2006)
򐂰 RFC 3587 – IPv6 Global Unicast Address Format (August 2003)
򐂰 RFC 2461 – Neighbor Discovery for IP Version 6 (IPv6) (December 1998)
򐂰 RFC 2462 – IPv6 Stateless Address Autoconfiguration (December 1998)
򐂰 RFC 3596 – DNS Extensions to Support IP Version6 (October 2003)
򐂰 RFC 2893 – Transition Mechanisms for IPv6 Hosts and Routers
(August 2000)
For more information about any of these topics, see:
򐂰 IANA Assignment Documentation: INTERNET PROTOCOL VERSION 6
򐂰 Global IPv6 Summit 2006
򐂰 6NET
򐂰 IPv6 Working Group
TCP/IP Tutorial and Technical Overview
Chapter 10.
Wireless IP
In an increasingly mobile society, the need for wireless connectivity is a
consistently growing area. As a result, technology is rapidly advancing to provide
wireless support for business and personal use. This chapter discusses some of
the fundamental concepts behind wireless IP and the technology that supports it.
© Copyright IBM Corp. 1989-2006. All rights reserved.
10.1 Wireless concepts
Given the diverse nature of wireless implementation, there are a number of terms
and concepts relating to the wireless ideology. This section reviews some of the
more common of these.
Radio propagation
Radio propagation refers to the behavior exhibited by radio waves as they are
transmitted to and from points around the earth, and includes aspects such as
aurora, backscatter, and tropospheric scatter.
The decibel (dB)
Signal strength of radio waves is measured in decibels (dBs), specifically by
quantifying the amount of signal lost between two points in a wireless network.
This measurement is calculated as the difference between a signal’s strength at
an originating point and at a destination point. Changes in signal strengths are
measured in terms of positive or negative dB gain.
Path loss
Path loss refers to a signal’s loss in electromagnetic radiation as it propagates
from one point to another. Though this reduction can be directly affected by
things such as terrain and the environment, the actual loss is inversely
proportional to the distance travelled by the signal, and directly proportional to
the wave length of the signal.
Effective isotropic radiated power
Effective isotropic radiated power (ERP) is used to quantify the signal strength
produced by an antenna. It accounts for both the gain of the antenna as well as
the power that feeds into the antenna.
For example, if an antenna has -13 dB gain, and is fed by 100 dB, its ERP is 87
dB, as illustrated in Figure 10-1.
Antenna gain
-13 dB
100 dB
ERP = 100 dB -13 dB = 87 dB
Figure 10-1 ERP example
TCP/IP Tutorial and Technical Overview
Fixed versus mobile wireless
There are two types of wireless devices: fixed and mobile. Fixed devices are
stationary and draw their power from a utility main. An example of such a device
is a wireless router plugged into a wall outlet. Conversely, mobile devices are
those that have the capability of movement. Naturally, these are powered from
batteries. An example of this is a mobile computer.
Effects of multipath
Similar to a wired IP network, it is possible for radio signals to traverse different
paths between a source and destination. This can occur when one signal
encounters an obstruction. This can introduce delays into the traversal of signals
and is called multipath distortion.
System operating margin
The system operating margin defines the range in which error free reception is
achieved. This is calculated in dB as the difference between the received signal
level and the receiver’s sensitivity. For example, if the received signal is -15 dB,
and the sensitivity of the receiver is -10 dB, the system operating margin is 5 dB.
Free space loss
Free space loss is similar to path loss, except that path loss is experienced
between any two radio points and thus incorporates signal loss through various
types of media. Conversely, free space loss is specific to the lessening of a
signal as it traverses free space.
Decibel over isotropic (dBi)
Before decibel isotropic (dBi) units can be understood, the concept of an
isotropic antenna must first be explained. An isotropic antenna is theoretical, and
produces uniform signal strength in every direction, called isotropic radiation.
This sphere can then be used as a point of reference when measuring an actual
antenna’s strength. This measurement is made in units of dBi, and compares the
antenna’s strength relative to the isotropic radiation that would be created by an
isotropic antenna of the same strength. This is illustrated in Figure 10-2 on
page 394.
Chapter 10. Wireless IP
Isotropic radiation
Actual radiation
Figure 10-2 Decibel over isotropic
Fresnel zone clearance
When obstructions exist within the path of a signal, diffraction of the signal
creates a series of concentric elliptical zones, each zone varying in signal
strength. Each of these zones represents a different Fresnel zone within the
signal. Fresnel zones are numbered outward from the center, and referred to as
the nth zone. This is illustrated in Figure 10-3. Note that the first zone has no
obstructions, providing the strongest signal to the house. The second zone was
created by tree obstructions and carries a signal weaker than the first zone, but
stronger than the third. The third zone, with the weakest signal, was the result of
an obstructing building.
Figure 10-3 An example of Fresnel zones
TCP/IP Tutorial and Technical Overview
Line of sight (LOS) and non-line of sight (NLOS) service
Line of sight (LOS) and non-line of sight (NLOS) are used to define a link by its
position relative to a signal’s transmitter. An LOS link is one that must have an
unobstructed path between it and the signal’s source, literally meaning that the
link has a line of site to the source. This usually indicates that the link is within the
first Freznel zone. If a link that requires LOS service moves into the second or
third zone (for example, where the person in Figure 10-3 on page 394 is
standing), it would no longer have LOS, and might not operate. However, a link
that can use NLOS would still operate correctly.
Wireless access point
Wireless access points typically relay data between wireless devices and a wired
network. However, multiple access points can be chained together, creating a
larger network to allow roaming of mobile devices.
Wireless router
A wireless router acts as a wireless access point combined with an Ethernet hub,
forwarding packets between a wireless subnet and any other subnet.
Wireless Ethernet bridge
Wireless Ethernet bridges connect two separate wireless networks without
requiring the services of a router.
10.2 Why wireless?
Though the immediate benefit implementing a wireless network (mobility) might
seem obvious, there are other benefits that might not be as readily evident.
10.2.1 Deployment and cost effectiveness
When creating a traditional, wired network, much of the construction centers
around laying cable. Though this is not as difficult a task when the network is
built in parallel with a structure, installing wired networks into existing structures
can be quite difficult because the wires must often be installed behind or above
solid walls or ceilings. This can incur substantial costs, both in purchasing the
wire as well as in paying for the construction to install the wire. When installed,
there is also the cost of maintaining the wires, which can degrade over time.
Conversely, creating a wireless network requires minimum construction, if any at
all. When building a large-scale network, there might be some initial cost and
construction to build antennas, access points, and so on. However, once built,
Chapter 10. Wireless IP
the maintenance required by such structures is minimal. Additionally, there is no
cost for laying cable, which is significant on a large-scale network.
For small-scale networks (such as office buildings), the cost is relatively minimal.
Only access points (such as wireless routers) need to be purchased, and can
create their own network or be hooked into an existing network. There is no
construction cost, no cost for wiring, and therefore no cost in installing the wiring.
Additionally, such a network can be set up and configured in as fast as a day,
depending on the complexity of the organization’s needs.
10.2.2 Reachability
Wired networks do not lend themselves to certain geographies. For example,
imagine laying cable to provide connectivity between research stations in the
Amazon, or to interconnect remote communities in sparsely populated regions of
Wyoming. Not only would the wiring be costly, but the terrain through which the
cable must be laid might be prohibitive. For example, wet or hot climates (such
as the Amazon) might cause cabling to deteriorate too fast. Rocky terrains might
not be cost effective to bury the cable. Additionally, when the distance between
connected points is too great, the signal might degrade before the distance is
spanned. This, of course, can be resolved using repeaters, but this adds
additional costs.
Implementation of a wireless network can overcome these challenges simply
because it nullifies the need for wiring. Distances between nodes can be
spanned easily and the nuances of a terrain can be overcome. Additionally, if a
wired network is desired, wireless can be used to interconnect remote wired
10.2.3 Scalability
A common challenge faced by growing businesses is outgrowing their network.
When first constructing a network, a young business might not have an accurate
forecast of the network size needed to accommodate the organization. Then, as
the business needs grow, the network is no longer capable of supporting its
needs. As described previously, adding additional wiring might be cost
prohibitive and might compromise the success of the business.
In such a scenario, wireless networks can offer two solutions. First, wireless
capability can be added to an existing wired network. This allows the network to
grow as needed, and additions can continue to be made if the needs continue to
grow. Second, if the business initially builds a wireless network, the problematic
scenario will never occur because the organization can continue to add wireless
capability to address growing needs.
TCP/IP Tutorial and Technical Overview
10.2.4 Security
One concern over any network is the question of security. As data becomes
more sensitive, and more readily available online, the need to protect this data
increases rapidly. A common misconception is that hackers or malicious users
are facilitated by the growing use of wireless because this allows them to steal
data having only proximity to a network.
However, with such a concern in mind, the wireless architectures and
technologies were designed specifically with security in mind. As such, wireless
networks are often more secure, through the use of advanced authentication and
encryption methods, than their wired counterparts.
10.2.5 Connectivity and reliability
Depending on the design and configuration of a wireless network, it is possible
that such a network might be prone to the same connectivity outages as a wired
network. However, this is a limitation of the design of a particular network and not
of the wireless architecture itself. For example, wireless networking lends itself to
the concept of mesh networking, described in 10.5.3, “Mesh networking” on
page 402. Through such an implementation, as nodes become available or are
removed from a network, the overall wireless network can “heal” itself, and still
provide connectivity to all of the other nodes.
10.3 WiFi
The term WiFi is short for Wireless Fidelity and is meant to be used generically
when referring to any type of 802.11 network, whether 802.11b, 802.11a,
dual-band, and so on. The term originated from the Wi-Fi Alliance.
The 802.11 standard refers to a family of specifications developed by the IEEE
for wireless LAN technology. The 802.11 standard specifies an over-the-air
interface between a wireless client and a base station or between two wireless
clients. The IEEE accepted the specification in 1997.
802.11 family of standards
There are several specifications in the 802.11 family of standards:
Applies to wireless LANs and provides 1 or 2 Mbps transmission in
the 2.4 GHz band using either frequency hopping spread spectrum
(FHSS) or direct sequence spread spectrum (DSSS).
Chapter 10. Wireless IP
An extension to 802.11 that applies to wireless LANs and provides
up to 54 Mbps in the 5 GHz band. 802.11a uses an orthogonal
frequency division multiplexing (OFDM) encoding scheme rather
than FHSS or DSSS.
Also known as 802.11 High Rate or WiFi. An extension to 802.11 that
applies to wireless LANs and provides 11 Mbps transmission with
fallbacks to 5.5, 2, and 1 Mbps in the 2.4 GHz band. 802.11b uses
only DSS. 802.11b was a 1999 ratification to the original 802.11
standard, allowing wireless functionality comparable to Ethernet.
Applies to wireless LANs and provides 20+ Mbps in the 2.4 GHz
For additional information about the 802.11 family of standards, see:
WiFi operates as a non-switched Ethernet network. Every 100 ms, Wireless
Application Protocols (WAPs) broadcast service set identifiers (SSIDs) using
beacon packets. Clients who receive these beacons can opt to wirelessly
connect to the WAP. This determination is usually established by some
combination of the following factors:
򐂰 Whether or not the client has been configured to connect to the broadcasted
򐂰 The signal strength of the WAP. In particular, a client might receive two
beacons from two different WAPs, each one broadcasting the same SSID. In
this instance, the client should opt to connect to the WAP demonstrating the
stronger signal.
򐂰 The level of encryption offered by a WAP.
Each beacon is broadcast at 1 Mbps, ensuring that any client who receives the
beacon at a minimum supports communication at this speed. All of the area to
which a WAP beacon can be received is referred to as a hotspot. Though WiFi
hotspots can be several miles long, such an implementation requires multiple
WAPs to overlap their individual hotspots using the same SSID.
WiFi can also be used in peer-to-peer mode, allowing mobile devices to
communicate with one another in the absence of a wireless network. Although
this method of operation does not provide any sort of connectivity to the Internet,
it does lend itself to other applications such as backing up data or gaming.
TCP/IP Tutorial and Technical Overview
The airborne nature of WiFi inherently makes it susceptible to security risks. No
longer hindered by the need to gain access to a wire, malicious users attempting
to capture data transfers must only gain proximity to the intended victim. As
such, several encryption protocols have been coupled with WiFi in order to
secure the data transferred using WiFi.
Wireless Equivalent Privacy (WEP)
Initially, WEP was used to secure WiFi communications. It uses RC4, or
ARCFOUR, stream cipher to provide confidentiality. Additionally, WEP employs
a 33-bit cyclic redundancy check (CRC-32) to ensure data integrity. However,
WEP uses a shared encryption key to which all users must have access in order
to authenticate with the WAP. This compromises the security of the network
because current hacking technology can decode the key using freely distributed
programs. Additionally, WEP security, because it employs a stream cipher, is
susceptible to stream cipher attacks. Due to these and other shortcomings, WEP
has been outdated by WiFi Protected Access (WPA and WPA2).
WiFi Protected Access (WPA)
Created by the Wi-Fi Alliance, WPA also employs a pass phrase concept similar
to that of the WEP implementation. However, WPA uses distributed private keys
administered by an 802.1X authentication server.
Note: A public-shared key (PSK) mode can be used, but it is less secure.
Data encryption is again provided through the RC4 stream cipher, which uses a
128-bit key and a 48-bit initialization vector. Security is increased by inserting
dynamic key changes using the Temporal Key Integrity Protocol (TKIP). Data
integrity is guaranteed using the Message Integrity Code (MIC) algorithm, also
called Michael’s algorithm.
While this increased security implementation compensates for the faults found
previously with WEP, cryptoanalysts have still found weaknesses in the WPA
architecture. Specifically, Michael’s algorithm was chosen because it still allowed
mobile devices using WPA to communicate with access points still using WEP,
and vice versa. However, the algorithm is still susceptible to packet forgery
attacks. To combat this, WPA was enhanced and expanded into WPA2.
WiFi Protected Access (WPA2)
In WPA2, Michael’s algorithm is replaced by the Counter Mode with Cipher Block
Chaining Message Authentication Protocol (CCMP). Because CCMP provides
both data integrity and key management using the Advanced Encryption
Standard (AES, also known as Rijndael), it combines both the data integrity and
Chapter 10. Wireless IP
confidentiality functions of WPA into one protocol. CCMP is considered fully
10.4 WiMax
Also known as WirelessMAN, the Worldwide Interoperability for Microwave
Access (WiMAX) is a digital communications system defined by the IEEE
standard 802.16 (most recently approved in 2004). Much like the Wi-Fi Wireless
Alliance, WiMAX is monitored by the WiMAX forum, which strives to ensure
product compliance with the 802.16 standard and device interoperability.
Similar to the client/server model (see 11.1.1, “The client/server model” on
page 408), WiMAX uses the notion of subscriber stations and base stations.
Base stations provide the wireless access and provide the same functions as the
WAPs. Subscriber stations are the clients using the wireless access provided by
the base station.
802.16 family of standards
There are several specifications in the 802.16 family of standards:
This applies to enabling last mile wireless broadband access and can
be used as an alternative to DSL and cable. This specification is also
known as WirelessMAN.
This specification addresses issues of radio spectrum use. It
specifies added support for the 2 to 11 GHz range that provides
support for low latency applications such as video and voice. It
enables the provision of broadband connectivity without the
requirement of direct line of sight (LOS) between the subscriber
terminals and the base station (BTS).
This extends 802.16 by increasing the spectrum to 5 and 6 GHz. This
provides quality of service (QoS) for voice and video services.
This extends 802.16 by representing the 10 to 66 GHz range. This
extension also addresses issues such as interoperability,
performance evaluation, testing, system profiling, and performance
Also known as Mobile WiMaX. This extends and improves the
modulation schemes described in the original/fixed WiMax standard.
This allows for fixed wireless and mobile NLOS applications by
improving upon the Orthogonal Frequency Division Multiple Access
(OFDMA). This should not be confused with 802.20.
TCP/IP Tutorial and Technical Overview
For additional information about the 802.16 family of standards, see:
Security over WiMax
Similar to WiFi, WiMAX uses WAP2, CCMP, and AES. Additionally, WiMAX
provides end-to-end authentication through the Public Key Methodology Extensible Authentication Protocol (PKM-EAP). This relies on Transport Layer
Security (TLS) to provide authentication and confidentiality.
Advantages of WiMAX over WiFi
Like WiFi, WiMAX provides wireless access to mobile devices. However, WiMAX
has advantages over WiFi in specific applications. WiFi access points are usually
unable to guarantee any quality of service (QoS, see Chapter 8, “Quality of
service” on page 287), and as such, QoS-dependent applications, such as VoIP
(see 20.1, “Voice over IP (VoIP) introduction” on page 724) and IP Television
(IPTV, see Chapter 21, “Internet Protocol Television” on page 745), are not
suitable for such a network infrastructure. This is because WiFi clients using the
same WAP must compete with each other for both bandwidth and attention from
the WAP.
Conversely, WiMAX uses a scheduling algorithm that does guarantee QoS.
Unlike the WiFi model, WiMAX clients must compete only for the initial entry into
the network. After a client is granted entry, that client is guaranteed a time slot
with the access point. Though the time slot might be expanded by the client
based on need and availability, this initial guarantee lends itself to client
applications that require a minimum QoS.
Other advantages WiMAX hold over WiFi include increased bandwidth (up to 70
Mbps), stronger encryption, and the ability to connect nodes that lack a
line-of-site association. Additionally, as noted earlier, creating large WiFi
hotspots requires the construction of multiple WAPs with overlapping smaller
hotspots. WiMAX, however, is capable of servicing up to 30 miles (50 km) of
service range. This makes WiMAX very suitable for rural areas, or remote areas
in which installing the wiring to support any wired networks is cost-prohibitive.
Another application of WiMAX is to connect remote networks. Scenarios can
exist when wired LANs or WiFi hotspots are preferred for a particular area.
However, that area might be remote to other areas, and it is not cost-effective to
connect the areas by WiFi or wires. Instead, these sites can be connected using
WiMAX, thus bridging the distance between sites while still using the preferred
network locally.
Chapter 10. Wireless IP
10.5 Applications of wireless networking
Given the benefits of wireless networking, there are several scenarios and
problems to which wireless can be applied.
10.5.1 Last mile connectivity in broadband services
Last mile connectivity, sometimes called last kilometer connectivity, is a term
commonly used by broadband providers (such as DSL or cable) to describe the
final portion of the physical network used to provide network services. For
example, this might be the wiring used to connect an individual home to a main
cable. Installing the last mile often requires significant labor, high costs, and a lot
of time. This is meaningful in respect to wireless because wireless presents a
potential resolution to the last mile problem. The primary installation of the
physical network can be attached to wireless radios, allowing subscribers to
access network services without the installation of wiring.
10.5.2 Hotspots
A hotspot is any public location in which a wireless signal is present. These are
often made available by businesses, such as coffee shops or restaurants, to
provide Internet access to patrons. Note that some hotspots can be very large,
such as those that span an university campus or an entire shopping mall.
However, these are typically implementations of multiple overlapping hotspots
that all broadcast the same SSID.
Hotspots can provide unlimited access to the Internet, or they can be restricted
by the provider. Additionally, some commercial hotspots charge a fee before
access to the Internet is granted. Many commercial hotspots include:
򐂰 A portal to which users are directed, allowing them to authenticate
themselves or to pay a fee for Internet access.
򐂰 Some type of payment option, either directly to the establishment that
maintains the hotspot, or through an Internet payment service.
򐂰 Free access to the internet, or limited access to prevent patrons from
participating in illegal or questionable activities through the provider’s hotspot.
10.5.3 Mesh networking
Mesh networking is a method of designing a network such that clients can act as
repeaters, and repeaters can sometimes act as clients. In theory, this allows
each node within a mesh network to be connected to every other node. Blocked
routes can easily be bypassed, because a datagram can hop from node to node
TCP/IP Tutorial and Technical Overview
until a new path is achieved. This essentially allows a meshed network to be
This topology lends itself well in a wireless environment, because many nodes
will be mobile, and therefore they will enter and leave the network continually.
10.6 IEEE standards relevant to this chapter
The following IEEE standards provide detailed information about the architecture
and concepts presented in this chapter:
򐂰 802.11 – Working Group for Wireless Local Area Networks. Reference:
– 802.11a – Wireless LANs
– 802.11b – Wireless Fidelity
– 802.11g – 20+ Mbps Wireless connectivity
򐂰 802.16 – Working Group for Wireless Metropolitan Area Networks.
– 802.16a – Radio Spectrum Use
– 802.16b – Five to six GHz Spectrum Use, Quality of Service
– 802.16c – Ten to sixty-six GHz Spectrum Use
– 802.16e – Mobile WiMax
Chapter 10. Wireless IP
TCP/IP Tutorial and Technical Overview
Part 2
TCP/IP application
Included in the TCP/IP suite of protocols is an extensive list of applications
designed to make use of the suite's services. It is through these entities that
resources can be made available, data can be moved between hosts, and
remote users can communicate. Examples of applications architected within the
TCP/IP suite include the File Transfer Protocol (FTP) and the Simple Mail
Transport Protocol (SMTP). Other applications have been architected to manage
networks and provide seamless access to resources. These include applications
such as the Domain Name System (DNS) and the Simple Network Management
Protocol (SNMP).
However, applications that make use of TCP/IP services are not limited to RFC
architected protocols defined in parallel to TCP/IP. Other proprietary and
open-source applications exist, defined either by industry standards or by
open-organization specifications. Some of these applications, such as sendmail
and the Common Internet File System (CIFS), mimic the services offered by RFC
architected protocols. Others, however, fulfill specific needs not specifically
addressed by RFCs. An example of the latter is the Wireless Application
© Copyright IBM Corp. 1989-2006. All rights reserved.
Protocol, which is defined by the Open Mobile Alliance (OMA) and is defined in
specifications created by that organization. These OMA specifications are
available at:
TCP/IP Tutorial and Technical Overview
Chapter 11.
Application structure and
programming interfaces
Application protocols consist of the highest level of protocols in the OSI model.
These protocols act as user interfaces to the TCP/IP protocol suite. In this
chapter, we discuss the following topics:
򐂰 Characteristics of applications
– The client/server model
򐂰 Application programming interfaces (APIs)
– The socket API
– Remote Procedure Call (RPC)
– The SNMP distributed programming interface (SNMP DPI)
– REXX sockets
򐂰 Related RFCs
© Copyright IBM Corp. 1989-2006. All rights reserved.
11.1 Characteristics of applications
Each of the application protocols share come common characteristics:
򐂰 They can be user-written applications or applications standardized and
shipped with the TCP/IP product. Examples of applications native to the
TCP/IP protocol suite include:
– Telnet, which provides interactive terminal access to remote hosts
– The File Transfer Protocol (FTP), which provides the ability to transfer files
between remote hosts
– The Simple Mail Transfer Protocol (SMTP), which provides an Internet
mailing system
While these are widely implemented application protocols, many others exist.
򐂰 They use either UDP or TCP as a transport mechanism. Remember that UDP
(see 4.2, “User Datagram Protocol (UDP)” on page 146) is unreliable and
offers no flow control. In this case, the application must provide its own error
recovery and flow control routines. For this reason, it is often easier to build
applications that use TCP (see 4.3, “Transmission Control Protocol (TCP)” on
page 149), a reliable, connection-oriented protocol.
򐂰 Most applications implement the client/server model of interaction.
11.1.1 The client/server model
TCP is a peer-to-peer, connection-oriented protocol. There are no
master/subordinate relationships, in which one instance of the application
protocol controls or is controlled by another instance. Instead, the applications
use a client/server model for communications. In such a model, the server offers
a service to users. The client is the interface by which the user accesses the
offered service. Both a client instance and a server instance must be active for
the application protocol to operate. Note that the both instances can reside on
the same host or on different hosts (see Figure 11-1 on page 409).
TCP/IP Tutorial and Technical Overview
Figure 11-1 The client/server model of applications
In the previous figure, client A and client B represent client instances on remote
hosts. Client C represents a client instance on the same system as the server
instance. Through the client, a user can generate a request for the service
provided by the server. The request is then delivered to the server using TCP/IP
as the transport vehicle.
Upon receiving the request, the server performs the desired service, and then
sends a reply back to the client. A server typically can accept and process
multiple requests (multiple clients) at the same time.
Common servers, such as Telnet, FTP, and SMTP, listen for requests on
well-known ports (see 4.1.1, “Ports” on page 144). This allows a client to connect
to the server without having determine on what port the server is listening.
Clients that need to connect to a nonstandard server application, or to a standard
server application that has been configured to listen on a port other than the
well-known port, must implement another mechanism to determine on which port
a server is listening. This mechanism might employ a registration service, such
as portmap or Remote Procedure Call Bind (RPCBIND), to identify the port to
which a request should be sent. Both portmap and RPCBIND are defined by
RFC 1833.
Chapter 11. Application structure and programming interfaces
11.2 Application programming interfaces (APIs)
An application programming interface (API) enables developers to write
applications that can make use of TCP/IP services. The following sections
provide an overview of the most common APIs for TCP/IP applications.
11.2.1 The socket API
The socket interface is one of several APIs to the communication protocols.
Designed to be a generic communication programming interface, it was first
introduced by the 4.2BSD UNIX-based system. Although the socket API for IPv4
was never standardized, it has become a de facto industry standard, and RFC
3493 was created to update the API for IPv6. More advanced IPv6 socket
programming can be found in RFC 3542.
The socket interface is differentiated by the following services provided to
򐂰 Stream sockets services
Stream sockets provide a reliable connection-oriented service such as TCP.
Data is sent without errors or duplication, and is received in the same order as
it is sent. Flow control is built in to avoid data overruns. No boundaries are
imposed on the exchanged data, which is considered a stream of bytes. An
example of an application that uses stream sockets is the File Transfer
Protocol (FTP).
򐂰 Datagram sockets services
Datagram sockets define a connectionless service such as UDP. Datagrams
are sent as independent packets. The service does not guarantee successful
delivery of the packets; data can be lost or duplicated, and datagrams can
arrive out of order. No disassembly and reassembly of packets is performed.
An example of an application that uses datagram sockets is the Network File
System (NFS).
򐂰 Raw sockets services
Raw sockets allow direct access to lower layer protocols, such as IP and
ICMP. This interface is often used for testing new protocol implementations.
An example of an application that uses raw sockets is the ping command.
Additional information about sockets is in 4.1.2, “Sockets” on page 145. Socket
APIs provide functions that enable applications to perform the following actions:
򐂰 Initialize a socket
򐂰 Bind (register) a socket to a port address
TCP/IP Tutorial and Technical Overview
򐂰 Listen on a socket for inbound connections
򐂰 Accept an inbound connection
򐂰 Connect outbound to a server
򐂰 Send and receive data on a socket
򐂰 Close a socket
Though the specific details of the previous functions will vary from platform to
platform, the industry standard is based on Berkeley sockets, also known as the
BSD socket API, released in 1989. Additionally, RFC 3493 was created to define
the extensions needed for socket APIs to incorporate IPv6. The core functions
made available by industry standard APIs are as follows:
򐂰 Initialize a socket
socket(domain, type, protocol)
Definitions of fields:
This is the protocol family of the socket to be created. Valid
values include PF_INET (IPv4) and PF_INET6 (IPv6).
Additional platform-specific values can also be used.
This is the type of socket to be opened. Valid values typically
include stream, datagram, and raw.
This is the protocol that will be used on the socket. Values
typically include UDP, TCP, and ICMP.
򐂰 Bind a socket to a port address
bind(sockfd, localaddress, addresslength)
Definition of fields:
This is the socket that is to be bound to the port address.
This is the value obtained previously from the socket
This is the socket address structure to which the socket is
addresslength This is the length of the socket address structure.
򐂰 Listen on a socket for inbound connections
listen(sockfd, queuesize)
Chapter 11. Application structure and programming interfaces
Definition of fields:
This is the socket on which the application is to listen. This is
the value obtained previously from the socket function.
This is the number of inbound requests that can be queued
by the system at any single time.
Note: The listen() function is typically invoked by server applications.
The function is called to await inbound connections from clients.
򐂰 Accept an inbound connection
accept(sockfd, remoteaddress, addresslength)
Definition of fields:
This is the socket on which the connection is to be accepted.
This is the value obtained previously from the socket
remoteaddress This is the remote socket address structure from which the
connection was initiated.
addresslength This is the length of the socket address structure.
Note: The accept() function is typically invoked by server applications to
accept connections from clients. The remote address is a place holder in
which the remote address structure will be stored.
򐂰 Connect outbound to a server
connect(sockfd, remoteaddress, addresslength)
Definition of fields:
This is the socket from which the connection is to be opened.
This is the value obtained previously from the socket
remoteaddress This is the remote socket address structure to which the
connection is to be opened.
addresslength This is the length of the socket address structure.
Note: The connect function is typically invoked by client applications.
TCP/IP Tutorial and Technical Overview
򐂰 Send and receive data on a socket
sendmsg(sockfd, data, datalength, flags)
recvmsg(sockfd, data, datalength, flags)
Definition of fields:
This is the socket across which the data will be sent or read.
This is the data to be sent, or the buffer into which the read
data will be placed.
When writing data, this is the length of the data to be written.
When reading data, this is the amount of data to be read
from the socket.
This field, which is in many implementations optional,
provides any specific information to TCP/IP regarding any
special actions to be taken on the socket when sending or
receiving the data.
Note: Other variations of sendmsg() and recv() can be as follows:
sendmsg(): send(), sendto(), write()
recvmsg(): recv(), recvfrom(), read()
RFC 3493 does not specifically discuss the fields passed on the sendmsg()
function. The fields discussed earlier are drawn from those typically used
by most implementations.
򐂰 Close a socket
Definition of fields:
This is the socket which is to be closed.
Chapter 11. Application structure and programming interfaces
An example of a client/server scenario
Figure 11-2 illustrates the appropriate order of socket API functions to implement
a connection-oriented interaction.
Open communication endpoint
Register address with the system
Establish a listen on the socket
Await inbound cient connections
The accept() call blocks the socket
until a connection is opened by a
Upon receiving a connection, the
accept() call creates a new socket to
service the request
data (request)
data (reply)
process request
Figure 11-2 Socket system calls for a connection-oriented protocol
TCP/IP Tutorial and Technical Overview
The connectionless scenario is simpler in that the listen(), accept(), and
connect() functions are not invoked. Table 11-1 compares the socket API
functions that are used for connection-oriented and connectionless clients and
Table 11-1 Socket API function comparison
Client/server connection
Connection-oriented server
Connection-oriented client
Connectionless server
Connectionless client
11.2.2 Remote Procedure Call (RPC)
Remote Procedure Call (RPC), originally developed by Sun Microsystems and
currently used by many UNIX-based systems, is an application programming
interface (API) available for developing distributed applications. It allows
programs to execute subroutines on a remote system. The caller program, which
represents the client instance in the client/server model (see Figure 11-1 on
page 409), sends a call message to the server process and waits for a reply
message. The call message includes the subroutine’s parameters, and the reply
message contains the results of executing the subroutine. RPC also provides a
standard way of encoding data passed between the client/server in a portable
fashion called External Data Representation (XDR), defined by
RFC 4506/STD 0067.
The RPC concept
The concept of RPC is very similar to that of an application program issuing a
procedure call:
1. The caller process, which represents the client instance in the client/server
model (see Figure 11-1 on page 409,) sends a call message and waits for the
2. The server awaits the arrival of call messages. When a call message arrives,
the server process extracts the procedure parameters, computes the results,
and sends them back in a reply message.
Chapter 11. Application structure and programming interfaces
Figure 11-3 shows the conceptual model of RPC.
Figure 11-3 Remote Procedure Call (RPC) conceptual model
Figure 11-3 represents only one possible model. In this figure, the caller's
execution blocks until a reply message is received. Other models are possible;
for example, the caller can continue processing while waiting for a reply, or the
server can dispatch a separate task for each incoming call so that it remains free
to receive other messages.
Note: The client and server stub seen in Figure 11-3 are created by the
RPCGEN command, discussed later in this chapter.
The Remote Procedure Calls differ from local procedure calls in the following
򐂰 The use of global variables is not possible because the server has no access
to the caller program's address space.
򐂰 Performance might be affected by the transmission times.
TCP/IP Tutorial and Technical Overview
򐂰 User authentication might be necessary.
򐂰 The location of the server must be known.
Call message and reply transport
The RPC protocol can be implemented on any transport protocol. In the case of
TCP/IP, it can use either TCP or UDP as the transport vehicle. When using UDP,
remember that this does not provide reliability. Therefore, it is the responsibility
of the caller program to employ any needed reliability (using timeouts and
retransmissions, usually implemented in RPC library routines). Note that even
with TCP, the caller program still needs a timeout routine to deal with exceptional
situations, such as a server crash or poor network performance.
RPC call message structure
The RPC call message consists of several fields:
򐂰 Program and procedure numbers
Each call message contains three unsigned integer fields that uniquely
identify the procedure to be executed:
– Remote program number
– Remote program version number
– Remote procedure number
The remote program number identifies a functional group of procedures, for
example, a file system, that includes individual procedures such as read and
write. These individual procedures are identified by a unique procedure
number within the remote program. As the remote program evolves, a version
number is assigned to the different releases.
Each remote program is attached to an internet port. The number of this port
can be freely chosen, except for the reserved well-known-services port
numbers. It is evident that the caller will have to know the port number used
by this remote program.
򐂰 Authentication fields
Two fields, credentials and verifier, are provided for the authentication of the
caller to the service. It is up to the server to use this information for user
authentication. Also, each implementation is free to choose the varieties of
supported authentication protocols. These authentication protocols include:
– Null authentication.
– UNIX authentication: The callers of a remote procedure can identify
themselves as they are identified on the UNIX system.
Chapter 11. Application structure and programming interfaces
– DES authentication: In addition to a user ID, a time stamp field is sent to
the server. This time stamp is the current time, enciphered using a key
known to the caller machine and server machine only (based on the secret
key and public key concept of DES).
򐂰 Procedure parameters
Procedure parameters consist of the data passed to the remote procedure.
RPC message reply structure
Several replies exist, depending on the action taken:
򐂰 SUCCESS: Procedure results are sent back to the client.
򐂰 RPC_MISMATCH: Server is running another version of RPC than the caller.
򐂰 AUTH_ERROR: Caller authentication failed.
򐂰 PROG_MISMATCH: If a program is unavailable, or if the version asked for
does not exist, or if the procedure is unavailable.
For a detailed description of the call and reply messages, see RFC 1057 – RPC:
Remote Procedure Call Protocol Specification Version 2, which also contains the
type definitions (typedef) for the messages in XDR language.
Using portmap to register a program’s port number
The portmap (or portmapper) is a server application that maps a program and its
version number to the Internet port number used by the program. Portmap is
assigned the reserved (well-known service) port number 111.
Portmap only knows about RPC programs for the host on which portmap runs. In
order for portmap to know about these programs, each program should, when
initializing, register itself with the local portmap application.
The RPC client (caller) has to ask the portmap service on the remote host about
the port used by the desired server program. Normally, the calling application
contacts portmap on the destination host to obtain the correct port number for a
particular remote program, and then sends the call message to this particular
port. A variation can exist such that the caller can also send the procedure data
along to portmap, and then portmap directly invokes the desired procedure using
the received data. Figure 11-4 on page 419 illustrates this process.
TCP/IP Tutorial and Technical Overview
Figure 11-4 RPC using portmap
RPCGEN is a tool that generates C code to implement an RPC protocol. The
input to RPCGEN is a file written in a language similar to C, known as the RPC
language. The output of the tool is four files used to implement the RPC program.
For example, assume that RPCGEN is used with an input file named proto.x.
The output will consist of the following files:
򐂰 proto.h: A header file containing common definitions of constants and macros
used by the RPC program
򐂰 protoc.c: The client stub source file, shown in Figure 11-3 on page 416
򐂰 protos.c: The server stub source file, also shown in Figure 11-3 on page 416
򐂰 protox.c: The XDR routines source file
11.2.3 The SNMP distributed programming interface (SNMP DPI)
The Simple Network Management Protocol (SNMP) defines a protocol that
permits operations on a collection of system-specific variables, also called
objects. This set of variables, called the Management Information Base (MIB),
and a core set of variables have previously been defined. However, the design of
the MIB makes provision for extension of this core set. However, conventional
Chapter 11. Application structure and programming interfaces
SNMP agent implementations provide no means for a user to make new objects
available. The SNMP DPI addresses this issue by providing a lightweight
mechanism that permits users to dynamically add, delete, or replace
application-specific or user-specific management variables in the local MIB
without requiring recompilation of the SNMP agent. This is achieved by writing a
subagent that communicates with the agent through the SNMP distributed
programming interface (DPI), described in RFC 1592.
The SNMP DPI allows a process to register the existence of a MIB object with
the SNMP agent. When requests for this object are received by the SNMP agent,
it will pass the query on to the process acting as a subagent. This subagent then
returns an appropriate answer to the SNMP agent, which then packages an
SNMP response packet and sends it to the remote network management station
that initiated the request.
The DPI communication between the SNMP agent and the subagent is
transparent to the remote network management stations. This communication
can occur over a variety of transports, including stream, datagram, and UNIX
Using the SNMP DPI, a subagent can:
򐂰 Create and delete subtrees in the application-specific or user-specific MIB
򐂰 Create a register request packet to inform the SNMP agent of the MIB
supported by the subagent
򐂰 Create response packet to answer requests received from the SNMP agent
򐂰 Create a TRAP request packet to be delivered by the SNMP agent to remote
management stations
TCP/IP Tutorial and Technical Overview
The interaction between a subagent and the SNMP agent is illustrated in
Figure 11-5.
Client or
Figure 11-5 SNMP DPI overview
An SNMP subagent, running as a separate process (potentially on another
machine), can set up a connection with the agent. The subagent has an option to
communicate with the SNMP agent through UDP, TCP sockets, or other
After the connection is established, the subagent issues a DPI OPEN and one or
more REGISTER requests to register one or more MIB subtrees with the SNMP
agent. The SNMP agent responds to DPI OPEN and REGISTER requests with a
RESPONSE packet, indicating success or failure.
As the SNMP agent receives queries from remote management stations, any
requests for an object in a subtree registered by a subagent are forwarded
through the DPI to that subagent. These requests are in the form of GET,
Chapter 11. Application structure and programming interfaces
Note: SET requests do not conform to the request/response sequence used
by GET or GETNEXT requests. Instead, SET requests occur in a sequence of
SET/COMMIT, SET/UNDO, or SET/COMMIT/UNDO. We discuss this in
greater detail in Chapter 17, “Network management” on page 623.
The subagent sends responses back to the SNMP agent through a RESPONSE
packet, which the SNMP agent then encodes into an SNMP packet sent back to
the requesting remote management station.
If the subagent wants to report an important state change, it can send a DPI
TRAP packet to the SNMP agent, which the agent will encode into an SNMP trap
packet and send to designated remote management stations.
A subagent can send an ARE_YOU_THERE to verify that the connection is still
open. If so, the agent sends a RESPONSE with no error; otherwise, it sends a
RESPONSE with an error.
If the subagent wants to stop operations, it sends a DPI UNREGISTER and a
DPI CLOSE packet to the agent. The agent sends a response to an
UNREGISTER request, but there is no RESPONSE to a CLOSE. A CLOSE
implies an UNREGISTER for all registrations that exist for the DPI connection
being CLOSED, even if no UNREGISTER has been sent.
An agent can send a DPI UNREGISTER if a higher priority registration arrives (or
for other reasons) to the subagent. The subagent then responds with a DPI
RESPONSE packet. An agent can also send a DPI CLOSE to indicate that it is
terminating the DPI connection. This might occur when the agent terminates, or if
the agent has timed out while awaiting a response from a subagent.
11.2.4 REXX sockets
The Restructured Extended Executor (REXX) programming language was
originally developed by IBM in 1982, but since then a wide variety of platforms
have created implementations for their platforms. In order to enable REXX
applications to communicate over the TCP/IP network, most platforms have
created a REXX socket API to allow:
򐂰 Socket initialization
򐂰 Data exchange through sockets
򐂰 Management activity through sockets
򐂰 Socket termination
TCP/IP Tutorial and Technical Overview
However, currently there is no standard to define the REXX socket API. The
specifics of the functions provided vary from platform to platform.
11.3 RFCs relevant to this chapter
The following RFCs contain detailed information about application structure and
programming interfaces:
򐂰 RPC 1057 – Remote Procedure Call Protocol specification: Version 2
(June 1988)
򐂰 RFC 1833 – Binding Protocols for ONC™ RPC Version 2 (August 1995)
򐂰 RFC 1592 – Simple Network Management Protocol Distributed Protocol
Interface Version 2.0 (March 1994)
򐂰 RFC 3493 – Basic Socket Interface Extensions for IPv6 (February 2003)
򐂰 RFC 3542 – Advanced Sockets Application Program Interface (API) for IPv6
(May 2003)
򐂰 RFC 4506 – XDR: External Data Representation Standard. (Also STD0067)
(May 2006)
Chapter 11. Application structure and programming interfaces
TCP/IP Tutorial and Technical Overview
Chapter 12.
Directory and naming
An inherent problem in making resources available through a network is creating
an easy way of accessing these resources. On small networks, it might be easy
enough to simply remember or write down the information needed to remotely
access resources. However, this solution does not scale as networks, and the
number of available resources, continue to grow. This becomes increasingly
more complex as resources are made available outside of individual networks, to
multiple networks, or even across the Internet, on multiple platforms and across
a variety of differing hardware. To overcome this, directory and naming methods
were devised to provide a uniform method of obtaining the information needed to
access a networked resource. This chapter reviews four of these methods:
򐂰 Domain Name System (DNS)
򐂰 Dynamic Domain Name System (DDNS)
򐂰 Network Information System (NIS)
򐂰 Lightweight Directory Access Protocol (LDAP)
© Copyright IBM Corp. 1989-2006. All rights reserved.
12.1 Domain Name System (DNS)
The Domain Name System is a standard protocol with STD number 13, and its
status is recommended. It is described in RFC 1034 and RFC 1035. This section
explains the implementation of the Domain Name System and the
implementation of name servers.
The early Internet configurations required the use of only numeric IP addresses.
Because this was burdensome and much harder to remember than just the name
of a system, this evolved into the use of symbolic host names. For example,
instead of typing:
You can type:
MyHost is then translated in some way to the IP address Though
using host names makes the process of accessing a resource easier, it also
introduces the problem of maintaining the mappings between IP addresses and
high-level machine names in a coordinated and centralized way.
Initially, host names to address mappings were maintained by the Network
Information Center (NIC) in a single file (HOSTS.TXT), which was fetched by all
hosts using FTP. This is called a flat namespace. But due to the explosive growth
in the number of hosts, this mechanism became too cumbersome (consider the
work involved in the addition of just one host to the Internet) and was replaced by
a new concept: Domain Name System. Hosts on smaller networks can continue
to use a local flat namespace (the HOSTS.LOCAL file) instead of or in addition to
the Domain Name System. Outside of small networks, however, the Domain
Name System is essential. This system allows a program running on a host to
perform the mapping of a high-level symbolic name to an IP address for any
other host without requiring every host to have a complete database of host
12.1.1 The hierarchical namespace
Consider the typical internal structure of a large organization. Because the chief
executive cannot do everything, the organization will probably be partitioned into
divisions, each of them having autonomy within certain limits. Specifically, the
executive in charge of a division has authority to make direct decisions, without
permission from the chief executive.
TCP/IP Tutorial and Technical Overview
Domain names are formed in a similar way, and will often reflect the hierarchical
delegation of authority used to assign them. For example, consider the name:
In this example, we know that there is a single host name myHost, which exists
within the myDept.myDiv.myCorp subdomain. The myDept.myDiv.myCorp
subdomain is one of the subdomains of subdomain, which is
in turn one of the subdomains of Finally, is a
subdomain of com. This hierarchy is better illustrated in Figure 12-1.
Figure 12-1 DNS hierarchical namespace
Chapter 12. Directory and naming protocols
We discuss this hierarchical structure at greater length in the following sections.
12.1.2 Fully qualified domain names (FQDNs)
When using the Domain Name System, it is common to work with only a part of
the domain hierarchy, such as the domain. The Domain
Name System provides a simple method of minimizing the typing necessary in
this circumstance. If a domain name ends in a dot (for example,, it is assumed to be complete. This is called a fully
qualified domain name (FQDN) or an absolute domain name. However, if it does
not end in a dot (for example, myDept.myDiv), it is incomplete and the DNS
resolver may complete this by appending a suffix such as to the
domain name. The rules for doing this are implementation-dependent and locally
12.1.3 Generic domains
The top-level names are called the generic top-level domains (gTLDs), and can
be three characters or more in length. Table 12-1 shows some of the top-level
domains of today's Internet domain namespace.
Table 12-1 Current generic domains
Domain name
The air transport industry
Business use
The Catalan culture
Commercial organizations
Educational organizations
U.S. governmental agencies
Informational sites
International organizations
Employment-related sites
The U.S. military
Mobile devices sites
TCP/IP Tutorial and Technical Overview
Domain name
Family and individual sites
Network infrastructures
Non-commercial organizations
Professional sites
The travel industry
These names are registered with and maintained by the Internet Corporation for
Assigned Names and Numbers (ICANN). For current information, see the ICANN
Web site at:
12.1.4 Country domains
There are also top-level domains named for the each of the ISO 3166
international 2-character country codes (from ae for the United Arab Emirates to
zw for Zimbabwe). These are called the country domains or the geographical
domains. Many countries have their own second-level domains underneath
which parallel the generic top-level domains. For example, in the United
Kingdom, the domains equivalent to the generic domains .com and .edu are and (ac is an abbreviation for academic). There is a .us top-level
domain, which is organized geographically by state (for example, refers to
the state of New York). See RFC 1480 for a detailed description of the .us
12.1.5 Mapping domain names to IP addresses
The mapping of names to addresses consists of independent, cooperative
systems called name servers. A name server is a server program that holds a
master or a copy of a name-to-address mapping database, or otherwise points to
a server that does, and that answers requests from the client software, called a
name resolver.
Conceptually, all Internet domain servers are arranged in a tree structure that
corresponds to the naming hierarchy in Figure 12-1 on page 427. Each leaf
represents a name server that handles names for a single subdomain. Links in
the conceptual tree do not indicate physical connections. Instead, they show
which other name server a given server can contact.
Chapter 12. Directory and naming protocols
12.1.6 Mapping IP addresses to domain names: Pointer queries
The Domain Name System provides for a mapping of symbolic names to IP
addresses and vice versa. While the hierarchical structure makes it easy in
principle to search the database for an IP address using its symbolic name, the
process of mapping an IP address to a symbolic name cannot use the same
process. Therefore, there is another namespace that facilitates the reverse
mapping of IP address to symbolic name. It is found in the domain
(arpa is used because the Internet was originally the ARPAnet).
Not including IPv6, IP addresses are normally written in dotted decimal format,
and there is one layer of domain for each hierarchy. Contrary to domain names,
which have the least-significant parts of the name first, the dotted decimal format
has the most significant bytes first. Therefore, in the Domain Name System, the
dotted decimal address is shown in reverse order.
For example, consider the following IPv4 address:
The address for this is:
This is handled slightly different for IPv6 addresses. Because of the IPv6
address’ structure, the reverse order is done in nibbles in stead of octets. Also,
the domain does not include IPv6. Instead, the domain used is
IP6.ARPA. For example, consider the following IPv6 address:
Breaking this into nibbles, reversing the odder, and appending the domain yields:
Given an IP address, the Domain Name System can be used to find the
matching host name. A domain name query to do this is called a pointer query.
12.1.7 The distributed name space
The Domain Name System uses the concept of a distributed name space.
Symbolic names are grouped into zones of authority, more commonly referred to
as zones. In each of these zones, one or more hosts has the task of maintaining
a database of symbolic names and IP addresses within that zone, and provides a
server function for clients who want to translate between symbolic names and IP
addresses. These local name servers are then (through the internetwork on
which they are connected) logically interconnected into a hierarchical tree of
domains. Each zone contains a part or a subtree of the hierarchical tree, and the
TCP/IP Tutorial and Technical Overview
names within the zone are administered independently of names in other zones.
Authority over zones is vested in the name servers.
Normally, the name servers that have authority for a zone will have domain
names belonging to that zone, but this is not required. Where a domain contains
a subtree that falls in a different zone, the name server or servers with authority
over the superior domain are said to delegate authority to the name server or
servers with authority over the subdomain. Name servers can also delegate
authority to themselves; in this case, the domain name space is still divided into
zones moving down the domain name tree, but authority for two zones is held by
the same server. The division of the domain name space into zones is
accomplished using resource records stored in the Domain Name System.
At the top-level root domain there is an exception to this. There is no higher
system to which authority can be delegated, but it is not desirable to have all
queries for fully qualified domain names to be directed to just one system.
Therefore, authority for the top-level zones is shared among a set of root name
servers1 coordinated by the ICANN.
To better illustrate the process of resolving a symbolic name to an IP address,
consider a query for, and let us assume that our
name server does not have the answer already in its cache. The query goes to
the .com root name server, which in turn forwards the query to a server with an
NS record for At this stage, it is likely that a name server has been
reached that has cached the needed answer. However, the query could be
further delegated to a name server for
As a result of this scheme:
򐂰 Rather than having a central server for the database, the work that is involved
in maintaining this database is off-loaded to hosts throughout the name
򐂰 Authority for creating and changing symbolic host names and responsibility
for maintaining a database for them is delegated to the organization owning
the zone (within the name space) containing those host names.
򐂰 From the user's standpoint, there is a single database that deals with these
address resolutions. The user might be aware that the database is
distributed, but generally need not be concerned about this.
At the time of writing, there were 13 root servers.
Chapter 12. Directory and naming protocols
Note: Although domains within the namespace will frequently map in a logical
fashion to networks and subnets within the IP addressing scheme, this is not a
requirement of the Domain Name System. Consider a router between two
subnets. It has two IP addresses, one for each network adapter, but it would
not normally have two symbolic names.
12.1.8 Domain name resolution
The domain name resolution process can be summarized in the following steps:
1. A user program issues a request such as the gethostbyname() system call
(this particular call asks for the IP address of a host by passing the host
name) or the gethostname() system call (which asks for a host name of a host
by passing the IP address).
2. The resolver formulates a query to the name server. (Full resolvers have a
local name cache to consult first; stub resolvers do not. See “Domain name
full resolver” and “Domain name stub resolver” on page 434.)
3. The name server checks to see if the answer is in its local authoritative
database or cache, and if so, returns it to the client. Otherwise, it queries
other available name servers, starting down from the root of the DNS tree or
as high up the tree as possible.
4. The user program is finally given a corresponding IP address (or host name,
depending on the query) or an error if the query could not be answered.
Normally, the program will not be given a list of all the name servers that have
been consulted to process the query.
Domain name resolution is a client/server process (see 11.1.1, “The client/server
model” on page 408). The client function (called the resolver or name resolver) is
transparent to the user and is called by an application to resolve symbolic
high-level names into real IP addresses or vice versa. The name server (also
called a domain name server) is the server application providing the translation
between high-level machine names and the IP addresses. The query/reply
messages can be transported by either UDP or TCP.
TCP/IP Tutorial and Technical Overview
Domain name full resolver
Figure 12-2 shows a program called a full resolver, which is distinct from the user
program, that forwards all queries to a name server for processing. Responses
are cached by the name server for future use.
Figure 12-2 DNS: Using a full resolver for domain name resolution
Chapter 12. Directory and naming protocols
Domain name stub resolver
Figure 12-3 shows a stub resolver, a routine linked with the user program, that
forwards the queries to a name server for processing. Responses are cached by
the name server, but not usually by the resolver, although this is implementation
dependent. On most platforms, the stub resolver is implemented by two library
routines (or by some variation of these routines): gethostbyname() and
gethostbyaddr(). These are used for converting host names to IP addresses
and vice versa. Stub resolvers are much more common than full resolvers.
Figure 12-3 DNS: Using a stub resolver for domain name resolution
Domain name resolver operation
Domain name queries can be one of two types: recursive or iterative (also called
non-recursive). A flag bit in the domain name query specifies whether the client
desires a recursive query, and a flag bit in the response specifies whether the
server supports recursive queries. The difference between a recursive and an
iterative query arises when the server receives a request for which it cannot
supply a complete answer by itself. A recursive query requests that the server
issues a query itself to determine the requested information and returns the
complete answer to the client. An iterative query means that the name server
returns what information it has available and also a list of additional servers for
the client to contact to complete the query.
Domain name responses can be one of two types: authoritative and
non-authoritative. A flag bit in the response indicates which type a response is.
When a name server receives a query for a domain in a zone over which it has
TCP/IP Tutorial and Technical Overview
authority, it returns all of the requested information in a response with the
authoritative answer flag set. When it receives a query for a domain over which it
does not have authority, its actions depend on the setting of the recursion
desired flag in the query:
򐂰 If the recursion desired flag is set and the server supports recursive queries, it
will direct its query to another name server. This will either be a name server
with authority for the domain given in the query, or it will be one of the root
name servers. If the second server does not return an authoritative answer
(for example, if it has delegated authority to another server), the process is
򐂰 When a server (or a full resolver program) receives a response, it will cache it
to improve the performance of repeat queries. The cache entry is stored for a
maximum length of time specified by the originator in a 32-bit time-to-live
(TTL) field contained in the response. A typical TTL value is 86,400 seconds
(one day).
򐂰 If the recursion desired flag is not set or the server does not support recursive
queries, it will return whatever information it has in its cache and also a list of
additional name servers to be contacted for authoritative information.
Domain name server operation
Each name server has authority for zero or more zones. There are three types of
name servers:
A primary name server loads a zone's information from
disk and has authority over the zone.
A secondary name server has authority for a zone, but
obtains its zone information from a primary server using a
process called zone transfer. To remain synchronized, the
secondary name servers query the primary on a regular
basis (typically three hours) and re-execute the zone
transfer if the primary has been updated. A name server
can operate as a primary or a secondary name server for
multiple domains, or a primary for some domains and as a
secondary for others. A primary or secondary name
server performs all of the functions of a caching-only
name server.
A name server that does not have authority for any zone
is called a caching-only name server. A caching-only
name server obtains all of its data from primary or
secondary name servers as required. It requires at least
one NS record to point to a name server from which it can
initially obtain information.
Chapter 12. Directory and naming protocols
When a domain is registered with the root and a separate zone of authority
established, the following rules apply:
򐂰 The domain must be registered with the root administrator.
򐂰 There must be an identified administrator for the domain.
򐂰 There must be at least two name servers with authority for the zone that are
accessible from outside and inside the domain to ensure no single point of
We also recommend that name servers that delegate authority apply these rules,
because the delegating name servers are responsible for the behavior of name
servers under their authority.
12.1.9 Domain Name System resource records
The Domain Name System's distributed database is composed of resource
records (RRs), which are divided into classes for different kinds of networks. We
only discuss the Internet class of records. Resource records provide a mapping
between domain names and network objects. The most common network objects
are the addresses of Internet hosts, but the Domain Name System is designed to
accommodate a wide range of different objects.
A zone consists of a group of resource records, beginning with a Start of
Authority (SOA) record. The SOA record identifies the domain name of the zone.
There will be a name server (NS) record for the primary name server for this
zone. There might also be NS records for the secondary name servers. The NS
records are used to identify which of the name servers are authoritative (see
“Domain name resolver operation” on page 434). Following these records are the
resource records, which might map names to IP addresses or aliases to names.
TCP/IP Tutorial and Technical Overview
The following figure shows the general format of a resource record (Figure 12-4).
Figure 12-4 DNS general resource record format
The domain name to be defined. The Domain Name System is
very general in its rules for the composition of domain names.
However, it recommends a syntax for domain names that
minimizes the likelihood of applications that use a DNS
resolver (that is, nearly all TCP/IP applications) from
misinterpreting a domain name. A domain name adhering to
this recommended syntax will consist of a series of labels
consisting of alphanumeric characters or hyphens, each label
having a length of between 1 and 63 characters, starting with
an alphabetic character. Each pair of labels is separated by a
dot (period) in human-readable form, but not in the form used
within DNS messages. Domain names are not case-sensitive.
Identifies the type of the resource in this record. There are
numerous possible values, but some of the more common
ones, along with the RFCs which define them, are listed in
Table 12-2 on page 438.
Identifies the protocol family. The only commonly used value is
IN (the Internet system), though other values are defined by
RFC 1035 and include:
CS (value 2): The CSNET class. This has been obsoleted.
CH (value 3): The CHAOS class.
HS (value 4): The Hesiod class.
Chapter 12. Directory and naming protocols
The time-to-live (TTL) time in seconds for which this resource
record will be valid in a name server cache. This is stored in
the DNS as an unsigned 32-bit value. A typical value for
records pointing to IP addresses is 86400 (one day).
An unsigned 16-bit integer that specifies the length, in octets,
of the RData field.
A variable length string of octets that describes the resource.
The format of this information varies according to the Type and
Class of the resource record.
Table 12-2 Some of the possible resource record types
RFC def
A host address
An authoritative name server
The canonical name for an alias
Marks the start of a zone of authority
A mailbox domain name (experimental)
A mail group member (experimental)
A mail rename domain name (experimental)
A NULL resource record (experimental)
A well-known service description
A domain name pointer
Host information
Mailbox or mail list information
Mail exchangea
Text strings
Responsible person record
Andrew File System database
X.25 resource record
ISDN resource record
Route Through resource record
Network Service Access Protocol record
TCP/IP Tutorial and Technical Overview
RFC def
NSAP Pointer record
The public key associated with a DNS name
An IPv6 address record
GPS resource record
Defines the services available in a zone
Certificate resource records
Forward mapping of an IPv6 address
Delegation of IPv6 reverse addresses
Delegated Signer record (DNS security)
Resource record digital signature
Next Secure record (DNS security)
Public Key record (DNS security
a. The MX Type obsoletes Types MD (value 3, Mail destination) and MF (value 4,
Mail forwarder).
12.1.10 Domain Name System messages
All messages in the Domain Name System protocol use a single format. This
format is shown in Figure 12-5 on page 440. This frame is sent by the resolver to
the name server. Only the header and the question section are used to form the
query. Replies and forwarding of the query use the same frame, but with more
sections filled in (the answer/authority/additional sections).
Chapter 12. Directory and naming protocols
Figure 12-5 DNS message format
Header format
The header section is always present and has a fixed length of 12 bytes. The
other sections are of variable length.
A 16-bit identifier assigned by the program. This identifier is
copied in the corresponding reply from the name server and can
be used for differentiation of responses when multiple queries
are outstanding at the same time.
A 16-bit value in the following format (Table 12-3).
Table 12-3 Parameters
Op code
Flag identifying a query (0) or a response(1).
Op code
4-bit field specifying the kind of query:
Standard query (QUERY)
Inverse query (IQUERY)
Server status request (STATUS)
TCP/IP Tutorial and Technical Overview
Other values are reserved for future use:
Authoritative answer flag. If set in a response, this flag
specifies that the responding name server is an authority for
the domain name sent in the query.
Truncation flag. Set if message was longer than permitted on
the physical channel.
Recursion desired flag. This bit signals to the name server that
recursive resolution is asked for. The bit is copied to the
Recursion available flag. Indicates whether the name server
supports recursive resolution.
3 bits reserved for future use. Must be zero.
4-bit response code. Possible values are:
No error.
Format error. The server was unable to interpret the
Server failure. The message was not processed because
of a problem with the server.
Name error. The domain name in the query does not exist.
This is only valid if the AA bit is set in the response.
Not implemented. The requested type of query is not
implemented by name server.
Refused. The server refuses to respond for policy reasons.
Other values are reserved for future use.
An unsigned 16-bit integer specifying the number of entries in the
question section.
An unsigned 16-bit integer specifying the number of RRs in the
answer section.
An unsigned 16-bit integer specifying the number of name server
RRs in the authority section.
An unsigned 16-bit integer specifying the number of RRs in the
additional records section.
Chapter 12. Directory and naming protocols
Question section
The next section contains the queries for the name server. It contains QDcount
(usually 1) entries, each in the format shown in Figure 12-6.
Figure 12-6 DNS question format2
A single byte giving the length of the next label.
One element of the domain name characters (for example, ibm
from The domain name referred to by the question is
stored as a series of these variable length labels, each preceded
by a 1-byte length.
X'00' indicates the end of the domain name and represents the null
label of the root domain.
2 bytes specifying the type of query. It can have any value from the
Type field in a resource record.
2 bytes specifying the class of the query. For Internet queries, this
will be IN.
Note that all of the fields are byte-aligned. The alignment of the Type field on a 4-byte boundary is
for example purposes and is not required by the format.
TCP/IP Tutorial and Technical Overview
For example, the domain name is encoded with the following
Therefore, the entry in the question section for requires 22
bytes: 18 to store the domain name and 2 each for the Qtype and Qclass fields.
Answer, authority, and additional resource sections
These three sections contain a variable number of resource records. The
number is specified in the corresponding field of the header. The resource
records are in the format shown in Figure 12-7.
Figure 12-7 DNS: Answer Record Entry format 3
Note that all of the fields are byte-aligned. The alignment of the Type field on a 4-byte boundary is
for example purposes and is not required by the format.
Chapter 12. Directory and naming protocols
Where the fields before the TTL field have the same meanings as for a question
entry and:
A 32-bit time-to-live value in seconds for the record. This defines
how long it can be regarded as valid.
A 16-bit length for the Rdata field.
A variable length string whose interpretation depends on the
Type field.
Message compression
In order to reduce the message size, a compression scheme is used to eliminate
the repetition of domain names in the various RRs. Any duplicate domain name
or list of labels is replaced with a pointer to the previous occurrence. The pointer
has the form of a 2-byte field as shown in Figure 12-8.
Figure 12-8 DNS message compression
򐂰 The first 2 bits distinguish the pointer from a normal label, which is restricted
to a 63-byte length plus the length byte ahead of it (which has a value of <64).
򐂰 The offset field specifies an offset from the start of the message. A zero offset
specifies the first byte of the ID field in the header.
򐂰 If compression is used in an Rdata field of an answer, authority, or additional
section of the message, the preceding RDlength field contains the real length
after compression is done.
Refer to 12.2, “Dynamic Domain Name System” on page 453 for additional
message formats.
Using the DNS Uniform Resource Identifiers (URI)
A DNS can also be queried using a Uniform Resource Identifier This is defined in
RFC 4501. Strings are not case sensitive, and adhere to the following format:
“dns:” + [ “//” + dnsauthority + “:” + port + ”/” ] + dnsname +
[ “?” + dnsquery ]
The DNS server to which the query should be sent. If this
is left blank, the query is sent to the default DNS server.
TCP/IP Tutorial and Technical Overview
The name or IP address to be queried.
The type of the query to be performed. This can be any
combination, separated by a semicolon (;), of:
Usually IN for internet, the class of the query
The type of resource record desired
For example, a request using the URI to resolve to an IP
address might appear as follows:
Additionally, the same request can be sent to the server at on port 5353
using the following:
Finally, this same query can be made specifying a CLASS of IN and a TYPE of
12.1.11 A simple scenario
Consider a stand-alone network (no outside connections), consisting of two
physical networks:
򐂰 One has an Internet network address of 129.112.
򐂰 One has a network address of 194.33.7.
Chapter 12. Directory and naming protocols
They are interconnected by an IP gateway (VM2). See Figure 12-9 for more
Figure 12-9 DNS: A simple configuration: Two networks connected by an IP gateway
Assume the name server function has been assigned to VM1. Remember that
the domain hierarchical tree forms a logical tree, completely independent of the
physical configuration. In this simple scenario, there is only one level in the
domain tree, which will be referred to as test.example.
TCP/IP Tutorial and Technical Overview
The zone data for the name server appears as shown in Figure 12-10 and
continued in Figure 12-11 on page 448.
;note: an SOA record has no TTL field
$origin test.example.
;note 1
IN SOA VM1.test.example. ADM.VM1.test.example.
;serial number for data
;secondary refreshes every 30 mn
;secondary reties every 5 mn
;data expire after 1 week
;minimum TTL for data is 1 week
99999 IN NS VM1.test.example.
;note 2
99999 IN A
;note 3
;note 4
99999 IN A
;note 5
99999 IN A
99999 IN A
99999 IN A
99999 IN A
;VM2 is an IP gateway and has 2 different IP addresses
99999 IN A
99999 IN A
Figure 12-10 Zone data for the name server, continued in Figure 12-11 on page 448
Chapter 12. Directory and naming protocols
;note 6
;Some mailboxes
central 10
IN MX VM2.test.example.
;notes 7 and 8
;a second definition for the same mailbox, in case VM2 is down
central 20
IN MX VM1.test.example.
IN MX VM2.test.example.
Figure 12-11 Zone data for the name server, continued from Figure 12-10 on page 447
Notes for Figure 12-10 on page 447 and Figure 12-11:
1. The $origin statement sets the @ variable to the zone name
(test.example.). Domain names that do not end with a period are suffixed
with the zone name. Fully qualified domain names (those ending with a
period) are unaffected by the zone name.
2. Defines the name server for this zone.
3. Defines the Internet address of the name server for this zone.
4. Specifies well-known services for this host. These are expected to always
be available.
5. Gives information about the host.
6. Used for inverse mapping queries (see 12.1.6, “Mapping IP addresses to
domain names: Pointer queries” on page 430).
7. Will allow mail to be addressed to [email protected]
8. See 15.1.2, “SMTP and the Domain Name System” on page 565 for the
use of these definitions.
TCP/IP Tutorial and Technical Overview
12.1.12 Extended scenario
Consider the case where a connection is made to a third network (129.113),
which has an existing name server with authority for that zone (see
Figure 12-12).
Figure 12-12 DNS: Extended configuration - Third network connected to existing
Let us suppose that the domain name of the other network is and that
its name server is located in VM9. All we have to do is add the address of this
name server to our own name server database (in the initial cache file)
and reference the other network by its own name server. The following two lines
are all that is needed to do that:
This simply indicates that VM9 is the authority for the new network and that all
queries for that network will be directed to that name server.
Chapter 12. Directory and naming protocols
12.1.13 Transport
Domain Name System messages are transmitted either as datagrams (UDP) or
through stream connection (TCP):
򐂰 UDP usage: Server port 53 (decimal).
Messages carried by UDP are restricted to 512 bytes. Longer messages are
truncated and the truncation (TC) bit is set in the header. Because UDP
frames can be lost, a retransmission strategy is required.
򐂰 TCP usage: Server port 53 (decimal).
In this case, the message is preceded by a 2-byte field indicating the total
message frame length.
򐂰 STD 3 – Host Requirements requires that:
– A Domain Name System resolver or server that is sending a
non-zone-transfer query must send a UDP query first. If the answer
section of the response is truncated and if the requester supports TCP, it
tries the query again using TCP. UDP is preferred over TCP for queries
because UDP queries have much lower overall processing cost, and the
use of UDP is essential for a heavily loaded server. Truncation of
messages is rarely a problem given the current contents of the Domain
Name System database, because typically 15 response records can be
accommodated in the datagram, but this might change as new record
types continue to be added to the Domain Name System.
– TCP must be used for zone transfer activities because the 512-byte limit
for a UDP datagram will always be inadequate for a zone transfer.
– Name servers must support both types of transport.
As IPv6 becomes more pervasive throughout the Internet community, some
problems are forecasted as a result of mixed IPv4/IPv6 network segments.
Primarily, if a resolver that can only use IPv4 is forwarded to across a network
segment that supports only IPv6, the resolver and the name server will be unable
to communicate. As a result, the hierarchical namespace becomes fragmented
into two sets of segments: those that support IPv4 and IPv6, and those that
support only IPv6. This impending issue has been named the Problem of Name
Space Fragmentation, and documented in RFC 3901. In order to preserve
namespace continuity, RFC 3901 recommends the following:
򐂰 Every recursive name server should be either IPv4 only, or dual IPv4 and
򐂰 Every DNS zone should be served by at least one IPv4-reachable
authoritative name server.
Additional suggestions about configuring IPv6 DNS servers are in RFC 4339.
TCP/IP Tutorial and Technical Overview
Dynamic DNS (DDNS)
The Dynamic Domain Name System (DDNS) is a protocol that defines
extensions to the Domain Name System to enable DNS servers to accept
requests to add, update, and delete entries in the DNS database dynamically.
Because DDNS offers a functional superset to existing DNS servers, a DDNS
server can serve both static and dynamic domains at the same time, a welcome
feature for migration and overall DNS design.
DDNS is currently available in a non-secure and a secure flavor, defined in RFC
2136 and RFC 3007, respectively. Rather than allowing any host to update its
DNS records, the secure version of DDNS uses public key security and digital
signatures to authenticate update requests from DDNS hosts.
Without client authentication, another host could impersonate an unsuspecting
host by remapping the address entry for the unsuspecting host to that of its own.
After the remapping occurs, important data, such as logon passwords and mail
intended for the host would, unfortunately, be sent to the impersonating host
See 12.2, “Dynamic Domain Name System” on page 453 for more information
about how DDNS works together with DHCP to perform seamless updates of
reverse DNS mapping entries, and see 9.4, “DNS in IPv6” on page 367 for more
information about DNS with IPv6.
12.1.14 DNS applications
Three common utilities for querying name servers are provided with many DNS
implementations: host, nslookup, and dig. Though specifics details about each
utility vary slightly by platform, most platforms provide a common set of options.
The host command obtains an IP address associated with a host name, or a host
name associated with an IP address. The typical syntax for host command is:
host [options] name [server]
Valid options typically include:
-c class
The query class. By default, this is IN
(Internet), but other valid class names
include CS (CSNET), CH (CHAOS), HS
(Hesiod), and ANY (a wildcard
encompassing all four classes).
Disables recursive processing.
Chapter 12. Directory and naming protocols
-t type
The type of query required. This can be any
of the standard resource record types (see
Table 12-2 on page 438).
Instructs the host command to wait forever
for a reply.
The name of the host or the address to be resolved.
The name server to query.
The nslookup command enables you to locate information about network nodes,
examine the contents of a name server database, and establish the accessibility
of name servers. The typical syntax for the nslookup command is:
nslookup [options] [host] [-nameserver]
These options vary widely by platform. Refer to the
documentation for a specific implementation for information
about what options are available.
The host name or IP address to be located.
The name server to which the query is to be directed.
dig stands for Domain Internet Groper, and enables you to exercise name
servers, gather large volumes of domain name information, and execute simple
domain name queries. The typical syntax for the dig command is:
dig @server [options] [name] [type] [class] [queryopt]
The DNS name server to be queried.
Valid options typically include:
-b address
The source IP address of the query-to
-c class
The query class. By default, this is IN
(Internet), but other valid class names
include CS (CSNET), CH (CHAOS), and HS
-f filename
Causes dig to operate in batch mode, and
specifies the file from which the batch
commands can be found.
TCP/IP Tutorial and Technical Overview
-p port
Specifies that dig should send the query to
a port other than well-known DNS port 53.
-x address
Instructs dig to do a reverse lookup on the
specified address.
The name of the resource record to be looked up.
The type of query required. This can be any of the standard
resource record types (see Table 12-2 on page 438).
12.2 Dynamic Domain Name System
The Domain Name System described in 12.1, “Domain Name System (DNS)” on
page 426 is a static implementation without recommendations with regard to
security. In order to implement DNS dynamically, take advantage of DHCP, and
still to be able to locate any specific host by means of a meaningful label (such as
its host name), the following extensions to DNS are required:
򐂰 A method for the host name to address a mapping entry for a client in the
domain name server to be updated after the client has obtained an address
from a DHCP server
򐂰 A method for the reverse address to host name mapping to take place after
the client obtains its address
򐂰 Updates to the DNS to take effect immediately, without the need for
intervention by an administrator
򐂰 Updates to the DNS to be authenticated to prevent unauthorized hosts from
accessing the network and to stop imposters from using an existing host
name and remapping the address entry for the unsuspecting host to that of its
򐂰 A method for primary and secondary DNS servers to quickly forward and
receive changes as entries are being updated dynamically by clients
However, implementation of a Dynamic Domain Name System (DDNS) can
introduce problems if the environment is not secure. One method of security
employed by DNS is the use of Secret Key Transaction Authentication (TSIG),
defined in RFC 2845. This can be used to authenticate dynamic updates from
clients, or authenticate responses coming from a recursive server. Additionally,
these messages can now be protected for integrity and confidentiality through
using TSIG over the Generic Security Service (GSS-TSIG). This extension, and
the associated algorithms needed to implement GSS-TSIG, are defined in RFC
Chapter 12. Directory and naming protocols
In addition to TSIG, and GSS-TSIG, several RFCs extended the functionality of
DNS such that it incorporated additional security methods. These additions,
defined in RFC 4033 and referred to as the DNS Security Extensions (DNSSEC),
allow DNS to authenticate the origin of data as well as negative responses to
DNS queries. However, they do not provide confidentiality, access control lists,
or protection against denial-of-service-attacks. New resource records relating to
security were added by RFCs 4034 and 4398, and include:
򐂰 DNSKEY (public key)
򐂰 DS (delegation signer)
򐂰 RRSIG (resource record digital signature)
򐂰 NSEC (authenticated denial of existence)
򐂰 CERT (public key certificates)
Note that these RRs are also listed in Table 12-2 on page 438. Specific details
about how the DNS protocol was modified to take advantage of these additions is
in RFC 4035.
12.2.1 Dynamic updates in the DDNS
The DNS message format (shown in Figure 12-5 on page 440) was designed for
the querying of a static DNS database. RFC 2136 defines a modified DNS
message for updates, called the UPDATE DNS message, illustrated in
Figure 12-13 on page 455. This message adds or deletes resource records in
the DNS, and allows updates to take effect without the DNS having to be
TCP/IP Tutorial and Technical Overview
Figure 12-13 DDNS UPDATE message format
The header section is always present and has a fixed length of 12 bytes. The
other sections are of variable length. They are:
A 16-bit identifier assigned by the program. This identifier is
copied in the corresponding reply from the name server and
can be used for differentiation when multiple queries/updates
are outstanding at the same time.
Flag identifying an update request (0) or a response (1).
Opcode. The value 5 indicates an UPDATE message.
7-bit field set to 0 and reserved for future use.
Response code (undefined in update requests). Possible
values are:
No error.
Format error. The server was unable to interpret the
Server failure. The message was not processed due to a
problem with the server.
Name error. A name specified does not exist.
Not implemented. The type of message specified in
Opcode is not supported by this server.
Refused. The server refuses to perform the UPDATE
requested for security or policy reasons.
Chapter 12. Directory and naming protocols
Name error. A name exists when it should not.
RRset error. A resource record set exists when it should
RRset error. A resource record set specified does not
Zone Authority error. The server is not authoritative for
the zone specified.
Zone error. A name specified in the Prerequisite or
Update sections is not in the zone specified.
ZO count
The number of RRs in the Zone section.
PR count
The number of RRs in the Prerequisite section.
UP count
The number of RRs in the Update section.
AD count
The number of RRs in the Additional information section.
Zone section
This section is used to indicate the zone of the records that are
to be updated. As all records to be updated must belong to the
same zone, the zone section has a single entry specifying the
zone name, zone type (which must be SOA), and zone class.
Prerequisite section
This section contains RRs or RRsets that either must, or must
not, exist, depending on the type of update.
Update section This section contains the RRs, RRsets, or both that are to be
added to or deleted from the zone.
Additional information section
This section can be used to pass additional RRs that relate to
the update operation in process.
For further information about the UPDATE message format, refer to RFC 2136.
12.2.2 Incremental zone transfers in DDNS
RFC 1995 introduces the IXFR DNS message type, which allows incremental
transfers of DNS zone data between primary and secondary DNS servers. In
other words, when an update has been made to the zone data, only the change
has to be copied to the other DNS servers that maintain a copy of the zone data,
rather than the whole DNS database (as is the case with the AXFR DNS
message type).
The format of an IXFR query is exactly that of a normal DNS query, but with a
query type of IXFR. The response’s answer section, however, is made up of
difference sequences.
TCP/IP Tutorial and Technical Overview
Each list difference sequences is preceded by the server’s current version of the
SOA and represents one update to the zone. Similarly, each difference
sequence is preceded by an SOA version (indicating in which versions
correspond to each change, and the difference sequences are ordered oldest to
newest. Upon receiving this message, a server can update its zone by tracking
the version history listed in the IXFR answer section.
For example, assume a server has the following zone:
1 600 600 3600000 614800)
Otherhost.mydiv.mycorp is removed, and in version 2, thishost.mydiv.mycorp is
added, leaving the zone as:
2 600 600 3600000 614800)
If the server receives an IXFR query, it sends back the following answer section:
SOA serial=2
SOA serial=1
SOA serial=2
SOA serial=2
Note: If a server received an IXFR query, but incremental zone transfers are
not available, it will send back the entire zone in the reply.
12.2.3 Prompt notification of zone transfer
RFC 1996 introduces the NOTIFY DNS message type, which is used by a
master server to inform subordinate servers that an update has taken place and
that they should initiate a query to discover the new data. The NOTIFY message
uses the DNS message format, but only a subset of the available fields (unused
fields are filled with binary zeros). The message is similar to a QUERY message,
and can contain the name of the RRs that have been updated. Upon receipt of a
NOTIFY message, the subordinate returns a response. The response contains
no useful information, and only serves to alert the master server of receipt of the
Chapter 12. Directory and naming protocols
NOTIFY. Based on the RRs contained in the notify, subordinate servers might
then send an update query to the server to obtain the new changes.
12.3 Network Information System (NIS)
The Network Information System (NIS) is not an Internet standard. It was
developed by Sun Microsystems, Inc. It was originally known as the Yellow
Pages (YP) and many implementations continue to use this name.
NIS is a distributed database system that allows the sharing of system
information in UNIX-based environment. Examples of system information that
can be shared include the /etc/passwd, /etc/group, and /etc/hosts files. NIS has
the following advantages:
򐂰 Provides a consistent user ID and group ID name space across a large
number of systems
򐂰 Reduces the time and effort by users in managing their user IDs, group IDs,
and NFS file system ownerships
򐂰 Reduces the time and effort by system administrators in managing user IDs,
group IDs, and NFS ownerships
NIS is built on the RPC, and employs the client/server model. Most NIS
implementations use UDP. However, because it uses RPC, it is also possible for
it to be implemented over TCP. A NIS domain is a collection of systems
consisting of:
NIS master server
Maintains maps, or databases, containing the
system information, such as passwords and
host names. These are also referred to as
Database Maps (DBMs).
NIS subordinate server(s)
Can be defined to offload the processing from
the master NIS server or when the NIS master
server is unavailable.
NIS client(s)
The remaining systems that are served by the
NIS servers.
The NIS clients do not maintain NIS maps; they query NIS servers for system
information. Any changes to an NIS map is done only to the NIS master server
(through RPC). The master server then propagates the changes to the NIS
subordinate servers.
TCP/IP Tutorial and Technical Overview
Note that the speed of a network determines the performance and availability of
the NIS maps. When using NIS, the number of subordinate servers should be
tuned in order to achieve these goals.
Because NIS is not standardized by the IETF, implementations vary by platform.
However, most platforms make available the following common NIS commands:
Generate a DBM file from an input file.
Display the contents of a DBM file.
Set up an NIS master or subordinate server.
Performs the same function as makedbm, but provides the
option to push the resulting DBMs to subordinate servers.
Prints the values associated with one or more keys in a DBM.
Change a login password stored in a DBM.
Pushes DBMs to subordinate servers.
Indicates what NIS server a client is using.
Pulls a DBM from the master server.
12.4 Lightweight Directory Access Protocol (LDAP)
When implementing a Distributed Computing Environment (DCE), directory
services are automatically included because they are an intrinsic part of the DCE
architecture. However, though widely used, implementation of a DCE is not a
practical solution for every company needing directory services because it is an
“all-or-nothing” architecture. As such, if the other services provided by a DCE are
not required, or if implementation of the DCE model is not feasible (for example,
if it is not feasible to install the client software on every workstation within the
network), other directory service alternatives must be identified.
One such alternative is the Lightweight Directory Access Protocol (LDAP), which
is an open industry standard that has evolved to meet these needs. LDAP
defines a standard method for accessing and updating information in a directory,
and is gaining wide acceptance as the directory access method of the Internet. It
is supported by a growing number of software vendors and is being incorporated
into a growing number of applications.
For further information about LDAP, refer to the IBM Redbook Understanding
LDAP - Design and Implementation, SG24-4986.
Chapter 12. Directory and naming protocols
12.4.1 LDAP: Lightweight access to X.500
The OSI directory standard, X.500, specifies that communication between the
directory client and the directory server uses the Directory Access Protocol
(DAP). However, as an application layer protocol, DAP requires the entire OSI
protocol stack to operate, which requires more resources than are available in
many small environments. Therefore, an interface to an X.500 directory server
using a less resource-intensive or lightweight protocol was desired.
LDAP was developed as a lightweight alternative to DAP, because it requires
the more popular TCP/IP protocol stack rather than the OSI protocol stack. LDAP
also simplifies some X.500 operations and omits some esoteric features. Two
precursors to LDAP appeared as RFCs issued by the IETF, RFC 1202 –
Directory Assistance Service and RFC 1249 – DIXIE Protocol Specification.
These were both informational RFCs which were not proposed as standards.
The directory assistance service (DAS) defined a method by which a directory
client communicates to a proxy on an OSI-capable host, which issues X.500
requests on the client's behalf. DIXIE is similar to DAS, but provides a more
direct translation of the DAP.
The first version of LDAP was defined in RFC 1487 – X.500 Lightweight Access,
which was replaced by RFC 1777 – Lightweight Directory Access Protocol. Much
of the work on DIXIE and LDAP was carried out at the University of Michigan,
which provides reference implementations of LDAP and maintains LDAP-related
Web pages and mailing lists. Since then, LDAPv2 has been replaced by LDAP
Version 3. LDAPv3 is summarized in RFC 4510, but the technical specifications
are divided into multiple subsequent RFCs listed in Table 12-4.
Table 12-4 LDAP-related RFCs
RFC number
Technical Specification Road Map
The Protocol
Directory Information Models
Authentication Methods and Security Mechanisms
String Representation of Distinguished Names
String Representation of Search Filters
Uniform Resource Locator
Syntaxes and Matching Rules
Internationalized String Preparation
TCP/IP Tutorial and Technical Overview
RFC number
Schema for User Applications
Internet Assigned Numbers Authority (IANA) Considerations for
Considerations for LDAP
The Binary Encoding Option
Schema Definitions for X.509 Certificates
COSINE/ LDAP X.500 Schema
Modify-Increment Extension
Absolute True and False Filters
Read Entry Controls
Assertion Control
Requesting Attributes by Object Class in LDAP
entryUUID Operational Attribute
Turn Operation
“Who Am I” Operation
Content Synchronization Operation
Though an application program interface (API) for previous versions of LDAP
was limited to specifications in RFC 1823, the LDAPv3 provides both a C API
and a Java™ Naming and Directory Interface (JNDI).
12.4.2 The LDAP directory server
LDAP defines a communication protocol. That is, it defines the transport and
format of messages used by a client to access data in an X.500-like directory.
LDAP does not define the directory service itself. An application client program
initiates an LDAP message by calling an LDAP API. But an X.500 directory
server does not understand LDAP messages. In fact, the LDAP client and X.500
server even use different communication protocols (TCP/IP versus OSI). The
LDAP client actually communicates with a gateway process (also called a proxy
or front end) that forwards requests to the X.500 directory server (see
Figure 12-14 on page 462), known as an LDAP server, which fulfils requests
from the LDAP client. It does this by becoming a client of the X.500 server. The
Chapter 12. Directory and naming protocols
LDAP server must communicate using both TCP/IP (with the client) and OSI
(with the X.500 server).
Figure 12-14 LDAP server acting as a gateway to an X.500 directory server
As the use of LDAP grew and its benefits became apparent, people who did not
have X.500 servers or the environments to support them wanted to build
directories that could be accessed by LDAP clients. This requires that the LDAP
server store and access the directory itself instead of only acting as a gateway to
X.500 servers (see Figure 12-15). This eliminates any need for the OSI protocol
stack but, of course, makes the LDAP server much more complicated, because it
must store and retrieve directory entries. These LDAP servers are often called
stand-alone LDAP servers because they do not depend on an X.500 directory
server. Because LDAP does not support all X.500 capabilities, a stand-alone
LDAP server only needs to support the capabilities required by LDAP.
Figure 12-15 Stand-alone LDAP server
The concept of the LDAP server being able to provide access to local directories
supporting the X.500 model, rather than acting only as a gateway to an X.500
server, is discussed in RFC 4511 (see Table 12-4 on page 460). From the client's
TCP/IP Tutorial and Technical Overview
point of view, any server that implements the LDAP protocol is an LDAP directory
server, whether the server actually implements the directory or is a gateway to an
X.500 server. The directory that is accessed can be called an LDAP directory,
whether the directory is implemented by a stand-alone LDAP server or by an
X.500 server.
12.4.3 Overview of LDAP architecture
LDAP defines the content and format of messages exchanged between an LDAP
client and an LDAP server. The messages specify the operations requested by
the client (search, modify, delete, and so on), the responses from the server, and
the format of data carried in the messages. LDAP messages are carried over
TCP/IP, a connection-oriented protocol, so there are also operations to establish
and disconnect a session between the client and server.
The general interaction between an LDAP client and an LDAP server takes the
following form:
1. The client establishes a session with an LDAP server. This is known as
binding to the server. The client specifies the host name or IP address and
TCP/IP port number where the LDAP server is listening. The client can
provide a user name and a password to properly authenticate with the server,
or the client can establish an anonymous session with default access rights.
The client and server can also establish a session that uses stronger security
methods, such as encryption of data (see 12.4.5, “LDAP security” on
page 471).
2. The client then performs operations on directory data. LDAP offers both read
and update capabilities. This allows directory information to be managed as
well as queried. LDAP supports searching the directory for data meeting
arbitrary user-specified criteria. Searching is the most common operation in
LDAP. A user can specify what part of the directory to search and what
information to return. A search filter that uses Boolean conditions specifies
which directory data matches the search.
3. When the client has finished making requests, it closes the session with the
server. This is also known as unbinding.
Because LDAP was originally intended as a lightweight alternative to DAP for
accessing X.500 directories, the LDAP server follows an X.500 model. The
directory stores and organizes data structures known as entries. A directory
entry usually describes an object such as a person, a printer, a server, and so
on. Each entry has a name called a distinguished name (DN) that uniquely
identifies it. The DN consists of a sequence of parts called relative distinguished
names (RDNs), much like a file name can consist of a path of directory names.
The entries can be arranged into a hierarchical tree-like structure based on their
Chapter 12. Directory and naming protocols
distinguished names. This tree of directory entries is called the directory
information tree (DIT).
LDAP defines operations for accessing and modifying directory entries, such as:
򐂰 Searching for entries meeting user-specified criteria
򐂰 Adding an entry
򐂰 Deleting an entry
򐂰 Modifying an entry
򐂰 Modifying the distinguished name or relative distinguished name of an entry
򐂰 Comparing an entry
12.4.4 LDAP models
LDAP can be better understood by considering the four models upon which it is
Describes the structure of information stored in an LDAP
Describes how information in an LDAP directory is organized
and identified.
Describes the operations that can be performed on the
information stored in an LDAP directory.
Describes how the information in an LDAP directory can be
protected from unauthorized access.
The following sections discuss the first three LDAP models. We describe LDAP
security in 12.4.5, “LDAP security” on page 471.
The information model
The basic unit of information stored in the directory is an entry, which represents
an object of interest in the real world such as a person, server, or organization.
Each entry contains one or more attributes that describe the entry. Each attribute
has a type and one or more values. For example, the directory entry for a person
might have an attribute called telephoneNumber. The syntax of the
telephoneNumber attribute specifies that a telephone number must be a string of
numbers that can contain spaces and hyphens. The value of the attribute is the
person's telephone number, such as 123-456-7890 (a person might have
multiple telephone numbers, in which case this attribute would have multiple
TCP/IP Tutorial and Technical Overview
In addition to defining what data can be stored as the value of an attribute, an
attribute syntax also defines how those values behave during searches and other
directory operations. This is done using syntax and matching rules. The attribute
telephoneNumber, for example, might have a syntax that specifies:
򐂰 Lexicographic ordering.
򐂰 Case, spaces, and dashes are ignored during the comparisons.
򐂰 Values must be character strings.
For example, using the correct definitions, the telephone numbers
123-456-7890, 123456-7890, and 1234567890 are considered to be the same. A
few of the common syntaxes and matching rules, defined in RFC 4517, are listed
in Table 12-5.
Table 12-5 Examples of LDAP syntaxes
Syntaxes and
matching rules
Bit String
A sequence of binary digits
Postal Address
A sequence of strings that form an address in a physical mail
A matching rule requiring that string comparisons are
A matching rule that does not require case-sensitive
Table 12-6 lists some common attributes defined by RFC 4519. Some attributes
have alias names that can be used wherever the full attribute name is used.
Table 12-6 Examples of LDAP syntaxes
Attribute, alias
commonName, cn
Common name of an entry
John Smith
surname, sn
A person’s last name
A person’s initials
A person’s telephone number
An object class is a general description, sometimes called a template, of an
object type, as opposed to the description of a specific object of that type. For
example, the object class person has a surname attribute, while the object
describing John Smith has a surname attribute with the value Smith. The object
Chapter 12. Directory and naming protocols
classes that a directory server can store and the attributes they contain are
described by a schema. A schema defines where object classes are allowed in
the directory, which attributes they must contain, which attributes are optional,
and the syntax of each attribute. For example, a schema might define a person
object class that requires that a person has a character-string surname attribute,
can optionally have a number-string telephoneNumber attribute, and so on.
Schema-checking ensures that all required attributes for an entry are present
before an entry is stored. Schemas also define the inheritance and subclassing
of objects, and where in the DIT structure (hierarchy) objects can appear.
Though an implementation can define any schema to meet its needs, RFC 4519
defines a few standard schemas. Table 12-7 lists a few of the common schema
(object classes and their required attributes). In many cases, an entry can consist
of more than one object class.
Table 12-7 Examples of object classes and required attributes
Object class
Defines a country
Defines a place in the physical
Defines a person
Because servers can define their own schema, LDAP includes the functionality
of allowing a client to query a server for the contents of the supported schema.
TCP/IP Tutorial and Technical Overview
The naming model
The LDAP naming model defines how entries are identified and organized.
Entries are organized in a tree-like structure called the directory information tree
(DIT). Entries are arranged within the DIT based on their distinguished name
(DN). A DN is a unique name that unambiguously identifies a single entry. DNs
are made up of a sequence of relative distinguished names (RDNs). Each RDN
in a DN corresponds to a branch in the DIT leading from the root of the DIT to the
directory entry.
Each RDN is derived from the attributes of the directory entry. In the simple and
common case, an RDN has the form <attribute-name>=<value>. A DN is
composed of a sequence of RDNs separated by commas. These relationships
are defined in RFC 4514.
An example of a DIT is shown in Figure 12-16. The example is very simple, but
can be used to illustrate some basic concepts. Each box represents a directory
entry. The root directory entry is conceptual and does not actually exist.
Attributes are listed inside each entry. The list of attributes shown is not
complete. For example, the entry for the country UK (c=UK) could have an
attribute called description with the value United Kingdom.
Directory Root
ou=LDAP Team
cn: John Smith
mail: [email protected]
Figure 12-16 Example of a directory information tree (DIT)
Chapter 12. Directory and naming protocols
It is usual to follow either a geographical or an organizational scheme to position
entries in the DIT. For example, entries that represent countries would be at the
top of the DIT. Below the countries would be national organizations, states, and
provinces, and so on. Below this level, entries might represent people within
those organizations or further subdivisions of the organization. The lowest layers
of the DIT entries can represent any object, such as people, printers, application
servers, and so on. The depth or breadth of the DIT is not restricted and can be
designed to suit application requirements.
Entries are named according to their position in the DIT. The directory entry in
the lower-right corner of Figure 12-16 on page 467 has the DN cn=John
Note: DNs read from leaf-to-root, as opposed to names in a file system
directory, which usually read from root-to-leaf.
The DN is made up of a sequence of RDNs. Each RDN is constructed from an
attribute (or attributes) of the entry it names. For example, the DN cn=John
Smith,o=IBM,c=UK is constructed by adding the RDN cn=John Smith to the DN of
the ancestor entry o=IBM,c=UK.
The DIT is described as being tree-like, implying that it is not a tree. This is
because of aliases. Aliases allow the tree structure to be circumvented. This can
be useful if an entry belongs to more than one organization or if a commonly
used DN is too complex. Another common use of aliases is when entries are
moved within the DIT and you want access to continue to work as before. In
Figure 12-16 on page 467, cn=John,ou=LDAP Team,o=IBM,c=US is an alias for
cn=John Smith,o=IBM,c=UK.
Because an LDAP directory can be distributed, an individual LDAP server might
not store the entire DIT. Instead, it might store the entries for a particular
department but not the entries for the ancestors of the department. For example,
a server might store the entries for the Accounting department at Yredbookscorp.
The highest node in the DIT stored by the server would be
ou=Accounting,o=Yredbookscorp,c=US. The server would store entries
ou=Accounting,o=Yredbookscorp,c=US but not for c=US or for
o=Yredbookscorp,c=US. The highest entry stored by a server is called a suffix.
Each entry stored by the server ends with this suffix, so in this case, the suffix is
the entire ou=Accounting,o=Yredbookscorp,c=US.
A single server can support multiple suffixes. For example, in addition to storing
information about the Accounting department, the same server can store
information about the Sales department at MyCorp. The server then has the
suffixes ou=Accounting,o=Yredbookscorp,c=US and ou=Sales,o=MyCorp,c=US.
Because a server might not store the entire DIT, servers need to be linked
TCP/IP Tutorial and Technical Overview
together in some way in order to form a distributed directory that contains the
entire DIT. This is accomplished with referrals. A referral acts as a pointer to an
entry on another LDAP server where requested information is stored. A referral is
an entry of objectClass referral. It has an attribute, ref, whose value is the LDAP
URL of the referred entry on another LDAP server. See 12.4.6, “LDAP URLs” on
page 474 for further information. Referrals allow a DIT to be partitioned and
distributed across multiple servers. Portions of the DIT can also be replicated.
This can improve performance and availability.
Note: When an application uses LDAP to request directory information from a
server, but the server only has a referral for that information, the LDAP URL
for that information is passed to the client. It is then the responsibility of that
client to contact the new server to obtain the information. This is unlike the
standard mechanisms of both DCE and X.500, where a directory server, if it
does not contain the requested information locally, will always obtain the
information from another server and pass it back to the client.
The functional model
LDAP defines operations for accessing and modifying directory entries. LDAP
operations can be divided into the following three categories:
Includes the search and compare operations used to
retrieve information from a directory.
Includes the add, delete, modify, modify RDN, and
unsolicited notification operations used to update stored
information in a directory. These operations will normally
be carried out by an administrator.
Includes the bind, unbind, abandon, and startTLS
operations used to connect and disconnect to and from an
LDAP server, establish access rights, and protect
information. For further information, see 12.4.5, “LDAP
security” on page 471.
The search operation
The most common operation is the search. This operation is very flexible and
therefore has some of the most complex options. The search operation allows a
client to request that an LDAP server search through some portion of the DIT for
information meeting user-specified criteria in order to read and list the results.
The search can be very general or very specific. The search operation allows the
specification of the starting point within the DIT, how deep within the DIT to
search, the attributes an entry must have to be considered a match, and the
attributes to return for matched entries.
Chapter 12. Directory and naming protocols
Some example searches expressed informally are:
򐂰 Find the postal address for cn=John Smith,o=IBM,c=UK.
򐂰 Find all the entries that are children of the entry ou=ITSO,o=IBM,c=US.
򐂰 Find the e-mail address and phone number of anyone in an organization
whose last name contains the characters “miller” and who also has a fax
To perform a search, the following parameters must be specified:
A DN that defines the starting point, called the base
object, of the search. The base object is a node within the
Specifies how deep within the DIT to search from the
base object. There are three choices:
Only the base object is examined.
Only the immediate children of the base
object are examined; the base object
itself is not examined.
wholeSubtree The base object and all of its
descendants are examined.
Alias dereferencing Specifies if aliases are dereferenced. That is, the actual
object of interest, pointed to by an alias entry, is
examined. Not dereferencing aliases allows the alias
entries themselves to be examined. This parameter must
be one of the following:
Do not deference aliases.
Dereference aliases only when
searching subordinates of the base
Dereference aliases only when
searching for the base object, but not
when searching subordinates of the
base object.
Size Limit
Always dereference aliases.
The maximum number of entries that should be returned
as a result of the search.
TCP/IP Tutorial and Technical Overview
Time Limit
The maximum number of seconds allowed to perform the
search. Specifying zero indicates that there is no time
Types Only
This parameter has two possible values:
Only attribute descriptions are returned.
Attribute descriptions and values are
Search filter
Specifies the criteria an entry must match to be returned
from a search. The search filter is a Boolean combination
of attribute value assertions. An attribute value assertion
tests the value of an attribute for equality, less than or
equal, and so on.
Attributes to return
Specifies which attributes to retrieve from entries that
match the search criteria. Because an entry can have
many attributes, this allows the user to only see the
attributes in which they are interested.
12.4.5 LDAP security
Security is of great importance in the networked world of computers, and this is
true for LDAP as well. When sending data over insecure networks, internally or
externally, sensitive information might need to be protected during
transportation. There is also a need to know who is requesting the information
and who is sending it. This is especially important when it comes to the update
operations on a directory. RFC 4513 discusses the authentication methods and
security mechanisms available in LDAPv3, which can be divided into the
following sections:
Assurance that the opposite party (machine or person)
really is who he/she/it claims to be.
Assurance that the information that arrives is really the
same as what was sent.
Protection against information disclosure, by means of
data encryption, to those who are not intended to receive
Assurance that a party is really allowed to do what it is
requesting to do, usually checked after user
authentication. Authorization is achieved by assigning
access controls, such as read, write, or delete, for user
IDs or common names to the resources being accessed.
Because these attributes are usually platform-specific,
LDAP does not provide specific controls. Instead, it has
Chapter 12. Directory and naming protocols
built-in mechanisms to allow the use of the
platform-specific controls.
Because the use of authorization controls is platform-specific, the following
sections describe only the authentication, integrity, and confidentiality. There are
several mechanisms that can be used for this purpose; the most important ones
are discussed here. These are:
򐂰 No authentication
򐂰 Basic authentication
򐂰 Simple Authentication and Security Layer (SASL)
򐂰 Transport Layer Security (TLS)
The security mechanism to be used in LDAP is negotiated when the connection
between the client and the server is established.
No authentication
No authentication should only be used when data security is not an issue and
when no special access control permissions are involved. This might be the
case, for example, when your directory is an address book browsable by
anybody. No authentication is assumed when you leave the password and DN
field empty in the bind API call. The LDAP server then automatically assumes an
anonymous user session and grants access with the appropriate access controls
defined for this kind of access (not to be confused with the SASL anonymous
user discussed in “Simple Authentication and Security Layer (SASL)”).
Basic authentication
Basic authentication is also used in several other Web-related protocols, such as
HTTP. When using basic authentication with LDAP, the client identifies itself to
the server by means of a DN and a password, which are sent in the clear over
the network (some implementations might use Base64 encoding instead). The
server considers the client authenticated if the DN and password sent by the
client matches the password for that DN stored in the directory. Base64 encoding
is defined in the Multipurpose Internet Mail Extensions, or MIME (see 15.3,
“Multipurpose Internet Mail Extensions (MIME)” on page 571). Base64 is a
relatively simple encryption, and it is not hard to break after the data has been
captured in the network.
Simple Authentication and Security Layer (SASL)
SASL is a framework for adding additional authentication mechanisms to
connection-oriented protocols, and is defined in RFC 4422. It has been added to
LDAPv3 to overcome the authentication shortcomings of Version 2. SASL was
originally devised to add stronger authentication to the IMAP protocol (see 15.5,
TCP/IP Tutorial and Technical Overview
“Internet Message Access Protocol (IMAP4)” on page 591), but has since
evolved into a more general system for mediating between protocols and
authentication systems.
In SASL, connection protocols, such as LDAP, IMAP, and so on, are represented
by profiles; each profile is considered a protocol extension that allows the
protocol and SASL to work together. A complete list of SASL profiles can be
obtained from the Information Sciences Institute (ISI). Among these are IMAP,
SMTP, POP, and LDAP. Each protocol that intends to use SASL needs to be
extended with a command to identify an authentication mechanism and to carry
out an authentication exchange. Optionally, a security layer can be negotiated to
encrypt the data after authentication and ensure confidentiality. LDAPv3 includes
such a command (ldap_sasl_bind() or ldap_sasl_bind_s()). The key
parameters that influence the security method used are:
This is the distinguished name of the entry which is to bind.
This can be thought of as the user ID in a normal user ID and
password authentication.
This is the name of the security method to use. Valid security
mechanisms are, currently:
The One Time Password mechanism
(defined in RFC 2444).
The Generic Security Services Application
Program Interface (defined in RFC 2743).
The Challenge/Response Authentication
Mechanism (defined in RFC 2195), based
on the HMAC-MD5 MAC algorithm.
An HTTP Digest-compatible CRAM based
on the HMAC -MD5 MAC algorithm.
An external mechanism. Usually this is
TLS, discussed in “Transport Layer
Security (TLS)” on page 474.
ANONYMOUS Unauthenticated access.
This contains the arbitrary data that identifies the DN. The
format and content of the parameter depends on the
mechanism chosen. If it is, for example, the ANONYMOUS
mechanism, it can be an arbitrary string or an e-mail address
that identifies the user.
SASL provides a high-level framework that lets the involved parties decide on the
particular security mechanism to use. The SASL security mechanism negotiation
between client and server is done in the clear. After the client and the server
Chapter 12. Directory and naming protocols
agree on a common mechanism, the connection is secure against modifying the
authentication identities. However, an attacker might try to eavesdrop the
mechanism negotiation and cause a party to use the least secure mechanism. In
order to prevent this from happening, configure clients and servers to use a
minimum security mechanism, provided they support such a configuration
option. As stated earlier, SSL and its successor, TLS, are the mechanisms
commonly used in SASL for LDAP. For details about these protocols, refer to
22.7, “Secure Sockets Layer (SSL)” on page 854.
Because no data encryption method was specified in LDAPv2, some vendors
added their own SSL calls to the LDAP API. A potential drawback of such an
approach is that the API calls might not be compatible among different vendor
implementations. The use of SASL, as specified in LDAPv3, assures
compatibility, although it is likely that vendor products will support only a subset
of the possible range of mechanisms (possibly only SSL).
Transport Layer Security (TLS)
Transport Layer Security (TLS) is available through the SASL EXTERNAL
method, described earlier. An LDAP client can opt to secure a session using TLS
at any point during a transaction with an LDAP server, except when:
򐂰 The session is already TLS protected.
򐂰 A multi-stage SASL negotiation is in progress.
򐂰 The client is awaiting a response from an operation request.
To request that TLS be set up on the session, the client sends a StartTLS
message to the server. This enables the client and server to exchange
certificates. RFC 4513 requires that, in addition to this, the client verifies the
server’s identity using the DNS or IP address presented in the server’s
certificate. This prevents a client’s attempt to connect to a server from being
intercepted by malicious user, who might then stage a man-in-the-middle attack.
After this has occurred, the client and server can negotiate a ciphersuite.
12.4.6 LDAP URLs
Because LDAP has become an important protocol on the Internet, a URL format
for LDAP resources has been defined in RFC 4516. LDAP URLs begin with
ldap:// or ldaps://, if the LDAP server communicates using SSL. LDAP URLs can
simply name an LDAP server, or can specify a complex directory search.
Some examples help make the format of LDAP URLs clear. The following
example refers to the LDAP server on the host (using
the well-known port 389):
TCP/IP Tutorial and Technical Overview
Additionally, search options can be specified in the URL. The following example
retrieves all the attributes for the DN ou=Accounting,c=US from the LDAP server
on host In this case, nonstandard port 4389 is
explicitly specified here as an example.
The following example retrieves all the attributes for the DN
cn=JohnSmith,ou=Sales,o=myCorp,c=US. Note that some characters are
considered unsafe in URLs because they can be removed or treated as
delimiters by some programs. Unsafe characters such as space, comma,
brackets, and so forth need to be represented by their hexadecimal value
preceded by the percent sign:
In this example, %20 is a space. More information about unsafe characters and
URLs in general are in RFC 4516.
In addition to options, the URL can specify what values attributes are to be
returned using the ? symbol. For example, assume we want to find the U.S.
address of myCorp. We use the following URL:
12.4.7 LDAP and DCE
DCE has its own Cell Directory Service, or CDS (see 13.3.1, “DCE directory
service” on page 498). If applications never access resources outside of their
DCE cell, only CDS is required. However, if an application needs to communicate
with resources in other DCE cells, the Global Directory Agent (GDA) is required.
The GDA accesses a global (that is, non-CDS) directory where the names of
DCE cells can be registered. This global directory (GDS) can be either a Domain
Name System (DNS) directory or an X.500 directory. The GDA retrieves the
address of a CDS server in the remote cell. The remote CDS can then be
contacted to find DCE resources in that cell. Using the GDA enables an
organization to link multiple DCE cells together using either a private directory on
an intranet or a public directory on the Internet.
In view of LDAP's strong presence in the Internet, two LDAP projects have been
sponsored by The Open Group to investigate LDAP integration with DCE
Chapter 12. Directory and naming protocols
LDAP interface for the GDA
One way LDAP is being integrated into DCE is to allow DCE cells to be
registered in LDAP directories. The GDA in a cell that wants to connect to remote
cells is configured to enable access to the LDAP directory (see Figure 12-17).
Figure 12-17 The LDAP interface for GDA
DCE originally only supported X.500 and DNS name syntax for cell names.
LDAP and X.500 names both follow the same hierarchical naming model, but
their syntax is slightly different. X.500 names are written in reverse order and use
a slash (/) rather than a comma (,) to separated relative distinguished names.
When the GDA is configured to use LDAP, it converts cell names in X.500 format
into the LDAP format.
LDAP interface for the CDS
DCE provides two programming interfaces to the Directory Service; Name
Service Interface (NSI) and the X/Open Directory Service (XDS). XDS is an
X.500-compatible interface used to access information in the GDS, and it can
also be used to access information in the CDS. However, the use of NSI is much
more common in DCE applications. The NSI API provides functionality that is
specifically tailored for use with DCE client and server programs that use RPC.
NSI allows servers to register their address and the type of RPC interface they
support. This address/interface information is called an RPC binding, and is
TCP/IP Tutorial and Technical Overview
needed by clients that want to contact the server. NSI allows clients to search the
CDS for RPC binding information.
NSI was designed to be independent of the directory where the RPC bindings
are stored. However, the only supported directory to date has been CDS. NSI will
be extended to also support adding and retrieving RPC bindings from an LDAP
directory. This will allow servers to advertise their RPC binding information in
either CDS or an LDAP directory. Application programs can use either the NSI or
the LDAP API when an LDAP directory is used (see Figure 12-18).
Figure 12-18 The LDAP interface for NSI
12.4.8 The Directory-Enabled Networks (DEN) initiative
In September 1997, Cisco Systems Inc. and Microsoft® Corp. announced the
so-called Directory-Enabled Networks (DEN) initiative as a result of a
collaborative work. Many companies, such as IBM, either support this initiative or
actively participate in ad hoc working groups (ADWGs). DEN represents an
information model specification for an integrated directory that stores information
about people, network devices, and applications. The DEN schema defines the
object classes and their related attributes for those objects. As such, DEN is a
Chapter 12. Directory and naming protocols
key piece to building intelligent networks, where products from multiple vendors
can store and retrieve topology and configuration-related data.
Of special interest is that the DEN specification defines LDAPv3 as the core
protocol for accessing DEN information, which makes information available to
LDAP-enabled clients or network devices, or both.
DEN makes use of the Common information Model (CIM). CIM details a way of
integrating different management models such as SNMP MIBs and DMTF MIFs.
At the time of writing, the most current CIM schema was version 2.12, released in
April of 2006.
More information about the DEN initiative can be found on the founder’s Web at:
12.4.9 Web-Based Enterprise Management (WBEM)
WBEM is a set of standards designed to deliver an integrated set of
management tools for the enterprise. By making use of XML and CIM, it
becomes possible to manage network devices, desktop systems, telecom
systems and application systems, all from a Web browser. For further
information, see:
12.5 RFCs relevant to this chapter
The following RFCs provide detailed information about the directory and naming
protocols and architectures presented throughout this chapter:
򐂰 RFC 1032 – Domain administrators guide (November 1987)
򐂰 RFC 1033 – Domain administrators operations guide (November 1987)
򐂰 RFC 1034 – Domain names - concepts and facilities (November 1987)
򐂰 RFC 1035 – Domain names - implementation and specifications
(November 1987)
򐂰 RFC 1101 – DNS encoding of network names and other types (April 1989)
򐂰 RFC 1183 – New DNS RR Definitions (October 1990)
򐂰 RFC 1202 – Directory Assistance service (February 1991)
򐂰 RFC 1249 – DIXIE Protocol Specification (August 1991)
򐂰 RFC 1348 – DNS NSAP RRs (July 1992)
TCP/IP Tutorial and Technical Overview
򐂰 RFC 1480 – The US Domain (June 1993)
򐂰 RFC 1706 – DNS NSAP Resource Records (October 1994)
򐂰 RFC 1823 – The LDAP Application Programming Interface (August 1995)
򐂰 RFC 1876 – A Means for Expressing Location Information in the Domain
Name System (January 1996)
򐂰 RFC 1995 – Incremental Zone Transfer in DNS (August 1996)
򐂰 RFC 1996 – A Mechanism for Prompt Notification of Zone Changes (DNS
NOTIFY) (August 1996)
򐂰 RFC 2136 – Dynamic Updates in the Domain Name System (DNS UPDATE)
(April 1997)
򐂰 RFC 2444 – The One-time-Password SASL Mechanism (October 1998)
򐂰 RFC 4592 – The Role of Wildcards in the Domain Name System (July 2006)
򐂰 RFC 2743 – Generic Security Service Application Program Interface Version
2, Update 1 (January 2000)
򐂰 RFC 2874 – DNS Extensions to Support IPv6 Address Aggregation and
Renumbering (July 2000)
򐂰 RFC 3007 – Secure Domain Name Systems (DNS) Dynamic Update
(November 2000)
򐂰 RFC 3494 – Lightweight Directory Access protocol version 2 (LDAPv2)
(March 2003)
򐂰 RFC 3596 – DNS Extensions to Support IP Version 6 (October 2003)
򐂰 RFC 3645 – Generic Security Service Algorithm for Secret Key Transaction
Authentication for DNS (GSS-TSIG) (October 2003)
򐂰 RFC 3901 – DNS IPv6 Transport Operational Guidelines (September 2004)
򐂰 RFC 4033 – DNS Security Introduction and Requirements (March 2005)
򐂰 RFC 4034 – Resource Records for the DNS Security Extensions
(March 2005)
򐂰 RFC 4035 – Protocol Modifications for the DNS Security Extensions
(March 2005)
򐂰 RFC 4339 – IPv6 Host Configuration of DNS Server Information Approaches
(February 2006)
򐂰 RFC 4398 – Storing Certificates in the Domain Name System (DNS)
(March 2006)
򐂰 RFC 4422 – Simple Authentication and Security Layer (SASL) (June 2006)
򐂰 RFC 4501 – Domain Name System Uniform Resource Identifiers (May 2006)
Chapter 12. Directory and naming protocols
򐂰 RFC 4505 – Anonymous Simple Authentication and Security Layer (SASL)
(June 2006)
򐂰 RFC 4510 – Lightweight Directory Access Protocol (LDAP): Technical
Specification Road Map (June 2006)
򐂰 RFC 4511 – Lightweight Directory Access Protocol (LDAP): The Protocol
(June 2006)
򐂰 RFC 4512 – Lightweight Directory Access Protocol (LDAP): Directory
Information Models (June 2006)
򐂰 RFC 4513 – Lightweight Directory Access Protocol (LDAP): Authentication
Methods and Security Mechanisms (June 2006)
򐂰 RFC 4514 – Lightweight Directory Access Protocol (LDAP): String
Representation of Distinguished Names (June 2006)
򐂰 RFC 4515 – Lightweight Directory Access Protocol (LDAP): String
Representation of Search Filters (June 2006)
򐂰 RFC 4516 – Lightweight Directory Access Protocol (LDAP): Uniform
Resource Locator (June 2006)
򐂰 RFC 4517 – Lightweight Directory Access Protocol (LDAP): Syntaxes and
Matching Rules (June 2006)
򐂰 RFC 4518 – Lightweight Directory Access Protocol (LDAP): Internationalized
String Preparation (June 2006)
򐂰 RFC 4519 – Lightweight Directory Access Protocol (LDAP): Schema for User
Applications (June 2006)
򐂰 RFC 4520 – Internet Assigned Numbers Authority (IANA) Considerations for
the Lightweight Directory Access Protocol (LDAP) (June 2006)
򐂰 RFC 4521 – Considerations for Lightweight Directory Access Protocol
(LDAP) (June 2006)
򐂰 RFC 4522 – Lightweight Directory Access Protocol (LDAP): The Binary
Encoding Option (June 2006)
򐂰 RFC 4523 – Lightweight Directory Access Protocol (LDAP): Schema
Definitions for X.509 Certificates (June 2006)
򐂰 RFC 4524 – Lightweight Directory Access Protocol (LDAP): COSINE/LDAP
X.500 Schema (June 2006)
򐂰 RFC 4525 – Lightweight Directory Access Protocol (LDAP): Modify-Increment
Extension (June 2006)
򐂰 RFC 4526 – Lightweight Directory Access Protocol (LDAP): Absolute True
and False Filters (June 2006)
TCP/IP Tutorial and Technical Overview
򐂰 RFC 4527 – Lightweight Directory Access Protocol (LDAP): Read Entry
Controls (June 2006)
򐂰 RFC 4528 – Lightweight Directory Access Protocol (LDAP): Assertion Control
(June 2006)
򐂰 RFC 4529 – Requesting Attributes by Object Class in the Lightweight
Directory Access Protocol (LDAP) (June 2006)
򐂰 RFC 4530 – Lightweight Directory Access Protocol (LDAP): entryUUID
Operational Attribute (June 2006)
򐂰 RFC 4531 – Lightweight Directory Access Protocol (LDAP): Turn Operation
(June 2006)
򐂰 RFC 4532 – Lightweight Directory Access Protocol (LDAP): “Who Am I?”
Operation (June 2006)
򐂰 RFC 4533 – Lightweight Directory Access Protocol (LDAP): Content
Synchronization Operation (June 2006)
Chapter 12. Directory and naming protocols
TCP/IP Tutorial and Technical Overview
Chapter 13.
Remote execution and
distributed computing
One of the most fundamental mechanisms employed on networked computers is
the ability to execute on the remote systems. That is, a user wants to invoke an
application on a remote machine. A number of application protocols exist to allow
this remote execution capability, most notably the Telnet protocol. This chapter
discusses some of these protocols. In addition, we discuss the concept of
distributed computing.
© Copyright IBM Corp. 1989-2006. All rights reserved.
13.1 Telnet
Telnet is a standard protocol with STD number 8. Its status is recommended. It is
described in RFC 854 – Telnet Protocol Specifications and RFC 855 – Telnet
Option Specifications.
The Telnet protocol provides a standardized interface, through which a program
on one host (the Telnet client) can access the resources of another host (the
Telnet server) as though the client were a local terminal connected to the server.
See Figure 13-1 for more details.
For example, a user on a workstation on a LAN can connect to a host attached to
the LAN as though the workstation were a terminal attached directly to the host.
Of course, Telnet can be used across WANs as well as LANs.
Figure 13-1 Telnet operation
Most Telnet implementations do not provide you with graphics capabilities.
13.1.1 Telnet operation
Telnet protocol is based on three ideas:
򐂰 The Network Virtual Terminal (NVT) concept. An NVT is an imaginary device
with a basic structure common to a wide range of real terminals. Each host
maps its own terminal characteristics to those of an NVT and assumes that
every other host will do the same.
TCP/IP Tutorial and Technical Overview
򐂰 A symmetric view of terminals and processes.
򐂰 Negotiation of terminal options. The principle of negotiated options is used by
the Telnet protocol, because many hosts want to provide additional services,
beyond those available with the NVT. Various options can be negotiated. The
server and client use a set of conventions to establish the operational
characteristics of their Telnet connection through the “DO, DONT, WILL,
WONT” mechanism discussed later in this chapter.
The two hosts begin by verifying their mutual understanding. After this initial
negotiation is complete, they are capable of working on the minimum level
implemented by the NVT. After this minimum understanding is achieved, they
can negotiate additional options to extend the capabilities of the NVT to reflect
more accurately the capabilities of the real hardware in use. Because of the
symmetric model used by Telnet (see Figure 13-2), both the host and the client
can propose additional options to be used.
Host A
Host B
Operating System
Operating System
Figure 13-2 Telnet negotiations
13.1.2 Network Virtual Terminal
The NVT has a printer (or display) and a keyboard. The keyboard produces
outgoing data, which is sent over the Telnet connection. The printer receives the
incoming data. The basic characteristics of an NVT, unless they are modified by
mutually agreed options, are:
򐂰 The data representation is 7-bit ASCII transmitted in 8-bit bytes.
򐂰 The NVT is a half-duplex device operating in a line-buffered mode.
Chapter 13. Remote execution and distributed computing
򐂰 The NVT provides a local echo function.
All of these can be negotiated by the two hosts. For example, a local echo is
preferred because of the lower network load and superior performance, but there
is an option for using a remote echo (see Figure 13-3), although no host is
required to use it.
Terminal (Client)
Host (Server)
Figure 13-3 Telnet: Echo option
An NVT printer has an unspecified carriage width and page length. It can handle
printable ASCII characters (ASCII code 32 to 126) and understands some ASCII
control characters, such as those shown in Table 13-1.
Table 13-1 ASCII control characters
No operation.
Line feed (LF)
Moves printer to next line, keeping the
same horizontal position.
Carriage return (CR)
Moves the printer to the next left margin.
Produces and audible or visible signal.
Backspace (BS)
Moves print head one character position
towards the left margin.
TCP/IP Tutorial and Technical Overview
Horizontal tab (HT)
Moves print head to the next horizontal
tab stop.
Vertical tab (VT)
Moves printer to next vertical tab stop.
Form feed (FF)
Moves to top of next page, keeping the
same horizontal position.
13.1.3 Telnet options
There is an extensive set of Telnet options, and the reader should consult
STD 1 – Official Internet Protocol Standards for the standardization state and
status for each of them. At the time of writing, the following options were defined
(Table 13-2).
Table 13-2 Telnet options
Binary transmission
Suppress Go Ahead
Timing mark
Extended options list
Approximate, message size
Remote controlled trans and
Output line width
Output page size
Output carriage-return
Output horizontal tab stops
Chapter 13. Remote execution and distributed computing
Output horizontal tab
Output form feed disposition
Output vertical tab stops
Output vertical tab disposition
Output line feed disposition
Extended ASCII
Byte macro
Data entry terminal
SUPDUP output
Send location
Terminal type
End of record
TACACS user identification
Output marking
Terminal location number
Telnet 3270 regime
Negotiate window size
Terminal speed
Remote flow control
X Display location
Telnet environment option
TN3270 enhancements
Telnet authentication option
TCP/IP Tutorial and Technical Overview
Telnet xauth
Telnet charset
Telnet remote serial port
Telnet com port control
All of the standard options have a status of recommended and the remainder
have a status of elective. There is a historic version of the Telnet Environment
option which is not recommended; it is Telnet option 36 and was defined in RFC
Full-screen capability
Full-screen Telnet is possible provided the client and server have compatible
full-screen capabilities. For example, VM and MVS provide a TN3270-capable
server. To use this facility, a Telnet client must support TN3270.
13.1.4 Telnet command structure
The communication between client and server is handled with internal
commands, which are not accessible by users. All internal Telnet commands
consist of 2- or 3-byte sequences, depending on the command type.
Chapter 13. Remote execution and distributed computing
The Interpret As Command (IAC) character is followed by a command code. If
this command deals with option negotiation, the command will have a third byte
to show the code for the referenced option. See Figure 13-4 for more details.
byte 1
byte 2
byte 3
Terminal Type
Figure 13-4 Telnet: Internal Telnet command proposes negotiation about terminal type
Table 13-3 shows some of the possible command codes.
Table 13-3 Command codes
End of sub-negotiation parameters.
No operation.
Data Mark
The data stream portion of a sync. This must
always be accompanied by a TCP urgent
NVT character BRK.
Go Ahead
The GA signal.
Start of sub-negotiation of option indicated by the
immediately following code.
TCP/IP Tutorial and Technical Overview
Shows desire to use, or confirmation of using, the
option indicated by the code immediately
Shows refusal to use or continue to use the option.
Requests that other party uses, or confirms that
you are expecting the other party to use, the
option indicated by the code immediately
Demands that the other party stop using, or
confirms that you are no longer expecting the
other party to use, the option indicated by the code
immediately following.
Interpret As command. Indicates that what follows
is a Telnet command, not data.
13.1.5 Option negotiation
Using internal commands, Telnet is able to negotiate options in each host. The
starting base of negotiation is the NVT capability: Each host to be connected
must agree to this minimum. Every option can be negotiated by the use of the
four command codes WILL, WONT, DO, and DONT. In addition, some options
have suboptions: If both parties agree to the option, they use the SB and SE
commands to manage the sub-negotiation. Table 13-4 shows a simplified
example of how option negotiation works.
Table 13-4 Option negotiation
DO transmit binary
WILL transmit binary
DO window size
WILL window size
SB Window size 0 80 0 24
DO terminal type
SB terminal type SE
Can we negotiate window
Specify window size.
WILL terminal type
Can we negotiate terminal
Send me your terminal
Chapter 13. Remote execution and distributed computing
DO echo
SB terminal type
IBM=3278-2 SE
My terminal is a 3278-2.
WONT echo
The terminal types are defined in STD 2 – Assigned Numbers.
13.1.6 Telnet basic commands
The primary goal of the Telnet protocol is the provision of a standard interface for
hosts over a network. To allow the connection to start, the Telnet protocol
defines a standard representation for some functions:
Interrupt Process
Abort Output
Are You There
Erase Character
Erase Line
13.1.7 Terminal emulation (Telnet 3270)
Telnet can be used to make a TCP/IP connection to an SNA host. However,
Telnet 3270 is used to provide 3270 Telnet emulation (TN3270). The following
differences between traditional Telnet and 3270 terminal emulation make it
necessary for additional Telnet options specifically for TN3270 to be defined:
򐂰 3270 terminal emulation uses block mode rather than line mode.
򐂰 3270 terminal emulation uses the EBCDIC character set rather than the
ASCII character set.
򐂰 3270 terminal emulation uses special key functions, such as ATTN and
The TN3270 connection over Telnet is accomplished by the negotiation of the
following three different Telnet options:
򐂰 Terminal type
򐂰 Binary transmission
򐂰 End of record
TCP/IP Tutorial and Technical Overview
A TN3270 server must support these characteristics during initial client/server
session negotiations. Binary transmission and end of record options can be sent
in any order during the TN3270 negotiation. Note that TN3270 does not use any
additional options during the TN3270 negotiation; it uses normal Telnet options.
After a TN3270 connection is established, additional options can be used. These
Terminal type option is a string that specifies the terminal type for the host, such
as IBM 3278-3. The -3 following 3278 indicates the use of an alternate screen
size other than the standard size of 24x80. The binary transmission Telnet option
states that the connection will be other than the initial NVT mode. If the client or
server want to switch to NVT mode, they send a command that disables the
binary option. A 3270 data stream consists of a command and related data.
Because the length of the data associated with the command can vary, every
command and its related data must be separated with the IAC EOR sequence.
For this purpose, the EOR Telnet option is used during the negotiation.
Other important issues for a TN3270 connection are the correct handling of the
ATTN and SYSREQ functions. The 3270 ATTN key is used in SNA environments
to interrupt the current process. The 3270 SYSREQ key is used in SNA
environments to terminate the session without closing the connection. However,
SYSREQ and ATTN commands cannot be sent directly to the TN3270 server
over a Telnet connection. Most of the TN3270 server implementations convert
the BREAK command to an ATTN request to the host through the SNA network.
On the client side, a key or combination of keys are mapped to BREAK for this
purpose. For the SYSREQ key, either a Telnet Interrupt Process command can
be sent or a SYSREQ command can be sent imbedded into a TN3270 data
stream. Similarly, on the client side, a key or combination of keys are mapped for
There are some functions that cannot be handled by traditional TN3270. Some of
these issues include:
򐂰 TN3270 does not support 328x types of printers.
򐂰 TN3270 cannot handle SNA BIND information.
򐂰 There is no support for the SNA positive/negative response process.
򐂰 TN3270 cannot map Telnet sessions into SNA device names.
13.1.8 TN3270 enhancements (TN3270E)
The 3270 structured field allows non-3270 data to be carried in 3270 data.
Therefore, it is possible to send graphics, IPDS™ printer data streams, and so
on. The structured field consists of a structured field command and one or more
Chapter 13. Remote execution and distributed computing
blocks following the command. However, not every TN3270 client can support all
types of data. In order for clients to be able to support any of these functions, the
supported range of data types needs to be determined when the Telnet
connection is established. This process requires additions to TN3270. To
overcome the shortcomings of traditional TN3270, TN3270 extended attributes
are defined. Refer to RFC 2355 for detailed information about TN3270
enhancements (TN3270E).
In order to use the extended attributes of TN3270E, both the client and server
must support TN3270E. If neither side supports TN3270E, traditional TN3270
can be used. After both sides agree to use TN3270E, they begin to negotiate the
subset of TN3270E options. These options are the device-type and a set of
supported 3270 functions, which are:
򐂰 Printer data stream type
򐂰 Device status information
򐂰 The passing of BIND information from server to client
򐂰 Positive/negative response exchanges
13.1.9 Device-type negotiation
Device-type names are NVT ASCII strings and all uppercase. When the
TN3270E server issues the DEVICE-TYPE SEND command to the client, the
server replies with a device type, a device name, or a resource name followed by
the DEVICE-TYPE REQUEST command. Table 13-5 and Table 13-6 show the
Table 13-5 TN3270 device-types: Terminals
Screen size
24 row x 80 col display
32 row x 80 col display
43 row x 80 col display
27 row x 132 col display
Table 13-6 TN3270E device-type: Printer
TCP/IP Tutorial and Technical Overview
Because the 3278 and 3287 are commonly used devices, device-types are
restricted to 3278 and 3287 terminal and printer types to simplify the negotiation.
This does not mean that other types of devices cannot be used. Simply, the
device-type negotiation determines the generic characteristic of the 3270 device
that will be used. More advanced functions of 3270 data stream supported by the
client are determined by the combination of read partition query and query reply.
The -E suffix indicates the use of extended attributes, such as partition, graphics,
extended colors, and alternate character sets. If the client and the server have
agreed to use extended attributes and negotiated on a device with the -E suffix,
such as an IBM-DYNAMIC device or printer, both sides must be able to handle
the 3270 structured field. The structured field also allows 3270 Telnet clients to
issue specific 3270 data stream to host applications that the client is capable of
From the point of TN3270E client, it is not always possible or easy to know
device names available in the network. The TN3270E server must assign the
proper device to the client. This is accomplished by using a device pool that is
defined on the TN3270E server. Basically, these device pools contain SNA
network devices, such as terminals and printers. In other words, the TN3270E
implementation maps TN3270 sessions to specific SNA logical unit (LU) names,
thus effectively turning them into SNA devices. The device pool not only defines
SNA network devices but also provides some other important functions for a
TN3270E session. Some of these are:
򐂰 It is possible to assign one or more printers to a specific terminal device.
򐂰 It is possible to assign a group of devices to a specific organization.
򐂰 A pool can be defined that has access to only certain types of applications on
the host.
The TN3270E client can issue CONNECT or ASSOCIATE commands to connect
or associate the sessions to certain types of resources. However, this resource
must not conflict with the definition on the server and the device-type determined
during the negotiation.
13.2 Remote Execution Command protocol (REXEC and
Remote Execution Command Daemon (REXECD) is a server that allows the
execution of jobs submitted from a remote host over the TCP/IP network. The
client uses the REXEC or Remote Shell Protocol (RSH) command to transfer the
job across to the server. Any standard output or error output is sent back to the
client for display or further processing.
Chapter 13. Remote execution and distributed computing
Principle of operation
REXECD is a server (or daemon). It handles commands issued by foreign hosts
and transfers orders to subordinate virtual machines for job execution. The
daemon performs automatic login and user authentication when a user ID and
password are entered.
The REXEC command is used to define the user ID, password, host address,
and the process to be started on the remote host. However, RSH does not
require you to send a user name and password; it uses a host access file
instead. Both server and client are linked over the TCP/IP network. REXEC uses
TCP port 512 and RSH uses TCP port 514. See Figure 13-5 for more details.
Figure 13-5 REXEC: REXECD principle
13.3 Introduction to the Distributed Computing
Environment (DCE)
Distributed Computing Environment (DCE) is an architecture, a set of open
standard services and associated APIs, used to support the development and
administration of distributed applications in a multiplatform, multivendor
TCP/IP Tutorial and Technical Overview
DCE is the result of work from the Open Systems Foundation, or OSF (now
called The Open Group), a collaboration of many hardware vendors, software
vendors, clients, and consulting firms. The OSF began in 1988 with the purpose
of supporting the research, development, and delivery of vendor-neutral
technology and industry standards. One such standard developed was DCE.
DCE Version 1.0 was released in January 1992.
As shown in Figure 13-6, DCE includes the following major services:
򐂰 Directory service
򐂰 Security service
򐂰 Distributed Time Service
򐂰 Distributed File Service
򐂰 Threads
򐂰 Remote Procedure Call
Figure 13-6 DCE architectural components
All these services have application program interfaces (APIs) that allow the
programmer to use these functions. We describe these services in more detail in
the following sections.
The DCE architecture does not specifically require that TCP/IP must be used for
transport services, but few other protocols today meet the open and multivendor
requirements of the DCE design goals. In practice, the vast majority, if not all,
implementations of DCE are based on TCP/IP networks.
Chapter 13. Remote execution and distributed computing
13.3.1 DCE directory service
When working in a large, complex network environment, it is important to keep
track of the locations, names, and services (and many other details) of the
participants and resources in that network. It is also important to be able to
access this information easily. To enable this, information needs to be stored in a
logical, central location and have standard interfaces for accessing the
information. The DCE Cell Directory Service does exactly this.
The DCE directory service has the following major components:
򐂰 Cell Directory Service (CDS)
򐂰 Global Directory Service (GDS)
򐂰 Global Directory Agent (GDA)
򐂰 Application program interface (API)
Cell Directory Service
The Cell Directory Service manages a database of information about the
resources in a group of closely cooperating hosts, which is called a cell. A DCE
cell is very scalable and can contain many thousands of entities. Typically, even
fairly large corporate companies will be organized within a single cell, which can
cover several countries. The directory service database contains a hierarchical
set of names, which represent a logical view of the machines, applications,
users, and resources within the cell. These names are usually directory entries
within a directory unit. Often, this hierarchical set of names is also called the
namespace. Every cell requires at least one DCE server configured with the Cell
Directory Service (a directory server).
The CDS has two very important characteristics: It can be distributed, and it can
be replicated. Distributed means that the entire database does not have to reside
on one physical machine in the cell. The database can logically be partitioned
into multiple sections (called replicas), and each replica can reside on a separate
machine. The first instance of that replica is the master replica, which has
read/write access. The ability of the cell directory to be split into several master
replicas allows the option of distributing the management responsibility for
resources in different parts of the cell. This might be particularly important if the
cell covers, say, several countries.
Each master replica can be replicated. That is, a copy of this replica can be
made on a different machine (which is also a directory server). This is called a
read-only replica. Read-only replicas provide both resilience and performance
enhancement by allowing a host machine to perform lookups to the nearest
available replica.
TCP/IP Tutorial and Technical Overview
Replicas are stored in a clearinghouse. A clearinghouse is a collection of
directory replicas at a particular server. All directory replicas must be part of a
clearinghouse (although not necessarily the same one).
The Cell Directory Service makes use of the DCE security service. When the
CDS initializes, it must authenticate itself to the DCE security service. This
prevents a fraudulent CDS from participating in the existing cell.
Figure 13-7 shows the directory structure of the CDS namespace. As you can
see, the namespace is organized in a hierarchical manner.
/... Global Root
/..: Local Root
Figure 13-7 DCE: CDS namespace directory structure
Not all DCE names are stored directly in the DCE directory service. Resource
entries managed by some services, such as the security service (sec) and the
distributed file system (fs), connect into the namespace by means of specialized
CDS entries called junctions. A junction entry contains binding information that
enables a client to connect to a directory server outside of the directory service.
The security namespace is managed by the registry service of the DCE security
component, and the DFS namespace is managed by the file set location
database (FLDB) service of DFS.
Global Directory Service and Agent
The Cell Directory Service is responsible for knowing where resources are within
the cell. However, in a multicell network, each cell is part of a larger hierarchical
namespace, called the global directory namespace. The Global Directory
Service (GDS) enables us to resolve the location of resources in foreign cells.
This is the case when a company wants to connect their cells together or to the
Chapter 13. Remote execution and distributed computing
In order to find a resource in another cell, a communication path needs to exist
between the two cells. This communication path can currently be one of two
򐂰 CCITT X.500
򐂰 Internet Domain Name Services (DNS)
In order for intercell communications to be accomplished, another component,
the Global Directory Agent, is required. The Global Directory Agent (GDA) is the
intermediary between the local cell and the Global Directory Service. In
Figure 13-8, if the CDS does not know the location of a resource, it tells the client
to ask the GDA for assistance. The GDA knows to which global namespace it is
connected and queries the GDS (either DNS or X.500) for the name of the
foreign cell directory server with which to communicate. When in direct
communication with the foreign cell directory server, the network name of the
resource requested can be found. The Global Directory Agent is the component
that provides communications support for either DNS or X.500 environments.
or X.500
Cell A
Cell B
Figure 13-8 DCE: Global Directory Agent
DCE security service
Security is always a concern in a networked environment. In a large, distributed
environment, it is even more crucial to ensure that all participants are valid users
who access only the data with which they are permitted to work. The two primary
concerns are authentication and authorization. Authentication is the process of
proving or confirming the identity of a user or service. Authorization is the
process of checking a user's level of authority when an access attempt is made.
For example, if a user tries to make a change when read-only access has been
granted, the update attempt will fail.
TCP/IP Tutorial and Technical Overview
The DCE security service ensures secure communications and controlled
access to resources in this distributed environment. It is based on the
Massachusetts Institute of Technology's Project Athena, which produced
Kerberos. Kerberos is an authentication service that validates a user or service.
The current DCE security service (DCE 1.2.2) is based on Kerberos Version 5.
Because the DCE security service must be able to validate users and services, it
must also have a database to hold this information. This is indeed the case. The
DCE security service maintains a database of principals, accounts, groups,
organizations, policies, properties, and attributes. This database is called the
registry. Figure 13-9 shows a pictorial representation of the registry tree. The
registry is actually part of the cell directory namespace, although it is stored on a
separate server.
Figure 13-9 DCE: Registry directory structure
The DCE security service consists of several components:
Authentication service
Handles the process of verifying that principals are
correctly identified. This also contains a ticket
granting service, which allows the engagement of
secure communications.
Privilege service
Supplies a user's privilege attributes to enable
them to be forwarded to DCE servers.
Registry service
Maintains the registry database, which contains
accounts, groups, principals, organizations, and
Access control list facility Provides a mechanism to match a principal's
access request against the access controls for the
Chapter 13. Remote execution and distributed computing
Provides the environment for a user to log in and
initialize the security environment and credentials.
Login facility
These services enable user authentication, secure communication, authorized
access to resources, and proper enforcement of security.
The DCE security service communicates with the Cell Directory Service to
advertise its existence to the other systems that are part of the cell. The DCE
security service also uses the Distributed Time Service to obtain time stamps for
use in many of its processes.
13.3.2 Authentication service
The role of the authentication service is to allow principals to positively identify
themselves and participate in a DCE network. Both users and servers
authenticate themselves in a DCE environment, unlike security in most other
client/server systems, where only users are authenticated. There are two distinct
steps to authentication. At initial logon time, the Kerberos third-party protocol is
used within DCE to verify the identity of a client requesting to participate in a
DSS network. This process results in the client obtaining credentials, which form
the basis for setting up secure sessions with DCE servers when the user tries to
access resources.
In DCE Version 1.1, the idea of preauthentication was introduced, which is not
present in the Kerberos authentication protocols. Preauthentication protects the
security server from a rogue client trying to guess valid user IDs in order to hack
into the system. In DCE 1.1, there are three protocols for preauthentication:
No preauthentication
This is provided to support DCE clients earlier than
Version 1.1.
This is used by DCE Version 1.1 clients that are
unable to use the third-party protocol. An encrypted
time stamp is sent to the security server. The time
stamp is decrypted, and if the time is within five
minutes, the user is considered preauthenticated. This
option needs to be specified for cell administrators and
non-interactive principals.
This is the default used by DCE Version 1.1 (and later)
clients. It is similar to the time stamps protocol, but
additional information about the client is also
encrypted in various keys.
TCP/IP Tutorial and Technical Overview
The login and authentication process using the third-party preauthentication
protocol is shown in Figure 13-10.
Client Machine
(User begins login)
Security Server
(RPC request to
resource server)
Privilege Service
Client RPC
ACL Facility
Figure 13-10 DCE: Authentication and login process using third-party protocol
This detailed process is as follows:
1. The user issues a request to log in to the cell. However, the user must first be
authenticated. The client creates two random conversation keys, one of them
based on the machine session key. The login facility on the client then uses
these keys and the supplied password to encrypt a request for an
authentication ticket (ticket granting ticket, or TGT) from the security server.
2. The authentication service (AS) on a security server receives the request.
The AS looks up the machine session key from the registry to decrypt the
request. (Note that by knowing the machine session key, the security server
proves to be valid for the cell. A false server would not know the machine
session key.) If the decryption is successful and the time stamp is within five
minutes, the AS encrypts a TGT using one of the client conversation keys
provided. When successful, this encrypted TGT is returned to the client.
3. The client receives the TGT envelope and decrypts it using one of the client
conversation keys it provided. Also included is a conversation key for the AS.
Note that the TGT itself is encrypted with a conversation key of the AS. This
valid TGT is proof that the user is now authenticated.
Chapter 13. Remote execution and distributed computing
4. Now the user needs the authorization credentials, known as extended
privilege attribute certificate (EPAC), from the privilege service (PS).
Therefore, it must construct a privilege ticket granting ticket (PTGT) request
to retrieve this from the PS. To communicate with the PS, the client sends a
request to the AS to contact the PS. This request is encrypted with the
conversation key of the AS.
5. The AS receives this request. Using the secret key of the PS, the AS
generates a conversation key for the client to use when contacting the PS.
This is returned to the client and encrypted again with the AS conversation
key. The client receives the envelope and decrypts it (using the conversation
key) and discovers the conversation key for the PS. The client can now send
a privilege service ticket to the PS.
6. The PS receives the request and decrypts it with its secret key successfully.
This proves that the service ticket is legitimate, which also implies that the AS
involved is also legitimate. From this, the PS knows that the client and the AS
are valid. The PS constructs the EPAC, which lists the user's standard and
extended registry attributes, including group membership. The PS creates
more conversation keys and sends the EPAC and other information in an
encrypted PTGT envelope to the client.
7. The client decrypts the PTGT envelope using the PS conversation key. Also,
the client has the conversation key information and an encrypted PTGT
(which the client cannot decrypt, because it is encrypted using the AS secret
8. Now, the client wants to contact an application server. To do so, it sends the
PTGT to the AS and requests a service ticket for the application server. The
AS receives the PTGT and decrypts it to obtain the EPAC information. It
encrypts the EPAC information with the secret key of the application server
and also provides a conversation key for the application server. This
information is encrypted with the conversation key of the AS (which the client
knows) and is returned to the client.
9. The client decrypts the envelope and discovers the application server's secret
conversation key. Using this key, it can now contact the application server. By
correctly decrypting the request from the client, the application server is able
to determine that the client has been authenticated, and by responding to the
client, the client knows that it was, indeed, the real application server that it
has contacted. The two will then establish a mutually authenticated session.
In addition to the extensive use of secret keys during the logon process,
third-party authentication makes use of time stamps to ensure that the
conversation is protected against intruders and eavesdropping. Time stamps
make impersonation techniques, such as record and playback, ineffective. Also,
the actual user password entered at logon time does not flow to the server as
such. Instead, it is used as an encryption key for the initial logon messages that
TCP/IP Tutorial and Technical Overview
are then decrypted by the security server using its own copy of the password
stored in the registry database.
If the security server is not able to authenticate the client for some reason, such
as the entering an invalid password, an error is returned and the logon is
terminated. However, if the exchange completes with the client being
successfully authenticated, the security server returns credentials that are then
used by the client to establish sessions with other DCE servers, such as
resource and directory servers. These credentials contain information in the form
of a privilege ticket granting ticket (PTGT) and extended privilege attribute
certificate (EPAC):
This is a validated list supplied by the security server
containing the client's name, groups the client belongs to, and
the extended registry attributes for the authenticated client (if
any were defined and associated with their account). A client
must present its EPAC (acquired during third-party
authentication) to any server the client wants to connect to in
order to access its resources.
PTGT is a privilege ticket granting ticket. It contains the EPAC,
which has all the relevant information about a user (UUID,
group membership, ERAs, and so on). The PTGT is what is
actually passed from a DCE client to a DCE server when it
needs to access resources.
Public key support
The latest version of DCE (DCE Version 1.2.2) introduces the option of using
public key technology (such as that from RSA or smart cards) to support the login
process. Using this technology, the long-term key (or password) for a user (or
other DCE object) does not need to be stored at the security server, providing
enhanced security in the event of a compromise of the security server.
Administrators can specify that some principals can use the pre-DCE 1.2
mechanisms while others have access to the public key mechanism. DCE_1.2.2
retains full interoperability with previous DCE releases. During the login process,
public key users receive credentials that allow them to use the current
(Kerberos-based) DCE authentication mechanism. A new pre-authentication
protocol is used. The login client does not have to determine whether a given
user is public key-capable prior to requesting credentials.
13.3.3 DCE threads
Traditional applications (written in languages such as C, COBOL, and so on)
have many lines of programming code that usually execute in a sequential
manner. At any time, there is one point in the program that is executing. This can
Chapter 13. Remote execution and distributed computing
be defined as single threading. A thread is a single unit of execution flow within a
process. Better application performance can often be obtained when a program
is structured so that several areas can be executed concurrently. This is called
multithreading. The capability of executing multiple threads also depends on the
operating system.
In a distributed computing environment based on the client/server model, threads
provide the ability to perform many procedures at the same time. Work can
continue in another thread while the thread waiting for a specific response is
blocked (for example, waiting for response from the network). A server can issue
concurrent procedure call processing. While one server thread is waiting for an
I/O operation to finish, another server thread can continue working on a different
To function well, thread support needs to be integrated into the operating system.
If threads are implemented at the application software level instead of within the
operating system, performance of multithreaded applications might seem slow.
The DCE thread APIs are either user-level (in operating systems that do not
support threads, such as Microsoft Windows® 3.x) or kernel threads (such as
IBM AIX® 5L™ and OS/2®). They are based on the POSIX 1003.4a Draft 4
standard. Because OS/2 also has threads, the programmer can use DCE
threads or OS/2 threads.
DCE threads can be mapped onto OS/2 threads through special programming
constructs. However, in order to write portable applications that can run on
different platforms, only DCE threads must be used. In many cases, there is little
performance difference resulting from this mapping.
DCE Remote Procedure Call
The DCE Remote Procedure Call (RPC) architecture is the foundation of
communication between the client and server in the DCE environment.
Note: The DCE RPC is conceptually similar to the ONC RPC (see 11.2.2,
“Remote Procedure Call (RPC)” on page 415), but the protocol used on the
wire is not compatible.
RPCs provide the ability for an application program's code to be distributed
across multiple systems, which can be anywhere in the network.
TCP/IP Tutorial and Technical Overview
An application written using DCE RPCs has a client portion, which usually issues
RPC requests, and a server portion, which receives RPC requests, processes
them, and returns the results to the client. RPCs have three main components:
򐂰 The Interface Definition Language (IDL) and its associated compiler. From
the specification file, it generates the header file, the client stub, and the
server stub. This allows an application to issue a Remote Procedure Call in
the same manner as it issues a local procedure call.
򐂰 The network data representation, which defines the format for passing data,
such as input and output parameters. This ensures that the bit-ordering and
platform-specific data representation can be converted properly after it arrives
at the target system. This process of preparing data for an RPC call is called
򐂰 The runtime library, which shields the application from the details of network
communications between client and server nodes.
The application programmer can choose to use multiple threads when making
RPC calls. This is because an RPC is synchronous; that is, when an RPC call is
made, the thread that issued the call is blocked from further processing until a
response is received.
Remote Procedure Calls can be used to build applications that make use of other
DCE facilities, such as the Cell Directory Service (CDS) and the security service.
The CDS can be used to find servers or to advertise a server's address for client
access. The security service might be used to make authenticated RPCs that
enable various levels of data integrity and encryption using the commercial data
masking facility (CDMF), data encryption standard (DES), and other functions
such as authorization.
13.3.4 Distributed Time Service
Keeping the clocks on different hosts synchronized is a difficult task because the
hardware clocks do not typically run at the same rates. This presents problems
for distributed applications that depend on the ordering of events that happen
during their execution. For example, let us say that a programmer is compiling
some code on a workstation and some files are also located on a server. If the
workstation and the server do not have their time synchronized, it is possible that
the compiler might not process a file, because the date is older than an existing
one on the server. In reality, the file is newer, but the clock on the workstation is
slow. As a result, the compiled code will not reflect the latest source code
changes. This problem becomes more acute in a large cell where servers are
distributed across multiple time zones.
The DCE Distributed Time Service (DTS) provides standard software
mechanisms to synchronize clocks on the different hosts in a distributed
Chapter 13. Remote execution and distributed computing
environment. It also provides a way of keeping a host's time close to the absolute
time. DTS is optional. It is not a required core service for the DCE cell. However,
if DTS is not implemented, the administrator must use some other means of
keeping clocks synchronized for all the systems in the cell.
The Distributed Time Service has several components. They are:
򐂰 Local time server
򐂰 Global time server
򐂰 Courier and backup courier time server
Local time server
The local time server is responsible for answering time queries from time clerks
on the LAN. Local time servers also query each other to maintain
synchronization on the LAN. If a time clerk cannot contact the required number of
local time servers (as specified by the minservers attribute), it must contact
global time servers through a CDS lookup.
We recommend that there are at least three local time servers per LAN. This
ensures that the time on the LAN is synchronized. The task of synchronization
across multiple LANs in the cell is performed by global and courier time servers.
Global time server
A global time server (GTS) advertises itself in the Cell Directory Service
namespace so that all systems can find it easily. A GTS participates in the local
LAN in the same way that local time servers do, but it has an additional
responsibility. It also gives its time to a courier time server, which is located in a
different LAN.
Courier roles
Local and global time servers can also have a courier role. They can be couriers,
backup couriers, or non-couriers. The courier behaves similarly to other local
time servers, participating in the time synchronization process. However, the
courier does not look at its own clock. It requests the time from a global time
server located in another LAN or in another part of the cell. Because the time is
imported from another part of the network, this enables many remote LAN
segments in all parts of the cell to have a very closely synchronized time value.
The backup courier role provides support in the event that the primary courier for
that LAN is not available. The backup couriers will negotiate to elect a new
courier and thus maintain the proper time synchronization with the global time
servers. Note that even if a courier time server is not defined, local time servers
and clerks will try to contact a global time server if they cannot contact the
minimum number of servers from the local segment.
TCP/IP Tutorial and Technical Overview
The default for time servers is the non-courier role. As long as enough local time
servers can be contacted, they will not contact a global time server.
In a large or distributed network, local time servers, global time servers, and
courier time servers automatically and accurately make the process of time
synchronization function.
13.3.5 Additional information
For additional information about DCE, refer to the IBM Redbook DCE Cell
Design Considerations, SG24-4746.
For information about the most current release of DCE (Version 1.2.2), view the
Open Group Web site at:
13.4 Distributed File Service (DFS)
The Distributed File Service is not really a core component of DCE, but it is an
application that is integrated with, and uses, the other DCE services. DFS
provides global file sharing. Access to files located anywhere in interconnected
DCE cells is transparent to the user. To the user, it appears as though the files
were located on a local drive. DFS servers and clients can be heterogeneous
computers running different operating systems.
The origin of DFS is Transarc Corporation's implementation of the Andrew File
System (AFS) from Carnegie-Mellon University. DFS conforms to POSIX 1003.1
for file system semantics and POSIX 1003.6 for access control security. DFS is
built onto, and integrated with, all of the other DCE services, and was developed
to address identified distributed file system needs, such as:
򐂰 Location transparency
򐂰 Uniform naming
򐂰 Good performance
򐂰 Security
򐂰 High availability
򐂰 File consistency control
򐂰 NFS interoperability
Chapter 13. Remote execution and distributed computing
DFS follows the client/server model, and it extends the concept of DCE cells by
providing DFS administrative domains, which are an administratively
independent collection of DFS server and client systems within a DCE cell.
There can be many DFS file servers in a cell. Each DFS file server runs the file
exporter service that makes files available to DFS clients. The file exporter is
also known as the protocol exporter. DFS clients run the cache manager, which
is an intermediary between applications that request files from DFS servers. The
cache manager translates file requests into RPCs to the file exporter on the file
server system and stores (caches) file data on disk or in memory to minimize
server accesses. It also ensures that the client always has an up-to-date copy of
a file.
The DFS file server can serve two different types of file systems:
򐂰 Local file system (LFS), also known as the episode file system
򐂰 Some other file system, such as the UNIX File System (UFS)
Full DFS functionality is only available with LFS and includes:
򐂰 High performance
򐂰 Log-based, fast restarting file sets for quick recovery from failure
򐂰 High availability with replication, automatic updates, and automatic bypassing
of failed file server
򐂰 Strong security with integration to the DCE security service providing ACL
authorization control
13.4.1 File naming
DFS uses the Cell Directory Service (CDS) name /.:/fs as a junction to its
self-administered namespace. DFS objects of a cell (files and directories) build a
file system tree rooted in /.:/fs of every cell. Directories and files can be accessed
by users anywhere in the network using the same file or directory names, no
matter where they are physically located, because all DCE resources are part of
a global namespace.
As an example of DFS file naming, to access a particular file from within a cell, a
user might use the following name:
From outside the cell, using GDS (X.500) format, the following name is used:
TCP/IP Tutorial and Technical Overview
Or, in DNS format:
13.4.2 DFS performance
Performance is one of the main goals of DFS, and it achieves it by including
features such as:
Cache manager
Files requested from the server are stored in cache at
the client so that the client does not need to send
requests for data across the network every time the
user needs a file. This reduces load on the server file
systems and minimizes network traffic, thereby
improving performance.
Multithreaded servers
DFS servers make use of DCE threads support to
efficiently handle multiple file requests from clients.
RPC pipes
The RPC pipe facility is extensively used to transport
large amounts of data efficiently.
Replication support allows efficient load-balancing by
spreading out the requests for files across multiple
File consistency
Using copies of files cached in memory at the client side can potentially cause
problems when the file is being used by multiple clients in different locations.
DFS uses a token mechanism to synchronize concurrent file accesses by
multiple users and ensure that each user is always working with the latest
version of a file. The whole process is transparent to the user.
LFS file sets can be replicated on multiple servers for better availability. Every file
set has a single read/write version and multiple read-only replicas. The
read/write version is the only one that can be modified. Every change in the
read/write file set is reflected in the replicated file sets. If there is a crash of a
server system housing a replicated file set, the work is not interrupted, and the
client is automatically switched to another replica.
DFS security
DCE security provides DFS with authentication of user identities, verification of
user privileges, and authorization control. Using the DCE security's ACL
mechanism, DFS provides more flexible and powerful access control than that
Chapter 13. Remote execution and distributed computing
typically provided by an operating system (for example, UNIX read, write, and
execute permissions).
DFS/NFS interoperability
DFS files can be exported to NFS so that NFS clients can access them as
unauthenticated users. This requires an NFS/DFS authenticating gateway
facility, which might not be available in every implementation.
13.5 RFCs relevant to this chapter
The following RFCs provide detailed information about the connection protocols
and architectures presented throughout this chapter:
򐂰 RFC 854 – TELNET Protocol Specifications (May 1983)
򐂰 RFC 855 – TELNET Option Specifications. (May 1983)
򐂰 RFC 2355 – TN3270 Enhancements (June 1998)
TCP/IP Tutorial and Technical Overview
Chapter 14.
File-related protocols
The TCP/IP protocol suite provides a number of protocols for the manipulation of
files. In general, there are two different mechanisms for accessing remote files.
The most simple mechanism is by transferring the particular file to the local
machine. In this case, multiple copies of the same file are likely to exist. File
Transfer Protocol (FTP), Trivial File Transfer Protocol (TFTP), Secure Copy
(SCP), and SSH FTP (SFTP) employ this mechanism of file sharing.
An alternate approach to accessing files is through the use of a file system. In
this case, the operating machine on the local host provides the necessary
functionality to access the file on the remote machine. The user and application
on the local machine are not aware that the file actually resides on the remote
machine; they just read and write the file through the file system as though it
were on the local machine. In this case, only one copy of the file exists, and the
file system is responsible for coordinating updates. The Network File System
(NFS), Andrew File System (AFS), and the Common Internet File System (CIFS,
previously called Server Message Block, or SMB) provide this type of
© Copyright IBM Corp. 1989-2006. All rights reserved.
14.1 File Transfer Protocol (FTP)
FTP is a standard protocol with STD Number 9. Its status is recommended. It is
described in RFC 959 – File Transfer Protocol (FTP) and updated by RFCs 2228
(FTP Security Extensions), 2428 (FTP Extensions for IPv6 and NATs), and 4217
(Securing FTP with TLS). Additional information is in RFC 2577 (FTP Security
Transferring data from one host to another is one of the most frequently used
operations. Both the need to upload data (transfer data from a client to a server)
and download data (retrieve data from a server to a client) are addressed by
FTP. Additionally, FTP provides security and authentication measures to prevent
unauthorized access to data.
14.1.1 An overview of FTP
FTP uses TCP as a transport protocol to provide reliable end-to-end connections
and implements two types of connections in managing data transfers. The FTP
client initiates the first connection, referred to as the control connection, to
well-known port 21 (the client’s port is typically ephemeral). It is on this port that
an FTP server listens for and accepts new connections. The control connection
is used for all of the control commands a client user uses to log on to the server,
manipulate files, and terminate a session. This is also the connection across
which the FTP server will send messages to the client in response to these
control commands.
Note: Communication between the client and the server follow the Telnet
protocol defined in RFC 854.
The second connection used by FTP is referred to as the data connection.
Typically, the data connection is established on server port 20. However,
depending on how the data connection is established, both the client and server
might use ephemeral ports. It is across this connection that FTP transfers the
data. FTP only opens a data connection when a client issues a command
requiring a data transfer, such as a request to retrieve a file, or to view a list of
the files available. Therefore, it is possible for an entire FTP session to open and
close without a data connection ever having been opened. Unlike the control
connection, in which commands and replies can flow both from the client to the
server and from the server to the client, the data connection is unidirectional.
FTP can transfer data only from the client to the server, or from the server to the
client, but not both. Also, unlike the control connection, the data connection can
be initiated from either the client or the server. Data connections initiated by the
server are active, while those initiated by the client are passive.
TCP/IP Tutorial and Technical Overview
The client FTP application is built with a protocol interpreter (PI), a data transfer
process (DTP), and a user interface. The server FTP application typically only
consists of a PI and DTP (see Figure 14-1).
Client System
Server System
Figure 14-1 The FTP model
The FTP client’s user interface communicates with the protocol interpreter, which
manages the control connection. This protocol interpreter translates any
application-specific commands to the RFC architected FTP commands, and then
communicates these control commands to the FTP server.
The FTP server’s PI receives these commands, and then initiates the
appropriate processes to service the client’s requests. If the requests require the
transfer of data, data management is performed by the DTPS on both the client
and server applications. Following the completion of the data transfer, the data
connection is closed, and control is returned to the PIs of the client and server
Note: Only one data transfer can occur for each data connection. If multiple
data transfers are required for a single FTP session, one distinct control
connection will be opened for each transfer.
14.1.2 FTP operations
When using FTP, the user performs some or all of the following operations:
򐂰 Connect to a remote host.
Chapter 14. File-related protocols
򐂰 Navigate and manipulate the directory structure.
򐂰 List files available for transfer.
򐂰 Define the transfer mode, transfer type, and data structure.
򐂰 Transfer data to or from the remote host.
򐂰 Disconnect from the remote host.
Connecting to a remote host
To execute a file transfer, the user begins by logging in to the remote host. This
is only the primary method of implementing security within the FTP model.
Additional security can be provided using SSL and TLS, which we discuss later
in this chapter (see 14.1.9, “Securing FTP sessions” on page 527). Conversely,
this authentication can be bypassed using anonymous FTP (see 14.1.7,
“Anonymous FTP” on page 525).
There are four commands that are used:
Selects the remote host and initiates the login session.
Identifies the remote user ID.
Authenticates the user.
Sends information to the foreign host that is used to provide
services specific to that host.
Navigating the directory structure
After a user has been authenticated and logged on to the server, that user can
navigate through the directory structure of the remote host in order to locate the
file desired for retrieval, or locate the directory into which a local file will be
transferred. The user can also navigate the directory structure of the client’s host.
After the correct local and remote directories have been accessed, users can
display the contents of the remote directory. The subcommands that perform
these functions are as follows:
Changes the directory on the remote host: A path name can be
specified, but must conform to the directory structure of the
remote host. In most implementations, cd .. will move one
directory up within the directory structure.
Changes the directory on the local host. Similar to the cd
command, a path name can be specified but must conform to the
directory structure of the local host.
TCP/IP Tutorial and Technical Overview
Lists the contents of the remote directory. The list generated by
this command is treated as data, and therefore, this command
requires the use of a data connection. This command is intended
to create output readable by human users.
Lists the contents of the remote directory. Similar to the ls
command, the list generated by dir is treated as data and
requires the use of a data connection. This command is intended
to create output readable by programs.
Controlling how the data is transferred
Transferring data between dissimilar systems often requires transformations of
the data as part of the transfer process. The user has to decide on three aspects
of the data handling:
򐂰 The way the bits will be moved from one place to another
򐂰 The different representations of data on the system's architecture
򐂰 The file structure in which the data is to be stored
Each of these is controlled by a subcommand:
Specifies whether the file is treated as having a record structure
in a byte stream format:
This specifies block mode is to be used. This indicates
that the logical record boundaries of the file are
This specifies that stream mode is to be used, meaning
that the file is treated as a byte stream. This is the
default and provides more efficient transfer but might not
produce the desired results when working with a
record-based file system.
Specifies the character sets used in translating and representing
the data:
Indicates that both hosts are ASCII-based, or that if one
is ASCII-based and the other is EBCDIC-based, that
ASCII-EBCDIC translation must be performed. On many
implementations, this can be invoked by issuing the
ASCII command, which the PI translates into the type a
Indicates that both hosts use an EBCDIC data
representation. On many implementations, this can be
invoked by issuing the EBCDIC command, which the PI
translates into the type E command.
Chapter 14. File-related protocols
Indicates that no translation is to be done on the data.
On many implementations, this can be invoked by using
the BINARY command, which the PI translates into the
type I command.
Specifies the structure of the file to be transferred:
Indicates that the file has no internal structure, and is
considered to be a continuous sequence of data bytes.
record Indicates that the file is made up of sequential records.
Indicates that the file is made up of independent indexed
Note: RFC 959 minimum FTP implementation includes only the
file and record options for the structure command. Therefore,
not all implementations will include the page option.
Because these subcommands do not cover all possible differences between
systems, the site subcommand is available to issue implementation-dependent
and platform-specific commands. The syntax of this command will vary by
Transferring files
The following commands can be used to copy files between FTP clients and
Copies a file from the remote host to the local host. The PI
translates get into a RETR command.
Copies multiple files from the remote to the local host. The PI
translates mget into a series of RETR commands.
Copies a file from the local host to the remote host. The PI
translates put into a STOR command.
Copies multiple files from the local host to the remote host. The
PI translates mput into a series of STOR commands.
Terminating the FTP session
The following commands are used to end an FTP session:
Disconnects from the remote host and terminates FTP. Some
implementations use the BYE subcommand.
Disconnects from the remote host but leaves the FTP client
running. An open command can be issued to establish a new
control connection.
TCP/IP Tutorial and Technical Overview
An example of an FTP scenario
A LAN user has to transfer data from a workstation to the remote host The data is binary, and exists in a file structure named mydata in
the workstation’s /localfolder directory. The user wants to transfer the data, using
stream mode, to the remote host’s /tmp directory, and would like the new file to
be named yourdata. The process for doing this is illustrated in Figure 14-2.
FTP Client
remote host
FTP Server
Local Area Network
Logon to the FTP server
Navigate to the correct
remote and local folders
l cd
/local folder
Specify the file attributes
Send the file
mydata yourdata
Terminate the session
Figure 14-2 An example of an FTP transfer
Chapter 14. File-related protocols
14.1.3 The active data transfer
When issuing the subcommands get, put, mget, mput, ls, or dir, FTP opens a
data connection across which the data will be transferred. Most FTP
implementations default to an active data transfer unless a passive transfer has
been specifically requested. In an active transfer, the FTP client sends a PORT
command to the FTP server, indicating the IP address and port number on which
the client will listen for a connection. Upon accepting the PORT command, the
FTP server initiates a connection back to the client on the indicated IP address
and port. After this connection has been established, data will begin flowing,
either from the client to the server (for put and mput commands) or from the
server to the client (for get, mget, ls, and dir commands). An example of this
sequence is illustrated in Figure 14-3.
FTP client
IP address
FTP server
IP address
PORT 10,1,2,3,10,11
(Control Connection)
200 Port request OK
>200 Port
request OK
(Control Connection)
<connection initiation>
(Data Connection)
(Control Connection)
>125 List
started OK
125 List started OK
(Control Connection)
>250 List
<file1,file2, file3>
(Data Connection)
250 List completed successfully
(Control Connection)
Figure 14-3 The active data connection
TCP/IP Tutorial and Technical Overview
14.1.4 The passive data transfer
Contrary to the use of an active data connection, the passive data transfer
reverses the direction of establishment of the data connection. Instead of issuing
a PORT command, the client issues a PASV command, which uses no
parameters. Upon accepting this command, the FTP server sends back a reply
containing an IP address and port number. The client initiates a connection back
to the server on the indicated IP address and port. An example of this sequence
is illustrated in Figure 14-4.
FTP client
IP address
FTP server
IP address
(Control Connection)
227 Entering Passive Mode (10,4,5,6,8,9)
>227 Entering
Passive Mode
(Control Connection)
<connection initiation>
(Data Connection)
>125 List
started OK
(Control Connection)
125 List started OK
(Control Connection)
>250 List
<file1,file2, file3>
(Data Connection)
250 List completed successfully
(Control Connection)
Figure 14-4 The passive data connection
One of the reasons to use a passive data transfer is to bypass firewall
configurations that block active data connections. For this reason, passive mode
is often referred to as “firewall friendly mode.” An example of such a scenario is a
firewall that has been configured to block any inbound attempts to open a
connection. In this example, an FTP server responding to a client’s PORT
Chapter 14. File-related protocols
command would receive an error when trying to open a connection to the
indicated IP address and port. However, by using passive mode, the client
initiates the connection from within the network, and the firewall allows the data
transfer to proceed. Other methods exist to resolve problems when FTPing
through a firewall, including proxy transfers (see 14.1.5, “Using proxy transfer” on
page 522) and the use of EPSV (see 14.1.8, “Using FTP with IPv6” on
page 525).
14.1.5 Using proxy transfer
FTP provides the ability for a client to have data transferred from one FTP server
to another FTP server. Several justifications for such a transfer exist, including:
򐂰 To transfer data from one host to another when direct access to the two hosts
are not possible
򐂰 To bypass a slow client connection
򐂰 To bypass a firewall restriction
򐂰 To reduce the amount of traffic within the client’s network
The process of setting up a proxy transfer begins with the use of a proxy open
command. Any FTP command can then be sent to the proxy server by preceding
the command with proxy. For example, executing the dir command lists the files
on the primary FTP server. Executing the proxy dir command lists the files on
the proxy server. The proxy get and proxy put commands can then be used to
transfer data between the two hosts. This process is illustrated in Figure 14-5 on
page 523.
TCP/IP Tutorial and Technical Overview
Server A
Server B
Figure 14-5 An FTP proxy transfer through a firewall
In Figure 14-5:
1. The FTP client opens a connection and logs on to the FTP server A.
2. The FTP client issues a proxy open command, and a new control connection
is established with FTP server B.
3. The FTP client then issues a proxy get command (though this can also be a
proxy put).
4. A data connection is established between server A and server B. Following
data connection establishment, the data flows from server B to server A.
14.1.6 Reply codes
In order to manage these operations, the client and server conduct a dialog using
the Telnet convention. The client issues commands, and the server responds
with reply codes. The responses also include comments for the benefit of the
user, but the client application uses only the codes. Reply codes are three digits
Chapter 14. File-related protocols
long, with the first two digits having specific meanings. See Table 14-1 and
Table 14-2 for descriptions of these codes.
Table 14-1 First digit reply codes and descriptions
First digit
reply code
Positive preliminary reply. These messages are usually informational,
and typically are followed by a message starting with a different reply
Positive completion reply. This indicates that the last client command
has completed successfully.
Positive intermediate reply. This indicates that the client’s command
has been accepted, but that additional information from the client is
needed before the command can complete.
Transient negative completion reply. This indicates that the client’s
command has failed, but that retrying the command might yield a
Permanent negative completion reply. This indicates that the
command has failed, and that retrying the command will not yield a
Protected reply. This indicates that the reply contains security-related
Table 14-2 Second digit reply codes and descriptions
Second digit
reply code
Syntax error. This indicates that a syntax error was encountered when
processing the client’s command.
Requested information. This indicates that the message contains
information requested by the client.
Connection information. This indicates that the reply contains
information about the control or data connection.
Authentication. This indicates that the reply is part of the logon
sequence, or some other authentication sequence (such as TLS
Not in use.
File system information. This indicates that the reply contains file
system information relevant to the last client command received.
TCP/IP Tutorial and Technical Overview
For each user command, shown as this, the FTP server responds with a 3-digit
reply and message, shown as this:
FTP foreignhost
220 service ready
331 user name okay
230 user logged in
TYPE Image
200 command okay
14.1.7 Anonymous FTP
Many TCP/IP sites implement what is known as anonymous FTP, which means
that these sites allow public access to some file directories. The remote user only
needs to use the login name anonymous and password guest or some other
common password conventions (for example, the user's Internet e-mail ID). The
password convention used on a system is explained to the user during the login
process. Because this method of logon is available to anyone with Internet
access to the system, most hosts restrict the folders accessible to anonymous
14.1.8 Using FTP with IPv6
The increasing use of IP version 6 (IPv6) in networks has little effect on FTP
implementations, because IPv6 addressing is managed by the IP layer (see
Chapter 9, “IP version 6” on page 327). However, a problem does arise with the
previously architected PORT command. To understand this, we must discuss the
PORT command in greater detail.
As discussed previously (see 14.1.3, “The active data transfer” on page 520), the
FTP client uses the PORT command to inform the server of what port and IP
address on which the client will listen for the data connection to be opened. The
syntax of this command is as follows:
PORT h1,h2,h3,h4,p1,p2
In this syntax, h1 through h4 are the four octets of an IP version 4 (IPv4) address.
p1 represents a number to be multiplied by 256, and then added to p2 to obtain
the port number. For example, assume that an FTP client issued the following
PORT command to an FTP server:
PORT 10,1,200,201,9,8
Chapter 14. File-related protocols
This command instructs the FTP server to open a connection to IP address It also tells the server that it should initiate the connection to port
(9 x 256) + 8 = 2312.
By this definition of the PORT command’s syntax, it becomes clear that the
command was designed only to be suitable for an IPv4 address, because there
are too few fields to accommodate an IPv6 address. A similar problem occurs
when using the PASV command. Though the PASV command is issued with no
arguments, the FTP server’s RFC architected reply is as follows:
227 Entering Passive Mode. (h1,h2,h3,h4,p1,p2)
Again, h1 through h4 denote the IP address to which the client must initiate a
connection, and (p1 x 256) + p2 denotes the port on which the server is listening.
Therefore, the same IPv4 restrictions that exist for the PORT command also
exist for PASV processing.
To overcome this limitation, RFC 2428 was created to replace the PORT and
PASV commands with EPRT and EPSV, respectively. The EPRT command is
processed similar to the PORT command, but the format is now as follows:
EPRT <d>protocol<d>address<d>port<d>
Denotes a delimiter from the ASCII range of 33 to 126, though
the ASCII 124 character (|) is recommended.
Is the address family by which the client requests a connection. A
value of 1 is a request for IPv4, and a value of 2 is a request for
Is the corresponding IPv4 or IPv6 address to which the
connection must be opened. Port is simply the port to which the
connection must be opened.
Two examples of the EPRT command in use are as follows (note that the IPv6 is
simply the IPv4 address converted to IPv6 notation):
򐂰 IPv4: EPRT |1||2312|
򐂰 IPv6: EPRT |2|::0A01:C8C9|2312|
Similar to PASV, EPSV can be issued without an argument. The server’s reply to
the EPSV command has been designed as follows:
229 <text> (<d><d><d>port<d>)
The text is optional, and simply indicates to users that the passive mode will be
implemented. The fields between the delimiters must be blank, because the
EPSV command assumes that the FTP client will initiate a data connection to the
TCP/IP Tutorial and Technical Overview
same IP address on which the control connection is established. An example of a
possible server reply to an EPSV command is:
229 Entering Passive Mode (|||2312|)
Upon receiving this, the FTP client opens a data connection to port 2312 on the
same IP address to which the control connection is active. This also useful when
establishing a secure FTP session (see 14.1.9, “Securing FTP sessions” on
page 527) across a firewall that implements Network Address Translation (NAT).
NAT implementation during an FTP transfer requires the firewall to translate the
IP address specified on the client’s PORT command or server’s PASV response.
This allows local IP addresses to be routable from outside of the client or server’s
subnet. However, if the control connection has been encrypted, the NAT
implementation can no longer access these addresses, and the FTP applications
might not be able to establish a data connection. However, by using the EPSV
command, the FTP server already knows what IP address to use, and the data
connection establishment can proceed without the intervention of the NAT
14.1.9 Securing FTP sessions
Although FTP provides security by requiring that users log on with a
platform-specific user ID and password, this only prevents unauthorized access
to the system itself. When transferring data from one host to another, the data
within the packets (both on the control connection and the data connection) is
sent in clear text. Therefore, network tools such as packet traces and sniffer
devices can capture the packets and gain access to the transferred data.
Additionally, the user ID and password used to log on to the server can be
captured in these traces, giving a malicious user access to the system.
To avoid this problem, the design of FTP has been enhanced to make use of
Transport Layer Security (TLS). TLS is defined in RFC 4366, and defines a
standard of data encryption between two hosts. As denoted by its name, TLS is
implemented on the transport layer, and thus applications using TLS do not need
to know the specifics of RFC 4366. Instead, such applications only need to know
how to invoke TLS, and in the case of FTP, this process is defined by RFCs 2228
and 4217.
The configuration and implementation of TLS varies by platform. However, RFCs
2228 and 4217 add to the FTP architecture the following TLS-related commands,
which invoke TLS regardless of the platform’s specific configuration process:
Specifies the authentication method to be used in securing the
FTP connection.
Chapter 14. File-related protocols
Passes Base64-encoded security data, specific to the
mechanism specified on the AUTH command, used to in the
security negotiation between the client and server.
Specifies the largest buffer size in which encrypted data can be
passed between the client and server.
Specifies the data channel protection level. The argument
passed must be one of the following:
Clear: Neither authentication nor encryption is used.
Safe: Authentication is performed, but no encryption is
Confidential: Encryption is performed, but no
authentication is implemented.
Private: Both encryption and authentication are
Provides a Base64-encoded message used to integrity protect
commands sent on the control connection.
Provides a Base64-encoded message used to confidentially
protect commands sent on the control connection.
Provides a Base64-encoded message used to privately protect
commands sent on the control connection.
Instructs the FTP server that integrity-protected commands are
no longer required for the FTP session.
TCP/IP Tutorial and Technical Overview
Figure 14-6 shows an example of a common security negotiation using these
FTP server
FTP client
<Connection establishment>
234 Security environment established
- ready for negotiations
<TLS Negotiation>
200 PBSZ=0 is the protection buffer size
200 Data connection set to private
USER userid
331 Send password please
PASS password
230 userid is logged on. Working directory
is /tmp
Figure 14-6 An example of FTP TLS processing
Most FTP client and server implementations provide configuration options to
automate the security negotiation using the previous commands. This makes the
implementation of secure FTP easier for the user because users do not need to
understand or implement the details of the security during the FTP session itself,
and only are prompted for a user ID and possibly a password.
14.2 Trivial File Transfer Protocol (TFTP)
The Trivial File Transfer Protocol (TFTP) is a standard protocol with STD number
33. Its status is elective and it is described in RFC 1350™ – The TFTP Protocol
(Revision 2). Updates to TFTP are in the RFCs: 1785, 2347, 2348, and 2349.
Chapter 14. File-related protocols
TFTP file transfer is a disk-to-disk data transfer, and is an simple protocol used to
transfer files. The simplicity of the architecture is deliberate in order to facilitate
ease of implementation. This simplistic approach has many benefits over
traditional FTP, including:
򐂰 Use by diskless devices to download firmware at boot time
򐂰 Use by any automated process for which the assignment of a user ID or
password is not feasible
򐂰 Small application size, allowing it to be implemented inexpensively and in
environments where resources are constricted
TFTP is implemented on top of the User Datagram Protocol (UDP, see 4.2, “User
Datagram Protocol (UDP)” on page 146). The TFTP client initially sends
read/write request through well-known port 69. The server and the client then
determine the port that they will use for the rest of the connection. TFTP lacks
most of the features of FTP (see 14.1, “File Transfer Protocol (FTP)” on
page 514), and instead is limited to only reading a file from a server or writing a
file to a server.
Note: TFTP has no provisions for user authentication; in that respect, it is an
insecure protocol.
14.2.1 TFTP usage
The commands used by TFTP implementations are not architected by an RFC.
Instead, only the direct interaction between a TFTP server and client are defined.
Therefore, the commands used to invoke this interaction vary between different
implementations of this protocol. However, each implementation has some
variation of the following commands:
Connect <host>
Specifies the destination host ID.
Mode <ascii|binary>
Specifies the type of transfer mode.
Get <remote filename> [<local filename>]
Retrieves a file.
Put <remote filename> [<local filename>]
Stores a file.
Toggles verbose mode, which displays
additional information during file transfer, on or
Exits TFTP.
TCP/IP Tutorial and Technical Overview
For a full list of these commands, see the user's guide of your particular TFTP
14.2.2 Protocol description
Every TFTP transfer begins with a request to read or write a file. If the server
grants the request, the connection is opened and the file is sent in blocks of 512
bytes (fixed length). Blocks of the file are numbered consecutively, starting at 1,
and each packet carries exactly one block of data.
Each data packet must be answered by an acknowledgment packet before the
next one can be sent. Termination of the transfer is assumed on a data packet of
less than 512 bytes. Although almost all errors will cause termination of the
connection due to lack of reliability, TFTP can recover from packet loss. If a
packet is lost in the network, a timeout occurs, initiating a retransmission of the
last packet. This retransmission occurs both for lost data blocks or lost
The requirement that every packet be acknowledged—including
retransmissions—uncovered a design flaw in TFTP known as the Sorcerer's
Apprentice Syndrome (SAS), described in RFC 783. On networks that
experience latency or other delays, this flaw might cause excessive
retransmission by both sides of the TFTP implementation. SAS was further
documented in RFC 1123 and later corrected by adding OACK packets as a
TFTP extension, described in RFC 2347. For additional details, refer to the
appropriate RFCs.
14.2.3 TFTP packets
TFTP uses six types of packets, described in Table 14-3.
Table 14-3 TFTP packet types
Read Request (RRQ)
Write Request (WRQ)
Data (DATA)
Acknowledgement (ACK)
Error (ERROR)
Option Acknowledgement (OACK)
Chapter 14. File-related protocols
The TFTP header contains the opcode associated with the packet. See
Figure 14-7 for more details.
Data packet
ACK packet
Error packet
OACK packet
2 bytes
1 byte
1 byte
Opcode = 1/2
2 bytes
2 bytes
up to 512 bytes
Opcode = 3
Block #
2 bytes
2 bytes
Opcode = 4
Block #
2 bytes
2 bytes
1 byte
Opcode = 5
Block #
Error message
Figure 14-7 TFTP packet headers
14.2.4 Data modes
Two modes of transfer are currently defined in RFC 1350:
US-ASCII, as defined in the USA Standard Code for Information
Interchange with modifications specified in RFC 854 – Telnet
Protocol Specification and extended to use the high order bit.
(That is, it is an 8-bit character set, unlike US-ASCII, which is
Raw 8-bit bytes, also called binary.
The mode used is indicated in the TFTP header for the Request for Read/Write
packets (RRQ/WRQ).
14.2.5 TFTP multicast option
The TFTP multicast option enables multiple clients to get files simultaneously
from the server using the multicast packets. For example, when two similar
machines are remotely booted, they can retrieve the same configuration file
simultaneously by adding the multicast option to the TFTP option set. The TFTP
TCP/IP Tutorial and Technical Overview
multicast option is described in RFC 2090. Figure 14-8 is an example of a TFTP
read request packet modified to include the multicast option.
Figure 14-8 The TFTP Multicast header
If the server accepts the multicast, it sends an option acknowledgment (OACK)
packet to the server including the multicast option. This packet consists of the
multicast address and a flag that specifies whether the client should send
acknowledgments (ACK).
14.2.6 Security issues
Because TFTP does not have any authentication mechanism, the server is
responsible for protecting the host files. Generally, TFTP servers do not allow
write access and only allow read access to public directories. Some server
implementations also might employ host access lists to restrict access to only a
subset of hosts.
14.3 Secure Copy Protocol (SCP) and SSH FTP (SFTP)
Secure Copy Protocol (SCP) and Secure Shell File Transfer Protocol (SFTP) are
two methods of implementing secure transfer of files from one host to another.
Neither protocol has been architected to perform authentication, and instead rely
on the underlying SSH protocol on which they were built. This not only simplifies
the implementation of the protocol, but avoids the pitfalls of traditional FTP (see
14.1, “File Transfer Protocol (FTP)” on page 514) by encrypting all data that
passes between the two hosts.
Neither protocol is RFC architected, though an Internet draft currently exists for
SFTP. Despite this lack of standardization, the protocol is implemented widely
enough to be considered a de facto industry standard, and thus warrants
14.3.1 SCP syntax and usage
SCP functions much like the copy (cp) command on UNIX-based systems, and
takes the following format:
scp flags sourceFile destinationFile
Chapter 14. File-related protocols
Both the user ID needed to gain access to the remote host and the location of the
remote host are specified as part of the file name. For example, assume a user
wants to copy a file from the host The file is in the /tmp directory and
is named yourdata. Additionally, this user has a user ID on the host of myId, and
wants to copy the file to the /localfolder directory as mydata. In order to achieve
this using SCP, the user issues the following command:
scp [email protected]:/tmp/yourdata /localfolder/mydata
Similarly, this user can upload a copy of mydata to, using the
following command:
scp /localfolder/mydata [email protected]:/tmp/yourdata
Many implementations of SCP also include additional flags that allow further
configuration of the SCP command. Although currently there are not RFC
architected standards, the open source community has established the following
flags for use with the protocol:
-c cipher
Selects the cipher to use for encrypting the data transfer.
-i identity_file Selects the file from which the identity (private key) for RSA
authentication is read.
Preserves modification times, access times, and modes from
the original file.
Recursively copies entire directories.
Verbose mode. Causes scp and ssh(1) to print debugging
messages about their progress. This is helpful in debugging
connection, authentication, and configuration problems.
Selects batch mode.
Disables the progress meter.
Passes the -C flag to ssh(1) to enable compression.
-F ssh_config
Specifies an alternative per-user configuration file for SSH.
-P port
Specifies the port to connect to on the remote host.
-S program
Specifies the name of program to use for the encrypted
connection. The program must understand ssh(1) options.
-o ssh_option
Can be used to pass options to SSH in the format used in
ssh_config(5). This is useful for specifying options for which
there is no separate scp command-line flag. For example,
forcing the use of protocol version 1 is specified using scp
Forces scp to use IPv4 addresses only.
Forces scp to use IPv6 addresses only.
TCP/IP Tutorial and Technical Overview
14.3.2 SFTP syntax and usage
SFTP allows a greater range of action than does SCP, but is still founded on the
same principle of using SSH to perform secure transfers of data between
systems. Unlike SCP, which only provides the capability of transferring a file,
SFTP offers many of the same features found in traditional FTP. For this reason,
a common mistake is to assume that SFTP is the RFC-architected FTP
implemented using SSH. This is not the case, because FTP has a separate
method of establishing data security (see 14.1.9, “Securing FTP sessions” on
page 527). Instead, SFTP was designed independently of FTP.
Some of the additional functionality provided by SFTP over SCP include the
ability to browse and manipulate the directory structure, rename files, and alter
file permissions. There are three ways to invoke SFTP:
򐂰 sftp flags host
This method of invoking SFTP enables a user to log on to the remote host. If
SSH cannot establish the authenticity of the remote host, the user might be
prompted for a password.
򐂰 sftp flags [email protected]:file
This method of invoking SFTP allows SFTP to automatically retrieve the
specified file.
򐂰 sftp flags [email protected]:directory
This method of invoking SFTP logs the user onto the remote host and
immediately places the user in the specified directory.
Similar to SCP, SFTP has a set of industry standard flags that can be configured
when invoking SFTP:
-b batchfile
Batch mode allows the SFTP commands to be entered
into a file instead of requiring a user to enter each
command sequentially. Because this mode does not
allow user interaction, non-interactive authentication
must be used. Additionally, in batch mode, SFTP
aborts if any of the following commands experience a
failure: get, put, rename, ln, rm, mkdir, chdir, lchdir,
and lmkdir.
-o ssh_option
Can be used to pass options to SSH.
-s subsystem | sftp_server
Specifies the SSH2 subsystem or the path for an
SFTP server on the remote host.
Raises the logging level. This option is also passed to
Chapter 14. File-related protocols
-B buffer_size
Specifies the size of the buffer that SFTP uses when
transferring files.
Enables compression.
-F ssh_config
Specifies an alternative per-user configuration file for
-P sftp_server_path
Connects directly to a local SFTP server.
-R num_requests
Specifies how many requests can be outstanding at
any one time. Increasing this might slightly improve file
transfer speed but will increase memory usage. The
default is 16 outstanding requests.
-S program
Specifies the name of the program to use for the
encrypted connection.
Specifies the use of protocol version 1.
14.3.3 SFTP interactive commands
Many of the interactive commands offered by SFTP closely resemble those
offered by a traditional FTP client (see 14.1.2, “FTP operations” on page 515).
Indeed, the functionality afforded by both protocols is in many ways equivalent.
Directory navigation and manipulation
The following commands are used to not only navigate through a directory
structure, but also alter paths within that structure. Note that fields enclosed in
brackets [ ] are optional.
cd path
Changes remote directory to path.
lcd path
Changes local directory to path.
lmkdir path
Creates the local directory specified by path.
mkdir path
Creates the remote directory specified by path.
rmdir path
Removes the remote directory specified by path.
TCP/IP Tutorial and Technical Overview
File manipulation
The following commands provide the ability to not only transfer files, but also to
delete and rename files:
get [-P] remote-path [local-path]
Retrieves the file specified by remote-path and stores it
on the local machine. If the local-path name is not
specified, it is given the same name it has on the remote
machine. If the -P flag is specified, the file's full
permission and access time are copied, too.
put [-P] local-path [local-path]
Uploads the file specified by local-path and stores it on
the remote host using the name specified by remote-path.
If the remote-path name is not specified, it is given the
same name it has on the local machine. If the -P flag is
specified, the file's full permission and access time are
copied, too.
rename oldpath newpath
Renames the remote file from oldpath to newpath.
rm path
Deletes the remote file specified by path.
Obtaining information
Use the following commands to obtain information in the SFTP session:
Displays help text.
lls [path]
Displays the local directory listing of either the specified
path, or current directory if path is not specified.
Prints the local working directory.
ls [-l] [path]
Displays the remote directory listing of either the specified
path, or the current directory if path is not specified. If the
-l flag is specified, it displays additional details including
permissions and ownership information.
Displays remote working directory.
Synonym for help.
Shell manipulation
Because SFTP is built on the secure shell environment, users might need to
obtain information about the shell, or issue shell commands without terminating
SFTP. For this, use the following commands:
chgrp grp path
Changes the group of the file specified by path to grp. grp
must be a numeric GID.
Chapter 14. File-related protocols
chmod mode path
Changes the permissions of the file specified by path to
chown own path
Changes the owner of the file specified by path to own.
own must be a numeric UID.
ln oldpath newpath
Creates a symbolic link from oldpath to newpath.
lumask umask
Sets the local umask to umask.
symlink oldpath newpath
Creates a symbolic link from oldpath to newpath.
! command
Executes the command in the local shell.
Escapes to the local shell.
Exiting SFTP
The following three commands terminate the SFTP session:
򐂰 bye
򐂰 exit
򐂰 quit
14.4 Network File System (NFS)
Designed by Sun Microsystems, the Network File System (NFS) protocol
enables machines to share file systems across a network. The NFS protocol is
designed to be machine-independent, operating system-independent, and
transport protocol-independent. This is achieved through implementation on top
of Remote Procedure Call (see 11.2.2, “Remote Procedure Call (RPC)” on
page 415), which establishes machine independence by using the External Data
Representation (XDR) convention.
Sun NFS is a proposed standard protocol with an elective status. The current
NFS specifications are in RFC 1813 – NFS: NFS Version 3 Protocol
Specification and RFC 3530 – NFS version 4 Protocol.
14.4.1 NFS concept
The NFS model allows authorized users to access files located on remote
systems as though they were local. Two protocols in the model serve this
򐂰 The Mount protocol specifies the remote host and file system to be accessed
and where to locate them in the local file hierarchy.
TCP/IP Tutorial and Technical Overview
򐂰 The NFS protocol performs the file I/O to the remote file system.
Both the Mount and NFS protocols are RPC applications that implement the
client/server model (see Figure 11-1 on page 409) and are transported by both
TCP and UDP.
Mount protocol
The Mount protocol is an RPC application shipped with NFS and uses program
number 100005. The MOUNT command acts as an RPC server program and
provides a total of six procedures when accessing remote systems:
Does nothing. Useful for server response testing.
MOUNT function. Returns a file handle pointing to the directory.
Returns the list of all mounted file systems.
Removes a mount list entry.
Removes all mount list entries for this client.
Returns information about the available file systems.
A file handle is a variable-length array with a maximum length of 64 bytes and is
used by clients to access remote files. They are a fundamental part of NFS,
because each directory and file are referenced only through file handles. For this
reason, some implementations increase the security of the protocol by
encrypting the handles. The file handles are obtained by executing the MOUNT
application’s MNT procedure, which locates the remote file system within the file
hierarchy, and returns the file handle. Following the MOUNT command, a user
can access the remote file system as though it were a part of the local file
For example, consider two remote hosts:
򐂰 HostA implements a hierarchical file system consisting of numerous
directories and subdirectories.
򐂰 HostB does not implement a hierarchical file system, but instead implements
a series of minidisks, with each minidisk appearing as a separate folder.
A user on HostA issues the MOUNT command to locally mount a minidisk from
HostB. However, the user does not see the mounted volume as a minidisk, but
instead sees it in the context of the hierarchical file system of HostA.
Although specific implementations can provide additional features, the generic
syntax of the MOUNT command is as follows:
MOUNT -o options host:rvolume lvolume
Chapter 14. File-related protocols
System-specific options, such as message size.
The TCP/IP name of the remote host.
The remote volume to be mounted.
The location to which the remote volume will be mounted. This is
also called the mount point.
In the case of our HostA and HostB example, assume the user wants to mount
minidisk1 on HostB as /usr/lpp/tmp/md1. The user might issue the following
MOUNT HostB:minidisk1 /usr/lpp/tmp/md1
The user can access minidisk1 by simply navigating on the local file system to
/usr/lpp/tmp/md1. This is illustrated in Figure 14-9.
Host A
RPC program
NFS server
Host B
mount HostB:minidisk1
Figure 14-9 The NFS model
TCP/IP Tutorial and Technical Overview
To better understand the MOUNT command, the syntax can be divided into three
-o options
The client portion. It is intended to be understood by the NFS
client only. Therefore, valid options are client
implementation-specific, and additional information must be
sought from each implementation’s documentation.
host:rvolume The server portion. The syntax of this argument depends on the
server's file system. In our example, we used a generic
HostB:minidisk1 argument. However, modify the argument to
conform with the file system of the remote server. Additionally,
the remote server can include additional parameters within this
argument, such as user authentication or file system attributes.
Refer to the documentation of the specific NFS server
implementation to determine what parameters it will accept.
A client part. This is called the mount point. This is where the
remote file system will be hooked into the local file system.
The UMOUNT command removes the remote file system from the local file
hierarchy. Continuing the previous example, the following command removes the
/usr/lpp/tmp/md1 directory:
UMOUNT /usr/lpp/tmp/md1
NFS protocol
NFS is the RPC application program providing file I/O functions to a remote host
after it has been requested through a MOUNT command. It has program number
100003 and sometimes uses IP port 2049. Because this is not an officially
assigned port and several versions of NFS (and Mount) already exist, port
numbers can change. We advise you to use portmap or RPCBIND (see 11.2.2,
“Remote Procedure Call (RPC)” on page 415) to obtain the port numbers for both
the Mount and NFS protocols. The NFS protocol is transported by both TCP and
The NFS program supports 22 procedures, providing for all basic I/O operations,
Resolves the access rights, according to the set of
permissions of the file for that user.
Searches for a file in the current directory and if found,
returns a file handle pointing to it plus information about the
file's attributes.
Basic read/write primitives to access the file.
Renames a file.
Chapter 14. File-related protocols
Deletes a file.
Creates and deletes subdirectories.
Gets or sets file attributes.
Other functions are also provided.
These correspond to most of the file I/O primitives used in the local operating
system to access local files. In fact, after the remote directory is mounted, the
local operating system has only to re-route the file I/O primitives to the remote
host. This makes all file I/Os look alike, regardless of whether the file is located
locally or remotely. The user can operate his or her normal commands and
programs on both kinds of files; in other words, this NFS protocol is completely
transparent to the user (see Figure 14-10).
NFS application
operating system
I/O primitives
Figure 14-10 NFS: File I/O
14.4.2 File integrity
Because NFS is a stateless protocol, there is a need to protect the file integrity of
the NFS-mounted files. Many implementations have the Lock Manager protocol
for this purpose. Sometimes, multiple processes open the same file
simultaneously. If the file is opened for merely read access, every process will
get the most current data. If there is more than one processes writing to the file at
one time, the changes made by writers should be coordinated, and at the same
time, the most accurate changes should be given to the readers. If some part of
the file is cached on each client's local system, it becomes a more complicated
TCP/IP Tutorial and Technical Overview
job to synchronize the writers and readers accordingly. Though they might not
occur so frequently, take these incidents into consideration.
14.4.3 Lock Manager protocol
This protocol allows client and server processes to exclude the other processes
when they are writing to the file. When a process locks the file for exclusive
access, no other process can access the file. When a process locks the file for
shared access, the other processes can share the file but cannot initiate an
exclusive access lock. If any other process request conflicts with the locking
state, either the process waits until the lock is removed or it returns an error
14.4.4 NFS file system
NFS assumes a hierarchical, or directory-based, file system. In such a system,
files are unstructured streams of uninterpreted bytes; that is, files are seen as a
contiguous byte stream, without any record-level structure.
With NFS, all file operations are synchronous. This means that the file operation
call only returns when the server has completed all work for this operation. In
case of a write request, the server will physically write the data to disk and if
necessary, update any directory structure before returning a response to the
client. This ensures file integrity.
NFS also specifies that servers should be stateless. That is, a server does not
need to maintain any extra information about any of its clients in order to function
correctly. In case of a server failure, clients only have to retry a request until the
server responds without having to reiterate a MOUNT operation.
14.4.5 NFS version 4
NFS in version 4 (NFSv4) is defined in RFC 3530. The NFS model and concepts
are essentially unchanged from version 3 (RFC 1813), and the new version
focuses on improving performance, increasing security, augmenting
cross-platform interoperability, and providing a design for protocol extensions.
Removal of ancillary protocols
As noted previously, the NFS model incorporated numerous other protocols,
such as the Mount protocol and the Lock Management protocol, in order to
address gaps in the basic NFS protocol. The use of the Mount protocol in
pervious versions of NFS was solely for the purpose of obtaining the initial file
handle for individual servers. This has been replaced by the combination of
Chapter 14. File-related protocols
public file handles and the LOOKUP, allowing the NFSv4 protocol to mount file
systems without invoking additional protocols. Additionally, NFSv4 integrates
locking schemes into the base NFS protocol, and therefore no longer relies upon
the Lock Manager protocol. These integrations are transparent to users.
Introduction of stateful operations
Previous versions of NFS were considered stateless. However, with NFSv4,
OPEN and CLOSE operations were added, creating a concept of statefulness on
the NFS server. This addition provides numerous advantages over the stateless
model, including:
򐂰 The ability to integrate enhanced locking functionality
򐂰 The ability for a server to delegate authority to a client
򐂰 Allowing aggressive caching of file data
The total number of operations in NFSv4 has expanded to 38. Many of these
perform functions similar to those provided in previous releases of NFS, but new
operations have been added to integrating locking, security, and client access.
Examples of some these include:
Creates a lock on a file.
Tests for the existence of a lock on a file.
Unlocks a file.
Retrieves NFS security information.
Security enhancements
In version 4, NFS now takes advantage of available security methods, including:
򐂰 Support for RPCSEC_GSS, described in RFC 2203
򐂰 Support for Kerberos 5, described in RFC 4120
򐂰 Support for the Low Infrastructure Public Key Mechanism (LIPKEY), defined
in RFC 2847
򐂰 Support for the Simple Public-Key GSS-API Mechanism (SPKM-3), defined in
RFC 2025
Additional changes
The following additional changes, which do not fit into the previous discussions of
NFSv4, are worth noting:
򐂰 New volatile style file handles, which help server implementations cope with
file system reorganizations
TCP/IP Tutorial and Technical Overview
򐂰 Compound commands for performance, which execute lookup and read
commands in one operation
򐂰 Server delegation of status and locks to a given client, improving caching and
򐂰 Internationalization in the form of 16/32-bit character support for file names by
means of the UTF-8 encoding
14.4.6 Cache File System
The Cache File System (CacheFS™) provides the ability to cache one file
system on another. CacheFS accomplishes caching by mounting remote
directories on the local system. Whenever the client needs a mounted file, it first