Network Programming Guide

Network Programming Guide
Network Programming Guide
Part Number: 800-3850-10
Revision A of 27 March, 1990
Trademarks
SunOS™, Sun Workstation®, as well as the word "Sun" followed by a numerical suffix, are trademarks
of Sun Microsystems, Incorporated.
ONC is a trademark of Sun Microsystems, Incorporated.
UNIX® and UNIX System V® are trademarks of Bell Laboratories.
All other products or services mentioned in this document are identified by the trademarks or service
marks of their respective companies or organizations.
Legal Notice to Users
The Network Infonnation Service (NIS) was formerly known as Sun Yellow Pages. The functionality of
the two remains the same, only the name has changed. The name Yellow Pages™ is a registered trademark in the United Kingdom of British Telecommunications pIc and may not be used without permission.
Copyright © 1990 Sun Microsystems, Inc. - Printed in U.S.A.
All rights reserved. No part of this work covered by copyright hereon may be reproduced in any fonn or by any
means - graphic, electronic, or mechanical - including photocopying, recording, taping, or storage in an information
retrieval system, without the prior written permission of the copyright owner.
Restricted rights legend: use, duplication, or disclosure by the U.S. government is subject to restrictions set forth in
subparagraph (c)(l)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 52.227-7013 and in
similar clauses in the FAR and NASA FAR Supplement.
The Sun Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface,
which license also covers Sun's licensees.
This product is protected by one or more of the following U.S. patents: 4,777,485 4,688,1904,527,2324,745,407
4,679,0144,435,7924,719,5694,550,368 in addition to foreign patents and applications pending.
This software and documentation is based in part on the Fourth Berkeley Software Distribution under license from the
Regents of the University of California. We acknowledge the following individuals and institutions for their role in
its development: The Regents of the University of California, the Electrical Engineering and Computer Sciences
Department at the Berkeley Campus of the University of California, and Other Contributors.
Contents
Chapter 1 Network Services .............................................................................................
1
1.1. The Major Network Services ....................................................................................
2
1.2. Network Programming Manual Overview........................................................
3
1.3. The Network File System (NFS) ............................................................................
4
Computing Environments ........................................................................................
5
Example NFS usage ....................................................................................................
6
Example 1: Mounting a Remote Filesystem ............................................
6
Example 2: Exporting a Filesystem ..............................................................
7
Example 3: Administering a Server Machine .........................................
8
NFS Architecture ..........................................................................................................
8
Transparent Information Access ............................... ......................................
8
Different Machines and Operating Systems ............................................
8
Easily Extensible .....................................................................................................
8
Ease of Network Administration ....................................................................
9
Reliability ....................................................................................................................
9
High Performance ...................................................................................................
9
The Sun NFS Implementation ..........................................................,•••••,:,:••• ".............
The NFS Interface ........................................................................ ,.:" .... ,:,"':........:.:.... ,......:,',:....,..
The NFS and the Mount Protocol ........................ ,...", ...................".;... ,' ..... ;.,; ......;-.
Pathname Parsing .................................................... ,.~;.,......;;;;;;.:.,,;:................,; ....:;;;:. .:;; ...;••• :::.
Export and Mount Lists ...........................................:;;.;.; ...:. • ,,"'........ . ii;; ....:;;:;.;;::;.•••• ;.••.•:....... ;;;:::
UNIX Mount Protocol Procedures ......................... 'i •••" ......... , ...." ....:.·....
i ... " ................... :<:::/
A Stateless Protocol ....................................................................,:.,. ...................... ;.. ,...... .
Note: Miscellaneous Network Operations ................................ ;;;.;:...;:, .. ;............. .
- iii-
Contents -
Continued
1.4. Remote File Sharing (RFS) .......................................................................................
18
Advertise ............................................................................................... .............................
18
Unadvertise .......................................................................................... .............................
18
Remote Mounts ..............................................................................................................
18
Resource Naming ..........................................................................................................
19
RFS Security Features ................................................................................................
19
Client Authentication ............................................................................................
19
Client Autllorization ..............................................................................................
19
User and Group Id Mapping .............................................................................
20
1.5. The Portmapper ................................................................................................................
21
Port Registration ............................................................................................................
21
1.6. The Network Infonnation SelVice Database Service ..................................
22
What Is The Network InfOImation Service? .................................................
23
Network Infonnation Service Maps ...................................................................
23
Network Infonnation Service Domains ...........................................................
23
Masters and Slaves .......................................................................................................
24
Naming ..........................................................................................................................
24
Data Storage ...............................................................................................................
25
Servers ...........................................................................................................................
25
Clients ............................................................................................................................
25
Default NIS Files .......................................................................... ................................
25
Hosts ...............................................................................................................................
26
Passwd ...........................................................................................................................
26
Others .............................................................................................................................
26
Changing your passwd .........................................................................................
26
1.7. The Network Lock Manager .....................................................................................
26
The Locking Protocol .................................................................................................
29
1.8. The Network Status Monitor ....................................................................................
30
PART ONE: Network Programming ..............................................................................
31
Chapter 2 Introduction to Remote Procedure Calls ........................................
33
2.1. Overview ..............................................................................................................................
33
How it is useful ..............................................................................................................
33
-iv-
Contents -
Continued
Terrn.inology ....................................................................................................................
33
The RPC Model .............................................................................................................
34
2.2. Versions and Numbers ..................................................................................................
35
2.3. Portmap .................................................................................................................................
36
2.4. Transports and Semantics ...........................................................................................
36
Transport Selection ......................................................................................................
36
2.5. External Data Representation ...................................................................................
37
2.6. Ipcinfo ....................................................................................................................................
37
2.7. Assigning Program Numbers ....................................................................................
37
Chapter 3 rpcgen Programming Guide ...............................................................
41
3.1. The rpcqen Protocol Compiler ..........................................................................
41
Converting Local Procedures into Remote Procedures ..........................
42
An Advanced Example ..............................................................................................
47
Debugging Applications ...........................................................................................
52
The C-Preprocessor .....................................................................................................
52
rpcqen Programming Notes .............................................................................
53
Network Types ..........................................................................................................
53
User-Provided Define Statements ..................................................................
53
Inetd Support ..............................................................................................................
54
Dispatch Tables ........................................................................................................
54
Client Programming Notes ......................................................................................
55
Timeout Changes .....................................................................................................
55
Client Authentication ................................................ .................. ..........................
56
Server Programming Notes .....................................................................................
56
Handling Broadcast on the Server Side ......................................................
56
Other Infonnation Passed to Server Procedures ....................................
57
RPC Language ................................................................................................................
58
Definitions ...................................................................................................................
58
Enumerations .............................................................................................................
58
Typedefs .......................................................................................................................
59
Constants ......................................................................................................................
59
Declarations ................................................................................................................
59
-v-
Contents - Continued
Structures .....................................................................................................................
60
Unions ............................................................................................................................
60
Programs .......................................................................................................................
61
Special Cases .............................................................................................................
62
Chapter 4 Remote Procedure Call Programming Guide .............................
65
4.1. Layers ofRPC ...................................................................................................................
65
Higher Layers of RPC ................................................................................................
66
Middle Layers of RPC ...............................................................................................
67
Passing Arbitrary Data Types ..........................................................................
69
Lower Layers of RPC .................................................................................................
72
More on the Server Side ......................................................................................
73
More on the Client Side .......................................................................................
75
Memory Allocation with XDR ........................................................................
77
4.2. RawRPC ..............................................................................................................................
78
4.3. Ot11erRPC Features ........................................................................................................
80
Select on tlle Server Side ..........................................................................................
80
Broadcast RPC ...............................................................................................................
81
Broadcast RPC Synopsis .....................................................................................
82
Batching .............................................................................................................................
82
Authentication ................................................................................................................
86
UNIX Authentication ............................................................................................
86
DES Authentication ...............................................................................................
89
Using Inetd .......................................................................................................................
92
4.4. More EXaIllples .................................................................................................................
92
Versions on Server Side ............................................................................................
92
Versions on Client Side ............................................................................... ..............
94
TCP .......................................................................................................................................
95
Callback Procedures ....................................................................................................
98
4.5. Futures ...................................................................................................................................
101
Chapter 5 External Data Representation: Sun Technical Notes ........... 103
Justification ......................................................................................................................
-vi-
104
Contents - Continued
A CaIlonical StaIldard ...........................................................................................
106
The XDR Library ..........................................................................................................
107
5.1. XDR Library Primitives ..............................................................................................
109
Number Filters ................................................................................................................
109
Floating Point Filters ................................................ ..................................................
110
Enumeration Filters .....................................................................................................
111
No Data ...............................................................................................................................
III
Constructed Data Type Filters ...............................................................................
111
Strings .......................... ............ ...... ............ ......... ......... ......... ......... ...... ...... ............ ........
111
Byte Arrays .................................................................................................................
112
Arrays .............................................................................................................................
113
Opaque Data ...............................................................................................................
115
Fixed Sized Arrays .................................................................................................
116
Discriminated Unions ...........................................................................................
116
Pointers .................................................................................................................. ........
118
Non-filter Primitives ...................................................................................................
120
XDR. Operation Directions ......................................................................................
120
XDR Stream Access ....................................................................................................
120
Standard I/O Streams ............................................................................................
120
Memory Streams ......................................................................................................
121
Record (l'CPIIP) Streams ...................................................................................
121
XDR Stream Implementation ................................................................................
123
The XDR Object ......................................................................................................
123
5.2. Advanced Topics .............................................................................................................
124
Linked Lists .......................................................... ...........................................................
124
PART TWO: Protocol Specifications ..............................................................................
129
Chapter 6 External Data Representation Standard: Protocol
Specification ................................................ ........................................................... 131
6.1. Status of this Standard ..................................................................................................
131
6.2. Introduction .........................................................................................................................
131
Basic Block Size ...........................................................................................................
131
6.3. XDR Data Types ..............................................................................................................
132
-vii-
Contents - Continued
Integer ..................................................................................................................................
132
Unsigned Integer ...........................................................................................................
132
Enumeration .....................................................................................................................
133
Boolean ...............................................................................................................................
133
Hyper Integer and Unsigned Hyper Integer ...................................................
133
Floating-point ..................................................................................................................
133
Double-precision Floating-point ..........................................................................
134
Fixed-length Opaque Data .......................................................................................
135
Variable-length Opaque Data ................................................................................
135
String ....................................................................................................................................
136
Fixed-length Array .......................................................................................................
136
Variable-length Array ................................................................................................
137
Structure .............................................................................................................................
137
Discriminated Union ...................................................................................................
138
Void ......................................................................................................................................
138
Constant ..............................................................................................................................
139
Typedef ...............................................................................................................................
139
Optional-data ...................................................................................................................
140
Areas for Future Enhancement ..............................................................................
141
6.4. Discussion ............................................................................................................................
141
Why a Language for Describing Data? ............................................................
141
Why Only one Byte-Order for an XDR Unit? ..............................................
141
Why does XDR use Big-Endian Byte-Order? ..............................................
141
Why is the XDR Unit Four Bytes Wide? ........................................................
141
Why must Variable-Length Data be Padded with Zeros? .....................
142
Why is there No Explicit Data-Typing? ..........................................................
142
6.5. The XDR Language Specification .........................................................................
142
Notational Conventions ............................................................................... ..............
142
Lexical Notes ..................................................................................................................
142
Syntax Information ......................................................................................................
143
Syntax Notes ..............................................................................................................
144
6.6. An Example of an XDR Data Description ........................................................
145
6.7. References ............................................................................................................................
146
- viii-
Contents -
Continued
Chapter 7 Remote Procedure Calls: Protocol Specification ..................... 147
7.1. Status of tllis Memo ....................................................................... ................................
147
7.2. Introduction .........................................................................................................................
147
Tenninology ....................................................................................................................
147
The RPC Model .............................................................................................................
147
Transports and Semantics ........................................................................................
148
Binding and Rendezvous Independence ..........................................................
149
Autllentication ................................................................................................................
149
7.3. RPC Protocol Requirements .....................................................................................
149
Programs and Procedures .........................................................................................
150
Autllentication ................................................................................................................
150
Program Number Assignment ...............................................................................
151
Other Uses of the RPC Protocol ...........................................................................
151
Batching ........................................................................................................................
152
Broadcast RPC ..........................................................................................................
152
7.4. The RPC Message Protocol.......................................................................................
152
7.5. Authentication Protocols .............................................................................................
155
Null Autllentication .....................................................................................................
155
UNIX Authentication .................................................................................................
155
DES Autllentication .....................................................................................................
156
Naming ..........................................................................................................................
156
DES Authentication Verifiers ...........................................................................
156
Nicknames and Dock Synchronization ......................................................
157
DES Authentication Protocol (in XDR language) ................................
158
Diffie-Hellman Encryption ................................................................................
159
7.6. Record Marking Standard ...........................................................................................
160
7.7. The RPC Language ........................................................................................................
161
An Example Service Described in the RPC Language ...........................
161
The RPC Language Specification ........................................................................
162
Syntax Notes ....................................................................................................................
162
7.8. Port Mapper Program Protocol ................................................................................
162
Port Mapper Protocol Specification (in RPC Language) .......................
163
Port Mapper Operation ..............................................................................................
164
-ix-
Contents -
Continued
7.9. References ............................................................................................................................
165
Chapter 8 Network File System: Version 2 Protocol
Specification ........................................................................................................... 167
8.1. Status oftllis Standard ..................................................................................................
167
8.2. Introduction .........................................................................................................................
167
Remote Procedure Call ..............................................................................................
167
External Data Representation ................................................................................
167
Stateless Servers ............................................................................................................
169
8.3. NFS Protocol Definition ..............................................................................................
169
File System Model .......................................................................................................
169
RPC Infonnation ...........................................................................................................
170
Sizes of XDR Structures ...........................................................................................
170
Basic Data Types ..........................................................................................................
170
stat ............................................................................................................................
171
ftype .........................................................................................................................
172
fhandle ...................................................................................................................
172
timeval ...................................................................................................................
173
fattr .........................................................................................................................
173
sattr .........................................................................................................................
174
filename ...............................................................................................................
174
path ............................................................................................................................
174
attrstat ...............................................................................................................
175
diropargs ............................................................................................................
175
diropres ...............................................................................................................
175
Server Procedures .........................................................................................................
175
Do Nothing ..................................................................................................................
176
Get File Attributes ..................................................................................................
176
Set File Attributes ...................................................................................................
176
Get Filesystem Root ..............................................................................................
177
Look Up File Name ...............................................................................................
177
Read From Symbolic Link .................................................................................
177
Read From File .........................................................................................................
177
-x-
Contents -
Continued
Write to Cache ..........................................................................................................
178
Write to File ...............................................................................................................
178
Create File ...................................................................................................................
178
Remove File ...............................................................................................................
178
Renrune File ........................................................................................................ ........
179
Create Link to File ..................................................................................................
179
Create Symoolic Link ...........................................................................................
179
Create Directory .......................................................................................................
180
Remove Directory ...................................................................................................
180
Read From Directory .............................................................................................
180
Get Filesystem Attributes .............................................................................. .....
181
8.4. NFS Implementation Issues .......................................................................................
181
Server/Client Relationship ......................................................................................
182
Pathnrune Interpretation ...................................................... .................. ...... ..............
182
Pennission Issues ..........................................................................................................
182
Setting RPC Parruneters ............................................................................................
183
8.5. Mount Protocol Definition .........................................................................................
183
Introduction ......................................................................................................................
183
RPC Information ...........................................................................................................
184
Sizes of XDR Structures ............................. ........................ ............ ............... ...........
184
Basic Data Types ..........................................................................................................
184
fhandle ...................................................................................................................
184
fhstatus ...............................................................................................................
184
dirpath ...................................................................................................................
184
name ............................................................................................................................
185
Server Procedures ....................................................... ...... ......... ........................ ...........
185
Do Nothing ..................................................................................................................
185
Add Mount Entry .................................................................. ...................................
185
Return Mount Entries ............................................................................................
185
Remove Mount Entry ............................................................................................
186
Remove All Mount Entries ................................................................................
186
Return Export List ..................................................................................................
186
PART THREE: Transport-Level Programming ....................................................
187
-xi-
Contents - Continued
Chapter 9 Transport Level Interface Programming ....................................... 189
9.1. Background .........................................................................................................................
189
9.2. Document Organization ...............................................................................................
191
9.3. OvelView of the Transport Interface ........................................ ......... ....................
192
Modes of SelVice ..........................................................................................................
192
Connection-Mode Service ..................................................................................
193
Local Management ................................................................................................
193
Connection Establishment ................................................................................
194
Data Transfer ............................................................................................................
196
Connection Release .......................................................................... ....................
196
Connectionless-Mode SelVice .........................................................................
197
State Transitions ............................................................................................................
197
9.4. Introduction to Connection-Mode Services ......................................................
197
Local Management· ......................................................................................................
198
The Client ....................................................................................................................
199
The Server ...................................................................................................................
201
Connection Establishment .......................................................................................
204
The Client ....................................................................................................................
204
Event Handling .........................................................................................................
205
The Server ...................................................................................................................
206
Data Transfer ...................................................................................................................
209
The Client ....................................................................................................................
210
The Server ...................................................................................................................
211
Connection Release .....................................................................................................
213
The Server ...................................................................................................................
213
The Client ....................................................................................................................
214
9.5. Introduction to Connectionless-Mode Service ...............................................
215
Local Management ......................................................................................................
215
Data Transfer ...................................................................................................................
217
Datagrrun Errors ............................................................................................................
219
9.6. A Read/Write Interface ................................................................................................
219
write ...............................................................................................................................
221
read ..................................................................................................................................
221
- xii-
Contents - Continued
close ............................................................................................................................... 221
9.7. Advanced Topics .............................................................................................................
222
Asynchronous Execution Mode ............................................................................ 222
Advanced Programming Example ...................................................................... 223
9.8. State Transitions ...............................................................................................................
229
Transport Interface States ........................................................................................ 229
Outgoing Events ............................................................................................................ 229
Incoming Events ............................................................................................................ 230
Transport User Actions .............................................................................................. 231
State Tables ...................................................................................................................... 231
9.9. Guidelines for Protocol Independence ................................................................
233
9.10. Some Examples .............................................................................................................
234
Connection-Mode Client .......................................................................................... 235
Connection-Mode Server ......................................................................................... 236
Connectionless-Mode Transaction Server ...................................................... 239
Read/Write Client ......................................................................................................... 241
Event-Driven Server .................................................................................................... 243
9.11. Glossary ..............................................................................................................................
248
Chapter 10 A Socket-Based Interprocess Communications
Tutorial .................................................................................................................... 251
10.1. Goals .................................................................................................................................... 251
10.2. Processes ............................................................................................................................ 252
10.3. Pipes ..................................................................................................................................... 253
10.4. Socketpairs ....................................................................................................................... 256
10.5. Domains and Protocols ..............................................................................................
258
10.6. Datagrams in the UNIX Domain .........................................................................
260
10.7. Datagrams in the Internet Domain ......................................................................
263
10.8. Connections ......................................................................................................................
266
10.9. Reads, Writes, Recvs, etc. ........................................................................................
275
10.10. Choices .............................................................................................................................
278
10.11. What to do Next ..........................................................................................................
278
- xiii-
Contents - Continued
Chapter 11 An Advanced Socket-Based Interprocess
Communications Tutorial ......................................................................... 279
11.1. Basics ...................................................................................................................................
280
Socket Types ................................................................................................................... 280
Socket Creation .............................................................................................................. 281
Binding Local Names ................................................................................................. 282
Connection Establishment ....................................................................................... 283
Data Transfer ................................................................................................................... 285
Discarding Sockets ....................................................................................................... 286
Connectionless Sockets ............................................................................................. 286
Input/Output Multiplexing ...................................................................................... 288
11.2. Library Routines ............................................................................................................
290
Host Names ...................................................................................................................... 291
Network Names ............................................................................................................. 291
Protocol Names .............................................................................................................. 292
Service Names ................................................................................................................ 292
Miscellaneous ................................................................................................................. 293
11.3. Client/SeIVer Model ....................................................................................................
295
Servers ................................................................................................................................. 295
Clients ................................................................................................................................. 298
Connectionless Servers .............................................................................................. 299
11.4. Advanced Topics ...........................................................................................................
302
Out Of Band Data ......................................................................................................... 302
Non-Blocking Sockets ............................................................................................... 304
Interrupt Driven Socket I/O .................................................................................... 304
Signals and Process Groups .................................................................................... 305
Pseudo Tenninals .......................................................................................................... 306
Selecting Specific Protocols ................................................................................... 308
Address Binding ............................................................................................................ 309
Broadcasting and Detennining Network Configuration ......................... 311
Socket Options ..................................... .......................................................................... 314
inetd ...............................................................................................................................
-xiv-
315
Contents - Continued
Chapter 12 Socket-Based IPC Implementation Notes .................................. 317
Overview ............................................................................................................................ 317
Goals .................................................................................................................................... 318
12.1. Memory, Addressing ..................................................................................................
318
Address Representation ............................................................................................. 318
Memory Management ................................................................................................ 319
12.2. Internal Layering ...........................................................................................................
322
Socket Layer .................................................................................................................... 323
Socket State ......................................................................................................................
324
Socket Data Queues ..................................................................................................... 325
Socket Connection Queuing ................................................................................... 326
Protocol Layer(s) ........................................................................................................... 326
Network-Interface Layer .......................................................................................... 328
12.3. Socket/Protocol Interface .........................................................................................
331
12.4. Protocol to Protocol Interface ................................................................................
334
pr_output () ...........................................................................................................
335
pr_input () ..............................................................................................................
335
pr_ctlinput () ....................................................................................................
336
pr_ctloutput () .................................................................................................
336
12.5. ProtocollNetwork-Interface Interface ................................................................
337
Packet Transmission ....... ....................................... .................. ................................... 337
Packet Reception ........................................................................................................... 337
12.6. Gateways and Routing Issues ................................................................................
338
Routing Tables ............................................................................................................... 338
Routing Table Interface .......................................................... ......... .......................... 340
User Level Routing Policies ................................................................................... 341
12.7. Raw Sockets .....................................................................................................................
341
Control Blocks ..................................................... ........................ ......... .......................... 341
Input Processing ................ ................................. ........................ ......... .......................... 342
Output Processing ...................................................................... ......... ..................... ..... 343
12.8. Buffering, Congestion Control..............................................................................
343
Memory Management ................................................................................................ 343
Protocol Buffering Policies ..................................................................................... 343
-xv-
Contents - Continued
Queue Limiting ..............................................................................................................
344
Packet Forwarding .......................................................................................................
344
12.9. Out of Band Data ..........................................................................................................
344
12.10. Acknowledgements ...................................................................................................
345
12.11. References ......................................................................................................................
345
Index .......................................................................................................................................................
347
- xvi-
Tables
Table 1-1 MOUNT: Remote Procedures, Version 1 ..................................................
15
Table 2-1 Registered RPC Program Numbers ...............................................................
38
Table 4-1 RPC Service Library Routines .........................................................................
67
Table 9-1 Local Management Routines .............................................................................
194
Table 9-2 Connection Establishment Routines .............................................................
195
Table 9-3 Connection Mode Data Transfer Routines ................................................
196
Table 9-4 Connection Release Routines ...........................................................................
196
Table 9-5 Connectionless-mode Data Transfer Routines ........................................
197
Table 9-6 Transport Interface States ....................................................................................
229
Table 9-7 Transport Interface Outgoing Events ............................................................
230
Table 9-8 Transport Interface Incoming Events ...........................................................
231
Table 11-1 C Run-time Routines ...........................................................................................
293
Table 11-2 ruptime Output ..................................................................................................
299
- xvii-
Figures
Figure 1-1 An Example NFS Filesystem Hierarchy ...................................................
7
Figure 1-2 Mount and NFS Servers .....................................................................................
13
Figure 1-3 Typical Portmapping Sequence .....................................................................
22
Figure 1-4 Architecture of the NFS Locking Service ................................................
28
Figure 2-1 Network Communication with the Remote Procedure Call..........
35
Figure 9-1 OSI Reference Model ..........................................................................................
189
Figure 9-2 Transport Interface ................................................................................................
192
Figure 9-3 Channel Between User and Provider ..........................................................
193
Figure 9-4 Transport Connection ..........................................................................................
195
Figure 9-5 Listening and Responding Transport Endpoints ..................................
209
Figure 9-6 Common Local Management State Table ................................................
232
Figure 9-7 Connectionless-Mode State Table ................................................................
232
Figure 9-8 Connection-Mode State Table ........................................................................
233
Figure 10-1 Use of a Pipe ............................................................................................................. ..
253
Figure 10-2 Sharing a Pipe between Parent and Child ........................;:;.;.;:;.;..... ;" ............... . 255
Figure 10-3 Use of a Socketpair ............................................................~.~.;. ;;;;+.~;;, ..
;! .....
Figure 10-4 Sharing a Socketpair between Parent "..
~i;;;;;~~.;;.; ... ·
A\.A ........ ,"UA ..&:::;; ..... ·i...'::., .. :,;.;·..... ;";;'i..
Figure 10-5 Reading UNIX Domain Datagrams ....... ;;"';,;:;;;;; ...;.;::.;.... ;.;;:; ........{;,,;...':':;.. ;;;.
Figure 10-6 Sending a UNIX Domain Datagrams ...........:...;;;...;;:;:;;;:;; .......;i•.•.• ;;;,:;:;.:..... ;:;;, •. ;;
Figure 10-7 Reading Internet Domain Datagrams ........................:...... ;;i:'i';..... ,·;.""....·.:.;iii••
Figure 10-8 Sending an Internet Domain Datagram ....................... ;;;;:;;;... ;;;,;; ..,......... ..
-xix-
264
Figures -
Continued
Figure 10-9 Initiating an Internet Domain Stream Connection ............................
266
Figure 10-10 Accepting an Internet Domain Stream Connection ......................
268
Figure 10-11 Using select () to Check for Pending Connections .............
270
Figure 10-12 Establishing a Stream Connection ..........................................................
272
Figure 10-13 Initiating a UNIX Domain Stream Connection ...............................
272
Figure 10-14 Accepting a UNIX Domain Stream Connection .............................
273
Figure 10-15 Varieties of Read and Write Commands .............................................
276
Figure 11-1 Remote Login Client Code ............................................................................
294
Figure 11-2 Remote Login Server ........................................................................................
295
Figure 11-3 rwho Server ...........................................................................................................
300
Figure 11-4 Flushing Tenninal I/O on Receipt of Out Of Band Data ..............
303
Figure 11-5 Use of Asynchronous Notification of I/O Requests .........................
305
Figure 11-6 Use of the SrGCHLD Signal .........................................................................
306
Figure 11-7 Creation and Use of a Pseudo Tenninal .................................................
307
-xx-
1
Network Services
This guide gives an overview of the network services available in the Sun 4.1
release. To appreciate the design of these services, it's necessary to see that
SunOS is structurally a network UNIX system, and is designed to evolve as network technology changes.
SunOS originally diverged from the 4.2BSD UNIX system, a system that already
strained at the limits of the UNIX system's original simplicity of design. It was
with 4.2BSD that many of the network services found in SunOS were first introduced. Fortunately, the Berkeley designers found alternatives to wedging everything into the kernel. They implemented network services by offloading certain
jobs to specialized daemons (server processes) working in close cooperation with
the kernel, rather than by adding all new code to the kernel itself. Though NFS is
primarily kernel based (using a daemon only to make system calls), SunOS has
continued this line of development. Its expanding domain of network services is
unifonnly built upon a daemon (server) based architecture. Examples of server
daemons are the portmapper, the network naming service (NIS), the Remote
Execution Facility (REX), the Network Lock Manager, and the Status Monitor.
Terminology
A machine that provides resources to the network is called a userver", while a
machine that employs these resources is called a client". A machine may be
both a server and a client, and when NFS resources (files and directories) are at
issue, often is. A person logged in on a client machine is a uuser", while a program or set of programs that run on a client is an application". There is a distinction between the code implementing the operations of a filesystem, (called
11ilesystem operations" ) and the data making up the filesystem' s structure and
contents (called "filesystem data" ).
II
II
Network services are added to SunOS by means of server processes that are
based upon Sun's RPC (Remote Procedure Call) mechanism. These servers are
executed on all machines that provide the service. Sun daemons differ
significantly from those that were inherited from Berkeley in that most of them
are based on RPC. As a consequence, they automatically benefit from the services provided by RPC, and the External Data Representation (XDR) that it is
built upon - for example, the data portability provided by XDR and RPC's
authentication system.
Anything built with RPC/XDR is automatically a network application, as is anything that stores data in NFS files, even if it doesn't use RPC directly. Furthermore, in so far as network applications can presume the functionality of other
Revision A. of 27 March 1990
2
Network Programming
network applications and call upon their services, all network applications are
network services as well. The RPC/XDR environment then, is inherently extensible. New network services can be easily added by building upon the foundation already in place. In SunOS, then, network services are analogous to UNIX
commands - anyone can add one, and when they do they are effectively
extending the "system".
NOTE
1.1. The Major Network
Services
The term Open Network Computing (ONC) is based on RPC utilities only (such
as REX, NIS, Lock Manager, and Status Monitor). The other network utilities
described here are not considered part of ONC.
The Remote Procedure Call (RPC) facility is a library of procedures that provide
a means whereby one process (the caller process) can have another process (the
server process) execute a procedure call, as if the caller process had executed the
procedure call in its own address space (as in the local model of a procedure
call). Because the caller and the server are now two separate processes, they no
longer have to live on the same physical machine.
The External Data Representation (XDR)is a specification for the portable data
representation standard. RPC uses XDR to ensure that data is represented the
same on different computers, operating systems, and computer languages. In
SunOS 4.1 XDR is implemented through the socket interface, yet allows programmers to have a standardized access to sockets without being concerned
about the low-level details of socket-based IPC.
The Network File System (NFS), is an operating system-independent service
which allows users to mount directories, even root directories, across the network, and then to treat those directories as if they were local. There is also an
option for a secure mount involving DES authentication of user and host-for
more infonnation about it, see the Secure Networking Features chapter of Security Features Guide.
portmapper is a system service upon which all other RPC-based services rely.
It's a kind of registrar that keeps track of the correspondence between ports (logical communications channels) and services on a machine, and provides a standard way for a client to look up the port number of any RPC program supported
by the server. But in effect, only RPC programs use it.
Sun's Network Information Service is a network. service designed to ease the job
of administering large networks. NIS is a replicated, read-only, distributed database service. Network. file system clients use it to access network-wide data in a
manner that is entirely independent of the relative locations of the client and the
server. The NIS database typically provides password, group, network, and host
infonnation.
As part of its System V compatibility program, Sun now supports System-V
(SVID) compatible advisory file and record locking for both local and NFS
mounted files. User programs simply issue lockf () and fcntl () system
calls to set and test file locks - these calls are then processed by Network Lock
Manager daemons, which maintain locks at the network level, even in the face of
multiple machine crashes.
Revision A, of 27 March 1990
Chapter 1 - Network Services
3
The lock-manager daemons are able to manage machine crashes because they are
based upon a general putpOse Network Status Monitor. This monitor provides a
mechanism by which network applications can detect machine reboots and
trigger application-specific recovery mechanisms. The Lock Manager is therefore equipped with a flexible fault-tolerant recovery capability.
There are other network services - NIS and REXl are two obvious examplesand there are many others that are certainly services in the broad sense. This section, however, is intended as an introduction, and it covers only the fundamental
services noted above.
1.2. Network Programming
Manual Overview
This Network Programming manual contains this Network Services overview and
then three major sections. In this overview the fundamental network services are
introduced without dealing with any protocol or implementation related issues.
PART ONE focuses on Sun's network programming mechanisms. It includes:
o
The rpcgen Programming Guide, which introduces the rpcgen protocol
compiler and the C-like language that it uses to specify RPC applications
and define network data. In almost all cases, rpcgen will allow network
applications developers to avoid the use of lower-level RPC mechanisms.
o
The Remote Procedure Call Programming Guide is intended for programmers who wish to understand the lower-level RPC mechanisms. Readers are
assumed to be familiar with the C language and to have a working
knowledge of network theory.
o
The External Data Representation: Sun Technical Notes, which introduces
XDR and explains the justification for its "canonical" approach to network
data interchange. This section also gives Sun implementation infonnation
and a few examples of advanced XDR usage.
PART TWO includes a number of number of protocol specifications. Both the
External Data Representation Protocol Specification and Remote Procedure Call
Specification have been published as a DARPA RFC (Request for Comments).
These protocol specifications include:
o
The External Data Representation Protocol Specification, which includes a
complete specification of XDR data types, a discussion of the XDR approach
and a number of examples of XDR usage. This specification is published as
DARPARFC 1014.
o
The Remote Procedure Call Protocol Specification, which includes a discussion of the RPC model, a detailed treatment of the RPC authentication facilities and a complete specification of the portmapper Protocol. This
specification is published as DARPA RFC 1057.
1 These. however. are notfundamenlal network services. in the same sense as NFS. REX. for example.
cannot be guaranteed to be portable to a non-UNIX enviromnent. This is true because the executability of a
program depends on many environmental factors - from machine architecture to operating-system services that are not universally available.
Revision A. of 27 March 1990
4
Network Programming
o
The Network File System: Version 2 Protocol Specification, which includes
a complete specification of the Mount Protocol, as well as the NFS
specification itself. This specification is published as DARPA RFC 1094.
PART THREE documents Transport-Level Network Programming.
o
The first chapter, Transport Level Interface (TLI) Programming, describes
the TLI system interface for direct access to network mechanisms.
The rest of the chapters in this part document the Berkeley style, socket-Based
Inter-Process Communications mechanisms.
1.3. The Network File
System (NFS)
o
A Socket-Based Interprocess Communications Tutorial then introduces
socket-based IPC. It assumes little more that basic networking concepts on
the part of its reader, and includes many examples.
o
An Advanced Socket-Based Interprocess Communications Tutorial, which
takes up where the Tutorial leaves off.
o
Berkeley-Style IPC Implementation Notes, which describes the low-level
networking primitives (e.g. accept () , bind () and select () ) which
originated with the 4.2BSD UNIX system. This document is of interest primarily to system programmers and aspiring UNIX gurus.
The Network File System is a facility for sharing files in a heterogeneous
environment of machines, operating systems, and networks. Sharing is accomplished by mounting a remote filesystem, then reading or writing files in place.
NFS was not designed by extending SunOS onto the network - such an
approach was considered unacceptable because it would mean that every computer on the network would have to run SunOS. Instead, operating-system
independence was taken as an NFS design goal, along with machine independence, crash recovery, transparent access and high performance. NFS was thus
designed as a collection of network services, and not as a distributed operating
system. As such, it is able to support distributed applications without restricting
the network to a single operating system.
Sun's implementation ofNFS is integrated with the SunOS kernel for reasons of
efficiency, although such close integration is not strictly necessary. Other vendors will make different choices, as dictated by their operating environments and
applications. And because of NFS 's open design, all of these applications will be
able to work together on a single network.
Revision A, of 27 March 1990
Chapter 1 -
Computing Environments
Network Services
5
The traditional timesharing environment looks like this:
tenninall
Mainframe
Io-----t
tenninal2
tenninal3
tenninal4
The major problem with this environment is competition for CPU cycles. The
workstation environment solves that problem, but requires more disk drives. A
network environment looks like this:
workstation2
workstation3
workstation4
Network
workstation 1
server
printer
The goal of the NFS design was to make all disks available as needed. Individual workstations have access to all infonnation residing anywhere on the network. Printers and supercomputers may also be available somewhere on the network.
Revision A. of 27 March 1990
6
Network Programming
Example NFS usage
This section gives three examples of NFS usage.
Example 1: Mounting a Remote
Filesystem
Suppose your machine name is client, that you want to read some on-line
manual pages, and that these pages are not available on your server machine,
named server, but are available on another machine named docserv. Mount
the directory containing the manuals as follows:
clienti /usr/etc/mount docserv:/usr/man lusr/man
Note that you have to be superuser in order to do this. Now you can use the man
command whenever you want. Try running the mount -p command (on
client) after you've mounted the remote filesystem. Its output will look
something like this:
server:/roots/client
server:/usr
server:/horne/server
server:/usr/local
docserv:/usr/rnan
/
/usr
/horne/server
/usr/local
/usr/rnan
nfs
nfs
nfs
nfs
nfs
rw,hard
ro
rw,bg
ro,soft,bg
ro,soft,bg
0
0
0
0
0
0
0
0
0
0
You can remote mount not only filesystems, but also directory hierarchies inside
filesystems. In this example, /usr /rnan is not a filesystem mount point - it's
just a subdirectory within the /usr filesystem. Here's a diagram showing a few
key directories of the three machines involved in this example. Ellipses
represent machines, and NFS-mounted filesystems are shown boxed. There are
five such boxed directories, corresponding to the five lines shown in the
mount -p output above. The docserv: /usr/rnan directory is shown
mounted as the /usr /man directory on client, as it would be by the mount
command given above.
Revision A, of 27 March 1990
Chapter 1 - Network Services
Figure 1-1
Ilib
lusr
7
An Example NFS Filesystem Hierarchy
lete
llib
lusr
lusr/bin
lusr/man
lusr/loeal
Example 2: Exporting a
Filesystem
Suppose that you and a colleague need to work together on a programming project. The source code is on your machine, in the directory /usr/proj. It
doesn't matter whether your workstation is a diskless node or has a local disk.
Suppose that after creating the proper directory your colleague tried to remote
mount your directory. Unless you have explicitly exported the directory, your
colleague's remote mount will fail with a "permission denied" message.
To export a directory, first become superuser and then edit the / etc/ exports
file. If your colleague is on a machine named cohort, then you need to run
exportfs (8) (after putting this line in / etc/ exports):
/usr/proj
-access=cohort
If no explicit access is given for a directory, then the system allows anyone on
the network to remote mount your directory. By giving explicit access to
cohort, you have denied access to others. (For more details about the
/etc/exports, see the exports (5) man page). mountd, the NFS mount
request server, (see The NFS I nterj'ace, below) reads the file / et c / xt ab whenever it receives a request for a remote mount. The file / etc/xtab contains the
Revision A, of 27 March 1990
8
Network Programming
entries for directories that are currently exported. Now your cohort can remote
mount the source directory by issuing this command:
cohortt Jete/mount elient:/usr/proj /usr/proj
This, however, isn't the end of the story, since NFS requests are also checked at
request time. If you do nothing, the accesses that you've established in your
/ etc / export 5 file will stay in effect, but you (and your programs) are free to
change them at any time with the exportfs command and system call.
Since both you and your colleague will be able to edit files on / u s r / pro j, it
would be best to use the s c c s source code control system for concurrency control.
Example 3: Administering a
Server Machine
System administrators must know how to set up the NFS server machine so that
client workstations can mount all the necessary filesystems. You export filesystems (that is, make them available) by placing appropriate lines in the
/etc/exports file. Here is a sample /etc/exports file for a typical
server machine:
/
/exec
/usr
/home/server
/home/local.sun2
/home/local.sun3
-access=systems
-access=engineering:joebob:shilling
-access=engineering
-access=engineering
-access=engineering:athena
-access=engineering
Machine names or netgroups, such as staff (see netgroup (5» may be
specified after the filesystem, in which case remote mounts are limited to
machines that are a member of this netgroup. For the complete syntax of the
/etc/exports file, see exports (5). At any time, the system administrator
can see which filesystems are remote mounted by executing the showmount
command.
NFS Architecture
Transparent Infonnation Access
Users are able to get directly to the files they want without knowing the network
address of the data. To the user, all NFS-mounted filesystems look just like
private disks. There's no apparent difference between reading or writing a file on
a local disk, and reading or writing a file on a disk in the next building. Infonnation on the network is truly distributed.
Different Machines and
Operating Systems
No single vendor can supply tools for all the work that needs to get done, so
appropriate services must be integrated on a network. NFS provides a flexible,
operating system-independent platform for such integration.
Easily Extensible
A distributed system must have an architecture that allows integration of new
software technologies without disturbing the extant software environment. Since
the NFS network-services approach does not depend on pushing the operating
system onto the network, but instead offers an extensible set of protocols for data
exchange, it supports the flexible integration of new software.
Revision A, of 27 March 1990
Chapter 1 -
Ease of Network Administration
Network Services
9
The administration of large networks can be complicated and time-consuming,
yet they should (ideally) be at least as easy to administer as a set of local filesysterns on a timesharing system. The UNIX system has a convenient set of maintenance commands developed over the years, and the Network Information Service, a NFS-based network database service, has allowed them to be adapted and
extended for the purpose of administering a network of machines. The NlS also
allows certain aspects of network administration to be centralized onto a small
number of file servers, e.g. only server disks must be backed up in networks of
diskless clients. An overview of the NIS facility is presented in the The Network
Information Service Database Service section of this manual.
The NIS interface is implemented using RPC and XDR, so it is available to nonUNIX operating systems and non-Sun machines. NIS servers do not intetpret
data, so it is easy for new databases to be added to the NIS service without modifying the servers.
Reliability
NFS's reliability derives from the robustness of the 4.2BSD filesystem, from the
stateless NFS protoco12, and from the daemon-based methodology by which network services like file and record locking are provided. See The Network Lock
Manager for more details on locking. In addition, the file server protocol is
designed so that client workstations can continue to operate even when the server
crashes and reboots.
The major advantage of a stateless server is robustness in the face of client,
server, or network failures. Should a client fail, it is not necessary for a server
(or human administrator) to take any action to continue nonnal operation.
Should a server or the network fail, it is only necessary that clients continue to
attempt to complete NFS operations until the server or network gets fixed. This
robustness is especially important in a complex network of heterogeneous systems, many of which are not under the control of a professional operations staff,
and which may be running untested systems that are often rebooted without
warning.
High Performance
The flexibility of NFS allows configuration for a variety of cost and performance
trade-offs. For example, configuring servers with large, high-performance disks,
and clients with no disks, may yield better performance at lower cost than having
many machines with small, inexpensive disks. Furthermore, it is possible to distribute the filesystem data across many servers and get the added benefit ofmultiprocessing without losing transparency. In the case of read-only files, copies
can be kept on several servers to avoid bottlenecks.
Sun has also added several performance enhancements to NFS, such as "fast
paths" for key operations, asynchronous service of multiple requests, disk-block
caching, and asynchronous read-ahead and write-behind. The fact that caching
and read-ahead occur on both client and server effectively increases the cache
size and read-ahead distance. Caching and read-ahead do not add state to the
server, nothing (except perfonnance) is lost if cached information is thrown
2 The NFS protocol is stateless because each transaction stands on its own. The server doesn't have to
remember anything - about clients or files - between transactions.
Revision A, of 27 March 1990
10
Network Programming
away. In the case of write-behind, both the client and selVer attempt to flush critical infonnation to disk whenever necessary, to reduce the impact of an unanticipated failure; clients do not free write-behind blocks until the selVer confinns
that the data is written.
The Sun NFS Implementation
In the Sun NFS implementation, there are three entities to be considered: the
operating system interface, the virtual file system (VFS) , interface, and the net-
work file system (NFS) interface. The UNIX operating system interface has been
preserved in the Sun implementation of NFS, thereby insuring compatibility for
existing applications. Applications will use read (2) and wr i te (2) to access
NFS files just as the do to access local files.
The VFS is best seen as a layer that Sun has wrapped around the traditional
UNIX filesystem. This traditional filesystem is composed of directories and files,
each of which has a corresponding inode (index node), containing administrative infonnation about the file, such as location, size, ownership, pennissions,
and access times. Inodes are assigned unique numbers within a filesystem, but a
file on one filesystem could have the same number as a file on another filesystem.
This is a problem in a network environment, because remote filesystems need to
be mounted dynamically, and numbering conflicts would cause havoc. To solve
this problem, Sun designed the VFS, which is based on a data structure called a
vnode. In the VFS, files are guaranteed to have unique numerical designators,
even within a network. Vnodes cleanly separate file system operations from the
semantics of their implementation. Above the VFS interface, the operating system deals in vnodes; below this interface, the filesystem may or may not implement inodes. The VFS interface can connect the operating system to a variety
of filesystems (for example, 4.2 BSD or MS-DOS). A local VFS connects to
file system data on a local device.
Revision A, of 27 March 1990
Chapter 1 -Network Services
11
The remote VFS defines and implements the NFS interface on the basis of the
RPC and XDR mechanisms. The figure below shows the flow of a request from
a client (at the top left) to a collection of filesystems.
NFS
sys calls
VFS
interface
other VFS
4.2
(like PC-FS)
VFS
VFS
NFS
client
4.2BSD
UFS
RPC/
XDR
RPC/
XDR
Network
In the case of access through a local VFS, requests are directed to file system data
on devices connected to the client machine. In the case of access through a
remote VFS, the request is passed through the RPC and XDR layers onto the net.
In the current implementation, Sun uses the UDP/IP protocols and the Ethernet.
On the server side, requests are passed through the RPC and XDR layers to an
NFS server, the selVeruses vnodes to access one of its local VFSs and service
the request. This path is retraced to return results.
Sun's implementation of NFS provides five types of transparency:
1.
Filesystem Type: The vnode, in conjunction with one or more local VFSs
(and possibly remote VFSs) pennits an operating system (hence client and
application) to interface transparently to a variety of filesystem types.
2.
Filesystem Location: Since there is no differentiation between a local and a
remote VFS, the location of filesystem data is transparent.
3.
Operating System Type: The RPC mechanism allows interconnection of a
variety of operating systems on the network, and makes the operating system
type of a remote server transparent.
4.
Machine Type: The XDR definition facility allows a variety of machines to
communicate on thenetwork and makes the machine type of a remote server
transparent.
Revision A, of 27 March 1990
12
Network Programming
5.
Network Type: RPC and XDR can be implemented for a variety of transport
protocols, thereby making the network type transparent.
Simpler NFS implementations are possible at the expense of some advantages of
the Sun version. In particular, a client (or server) may be added to the network
by implementing one side of the NFS interface. An advantage of the Sun implementation is that the client and seIVer sides can be symmetrical; thus, it is possible for any machine to be client, seIVer, or both. Users at client machines with
disks can arrange to share them over NFS without having to appeal to a system
administrator or configure a different system on their workstation.
The NFS Interface
As mentioned in the preceding section, a major advantage of NFS is the ability to
mix filesystems. In keeping with this, Sun encourages other vendors to develop
products to interface with Sun network services. The specifications for RPC and
XDR have been placed in the public domain, and Sun's implementation ofRPC
and XDR is freely licensed, whic seIVes as a standard for anyone wishing to
develop applications for the network. Furthennore, the NFS interface itself is
open and can be used by anyone wishing protocol specifications to implement an
NFS client or seIVer for the network.
The NFS and the Mount
Protocol
The NFS interface defines traditional filesystem operations for reading directories, creating and destroying files, reading and writing files, and reading and
setting file attributes. The interface is designed so that file operations address
files with an uninterpreted identifier called ajiiehandle, a starting byte address,
and a length in bytes. NFS never deals with pathnames, only with filehandles.
More precisely J NFS never interprets
path names. Some NFS procedures
take pathname arguments, but they
are just strings to NFS.
Given a filehandle for a directory, a client program can use NFS procedures to
get other filehandles and thereby navigate throughout the directories and files of a
filesystem. A client must, however, get its first filehandle for a filesystem by
using RPC to call the mount seIVer. Mount will return a filehandle that grants
access to the filesystem. Figure 1-2 shows the interaction between a client program, a mount seIVer, and an NFS server. Note that the only interface between a
mount server and an NFS seIVer is a common filehandle.
Revision A, of 27 March 1990
Chapter 1 - Network Services
Figure 1-2
MountandNFS Servers
Mount
Server
Application
td:erface
Filehandle
Legend:
Patlmame Parsing
13
I----II~
NFS
Server
1 Client sends pathnarm to rrount server
2. Mount server returns corresponding f i1ehandle
3. Client sends f ilehandle to NFS server
Although many operating systems have analogs to the hierarchical NFS directory
and file structure, the conventions used by operating systems to fonnulate pathnames vary considerably. To accommodate the many possible path naming conventions, the mount procedure is not defined in the NFS protocol but in a
separate mount protocol. Actually the mount protocol is the same for any
Operating System. It is only the implementation that differs between systems.
The mount procedure in the UNIX mount protocol converts a UNIX patbname
into a filehandle. If local patbnames can be reasonably mapped to UNIX pathnames; an NFS server developer may wish to implement the UNIX mount protocol, even though the server runs on a different operating system. This approach
makes the server immediately usable by clients that use the UNIX protocol and
eliminates the need to develop a new mount command for UNIX-based clients.
Alternatively, a server developer can obtain a new remote program number from
Sun and define a new mount protocol. For example, the mount procedure in a
VMS Mount protocol would take a VMS file specification rather than a UNIX
patbname. Mount protocols are not mutually exclusive; a server could, for example, support the UNIX protocol for UNIX clients and a Multics protocol for Multics clients. Both protocols would return filehandles defined by the NFS implementation on their server.
The mount protocols remove patbname parsing from the NFS protocol, so that a
single NFS protocol can work with multiple operating systems. This means that
Revision A, of 27 March 1990
14
Network Programming
users and client programs need to know the details of a server's path naming conventions only when mounting a filesystem. Different server path naming conventions therefore typically have little impact on users.
Because mounts are relatively infrequent operations, mount servers can be implemented outside of operating system kernels without materially affecting overall
file system perfonnance. Because user-level code is easier to write and far easier
to debug than kernel code, mount servers are fairly simple to put together.
Export and Mount Lists
Technically, a mount protocol needs to define only a mount procedure that
bootstraps the first filehandle for a filesystem. (By convention, a mount protocol
should also define a NULL procedure). However, adding other procedures can
simplify network management. As a convenience to clients, a mount protocol
might provide a procedure that returns a list of filesystems exported by a server.
Another useful item is a mount list, a list of clients and the patbnames they have
mounted from the server. The UNIX mount protocol defines a mount list and a
procedure called readmount () that returns the list. With the help of readmount () , an administrator can notify the clients of a server that is about to be
shutdown.
Note that a mount list makes a mount server stateful. Recall, however, that the
business of a mount server is to translate patbnames into filehandles; the state
represented by a mount list does not affect a server's ability to operate correctly.
Neither servers nor clients need take any action to update or rebuild a mount list
after a crash. Mount server users should regard the mount and export lists provided by a mount seIVer as "accessories" that are usually, but not necessarily,
accurate.
UNIX Mount Protocol
Procedures
The mount protocol consists of the six remote procedures listed in Table 1-1.
The mount () procedure transforms a UNIX patbname into a filehandle which
the client can then pass to the associated NFS server. The patbname passed to
the mount procedure usually refers to a directory, often the root directory of a
filesystem, but it can name a file instead. In addition to returning the filehandle,
mount adds the client's host name and the pathname to its mount list. The
readmount () procedure returns the seIVer's mount list. unmount ( )
removes an entry from the server's mount list and unmountall () removes all
of a client's mount list entries. The readexport () procedure returns the
server's export list.
Revision A. of 27 March 1990
Chapter 1-Network Services
Table 1-1
MOUNT: Remote Procedures, Version 1
Number
Name
0
1
null
mount
readmount
unmount
unmountall
readexport
2
3
4
5
A Stateless Protocol
15
Description
Do nothing
Return filehandle for pathname
Return mount list
Remove mount list entry
Qear mount list
Return export list
The NFS interface is defined so that a server can be stateless. This means that a
seIVer does not have to remember from one transaction to the next anything
about its clients, transactions completed or files operated on. For example, there
is no open () operation, as this would imply state in the seIVer; of course, the
UNIX interface uses an open () operation, but the information in the UNIX
operation is remembered by the client for use in later NFS operations.
An interesting problem occurs when a UNIX application unlinks an open file.
This is done to achieve the effect of a temporary file that is automatically
removed when the application tenninates. If the file in question is served by
NFS, the call to unlink () will remove the file, since the server does not
remember that the file is open. Thus, subsequent operations on the file will fail.
In order to avoid state on the server, the client operating system detects the situation, renames the file rather than unlinking it, and unlinks the file when the application terminates. In certain failure cases, this leaves unwanted "temporary" files
on the seIVer, these files are removed as a part of periodic file system maintenance.
Another example of the advantages gained by having the NFS interface to the
UNIX system without introducing state is the mount command. A UNIX client
ofNFS "builds" its view of the filesystem on its local devices using the mount
command or via automount; thus, it is natural for the UNIX client to initiate
its contact with NFS and build its view of the filesystem on the network with an
extended mount command. This mount command does not imply state in the
seIVer, since it only acquires information for the client to establish contact with a
seIVer. The mount command may be issued at any time, but is typically executed as a part of client initialization. The corresponding umount command is
only an informative message to the server, but it does change state in the client
by modifying its view of the filesystem on the network.
The major advantage of a stateless seIVer is robustness in the face of client,
seIVer or network failures. Should a client fail, it is not necessary for a server (or
human administrator) to take any action to continue normal operation. Should a
server or the network fail, it is only necessary that clients continue to attempt to
complete NFS operations until the server or network is fixed. This robustness is
especially important in a complex network of heterogeneous systems, many of
which are not under the control of a professional operations staff and may be
Revision A, of 27 March 1990
16
Network Programming
running untested systems and/or may be rebooted without warning.
An NFS selVer can be a client of another NFS selVer. However, it is not often
that a Sun selVer will not act as an intermediary between a client and another
selVer. Instead, a client may ask what remote mounts the selVer has and then
attempt to make similar remote mounts. The decision to disallow intermediary
selVers is based on several factors. First, the existence of an intennediary will
impact the performance characteristics of the system; the potential performance
implications are so complex that it seems best to require direct communication
between a client and selVer. Second, the existence of an intermediary complicates access control; it is much simpler to require a client and selVer to establish
direct agreements for selVice. Finally, disallowing intermediaries prevents
cycles in the selVice arrangements; Sun prefers this to detection or avoidance
schemes.
NFS currently implements UNIX file protection by making use of the authentication mechanisms built into RPC. This retains transparency for clients and applications that make use of UNIX file protection. Although the RPC definition
allows other authentication schemes, their use may have adverse effects on transparency.
Note that NFS, although very UNIX-like, is not a UNIX filesystem per sethere are cases in which its behavior differs from that which would be expected
of the UNIX system proper:
Note: Network access to devices
such as tape drivers is a good idea,
but it is best implemented as a
separate network service whose
requirement for stateful operation is
kept separate from network access
to files.
o
The guaranteed APPEND_MODE is the most striking of these differences,
for it simply is not supported by NFS.
o
NFS does not support device operation over NFS. Support of special files is
not stateful because the device operations are carried out locally.
o
There are also minor incompatibilities between NFS and UNIX file-system
interfaces that are dictated by the very nature of remote NFS mounts. For
example, a local NFS daemon simply can't tell that a remote disk partition is
full until the remote NFS daemon tells it so. Rather than wait for a positive
confirm on every write - a strategy that would impose unacceptable performance problems - the local NFS code caches writes and returns to its
caller. If a remote error occurs, it gets reported back as soon as possible, but
not as immediately as would a local disk.
File locking and other inherently stateful functionality has been omitted from the
base NFS definition. In this way, Sun has been able to preselVe a simple, general
interface that can be implemented by a wide variety of customers. File locking
has been provided as a NFS-compatible network selVice, and Sun is considering
doing the same for other features that inherently imply state and/or distributed
synchronization. These features, too, will be kept separate from the base NFS
definition. In any case, the open nature of the RPC and NFS interfaces means
that customers and users who need stateful or complex features can implement
them "beside" NFS.
Revision A, of 27 March 1990
Chapter 1 - Network Services
Note: Miscellaneous Network
Operations
17
Sun supports a small number of miscellaneous networking operations that are
useful for temporary inter-host connections, isolated file transfers, and access to
non-UNIX systems (e.g. VMS machines on the Internet). These operations
include rep, rlogin, rsh, ftp, telnet, and tftp.
o
rep is a remote copy utility program that uses "BSD networking facilities"
to copy files from one machine to another. The rep user supplies the path
name of a file on a remote machine, and receives a stream of bytes in return.
Access control is based on the client's login name and host name.
The major problem with rep is that it's not transparent to the user, who
winds up with a redundant copy of the transferred file. With NFS, by contrast, only one copy of the file is necessary. Another problem is that rep
does nothing but copy files. To use it as a model for additional network services would be to introduce a remote command for every regular command:
for example, r di f f to perform differential file comparisons across
machines. By providing for the sharing of filesystems, NFS makes this
unnecessary .
rep is useful for NFS seIVers that you have login access to but not NFS access.
Files can copied back and forth, yet you don't need any file system mounted.
o
r login allows the user to log into a remote machine, directly accessing
both its processor and its mounted file systems. It remains useful in NFSbased networks because, with it, users can directly execute commands on
remote machines over the network.
Drs h allows the user to execute a command on a remote machine. If no command is specified, rsh is equivalent to rlogin. Unlike the REX-based on
command, r s h does not copy the users local environment to the remote
machine before executing the command. This can be a benefit in situations
where exporting your local environment might cause problems.
o
ftp is very much like rep, in that it supports file copying between
machines. However, ftp is more general that rep, and is not restricted to
copies between two UNIX systems.
o
telnet communicates with another host using the TELNET protocol. It
isn't used much because rlogin is the standard mechanism for local interhost communication. But like ftp, telnet is useful for non-Unix systems.
o
tftp is like ftp, expect that it is simpler and less reliable. This is because
tftp's transfer protocol is very simple; it is less robust that ftp's protocol,
and offers fewer options. tftp is also used as part of the diskless NFS
booting procedure (Le. netdisk).
Revision A. of 27 March 1990
18
Network Programming
1.4. Remote File Sharing
(RFS)
Remote File Sharing (RFS) provides a means of viewing files that physically
reside on remote machines as if they were on the local machine. Remote files are
named using the same conventions as for local files, and all operations on remote
files work the same as they do on local files. Like NFS, RFS allows application
programs to transparently share files across the network.
NFS, however, is stateless, transactions are independent of each other, and thus
no recovery is required when a server or client goes down. RFS, in contrast, supports all UNIX semantics as defined by AT&T. Consequently, it saves state
across transactions, and must recover when a server or client goes down.
RFS is used in much the same way as NFS. For both, the user accesses remote
files by mounting directories which are made available across the network by
server processes running on remote machines. The details do vary, though.
Machines using RFS make selected directories available for sharing by advertising them. Correspondingly, machines are able to augment their own file trees
with the advertised files from other machines. This augmentation is perfonned
by means of a remote mount, which is a direct extension of the standard mount
operation. Once remote directories have been mounted on the local filesystem,
they are functionally part of that filesystem and are accessed in the same way as
local directories.
Advertise
To allow other machines to access a directory, its owner must advertise it by
using the adv (8) command. Once advertised, the directory and all files contained in its subtree are available for sharing by any authorized machine.
Unadvertise
A directory can be unadvertised at any time with the unadv (8) command.
Unadvertising a directory has no effect on existing mounts of that directory, but
future mount requests will fail.
Remote Mounts
RFS extends the mount (8) operation to include a remote mount. After a
machine has advertised a resource, another machine may remotely mount that
resource in its own file tree. For example, to advertise a directory named / f s 1,
the administrator of a server machine would type:
exarnple% adv DATA /£sl
This makes the / f s 1 subtree available for sharing, and specifies that other
machines will use the name DATA to refer to it when they mount it. The name
DATA can be almost any name that would work as a file name as long as it does
not contain a period (". "). See below for the special meaning of the period.
Another machine (a client) gains access to the advertised subtree by mounting
the remote subtree on the local directory. The remote / f s 1 is mounted on the
local / f s 1 with the command
exarnple% mount -d DATA /£sl
The -d option tells the mount (8) command that the resource being mounted is
remote.
There is no need for the structures of the client and server file trees to match in
any way, or for advertised subtrees to be mounted at the same level on the client
Revision A, of 27 March 1990
Chapter 1 - Network Services
19
as they occupy on the server. If the client had done the remote mount onto its
/ u s r directory, then its references to files under / u s r would yield files in the
server subtree under / f s 1. A client cannot get to parts of the server file tree that
are not within an advertised directory.
Resource Naming
Resource naming is modeled after the DARPA domain naming convention,
which has a hierarchically structured name space. A domain in this usage is a
name space that may encompass a group of machines and a set of resources
advertised by that group of machines.
Resource names are made up of two components separated by a period ("."). For
example, isl.payroll might represent a resource called payroll in domain isl, and
isl.acctp might represent the machine acctp within the same domain. Whether a
name specifies a resource or a machine is determined by context; there is no syntactic distinction. If a name is unqualified (Le., if it contains no periods), the
associated domain may (in some cases) be inferred from the context.
A domain's name space is maintained by a domain name server, which insures
uniqueness of names within the domain and provides a central location for storing infonnation about the machines and advertised resources in the domain. The
adv (8), unadv (8), mount (8), umount (8), and nsquery (8) commands use the domain name server as a data base for infonnation about advertised resources, such as their names and the servers that own them.
As described above, each resource is assigned a symbolic name when it is advertised, and the resource is subsequently identified (e.g. with a mount (8) command issued on a client) using just the domain name and that symbolic name.
Because of this symbolic naming of resources, remote users of resources need
not know the actual position of the resources within the server's file tree, nor
even what server within the domain is offering the resource. This location
independence simplifies references to resources, and allows for the transparent
migration of resources among the machines within a domain (for example, for
balancing the load among a set of server machines).
RFS Security Features
RFS contains three security features - client authentication, client authorization,
and user and group id mapping.
Client Authentication
This feature associates a password with a client machine so that the identity of a
prospective client can be checked before a mount request is serviced. Entry and
update of passwords is discussed in the rfadmin (8) , rfstart (8) , and
rfpasswd(8) commands.
Client Authorization
RFS provides a means of selectively advertising directories through the adv ( 8 )
command. For example, if you want to advertise / us r / pr iva t e, but only
want to authorize machines mach1 and mach2 to mount it, you would issue the
command:
example% adv PRIVATE /usr/private machl mach2
Without such a list of machines, the adv ( 8) command puts no restrictions on
av ailability .
Revision A, of 27 March 1990
20
Network Programming
One may also choose to advertise a directory read-only by using the - r option.
Here, a remote mount will only succeed if the mount command also includes the
-r option.
User and Group Id Mapping
Whenever a user accesses a remote file, that user's pennissions must be checked
as part of the normal processing of the request (for example, an "open to write" is
only valid if the user making the request has write pennissions on the file).
When accessing a file across two machines, there is no guarantee that the user
and group ids on the local machine have the same meaning on the other machine.
Some machines handle this problem by requiring the same numeric ids across
machines and expecting the administrators to make sure that the / etc/passwd
and / etc/ group files are identical across all machines (at least the entries for
all users that access remote files). This approach is conceptually simple, but it is
not always feasible in practice, especially in large or already established environments.
RFS, therefore, provides a range of id mapping options through the idload ( 8 )
command. Id mapping is done by a server machine on all incoming requests, as
well as in reporting file ownership ids in response to a request from a client
machine (e.g. a stat (2) or f s ta t (2 »). A client machine maps ids in order
to detennine the effective user or group id to use in executing a program that is
stored on a server and is "set user id" or "set group id".
On each machine, mapping can be set globally, for all remote machines, or on a
per-machine basis. All mapping is based on one of two default cases:
Id This case maps all incoming ids to id, which means that remote users will
have the pennissions associated with id in accessing a server's files. This
mapping is the default if no other mapping is specified.
Transparent
This is a null mapping; remote user and group ids are used locally without
change.
These base mappings are augmented by two additional capabilities:
Exclude
This capability excludes selected ids from the default mapping by mapping
them to an otherwise unused id. This capability can be used together with
the transparent mapping capability to handle a network where the
/ etc/passwd and / etc/ group files were identical, but certain pennissions (e.g. root) are to be disallowed from remote machines.
Map
This capability provides arbitrary mapping between remote and local ids that
have different name or different numeric values. It can be used with the
transparent mapping to handle exceptions to "nearly" identical
/ etc/passwd files.
Revision A, of 27 March 1990
Chapter 1 - Network Services
1.5. The Portmapper
21
Client programs need a way to find server programs; that is, they need a way to
look up and find the port numbers of server programs. 3 Network transport services do not provide such a selVice; they merely provide process-to-process message transfer across a network. A message typically contains a transport address
which contains a network number, a host number, and a port number. (A port is a
logical communications channel in a host - by waiting on a port, a process
receives messages from the network).
How a process waits on a port varies from one operating system to the next, but
all provide mechanisms that suspend a process until a message arrives at a port.
Thus, messages are not sent across networks to receiving processes, but rather to
the ports at which receiving processes wait for messages. The portmapper protocol defines a network service that provides a standard way for clients to look up
the port number of any remote program supported by a server.
The portmapper on every host is associated with port number 111. The portmapper is one of the few network services that must have such a well-known and
dedicated port. Other network services can be assigned port numbers statically
or dynamically so long as they register their ports with their host's portmapper.
For example, a selVerprogram based on Sun's RPC library typically gets a port
number at run time by calling an RPC library procedure. Note that a given network service can be associated with port number 1256 on one server and with
port number 885 on another; on a given host, a service can be associated with a
different port every time its selVer program is started. Delegating port-to-remote
program mapping to portrnappers also automates port number administration.
Port Registration
The portmapper is started automatically whenever a machine is booted. As
shown in the Typical Portmapping Sequence figure, below, both server programs
and client programs call portmapper procedures.4 To find a remote program's
port, a client sends an RPC call message to a server's portrnapper; if the remote
program is registered with the portrnapper, it returns the relevant port number in
an RPC reply message. The client program can then send RPC call messages to
the remote program's port.
NOTE
The portmapper provides an inherently stateful service because a portmap is a
set of associations between registrants and ports. Hence, all the RPC services
need to be reregistered if the portmap is restarted.
3 The naming of services by way of the port-number segment of their IP address is mandated by the Internet
protocols. Given this, clients face the problem of detennining which ports are associated with the services they
wish to use.
4 Although client and server programs and client and server machines are usually distinct, they need. not be.
A server program can also be a client program, as when an NFS server calls a porunapper server. Likewise,
when a client program directs a "remote" procedure call to its own machine, the machine acts as both client and
server.
Revision A, of 27 March 1990
22
Network Programming
Figure 1-3
Typical Portmapping Sequence
Oient Machine
Network
Server Machine
•
I
aient
Program
®
1[\
@
a
I
"""'111
"
Portrrapper
I
CD I
b
I
c
L....::::::f
,.
I
I
d
•
l!:::
Legend:
Server
Program
Ports
1 Server registers with portrrapper
2. Client gets server's port from portrrapper
3. aient calls server
Note that, because every instance of a remote program can be mapped to a different port on every server, a client has no way to broadcast a remote procedure
call directly. However, the portmapper PMAPPROC_ CALLIT procedure can be
used to broadcast a remote procedure call indirectly, since all portmappers are
associated with port number 111. One way for a client to find a seIVer running a
remote program is to broadcast a call to PMAPPROC_ CALLIT, asking it to call
procedure 0 of the desired remote program.
The Sun RPC library provides an interface to all portmapper procedures. Some
of the RPC library procedures also call portmappers automatically on behalf of
client and seIVer programs.
1.6. The Network
Information Service
Database Service
This chapter explains Sun's network database mechanism, the Network Infonnation Service. NIS was previously known as "Yellow Pages", which is now a
trademark of British Telecom (refer to the trademark page at the front of this
manual). Although it is not intended exclusively for system administrators, it
leans towards their concerns. The Network Information Service pennits password information and host addresses for an entire network to be held in a single
database, and, by so doing, greatly ease system and network administration.
Revision A, of 27 March 1990
Chapter 1 - Network Services
What Is The Network
Information Service?
Network Information Service
Maps
23
The Network Infonnation Service constitutes a distributed network lookup service:
o
NIS is a lookup service: it maintains a set of databases for querying. Programs can ask for the value associated with a particular key, or all the keys,
in a database.
o
NIS is a network service: programs need not know the location of data, or
how it is stored. Instead, they use a network protocol to communicate with a
database server that knows those details.
o
Network Infonnation Service is distributed: databases are fully replicated on
several machines, known as NIS servers. SelVers propagate updated databases among themselves, ensuring consistency. At steady state, it doesn't
matter which server answers a request; the answer is the same everyWhere.
The Network Infonnation Service serves infonnation stored in NIS maps. Each
map contains a set of keys and associated values. For example, the hosts map
contains (as keys) all host names on a network, and (as values) the corresponding
Internet addresses. Each NIS map has a mapname, used by programs to access
data in the map. Programs must know the fonnat of the data in the map. Most
maps are derived from ASCII files fonnerly found in / et e / pa s s wd,
fete/group, fete/hosts, fete/networks, and other files in / etc.
The fonnat of data in the NIS map is in most cases identical to the fonnat of the
ASCII file. Maps are implemented by dbm ( 3X) files located in subdirectories
of / et e / yp on NIS server machines.
The relationship between a NIS map and the standard UNIX / ete file which it
relates to varies from map to map. Some files (e.g. fete/hosts, are replaced
by their corresponding NIS maps, while some (e.g. / ete/passwd are merely
augmented.
Maps sometimes have nicknames. Although the ypea t command is a general
NIS database print program, it knows about the standard files in the NIS. Thus
ypeat hosts is translated into ypeat hosts .byaddr, since there is no
file called hosts in the NIS. The command ypeat -x furnishes a list of
expanded nicknames.
Network Information Service
Domains
A NIS domain is a named set of NIS maps. Taken together, these maps define a
distinct network namespace and locate a distinct area of administrative control.
NIS domains differ from both Internet domains and sendmail domains, which
define similar kinds of administrative loci in their respective (IP and electronic
mail) networks. A given host will typically fall within all three domains, but
these domains will not typically coincide. A NIS domain is implemented as a
directory in / ete/yp containing a set of maps.
You can determine your NIS domain by executing the domainname command.
A domain name is required for retrieving data from a NIS database. For instance,
if your NIS domain is s y s 1 and you want to find the Internet address of host
dbserver, you must ask NlS for the value associated with the key dbserver
in the map ho st s . byname within the NIS domain sy s 1. Each machine on
the network belongs to a default domain, which is set at boot time. Diskfull
Revision A. of 27 March 1990
24
Network Programming
machines have their default domains set by a call to the domainname command
made from / etc/ rc .local. Diskless clients have it set as the result of a consultation with the bootparams {5} server.
A NIS server holds all the maps of a NIS domain in a subdirectory of / et c / yp,
named after the domain. In the example above, maps for the s y s 1 domain
would be held in / et c / yp / s y s 1. A given host can contain maps for more
than one NIS domain.
Masters and Slaves
NIS servers containing copies of the same databases can be spread throughout a
network. When an arbitrary machine wants information in one of the NIS databases, it makes an RPC call to one of the NIS servers to get it. For any NIS map,
one NIS server is designated as the master - the only one whose database may
be modified. The other NIS servers are slaves, and they are automatically
updated from time to time to keep their information in sync with that of the master.
All changes to a NIS map must be made on the machine which is the master NIS
server for that map. The changes will then propagate to the slaves. A newly
built map is timestamped internally when it's created by makedbm. If you build
a NIS map on a slave server, you will temporarily break the NIS update algorithm, and will have to get all versions in synch manually. Moral: after you
decide which server is the master, do all database updates and builds there, not
on slaves.
A given server may even be master with regard to one map, and slave with regard
to another. This can get confusing quickly. Thus, its recommended that a single
server be master for all maps created by ypini t in a single domain. Here we
are assuming this simple case, in which one server is the master for all maps in a
database.
Naming
Imagine a company with two different networks, each of which has its own
separate list of hosts and passwords. Within each network, user names, numerical user IDs, and host names are unique. However, there is duplication between
the two networks. If these two networks are ever connected, chaos could result.
The host name, returned by the hostname command and the gethostname () system call, may no longer uniquely identify a machine. Thus a new
command and system call, domainname and getdomainname () have been
added. In the example above, each of the two networks could be given a different domain name. However, it is always simpler to use a single domain whenever possible.
The relevance of domains to NIS is that data is stored in
/etc/yp/ domainname. In particular, a machine can contain data for several
different domains.
~~sun
~~ mlcrosystems
Revision A, of 27 March 1990
Chapter 1 - Network Services
25
Data Storage
The data in NIS maps is stored as dbm fonnat databases. (See dbm (3X). Thus
the database ho st s . byname for the domain s y s 1 is stored as
/etc/yp/sysl/hosts.byname.pagmd
/etc/yp/sysl/hosts .byname. dire The command makedbm takes m
ASCII file such as /etc/hosts and converts it into a dbm file suitable for use
by the NIS. However, system administrators normally use the makefile in
/ etc/yp to create new dbm files (read on for details). This makefile in tum
calls makedbm.
SeNers
To become a seNer, a machine must contain the NIS databases, and must also be
running the NIS daemon ypserv. The ypini t commmd invokes this daemon
automatically. It also takes a flag saying whether you are creating a master or a
slave. When updating the master copy of a database, you can force the change to
be propagated to all the slaves with the yppush commmd. This pushes the
information out to all the slaves. Conversely, from a slave, the ypxfr command
gets the latest information from the master. The makefile in / etc/yp first executes makedbm to make a new database, and then calls yppush to propagate
the chmge throughout the network.
Clients
Remember that a client machine does not access local copies of / etc files, but
rather makes m RPC call to a NIS seNer each time it needs information from a
NIS database. NIS clients on NIS seNers also don't access local copies of / etc
files. The ypbind daemon remembers the name of a seNer. When a client
boots, ypbind broadcasts asking for the name of the NIS server. Similarly,
ypbind broadcasts asking for the name of a new NIS server if the old seNer
crashes. The ypwhich commmd gives the name of the server that ypbind
currently points at.
Since client machines don't have entire copies of files in the NIS, the commands
ypcat and ypmatch have been provided. As you might guess, ypcat
passwd is equivalent to cat / etc/passwd. To look for someone's password entry, searching through the password file no longer suffices; you have to
issue one of the following commands
example% ypcat passwd I qrep username
example% ypmatch username passwd
where you replace username with the login name you're searching for.
Default NIS Files
By default, Sun workstations have a number of files from / etc in their NIS:
/etc/passwd,/etc/group,/etc/hosts,/etc/networks,
/etc/services, /etc/protocols, and /etc/ethers. In addition,
there is the netgroup (5) , file, which defines network wide groups, and used
for permiSSion checking when doing remote mounts, remote logins, md remote
shells.
In SunOS 4.0, the library routines getpwent () ,getgrent (), and gethostent () were rewritten to take advmtage of the NIS. Thus, C programs that
call these library routines may have to be relinked in order to function correctly.
Revision A, of 27 March 1990
26
Network Programming
Hosts
The hosts file is stored as two different NIS maps. The first, hosts. byname,
is indexed by hostname. The second, hosts. byaddr, is indexed by Internet
address. Remember that this actually expands into four files, with suffixes
.pag, and . dir. When a user program calls the library routine gethostbyname ( ) , a single RPC call to a server retrieves the entry from the
hosts. byname file. Similarly, gethostbyaddr () retrieves the entry from
the hosts. byaddr file. If the NIS is not running (which is caused by commenting ypbind out of the / ete/ re file), then gethostbyname () will
read the / etc/hosts files, just as it always has.
Normally, the hosts file for the NIS will be the same as the fete/hosts file on
the machine serving as a NIS master. In this case, the makefile in / et e / yp will
check to see if / et e / ho s t s is newer than the dbm file. If it is, it will use a
simple sed script to recreate host s . byname and hosts. byaddr, run them
through makedbm and then call yppush See ypmake for details.
Passwd
The passwd file is similar to the hosts file. It exists as two separate files,
passwd. byname and passwd. byuid. The ypeat program prints it, and
ypmake updates it. However, if getpwent always went directly to the NIS as
does gethostent, then everyone would be forced to have an identical password file. Consequently, getpwent reads the local / ete/passwd file, just as
it always did. But now it intetprets "+" entries in the password file to mean,
intetpolate entries from the NIS database. If you wrote a simple program using
getpwent to print out all the entries from your password file, it would print out
a virtual password file: rather than printing out + signs, it would print out whatever entries the local password file included from the NIS database.
Others
Of the other files in / etc, / ete/ group is treated like / ete/passwd, in that
getgrent () will only consult the NIS if explicitly told to do so by the
fete/group file. The files fete/networks, /ete/serviees,
/ete/protoeols, fete/ethers, and /ete/netgroup are treated like
jete/hosts: for these files, the library routines go directly to the NIS, without
consulting the local files.
Changing your passwd
To change data in the NIS, the system administrator must log into the master
machine, and edit databases there; ypwhieh -m tells where the master server
is. However, since changing a password is so commonly done, the yppasswd
command has been provided to change your NIS password. It has the same user
interface as the passwd command. This command will only work if the
yppa s s wdd server has been started up on the NIS master server machine.
1.7. The Network Lock
Manager
SunOS includes an NFS-compatible Network Lock Manager (see the lockd(8C)
man page for more details) that supports the loekf () /fentl (), System V
style of advisory file and record locking over the network. System V locks are
generally considered superior to 4.3BSD locks, implemented with the flock ( )
system call, for they provide record level, and not merely file level, locking.
Record level locking is essential for database systems. Sun does support
flock () for use on individual machines, but flock () is not intended to be
used across the network. flock () locks exclude only other processes on the
Revision A. of 27 March 1990
Chapter 1 - Network Services
27
same machine. There is no interaction between flock () and lockf () .
Locking prevents multiple processes from modifying the same file at the same
time, and allows cooperating processes to synchronize access to shared files. The
user interfaces with Sun's network locking service by way of the standard
lockf () system-call interface, and rarely requires any detailed knowledge of
how it works. The kernel maps user calls to flock () and fcntl () into
RPC-based messages to the local lock manager (or, if the files in question are on
RFS-mounted filesystems, into calls to RFS). The fact that the file system may
be spread across multiple machines is really not a complication - until a crash
occurs.
All computers crash from time to time, and in an NFS environment, where multipIe machines can have access to the same file at the same time, the process of
recovering from a crash is necessarily more complex than in a non-network
environment. Furthermore, locking is inherently stateJul. If a server crashes,
clients with locked files must be able to recover their locks. If a client crashes,
its servers must have the sense to hold the client's locks while it recovers. And,
to preserve NFS's overall transparency, the recovery of lost locks must not
require the intervention of the applications themselves. This is accomplished as
follows:
D
D
Basic file access operations, such as read and write, use a stateless protocol
(the NFS protocol). All interactions between NFS servers and clients are
atomic - the server doesn't remember anything about its clients from one
interaction to the next. In the case of a server crash, client applications will
simply sleep until it comes back up and their NFS operations can complete.
StateJul services (those that require the server to maintain client information
from one transaction to the next) such as the locking service, are not part of
NFS per se. They are separate services that use the status monitor (see The
Network Status Monitor) to ensure that their implicit network state information remains consistent with the real state of the network. There are two
specific state-related problems involved in providing locking in a network
context:
1) if the client has crashed, the lock can be held forever by the server
2) if the server has crashed, it loses its state (including all its lock information) when it recovers.
The Network Lock Manager solves both of these problems by cooperating
with the Network Status Monitor to ensure that it's notified of relevant
machine crashes. Its own protocol then allows it to recover the lock information it needs when crashed machines recover.
The lock manager and the status monitor are both network-service daemons they run at user level, but they are essential to the kernel's ability to provide fundamental network services, and they are therefore run on all network machines.
They are best seen as extensions to the kernel which, for reasons of space,
efficiency and organization, are implemented as daemons. Most application programs will request the network service through a system call to the kernel (like
lockf () ), though it is possible to interact with the service directly with RPC.
Revision A. of 27 March 1990
28
Network Programming
With lockf () the kernel uses RPC to call the daemon. The network daemons
communicate among themselves with RPC (see The Locking Protocol for some
details of the lock manager protocol). It should be noted that the daemon-based
approach to network services allows for tailoring by users who need customized
services.
The following figure depicts the overall architecture of the locking service.
Figure 1-4
Machine A
Architecture of the NFS Locking Service
r-------------------------------,
Machine B
r-------------------------------,
local applications
local applications
lock requests
lock
manager
lock requests
RPC Lock Re \Jests
I
I
status
monitor
I
____ ~ta!p§ M.e§S.!l~_s _____ _
I
I
_______________________________ JI
lock
manager
status
monitor
I
IL ______________________________ _
At each server site, a lock manager process accepts lock requests, made on behalf
of client processes by a remote lock manager, or on behalf of local processes by
the kernel. The client and server lock managers communicate with RPC calls.
Upon receiving a lock request for a machine that it doesn't already hold a lock
on, the lock manager registers its interest in that machine with the local status
monitor, and waits for that monitor to notify it that the machine is up. The monitor continues to watch the status of registered machines, and notifies the lock
manager is one of them is rebooted (after a crash). If the lock request is for a
local file, the lock manager tries to satisfy it, and communicates back to the
application along the appropriate RPC path.
The crash recovery procedure is very simple. If the failure of a client is detected,
the server releases the failed client's locks, on the assumption that the client
application will request locks again as needed. If the recovery (and, by implication, the crash) of a server is detected, the client lock manager retransmits all
lock requests previously granted by the recovered server. This retransmitted
information is used by the server to reconstruct its locking state. See below for
more details.
Revision A, of 27 March 1990
Chapter 1 - Network Services
29
The locking service recovers from failure in a stateless manner. Its state information is carefully circumscribed within a pair of system daemons that are set up
for automatic, application-transparent crash recovery. If a server crashes, and
thus loses its state, it expects that its clients will be notified of the crash and send
it the infonnation that it needs to reconstruct its state. The key in this approach is
the status monitor, which the lock manager uses to detect both client and server
failures.
The Locking Protocol
The lock style implemented by the network lock manager is that specified in the
AT&T System V Inter/ace Definition, (see the lockf (2) and fcntl (2) man
pages for details). There is no interaction between the lock manager's locks and
flock ( ) -style locks, which remain supported, but which should be used for
non-network applications only.
Locks are presently advisory only, on the (well supported) assumption that
cooperating processes can do whatever they wish without mandatory locks. (See
the fcntl (2) man page for more infonnation about advisory locks).
There are four basic Lock Manager requests that are made by the kernel in
response to various ioctl () I fcntl () calls:
KLM LOCK
Lock the specified record.
KLM UNLOCK
Unlock the specified record.
KLM TEST
Test if the specified record is locked.
KLM CANCEL
Cancel an outstanding lock request.
Despite the fact that the network lock manager adheres to the
lockf () / fcntl () semantics, there are a few subtle points about its behavior
that deserve mention. These arise directly from the nature of the network:
o
The first and most important of these has to do with crashes. When an
NFS-client goes down, the lock managers on all of its servers are notified by
their status monitors, and they simply releases its locks, on the assumption
that it will request them again when it wants them. When a server crashes,
however, matters are different: the clients will wait for it to come back up,
and when it does, its lock manager will give the client lock managers a grace
period to submit lock reclaim requests, and during this period will accept
only reclaim requests. The client status monitors will notify their respective
lock managers when the server recovers. The default grace period is 45
seconds.
o
It is possible that, after a server crash, a client will not be able to recover a
lock that it had on a file on that server. This can happen for the simple reason that another process may have beaten the recovering application process
to the lock. In this case the S IGLOST signal will be sent to the process (the
default action for this signal is to kill the application).
Revision A. of 27 March 1990
30
Network Programming
D
1.S. The Network Status
Monitor
The local lock manager does not reply to the kemellock request until the
selVer lock manager has gotten back to it. Further, if the lock request is on a
selVer new to the local lock manager, the lock manager registers its interest
in that selVer with the local status monitor and waits for its reply. Thus, if
either the status monitor or the selVer's lock manager are unavailable, the
reply to a lock request for remote data is delayed until it becomes available.
The Network Status Monitor (see the statd(8C) man page for more details) was
introduced with the lock manager, which relies heavily on it to maintain the
inherently statefullocking service within the stateless NFS environment. However, the status monitor is very general, and can also be used to support other
kinds of stateful network selVices and applications. Nonnally, crash recovery is
one of the most difficult aspects of network application development, and
requires a major design and installation effort. The status monitor makes it more
or less routine.
It is anticipated that, in the future, new network selVices, some of them stateful,
will be introduced into the Sun system. These selVices will use the status monitor to keep up with the state of the network and to cope with machine crashes.
The status monitor works by providing a general framework for collecting network status infonnation. Implemented as a daemon that runs on all network
machines, it implements a simple protocol which allows applications to easily
monitor the status of other machines. Its use improves overall robustness, and
avoids situations in which applications running of different machines (or even on
the same machine) come to disagree about the status of a site - a potentially
dangerous situation that can lead to inconsistencies in many applications.
Applications using the status monitor do so by registering with it the machines
that they are interested in. The monitor then tracks the status of those machines,
and when one of them crashes5 it notifies the interested applications to that
effect, and they then take whatever actions are necessary to reestablish a consistent state.
There are several major advantages to this approach:
D
Only applications that use stateful selVices must pay the overhead - in time
and in code - of dealing with the status monitor.
D
The implementation of stateful network applications is eased, since the
status monitor shields application developers from the complexity of the network.
S Actually. when one of them
recovers from a crash.
Revision A, of 27 March 1990
PART ONE: Network ProgralDming
PART ONE: Network Programming - Continued
2
Introduction to Remote Procedure Calls
2.1. Overview
What are Remote Procedure Calls? Simply put, they are a high-level communications paradigm which allows network applications to be developed by way of
specialized kinds of procedure calls designed to hide the details of the underlying
networking mechanisms.
RPC implements a logical client to server communications system designed
specifically for the support of network applications. With RPC, the client makes
a procedure call which sends requests to the server as necessary. When these
requests arrive, the server calls a dispatch routine, performs whatever service is
requested, sends back the reply, and the procedure call returns to the client.
How it is useful
The net effect of programming with RPC is that programs are designed to run
within a client/server network model. Such programs use RPC mechanisms to
avoid the details of interfacing to the network, and provide network services to
their callers without even requiring that they be aware of the existence and function of the underlying network.
This mechanism solves the tedious issues of programming by making the calls
transparent. For example, a program can simply make a call to rnusers () , a
C routine which returns the number of users on a remote machine. The caller is
not explicitly aware of using RPC - they simply call a procedure, much like
making a system call to rna 11 0 C ( ) .
Even though this discussion only mentions the interface to C, Remote Procedure
Calls can be made from any language. Additionally even though this discussion
refers to RPC only as it is used to communicate between processes on different
machines, it also works for communication between different processes on the
same machine.
Terminology
This chapter discusses servers, services, programs, procedures, clients, and versions. A server provides network services and a network service is a collection
of one or more remote programs. A remote program implements one or more
remote procedures; the procedures, their parameters, and results are documented
in the specific program's protocol specification. Network clients initiate remote
procedure calls to services. A server may support more than one version of a
remote program in order to be forward compatible with changing protocols.
For example, a network file service may be composed of two programs. One
program may deal with high-level applications such as file system access control
and locking. The other may deal with low-level file 10 and have procedures like
33
Revision A, of 27 March 1990
34
Network Programming
"read" and "write". A client machine of the network file service would call the
procedures associated with the two programs of the service on behalf of some
user on the client machine.
The RPC Model
The remote procedure call model is similar to the local procedure call model. In
the local case, the caller places arguments to a procedure in some well-specified
location (such as a result register). It then transfers control to the procedure, and
eventually gains back control. At that point, the results of the procedure are
extracted from the well-specified location, and the caller continues execution.
The remote procedure call is similar, in that one thread of control logically winds
through two processes--one is the caller's process, the other is a server's process. That is, the caller process sends a call message to the server process and
waits (blocks) for a reply message. The call message contains the procedure's
parameters, among other things. The reply message contains the procedure's
results, among other things. Once the reply message is received, the results of
the procedure are extracted, and caller's execution is resumed.
On the selVer side, a process is dormant awaiting the arrival of a call message.
When one arrives, the server process extracts the procedure's parameters, computes the results, sends a reply message, and then awaits the next call message.
Please refer to Figure 2-1.
Note that in this model, only one of the two processes is active at any given time.
The RPC protocol makes no restrictions on the concurrency model implemented,
and others are possible. For example, an implementation may choose to have
RPC calls be asynchronous, so that the client may do useful work while waiting
for the reply from the server. Another possibility is to have the server create a
task to process an incoming request, so that the server can be free to receive other
requests. For a more detailed discussion on the RPC protocol, see Chapter 7Remote Procedure Calls: Protocol Specification.
Revision A, of 27 March 1990
Chapter 2 - Introduction to Remote Procedure Calls
Figure 2-1
35
Network Communication with the Remote Procedure Call
•
•
•
service.•
daemon:
client
program
•
•
•
RPC Call
MachineB
invoke
service
Machine A
•
•
:
call
servIce
•
service
executes
I
I
•I
•I
Answer Return
request
completed
return
reply
program
continues
:
•I
•
•
~
•
•
•
•
V
In the above diagram, the details of the network transport are hidden within the
Remote Procedure Call. Note, however, that the RPC would not be very useful if
those details were entirely unavailable to user and programmers who required
access to them.
2.2. Versions and Numbers
Each RPC procedure is uniquely defined by a program number and procedure
number. The program number specifies a group of related remote procedures,
each of which has a different procedure number. Each program also has a version number, so when a minor change is made to a remote service (adding a new
procedure, for example), a new program number doesn't have to be assigned.
For example, when you want to call a procedure to find the number of remote
users, you look up the appropriate program, version and procedure numbers in a
Revision A, of 27 March 1990
36
Network Programming
manual, just as you look up the name of a memory allocator when you want to
allocate memory.
2.3. Portmap
The portmap is the only network service that must have such a well-known
(dedicated) port. Other network services can be assigned port numbers statically
or dynamically so long as they register their ports with their host's portmap.
The portmap is started automatically whenever a machine is booted. As part of
its initialization, a server program calls its host's portmap to create a portmap
entry for its program and version number. To find a remote program's port, a
client sends an RPC call message to a server's portmap; if the remote program
is registered with the portmap, it returns the relevant port number in an RPC
reply message. The client program can then send RPC call messages to the
remote program's port.
2.4. Transports and
The RPC protocol is independent of transport protocols. That is, RPC does not
care how a message is passed from one process to another. The protocol deals
only with specification and interpretation of messages.
Semantics
It is important to point out that RPC does not try to implement any kind ofreliability and that the application must be aware of the type of transport protocol
underneath RPC. If it knows it is running on top of a reliable transport such as
TCP/IP[ 6], then most of the work is already done for it. On the other hand, if it
is running on top of an unreliable transport such as UDPIIP[7], it must implement
is own retransmission and time-out policy as the RPC layer does not provide this
service.
Because of transport independence, the RPC protocol does not attach specific
semantics to the remote procedures or their execution. Semantics can be inferred
from (but should be explicitly specified by) the underlying transport protocol.
For example, consider RPC running on top of an unreliable transport such as
UDP/IP. If an application retransmits RPC messages after short time-outs, the
only thing it can infer if it receives no reply is that the procedure was executed
zero or more times. If it does receive a reply, then it can infer that the procedure
was executed at least once.
On the other hand, if using a reliable transport such as TCP/lP, the application
can infer from a reply message that the procedure was executed exactly once, but
if it receives no reply message, it cannot assume the remote procedure was not
executed. Note that even if a connection-oriented protocol like TCP is used, an
application still needs time-outs and reconnection to handle server crashes.
Transport Selection
Sun RPC is currently supported on both UDP/IP and TCP/IP transports. The
selection of the transport depends upon the requirements of the application. UDP
(connection less) may be the transport of choice if the application has all of the
following characteristics:
1.
The procedures are idempotent. Le. the same procedure can be executed
more than once without any harmful side-effects. For example, reading a
block of data is idempotent, while creating a file is a non-idempotent operation.
~~sun
~~ mlcrosystems
Revision A. of 27 March 1990
Chapter 2 - Introduction to Remote Procedure Calls
37
2.
The size of both the arguments and results is smaller than the UDP packet
size (8 Kbytes for Sun UDP implementation).
3.
The server is required to handle many (hundreds) of clients. Since the UDP
server does not keeps any state about the client, it can potentially handle
many clients. On the other hand, TCP server keeps state for each open client
connection and hence the number of clients is limited by the machine
resources.
TCP (connection oriented) may be the transport of choice if the application has
any of the following requirements and characteristics:
2.S. External Data
Representation
1.
The application needs to maintain a high degree of reliability.
2.
The procedures are non-idempotent and at-most-once semantics are
required.
3.
The size of either the arguments or the results exceeds 8 Kbytes.
RPC presumes the existence of the eXternal Data Representation (XDR), a standard for the machine-independent description and encoding of data. XDR is useful for transferring data between different computer architectures, and has been
used to communicate data between such divers machines as the Sun Workstation,
VAX, IBM-PC, and Cray.
RPC can handle arbitrary data structures, regardless of different machines' byte
orders or structure layout conventions, by always converting them to a network
standard called External Data Representation (XDR) before sending them over
the wire. The process of converting from a particular machine representation to
XDR format is called serializing, and the reverse process is called deserializing.
For a detailed discussion of XDR, see Chapter 6 - External Data Representation Standard: Protocol Specification.
2.6. rpcinfo
2.7. Assigning Program
Numbers
rpcinfo is a command that reports current RPC registration information
known to portmap (and can be used by administrators to delete registrations).
rpcinfo can be used to find all the RPC services registered on a specified host
and to report their port numbers and the transports for which they are registered.
It can also be used to call (ping) a specific version of a specific program on a
specific host using TCP or UDP transport, and to report whether the response was
received. For details, see the rpc in f 0 (8 C) manual pages.
Program numbers are assigned in groups of 0 x 2000 0000 according to the following chart:
OxO
Ox20000000
Ox40000000
Ox60000000
Ox80000000
OxaOOOOOOO
OxcOOOOOOO
OxeOOOOOOO
.~!!!!
-
Oxlfffffff
Ox3fffffff
OxSfffffff
Ox7fffffff
Ox9fffffff
Oxbfffffff
Oxdfffffff
Oxffffffff
Defined by Sun
Defined by user
Transient
Reserved
Reserved
Reserved
Reserved
Reserved
Revision A, of 27 March 1990
38
Network Programming
Sun Microsystems administers the first group of numbers, which should be identical for all Sun customers. If a customer develops an application that might be of
general interest, that application should be given an assigned number in the first
range. The second group of numbers is reserved for specific customer applications. This range is intended primarily for debugging new programs. The third
group is reserved for applications that generate program numbers dynamically.
The final groups are reserved for future use, and should not be used.
To register a protocol specification, send a request by network. mail to
[email protected] sun. com, or write to:
RPC Administrator
Sun Microsystems
2550 Garcia Ave.
Mountain View, CA 94043
Please include a compilable rpcgen ". x" file describing your protocol. You
will be given a unique program number in return.
Some of the RPC program numbers can be found in letc/rpc. Protocol
specifications of standard Sun RPC services can be found in the include files in
/usr / include/ rpcsvc. These services, however, constitute only a small
subset of those which have been registered. A list of some of the registered programs is:
Table 2-1
Registered RPC Program Numbers
RPC Number
Program
Description
100000
100001
100002
100003
100004
100005
100006
100007
100008
100009
100010
100011
100012
100013
100014
100015
100016
100017
100018
100019
100020
100021
100022
PMAPPROG
RSTATPROG
RUSERSPROG
NFSPROG
YPPROG
MOUNTPROG
DBXPROG
YPBINDPROG
WALLPROG
YPPASSWDPROG
ETHERSTATPROG
RQUOTAPROG
SPRAYPROG
IBM3270PROG
IBMRJEPROG
SELNSVCPROG
RDATABASEPROG
REXECPROG
ALICEPROG
SCHEDPROG
LOCKPROG
NETLOCKPROG
X25PROG
portmap
remote stats
remote users
nfs
NIS
mount daemon
remote dbx
NIS binder
shutdown msg
yppasswd server
ether stats
disk quotas
spray packets
3270 mapper
RJEmapper
selection service
remote database access
remote execution
Alice Office Automation
scheduling service
local lock manager
network lock manager
x.25 inr protocol
+~,.!!
Revision A, of 27 March 1990
Chapter 2 -
Table 2-1
Introduction to Remote Procedure Calls
39
Registered RPC Program Numbers- Continued
RPC Number
100023
100024
100025
100026
100027
100028
100029
100030
100031
100032
100033
100034
100035
100036
100037
100038
100039
100043
Program
STATMON1PROG
STATMON2PROG
SELNLIBPROG
BOOTPARAMPROG
MAZEPROG
YPUPDATEPROG
KEYSERVEPROG
SECURECMDPROG
NETFWDIPROG
NETFWDTPROG
SUNLINKMAP PROG
NETMONPROG
DBASEPROG
PWDAUTHPROG
TFSPROG
NSEPROG
NSE ACTIVATE PROG
SHOWHFD
Description
status monitor 1
status monitor 2
selection library
boot parameters service
mazewars game
NIS update
key server
secure login
nj's net forwarder init
nj's net forwarder trans
sunlinkMAP
network monitor
lightweight database
password authorization
translucent file svc
nse server
nse activate daemon
showfh
150001
PCNFSDPROG
pc passwd authorization
200000
200001
200002
PYRAMIDLOCKINGPROG
PYRAMIDSYS5
CADDS IMAGE
Pyramid-locking
Pyramid-sys5
CV cadds_image
300001
ADT RFLOCKPROG
ADT file locking
Revision A. of 27 March 1990
40
Network Programming
Revision A, of 27 March 1990
3
rpcgen Programming Guide
3.1. The rpcgen Protocol
Compiler
The details of programming applications to use Remote Procedure Calls can be
tedious. One of the more difficult areas is writing XDR routines to convert procedure arguments and results into their network format and vice-versa.
Fortunately, rpcgen (1) exists to help programmers write RPC applications
simply and directly. rpcgen does most of the dirty work, allowing programmers to debug the main features of their application, instead of requiring them to
spend most of their time on their network interface code.
rpcgen is a compiler. It accepts a remote program interface definition written
in a language, called RPC Language, which is similar to C. It produces a C
language output for RPC programs. This output includes skeleton versions of the
client routines, a server skeleton, XDR filter routines for both parameters and
results, a header file that contains common definitions and, optionally, dispatch
tables which the server can use to check authorizations and then invoke service
routines. The client skeletons' interface with the RPC library and effectively
hide the network from their callers. The server skeleton similarly hides the network from the server procedures that are to be invoked by remote clients.
rpcgen's output files can be compiled and linked in the usual way. The server
code generated by rpcgen has support for inetd i.e. the server can be started
via inetd or at the command line.
The developer writes server procedures-in any language that observes system .
calling conventions-and links them with the server skeleton produced by
rpcgen to get an executable server program. To use a remote program, a programmer writes an ordinary main program that makes local procedure calls to the
client skeletons. Linking this program with rpcgen's skeletons creates an executable program. rpcgen options can be used to suppress skeleton generation
and to specify the transport to be used by the server skeleton.
Like all compilers, rpcgen reduces development time that would otherwise be
spent coding and debugging low-level routines. All compilers, including
rpcgen, do this at a small cost in efficiency and flexibility. However, many
compilers allow escape hatches for programmers to mix low-level code with
high-level code. rpcgen is no exception. In speed-critical applications, handwritten routines can be linked with the rpcgen output without any difficulty.
Also, one may proceed by using rpcgen output as a starting point, and then
rewriting it as necessary. (For a discussion of RPC programming without
rpcgen, see the next chapter, the Remote Procedure Call Programming Guide).
41
Revision A, of 27 March 1990
42
Network Programming
Converting Local Procedures
into Remote Procedures
Assume an application that runs on a single machine, one which we want to convert to run over the network. Here we will demonstrate such a conversion by
way of a simple example-a program that prints a message to the console:
1*
* printmsg.c: print a message on the console
*1
iinclude <stdio.h>
main (argc, argv)
int argc;
char *argv[];
char *message;
if (argc != 2)
fprintf(stderr, "usage: %s <message>\n", argv[O]);
exit (1);
message = argv[1];
if (!printmessage(message»
fprintf (stderr, "%s: couldn't print your message \n" ,
argv[O]);
exit (1);
printf("Message Delivered!\n");
exi t (0);
1*
* Print a message to the console.
* Return a boolean indicating whether the message was actually printed.
*1
printmessage(msg)
char *msg;
FILE *f;
f = fopen("/dev/console", "w");
if (f == NULL) {
return (0);
fprintf(f, "%s\n", msg);
fclose(f);
return (1);
And then, of course:
Revision A, of 27 March 1990
Chapter 3 -
rpcqan Programming Guide
43
example % cc printmsg.c -0 printmsg
example % printmsg "Hello, there."
Message delivered!
example %
If printmes sage () was turned into a remote procedure, then it could be
called from anywhere in the network. Ideally, one would just like to stick a keyword like remote in front of a procedure to tum it into a remote procedure.
Unfortunately, we have to live within the constraints of the C language, since it
existed long before RPC did. But even without language support, it's not very
difficult to make a procedure remote.
In general, it's necessary to figure out what the types are for all procedure inputs
and outputs. In this case, we have a procedure printmessage () which takes
a string as input, and returns an integer as output. Knowing this, we can write a
protocol specification in RPC language that describes the remote version of
printmessage (). Here it is:
1*
* msg.x: Remote message printing protocol
*1
program MESSAGEPROG {
version MESSAGEVERS
int PRINTMESSAGE(string)
} = 1;
1;
Ox20000099;
Remote procedures are part of remote programs, so we actually declared an
entire remote program here which contains the single procedure PRINTMESSAGE. By convention, all RPC services provide for procedure O. It is nonnally
used for pinging purposes. The above procedure was declared to be in version I
of the remote program. No null procedure (procedure 0) is necessary in the protocol definition because rpcgen generates it automatically and the user is not
concerned with it.
Notice that everything is declared with all capital letters. This is not required,
but is a good convention to follow.
Notice also that the argument type is "string" and not "char *". This is because a
"char *" in C is ambiguous. Programmers usually intend it to mean a nullterminated string of characters, but it could also represent a pointer to a single
character or a pointer to an array of characters. In RPC language, a nullterminated string is unambiguously called a "string".
There are just two more things to write. First, there is the remote procedure
itself. Here's the definition of a remote procedure to implement the PRINTMESSAGE procedure we declared above.
Revision A. of 27 March 1990
44
Network Programming
1*
* msgyroc.c: implementation o/the remote procedure "printmessage"
*1
#include <stdio.h>
#include <rpc/rpc.h>
#include "msg.h"
/ * always needed
*/
/ * msg.h will be generated by rpcgen * /
1*
* Remote verson of "printmessage"
*1
int *
printmessage_1(msg)
char **msg;
static int result; /* mustbestatic! */
FILE *f;
f = fopen("/dev/console", "w");
if (f == NULL) {
result = 0;
return (&result);
fprintf(f, "%s\n", *msg);
fclose(f);
result = 1;
return (&result);
Notice here that the declaration of the remote procedure printmessage_l ()
differs from that of the local procedure printmessage () in three ways:
1.
It takes a pointer to a string instead of a string itself. This is true of all
remote procedures: they always take pointers to their arguments rather than
the arguments themselves. If there are no arguments, specify void.
2.
It returns a pointer to an integer instead of an integer itself. This is also
characteristic of remote procedures - they return pointers to their results.
Therefore it is important to have the result declared as a static. If there
are no arguments, specify void.
3.
It has an "_1" appended to its name. In general, all remote procedures called
by rpcgen are named by the following rule: the name in the procedure
definition (here PRINTMESSAGE) is converted to all lower-case letters, an
underbar ("_") is appended to it, and finally the version number (here 1) is
appended.
The last thing to do is declare the main client program that will call the remote
procedure. Here it is:
1*
* rprintmsg.c: remote version of"printmsg.c"
*1
Revision A, of 27 March 1990
Chapter 3 -
iinclude <stdio.h>
iinclude <rpc/rpc.h>
iinclude "msg.h"
/*
rpcqen Programming Guide
always needed
45
*/
/ * msg.h will be generated by rpcgen * /
main (argc, argv)
int argc;
char *argv[];
CLIENT *cl;
int *result;
char *server;
char *message;
if (argc != 3)
fprintf (stderr,
"usage: %s host message\n", argv[O]);
exit{l);
server = argv[l];
message = argv[2];
1*
* Create client "handle" usedfor calling MESSAGEPROG on the
* server designated on the command line. We tell the Rep package
* to use the "tcp" protocol when contacting the server.
*1
= clnt_create(server, MESSAGEPROG, MESSAGEVERS,
"tcp") ;
if (cl == NULL) {
cl
1*
* Couldn't establish connection with server.
* Print error message and die.
*1
clnt-pcreateerror(server)i
exit(l);
1*
* Call the remote procedure ''printmessage'' on the server
*1
result = printmessage_l{&message, cl);
if (result == NULL) {
1*
* An error occurred while calling the server.
* Print error message and die.
*1
clnt-perror{cl, server);
exit{l);
1*
* Okay, we successfully called the remote procedure.
*1
if (*result
==
0)
{
1*
Revision A, of 27 March 1990
46
Network Programming
* Server was unable to print our message.
* Print error message and die.
*/
fprintf(stderr, "%5: %5 couldn't print your message\n",
argv[O], server);
exit(l);
/*
* The message got printed on the server's console
*/
printf ("Message delivered to %s! \n", server);
exit(O);
There are a few points worth noting here:
1.
First a client "handle" is created using the RPC library routine
cInt_create (). This client handle will be passed to the skeleton routines which call the remote procedure.
2. The last parameter to cInt_create is "tcp", the transport on which you
want your application to run on. It could also have been "udp", as an alternate transport. For more information on transport selection see the section
Transport Selection in Chapter 2 -Introduction to Remote Procedure
Calls.
3. The remote procedure printmessage_l () is called exactly the same
way as it is declared in msgyroc . c except for the inserted client handle
as the second argument.
4. The remote procedure call can fail in two ways. The RPC mechanism itself
can failor, alternatively, there can be an error in the execution of the actual
remote procedure. In the fonner case, the remote procedure (in this case
print_message_l (» returns with a NULL. In the later case, however,
the details of error reporting are application dependent. Here, the error is
being reported via *result.
Here's how to put all of the pieces together:
example%
example%
example%
rpcqen msq. x
cc rprintmsq.c msq_clnt.c -0 rprintmsq
cc msqJroc. c msq_svc . c -0 msq_server
Two programs were compiled here: the client program rpr intmsg and the
server program msg_ server. Before doing this though, rpcgen was used to
fill in the missing pieces.
Here is what rpcgen (called without any flags) did with the input file msg . x:
1.
It created a header file called msg . h that contained #def ine 's for MESSAGEPROG, MESSAGEVERS and PRINTMESSAGE for use in the other
modules. This file should be included by both the client and the server
Revision A, of 27 March 1990
Chapter 3 -
47
rpcg8n Programming Guide
modules.
2.
It created the client "skeleton" routines in the msg_clnt . c file. In this
case there is only one, the printmessage_l () that was referred from the
printmsg client program. If the name of the input file is FOO. x, the
client skeletons output file is called Faa_ c 1 n t . c.
3.
It created the selVerprogram inmsg_svc. c which calls printmessage_l () from msgyroc. c. The rule for naming the selVer output file
is similar to the previous one: for an input file called FOO • x, the output
selVer file is named FOO svc. c.
(Note that, given the -T argument, rpcgen creates an additional output file
which contains index infonnation used for the dispatching of selVice routines).
Now we're ready to have some fun. First, copy the selVer to a remote machine
and run it. For this example, the machine is called "moon".
Note that servers generated by
rpcgen can be invoked with portmonitors like inetd as well as from
the command line, if they are
invoked with the - I option.
( moon% msg_server
J
&
Then on our local machine ("sun") we can print a message on "moon"s console.
[
sun% rprintmsg moon "Hello, moon."
]
- - - - - -
"----
The message will get printed on "moon"s console. You can print a message on
anybody's console (including your own) with this program if you can copy the
selVer to their machine and run it.
An Advanced Example
The previous example only demonstrated the automatic generation of client and
selVerRPC code. rpcgen may also be used to generate XDR routines, that is,
the routines necessary to convert local data structures into network fonnat and
vice-versa. This next example is more advanced in that it presents a complete
RPC selVice-a remote directory listing selVice, which uses rpcgen not only to
generate skeleton routines, but also to generate the XDR routines. Here is the
protocol description file.
1*
* dir.x: Remote directory listing protocol
*1
/ * maximum length of a directory entry * /
nametype<MAXNAMELEN>;
/ * a directory entry * /
const MAXNAMELEN = 255;
typedef string
typedef struct namenode *namelist;
/*
a link in the listing
*/
1*
* A node in the directory listing
*1
struct namenode {
nametype name;
namelist next;
/*
/*
name of directory entry
next entry * /
*/
} ;
Revision A, of 27 March 1990
48
Network Programming
1*
* The result of a READDIR operation.
*1
union readdir_res switch (int errno) {
case 0:
namelist list;
/* no error: return directory listing * /
default:
void;
/ * error occurred: nothing else to return * /
};
1*
* The directory program definition
*1
program DIRPROG {
version DIRVERS
readdir res
READDIR(nametype)
} = 1;
1;
Ox20000076;
NOTE
Types (like readdir_res in the example above) can be defined using the
Ustruct", uunion" and uenum" keywords, but those keywords should not be used
in subsequent declarations o/variables o/those types. For example, if you define
a union ufoo", you should declare using only ('loo" and not union foo". In
fact, rpcgen compiles RPC unions into C structures and it is an error to
declare them using the tlunion" keyword.
U
Running rpcgen on dir . x creates four output files. First are the basic three
itemized above: those containing the header file, client skeleton routines and
selVer skeleton. The fourth contains the XDR routines necessary for converting
the data types we declared into XDR fonnat and vice-versa. These are output in
the file dir _ xdr . c. For each data type used in the . x file, rpcgen assumes
that the RPCIXDR library has a routine defined with the name of that data type
prepended by xdr_ (e.g. xdr_int). If the data type was defined in the . x file,
then rpcgen will generate the required xdr routine. If there are no such data
types, then the file (e.g. dir_xdr. c) will not be generated. If the data types
were used but not defined, then the user has to provide that xdr routine. This is a
way for users to provide their own customized xdr routines.
Here is the implementation of the READD IR procedure.
1*
* diryroc.c: remote readdir implementation
*1
*include <rpc/rpc.h>
*include <sys/dir.h>
*include "dir.h"
/ * Always needed * I
/ * Created by rpcgen * /
extern int errno;
extern char *malloc{);
extern char *strdup();
Revision A. of 27 March 1990
Chapter 3 -
rpcg'en Programming Guide
49
readdir res *
readdir_l(dirname)
nametype *dirname;
DIR *dirp;
struct direct *d;
namelist nl;
namelist *nlp;
static readdir res res; /* must be static! * /
1*
* Open directory
*1
dirp = opendir(*dirname);
if (dirp == NULL) {
res.errno = errnOi
return (&res)i
1*
* Free previous result
*1
xdr_free(xdr_readdir_res, &res);
1*
* Collect directory entries.
* Memory allocated here will be freed by xdr_free
* next time readdi r 1 is called
*1
nIp = &res.readdir_res_u.list;
while (d = readdir(dirp»
{
nl = *nlp = (namenode *) malloc(sizeof(namenode»;
nl->name = strdup(d->d_name)i
nIp = &nl->nexti
*nlp = NULL;
1*
* Return the result
*1
res.errno = 0;
closedir (dirp)
return (&res) i
i
Finally, there is the client side program to call the server:
1*
* rls.c: Remote directory listing client
*1
*include <stdio.h>
*include <rpc/rpc.h>
*include "dir.h"
/ * always need this * /
/* will be generated by rpcgen * /
Revision A, of 27 March 1990
50
Network Programming
extern int errno;
main (argc, argv)
int argc;
char *argv[];
CLIENT *cl;
char *server;
char *dir;
readdir_res *result;
namelist nl;
if ( a rgc ! = 3) {
fprintf(stderr,
argv[O]);
exit(l);
"usage: %s host directory\n",
server = argv[l];
dir = argv[2];
1*
* Create client "handle" used/or calling DIRPROG on the
* server designated on the command line . Use the tcp protocol when
* contacting the server.
*1
cl = clnt_create(server, DIRPROG, DIRVERS, "tcp");
if (cl == NULL) {
1*
* Couldn't establish connection with server.
* Print error message and die.
*1
clnt-pcreateerror(server);
exit(l);
1*
* Call the remote procedure readdir on the server
*1
result = readdir_l(&dir, cl);
if (result == NULL) {
1*
* An RPC error occurred while calling the server.
* Print error message and die.
*1
clnt-perror(cl, server);
exit(l);
1*
* Okay, we successfully called the remote procedure.
*1
if (result->errno != 0)
{
1*
* A remote system error occurred.
* Print error message and die.
Revision A. of 27 March 1990
Chapter 3 -
rpcqan Programming Guide
51
*'
errno = result->errno;
perror (dir) ;
exit(l);
'*** Successfully
got a directory listing.
Print it out.
*' =
for (nl = result->readdir_res u.list; nl != NULL;
nl
nl->next) {
printf("%s\n", nl->name);
exit(O);
Compile everything, and run.
sun%
sun%
sun%
sun%
rpegen dir. x
ee -c dir_ xdr . e
ee r1s.e dir_elnt.c dir_xdr.o -0 rls
ee dir_sve.e dir-proc.e dir xdr.o -0 dir sve
sun%
dir_sve ,
moon%
rls sun /usr/pub
ascii
eqnehar
greek
kbd
marg8
tabelr
tabs
tabs4
moon%
rpcgen generated client code does not release the memory allocated for the
results of the RPC call. Users can call xdr_ f re e to free up the memory once
they are done with it. It is quite similar to calling free () except that here one
also has to pass the xdr routine for the result. In this example, after printing the
list, the user could have called
Revision A, of 27 March 1990
52
Network Programming
It is often difficult to debug distributed applications like these because the client
Debugging Applications
and the server are two different processes. To simplify the testing and debugging
process, the client program and the server procedure can be tested together as a
single program by simply linking them with each other rather than with the client
and server skeletons. This could be done in the previous example by doing:
cc rls.c dir_cInt.c dir-proc.c dir_xdr.c
-0
rls
The procedure calls will be executed as ordinary local procedure calls and the
program can be debugged with a local debugger such as dbxtool. When the
program is working, the client program can be linked to the client skeleton produced by rpcgen and the server procedures can be linked to the server skeleton
produced by rpcgen.
NOTE
If you do this, you will have to comment out calls to client create RPC library
routines (e.g. clnt_create ()).
There are two kinds of errors which can happen in an RPC call. The first kind of
error is caused if there is some problem with the actual mechanism of the remote
procedure calls. This could happen in such cases as the procedure is not available, the remote server is not responding, the remote server is unable to decode
the arguments, and so on. In the previous example, an RPC error has occurred if
result is NULL. The reason for the failure can be printed by using
clntyerror () ,or an error string can be returned through
clnt_sperror ().
The second type of error is due to the server itself. In the previous example, an
error was reported if opendir () fails. Now you can see why readdir_res
is of type union. The handling of these types of errors are the responsibility of
the programmer.
The C-Preprocessor
The C-preprocessor, cpp, is run on all input files before they are compiled, so all
the preprocessor directives are legal within a ".x" file. Five macro identifiers may
have been defined, depending upon which output file is getting generated. They
are:
Identifier
RPC HDR
RPC XDR
RPC SVC
RPC CLNT
RPC TBL
Usage
For header-file output
For XDR routine output
For server-skeleton output
For client skeleton output
For index table output
Also, rpcgen does some additional preprocessing of the input file. Any line
that begins with a percent sign is passed directly into the output file, without any
interpretation of the line. Here is a simple example that demonstrates this processing feature.
Revision A, of 27 March 1990
Chapter 3 -
rpcgen Programming Guide
53
1*
* time.x: Remote time protocol
*1
program TlMEPROG {
version TlMEVERS
unsigned int TIMEGET(void)
} = 1;
1;
} = 44;
iifdef RPC_SVC
%int *
%timeget_1 ()
%{
%
static int thetime;
%
%
%
%}
iendif
thetime = time(O);
return (&thetime);
When using the '%' feature, there is no guarantee that rpcgen will place the
output where you intended. If you have problems of this type, we recommend
you to not use this feature.
rpcgen Programming Notes
Network Types
By default rpcgen generates server code for both UDP & TCP transports. The
- s flag creates a server which responds to requests on the specified transport.
The following example creates a udp server:
examplei
User-Provided Define
Statements
rpcgen
-8
udp_n proto.x
rpcgen also provides a means of defining symbols and assigning values to
them. These defined symbols are passed on to the C preprocessor when it is
invoked. This facility is useful when the user wants to, for example, invoke
debugging code which is enabled only when the DEBUG symbol is defined. For
example:
rpcgen -DDEBUG proto.x
l[xample%
___
_ _ _ _ _ _ _ __
J
Revision A. of 27 March 1990
54
Network Programming
Inetd Support
rpcgen can also be used to create RPC servers which can be invoked by
inetd when a request for that service comes in.
~xample%
)
%pcgan -I proto.x
The server code in proto_svc. c has the required support for inetd. For
more infonnation on how to setup the entry for RPC services in
/ etc/ inetd. conf, please see the Using Inetd section of Remote Procedure
Call Programming Guide.
In many applications, it is useful for services to wait after satisfying a servicing
request, on the chance that another will follow. However, if there is no call
within the specified time, the server will exit and the portmonitor will continue to
monitor requests for its services. By default, services wait of 120 seconds after
servicing a request before exiting. The user can, however, change that interval
with the - K flag.
~xample%
)
rpcgen -I -K 20 proto.x
Here the server will wait only for 20 seconds before exiting. If you want the
server to exit immediately, - K 0 can be used, while if the selVer is intended to
stay around forever (a nonnal server) the appropriate argument is - K -1.
Dispatch Tables
There are a number of cases when dispatch tables are useful. For example, the
server dispatch routine may need to check authorization and then invoke the service routine; or a client library may want to deal with the details of storage
management and XDR data conversion.
txamPle%
%pcgen -T proto.x
)
Here rpcgen generates RPC dispatch tables for each program defined in the
protocol description file, proto. x, in the file proto_ tbl. i. (The suffix . i
stands for "index"). See below for how to use this file when compiling programs.
Each entry in the table is a struct rpcgen_table, defined in the header file
proto. h as follows:
struct rpcgen_table
char
* (*proc) ();
xdr_arg;
xdrproc_t
len_arg;
unsigned
xdr_res;
xdrproc_t
unsigned
len_res;
};
where
proc is a pointer to the service routine,
Revision A, of 27 March 1990
Chapter 3 - rpcqan Programming Guide
55
xdr arg is a pointer to the input (argument) xdr_routine,
len arg is the length in bytes of the input argument,
xdr res is a pointer to the output (result) xdr_routine, and
len res is the length in bytes of the output result
The table, named dirprog_l_table, is indexed by procedure number. The
variable dirprog_l_nproc contains the number of entries in the table.
An example of how to locate an procedure in the dispatch tables is demonstrated
by the routine findyroc:
struct rpcgen_table
findyroc(proc)
long
proc;
*
if (proc >= dirprog_l_nproc)
/* error * /
else
return (&dirprog_l_table[proc);
Each entry in the dispatch table contains a pointer to the corresponding service
routine. However, the selVice routine is not defined in the client code. To avoid
generating unresolved external references, and to require only one source file for
the dispatch table, the actual selVice routine initializer is
RPCGEN_ACTION(proc_ver).
This way, the same dispatch table can be included in both the client and the
server. Use the following define when compiling the client:
tdefine RPCGEN_ACTION(routine)
)
0
and use this define when compiling the server:
define RPCGEN_ACTION(routine)
routine
Client Programming Notes
Timeout Changes
RPC sets a default timeout of25 seconds for RPC calls when clnt_create ()
is used. This means RPC will wait for 25 seconds to get the results from the
selVer. If it does not hear within that time period, then perhaps the server isn't
running or the remote machine crashed or the network is unreachable. There are
many possibilities of why no answer is heard. In such cases the function will
return NULL and the error can be printed using clntyerrno ().
There are cases when the user wants to change the timeout value to accommodate
the application needs or the fact that the server is slow and quite far away. The
Revision A. of 27 March 1990
56
Network Programming
timeout can be changed using clnt_control (). Here is a small code fragment to demonstrate use of clnt_control () :
struct timeval tv;
CLIENT *cl;
cl = clnt_create("somehost", SOMEPROG, SOMEVERS, "tcp");
if (cl == NULL) {
exit(l);
tv . tv_sec = 60; /* change timeout to 1 minute * /
tv. tv_usec = 0; /* this should always be set * /
clnt_control(cl, CLSET_TIMEOUT, &tv);
Client Authentication
The client create routines do not, by default, have any facilities for client authentication, but the client may sometimes want to authenticate itself to the selVer.
For more infonnation on how to perform authentication, see the Authentication
section of Remote Procedure Call Programming Guide. Doing so is trivial, and
looks like this:
CLIENT *cl;
cl = client_create("somehost", SOMEPROG, SOMEVERS, "udp");
if (cl != NULL) {
/* To set UNIX style authentication * /
cl->cl_auth
authunix_create_default();
Server Programming Notes
Handling Broadcast on the
SelVer Side
Clients may sometimes broadcast to find out whether a particular selVer exists on
the network or just to find out about all the selVers for a particular program and
version number. These calls are made via clnt _broadcast ( ). Note that
there is no rpcgen support for that. Please see Broadcast RPC Synopsis in
Remote Procedure Call Programming Guide.
When a procedure is known to be called via broadcast RPC, it is usually wise for
the selVer to not reply unless it can provide some useful information to the client.
This prevents the network from getting flooded by useless replies.
To prevent the selVer from replying, a remote procedure can return NULL as its
result, and the server code generated by rpcgen will detect this and not send out
a reply.
Here is an example of a procedure that replies only if it thinks it is an NFS
server:
Revision A, of 27 March 1990
Chapter 3 -
rpcgen Programming Guide
57
void *
reply_if_nfsserver()
{
char notnull;
/ * just here so we can use its address * /
if (access ("/etc/exports", F_OK) < 0) {
return (NULL);
/* prevent RPCfrom replying * /
1*
* return non-null pointer so RPC will send out a reply
*1
return «void *)&notnull);
Note that if procedure returns type "void *", they must return a non-NULL
pointer if they want RPC to reply for them.
Other Infonnation Passed to
Server Procedures
Server procedures will often want to know more about an RPC call than just its
arguments. For example, getting authentication infonnation is important to procedures that want to implement some level of security. This extra infonnation is
actually supplied to the server procedure as a second argument. (For details see
the structure of svc _ req, in the Authentication section of Remote Procedure
Call Programming Guide. Here is an example to demonstrate its use. What
we've done here is rewrite the previous pr intmes sage _1 () procedure to
only allow root users to print a message to the console.
int *
printmessage_l(msg, rqstp)
char **msg;
struct svc_req *rqstp;
static int result;
/ * Must be static * /
FILE *f;
struct authunix-parms *aupi
aup = (struct authunix-parms
if (aup->aup_uid != 0) {
result = 0;
return (&result)i
*)rqstp->r~clntcredi
1*
* Same code as before.
*1
Revision A, of 27 March 1990
58
Network Programming
RPC Language
RPC language is an extension ofXDR language. The sole extension is the addition of the program and version types. For a complete description of the
XDR language syntax, see the External Data Representation Standard: Protocol
Specification chapter. For a description of the RPC extensions to the XDR
language, see the Remote Procedure Calls: Protocol Specification chapter.
However, XDR language is so close to C that if you know C, you know most of
it already. We describe here the syntax of the RPC language, showing a few
examples along the way. We also show how the various RPC and XDR type
definitions get compiled into C type definitions in the output header file.
Definitions
An RPC language file consists of a series of definitions.
definition-list:
definition ";"
definition ";" definition-list
It recognizes the following types of definitions.
definition:
enum-definition
typedef-definition
const-definition
declaration-definition
struct-definition
union-definition
program-definition
Enumerations
XDR enumerations have the same syntax as C enumerations.
enum-definition:
"enum" enum-ident "{"
enum-value-list
"}"
enum-value-list:
enum-value
enum-value "," enum-value-list
enum-value:
enum-value-ident
enum-value-ident "=" value
Here is a short example of an XDR enum, and the C enum that it gets compiled
into.
enum colortype {
RED = 0,
GREEN = 1,
BLUE
};
enum colortype {
RED = 0,
GREEN = 1,
-->
= 2
BLUE
= 2,
};
typedef enum colortype colortype;
Revision A. of 27 March 1990
Chapter 3 - rpcgen Programming Guide
Typedefs
59
XDR typedefs have the same syntax as C typedefs.
typedef-definition:
"typedef" declaration
Here is an example that defines a f name_ type used for declaring file name
strings that have a maximum length of 255 characters.
typedef string fname_type<255>; --> typedef char *fname_type;
Constants
XDR constants may be used wherever a integer constant is used, for example, in
array size specifications.
const-definition:
"const" const-ident "=" integer
For example, the following defines a constant DOZEN equal to 12.
const DOZEN
Declarations
= 12;
-->
#define DOZEN 12
In XDR, there are only four kinds of declarations.
declaration:
simple-declaration
fixed-array-declaration
variable-array-declaration
pointer-declaration
1) Simple declarations are just like simple C declarations.
simple-declaration:
type-ident variable-ident
Example:
colortype color;
--> colortype color;
2) Fixed-length Array Declarations are just like C array declarations:
fixed-array-declaration:
type-ident variable-ident "[" value "]"
Example:
colortype palette[8];
--> colortype palette[8];
3) Variable-Length Array Declarations have no explicit syntax in C, so XDR
invents its own using angle-brackets.
variable-array-declaration:
type-ident variable-ident "<" value ">"
type-ident variable-ident n<" n>n
The maximum size is specified between the angle brackets. The size may be
omitted, indicating that the array may be of any size.
int heights<12>;
int widths<>;
/* at most 12 items * /
/ * any number of items * /
Since variable-length arrays have no explicit syntax in C, these declarations are
Revision A, of 27 March 1990
60
Network Programming
actually compiled into "struct"s. Foor example, the "heights" declaration gets
compiled into the following struct:
struct {
u_int heights_leni
int *heights_vali
heightsi
/ * # of items in array * /
/ * pointer to array * /
Note that the number of items in the array is stored in the "_len" component and
the pointer to the array is stored in the "_val" component. The first part of each
of these component's names is the same as the name of the declared XDR variable.
4) Pointer Declarations are made in XDR exactly as they are in C. You can't
really send pointers over the network, but you can use XDR pointers for sending
recursive data types such as lists and trees. The type is actually called
"optional-data", not "pointer", in XDR language.
pointer-declaration:
type-ident n*" variable-ident
Example:
listitem *nexti
Structures
-->
listitem *nexti
An XDR struct is declared almost exactly like its C counterpart. It looks like the
following:
struct-definition:
"structn struct-ident "{"
declaration-list
"}"
declaration-list:
declaration "i"
declaration "i" declaration-list
As an example, here is an XDR structure to a two-dimensional coordinate, and
the C structure that it gets compiled into in the output header file.
struct coord
int Xi
int Yi
} i
-->
struct coord
int Xi
int Yi
}i
typedef struct coord coord;
The output is identical to the input, except for the added typedef at the end of
the output. This allows one to use "coord" instead of "struct coord" in declarations.
Unions
XDR unions are discriminated unions, and look quite different from C unions.
They are more analogous to Pascal variant records than they are to C unions.
Revision A, of 27 March 1990
Chapter 3 - rpcgen Programming Guide
union-definition:
"union" union-ident n switch" " (" simple declaration")
case-list
61
n " {"
"}"
case-list:
"case" value ":" declaration ";"
"case" value n:n declaration ";" case-list
"default n n:" declaration ,,;n
Here is an example of a type that might be returned as the result of a "read data"
operation. If there is no error, return a block of data. Otherwise, don't return
anything.
union read_result switch (int errno) {
case 0:
opaque data[1024];
default:
void;
};
It gets compiled into the following:
struct read result
int errno;
union {
char data[1024];
} read_result_u;
};
typedef struct read_result read_result;
Notice that the union component of the output struct has the same name as the
structure type name, except for the trailing "_u".
Programs
RPC programs are declared using the following syntax:
program-definition:
"program" program-ident "{"
version-list
"}" "=" value
version-list:
,
version ".n
, version-list
version ".n
version:
"version" version-ident "{"
procedure-list
"}" n=" value
procedure-list:
procedure n;n
procedure n;n procedure-list
procedure:
type-ident procedure-ident "(" type-ident ")" "=" value
Revision A, of 27 March 1990
62
Network Programming
For example, here is the time protocol, revisited:
1*
* time.x: Get or set the time. Time is represented as number of seconds
* since 0:00, January 1,1970.
*1
program TlMEPROG {
version TlMEVERS
unsigned int TIMEGET(void)
void TlMESET(unsigned) = 2;
} = 1;
} = 44;
1;
This file compiles into these #defines in the output header file:
#define
#define
#define
#define
TlMEPROG 44
TlMEVERS 1
TIMEGET 1
TlMESET 2
There are a few exceptions to the rules described above.
Special Cases
Booleans: C has no built-in boolean type. However, the RPC library has a
boolean type called bool_t that is either TRUE or FALSE. Things declared as
type bool in XDR language are compiled into bool_t in the output header
file.
Example:
bool married;
-->
bool_t married;
Strings: C has no built-in string type, but instead uses the null-tenninated "char
*" convention. In XDR language, strings are declared using the "string" keyword, and compiled into "char *"s in the output header file. The maximum size
contained in the angle brackets specifies the maximum number of characters
allowed in the strings (not counting the NULL character). The maximum size
may be left off, indicating a string of arbitrary length.
Examples:
string name<32>;
--> char *name;
string longname<>;
--> char *longname;
Opaque Data: Opaque data is used in RPC and XDR to describe untyped data,
that is, just sequences of arbitrary bytes. It may be declared either as a fixed or
variable length array.
Examples:
opaque diskblock[512];
-->
char diskblock[512];
opaque filedata<1024>;
-->
struct {
u_int filedata_len;
char *filedata_val;
filedata;
Voids: In a void declaration, the variable is not named. The declaration is just
"void" and nothing else. Void declarations can only occur in two places: union
~
sun
microsystems
Revision A, of 27 March 1990
Chapter 3 - rpcgen Programming Guide
63
definitions and program definitions (as the argument or result of a remote procedure).
Revision A, of 27 March 1990
64
Network Programming
Revision A, of 27 March 1990
4
Remote Procedure Call Programming
Guide
This document assumes a working knowledge of network theory. It is intended
for programmers who wish to write network applications using remote procedure
calls (explained below), and who want to understand the RPC mechanisms usually hidden by the rpcgen (1) protocol compiler. rpcgen is described in
detail in the previous chapter, the rpcqen Programming Guide.
NOTE
4.1. Layers of RPC
Before attempting to write a network application, or to convert an existing nonnetwork application to run over the network, you may want to understand the
material in this chapter. However, for most applications, you can circumvent the
need to cope with the details presented here by using rpcgen. The An
Advanced Example section of that chapter contains the complete source for a
working RPC service-a remote directory listing service which uses rpcgen to
generate XDR routines as well as client and server stubs.
The RPC interface can be seen as being divided into three layers. 6
The Highest Layer: The highest layer is totally transparent to the operating system, machine and network upon which is run. It's probably best to think of this
level as a way of using RPC, rather than as a part ofRPC proper. Programmers
who write RPC routines should (almost) always make this layer available to others by way of a simple C front end that entirely hides the networking.
To illustrate, at this level a program can simply make a call to rnusers () , a C
routine which returns the number of users on a remote machine. The user is not
explicitly aware of using RPC - they simply call a procedure, just as they would
call malloc () .
The Middle Layer: The middle simplified layer is really "RPC proper." Here, the
user doesn't need to consider details about sockets, the UNIX system, or other
low-level implementation mechanisms. They simply make remote procedure
calls to routines on other machines. The selling point here is simplicity. It's this
layer that allows RPC to pass the "hello world" test - simple things should be
simple. The middle layer routines are used for most applications.
Simplified RPC calls are made with the system routines registerrpc ( ) ,
callrpc () and svc_run (). registerrpc () obtains a unique system6
For a complete specification of the routines in the remote procedure call Library, see the rpc (3N) manual
page.
65
Revision A, of 27 March 1990
66
Network Programming
wide procedure-identification number, and callrpc () actually executes a
remote procedure call. At the middle level, a call to rnuser s () is implemented by way of these two routines.
The middle layer is rarely used in serious programming due to its inflexibility
(simplicity). It does not allow timeout specifications or the choice of transport.
It allows no UNIX process control or flexibility in case of errors. It doesn't support multiple kinds of call authentication. The programmer rarely needs all these
kinds of control, but one or two of them is often necessary.
The Lowest Layer: The lowest layer does allow these details to be controlled by
the programmer. Programs written at this level are also most efficient and allow
for flexibility. The lowest layer routines include client creation routines such as
clnt_create (), the actual client call clnt_call () ,server creation routines such as svcudp_create () , and the server registration routine
svc_register ().
Higher Layers of RPC
This layer consists of RPC-library based services. Imagine you're writing a program that needs to know how many users are logged into a remote machine. You
can do this by calling the RPC library routine rnusers () , as illustrated below:
iinclude <stdio.h>
main (argc, argv)
int argCi
char **argvi
int numi
if (argc != 2) {
fprintf(stderr, "usage: rnusers hostname\nn)i
exit(l);
if «num = rnusers(argv[l]»
< 0) {
fprintf(stderr, "error: rnusers\nn)i
exit(l);
printf("%d users on %s\n", num, argv[l]);
exit(O)i
RPC library routines such as rnusers () are in the RPC services library
librpcsvc. a. Thus, the program above should be compiled with
example% cc program.c -lrpcsvc
rnuser s {} , like the other RPC library routines, is documented in section 3R of
the System Services Overview, the same section which documents the standard
Sun RPC services. See the intro (3R) manual page for an explanation of the
documentation strategy for these services and their RPC protocols.
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
Table 4-1
67
Here are some of the RPC service library routines available to the C programmer:
RPC Service Library Routines
Routine
Description
rnusers
rusers
havedisk
rstat
rwall
yppasswd
Return number of users on remote machine
Return infonnation about users on remote machine
Detennine if remote machine has disk
Get perfonnance data from remote kernel
Write to specified remote machines
Update user password in Network Information Service
Other RPC services - for example ether, mount, rquota, and sprayare not available to the C programmer as library routines. They do, however,
have RPC program numbers so they can be invoked with callrpc () ,which
will be discussed in the next section. Most of them also have compilable
rpcgen (1) protocol description files. Some of the files (in the form *.x) may
be found in /usr /include/rpcsvc. (The rpcgen protocol compiler radically simplifies the process of developing network. applications. See the
rpcqen Programming Guide chapter for detailed information about rpcgen
and rpcgen protocol description files).
Middle Layers of RPC
The simplest interface, which explicitly makes RPC calls, uses the functions
callrpc () and registerrpc ( ). Using this method, the number of remote
users can be obtained as follows:
iinclude <stdio.h>
iinclude <rpc/rpc.h>
iinclude <rpcsvc/rusers.h>
main (argc, argv)
int argc;
char **argv;
unsigned long nusers;
int stat;
if
if
(argc ! = 2) {
fprintf(stderr,
exit(l);
"usage: nusers hostname\n");
(stat = callrpc(argv[l],
RUSERSPROG, RUSERSVERS, RUSERSPROC_NUM,
xdr_void, 0, xdr_u_long, &nusers) != 0)
clnt-perrno(stat);
exit(l);
printf("%d users on %s\n", nusers, argv[l]);
exit(O);
Revision A, of 27 March 1990
68
Network Programming
The simplest way of making remote procedure calls is with the RPC library routine callrpc ( ). It has eight parameters. The first is the name of the remote
server machine. The next three parameters are the program, version, and procedure numbers-together they identify the procedure to be called. The fifth and
sixth parameters are an XDR filter and an argument to be encoded and passed to
the remote procedure. XDR filter is a user provided procedure which can encode
or decode machine native data to or from the XDR fonnat. The final two parameters are an XDR filter for decoding the results returned by the remote procedure
and a pointer to the place where the procedure's results are to be stored. Multiple
arguments and results are handled by embedding them in structures. If
callrpc () completes successfully, it returns zero; else it returns a nonzero
value. The return codes are found in <rpc/ clnt. h>.
callrpc () needs both the type of the RPC argument, as well as a pointer to
the argument itself (and similarly for the result). For RUSERSPROC NUM, the
return value is an unsigned long, so callrpc () has xdr_u_long () as
its first return parameter, which says that the result is of type unsigned long,
and &nusers as its second return parameter, which is a pointer to where the
long result will be placed. Since RUSERSPROC_NUM takes no argument, the
argument parameter of callrpc () is xdr_void. In such cases the argument
should be NULL.
After trying several times to deliver a message, if callrpc () gets no answer, it
returns with an error code. Methods for adjusting the number of retries or for
using a different protocol require you to use the lower layer of the RPC library,
discussed later in this document.
The remote server procedure corresponding to the above might look like this:
unsigned long *
nuser(indata)
char *indatai
static unsigned long nuserSi
1*
* Code here to compute the number of users
* and place result in variable nusers.
*1
return(&nuSers)i
It takes one argument, which is a pointer to the input of the remote procedure call
(ignored in our example), and it returns a pointer to the result. In the current ver-
sion of C, character pointers are the generic pointers, so input argument and the
return value can be cast to char *.
Nonnally, a server registers all of the RPC calls it plans to handle, and then goes
into an infinite loop waiting to service requests. If rpcgen is used to provide
this functionality, it will also generate a server dispatch function. But users can
write the servers themselves using registerrpc () and especially so for
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
69
simple applications like the one shown here. In this example, there is only a single procedure to register, so the main body of the server would look like this:
*include <stdio.h>
*include <rpc/rpc.h>
*include <rpcsvc/rusers.h>
/*
required
*/
/* for prog. vers definitions * /
unsigned long *nuser();
main ()
{
registerrpc(RUSERSPROG, RUSERSVERS, RUSERSPROC_NUM,
nuser, xdr_void, xdr_u_long);
svc_run () ;
/* Never returns * /
fprintf(stderr, "Error: svc run returned!\nn);
exit(l);
The registerrpc () routine registers a procedure as corresponding to a given
RPC procedure number. The first three parameters, RUSERPROG,
RUSERSVERS, and RUSERSPROC_NUM are the program, version, and procedure numbers of the remote procedure to be registered; nuser () is the name
of the local procedure that implements the remote procedure; and xdr_void ()
and xdr_ u_10 ng () are the XDR filters for the remote procedure's arguments
and results, respectively. (Multiple arguments or multiple results are passed as
structures).
The underlying transport mechanism used with registerrpc () is both
callrpc () and UDP.
WARNING
Warning: the UDP transport mechanism can only deal with arguments and
results less than 8K bytes in length.
After registering the local procedure, the server program's main procedure calls
svc_run ( ) , the RPC library's remote procedure dispatcher. It is this function
that calls the remote procedures in response to RPC requests. Note that the
dispatcher takes care of decoding remote procedure arguments and encoding
results, using the XDR filters specified when the remote procedure was registered
with registerrpc () .
Passing Arbitrary Data Types
In the previous example, the RPC passes a single unsigned long. RPC can
handle arbitrary data structures, regardless of different machine's byte orders or
structure layout conventions, by always converting them to a network standard
called External Data Representation (XDR) before sending them over the wire.
The process of converting from a particular machine representation to XDR format is called serializing, and the reverse process is called deserializing. The type
field parameters of callrpc () and registerrpc () can be a built-in procedure like xdr_ u _long () in the previous example, or a user supplied one.
XDR has these built-in type routines:
Revision A, of 27 March 1990
70
Network Programming
xdr_int ()
xdr_long()
xdr_short ()
xdr_char ()
xdr_u_int ()
xdr_u_long ()
xdr_u_short ()
xdr_u_char ()
xdr_enum()
xdr_bool ()
xdr_wrapstring ()
Note that the routine xdr_string () exists, but cannot be used with
callrpc () and registerrpc (), which only pass two parameters to their
XDR routines. Instead xdr_wrapstring () can be used. It takes only two
parameters, and is thus OK. It calls xdr_string ().
As an example of a user-defined type routine, if you wanted to send the structure
struct simple
int a;
short b;
simple;
then you would call callrpc () as
callrpc(hostname, PROGNUM, VERSNUM, PROCNUM,
xdr_simple, &simple ... );
where xdr_simple () is written as:
#include <rpc/rpc.h>
xdr_simple(xdrsp, simplep)
XDR *xdrsp;
struct simple *simplep;
if (!xdr_int(xdrsp, &simplep->a»
return (0);
if (!xdr_short(xdrsp, &simplep->b»
return (0);
return (1);
An XDR routine returns nonzero (true in the sense of C) if it completes successfully, and zero otherwise. A complete description of XDR is in the XDR Protocol Specification section of this manual, only few implementation examples are
given here.
NOTE
We strongly recommend that rpcgen be used to generate XDR routines. The
"_C" option oJrpcgen can be used to generate just the xdr. cfile.
In addition to the built-in primitives, there are also the prefabricated building
blocks:
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
xdr_array ()
xdr_vector ()
xdr_string ()
xdr_bytes ()
xdr_union ()
xdr_opaque ( )
71
xdr_reference ()
xdryointer ()
To send a variable array of integers, you might package them up as a structure
like this
struct varintarr
int *datai
int arrlnthi
arri
and make an RPC call such as
callrpc(hostname, PROGNUM, VERSNUM, PROCNUM,
xdr_varintarr, &arr ... )i
with xdr _ var intarr () defined as:
xdr_varintarr(xdrsp, arrp)
XDR *xdrspi
struct varintarr *arrpi
return (xdr_array(xdrsp, &arrp->data, &arrp->arrlnth,
MAXLEN, sizeof(int), xdr_int»;
This routine takes as parameters the XDR handle, a pointer to the array, a pointer
to the size of the array, the maximum allowable array size, the size of each array
element, and an XDR routine for handling each array element.
If the size of the array is known in advance, one can use xdr_vector (),
which serializes fixed-length arrays.
int intarr[SIZE]i
xdr_intarr(xdrsp, intarr)
XDR *xdrspi
int intarr[]i
return (xdr_vector(xdrsp, intarr, SIZE, sizeof(int),
xdr_int) ) ;
XDR always converts quantities to 4-byte multiples when serializing. Thus, if
either of the examples above involved characters instead of integers, each character would occupy 32 bits. That is the reason for the XDR routine
xdr_bytes (), which is like xdr_array () except that it packs characters;
Revision A. of 27 March 1990
72
Network Programming
xdr_bytes () has four parameters, similar to the first four parameters of
xdr_array (). For null-tenninated strings, there is also the xdr_str ing ()
routine, which is the same as xdr_ byt e s () without the length parameter. On
serializing it gets the string length from strlen () , and on deserializing it
creates a null-tenninated string.
Here is a final example that calls the previously written xdr_simple () as well
as the built-in functions xdr_string () and xdr_reference (), which
chases pointers:
struct finalexample
char *string;
struct simple *simplep;
finalexample;
xdr_finalexample(xdrsp, finalp)
XDR *xdrsp;
struct finalexample *finalp;
if (!xdr_string(xdrsp, &finalp->string, MAXSTRLEN»
return (0);
if (!xdr_reference(xdrsp, &finalp->simplep,
sizeof(struct simple), xdr_simple);
return (0);
return (1);
Note that we could as easily call xdr_ simple () here instead of
xdr_reference().
Lower Layers of RPC
In the examples given so far, RPC takes care of many details automatically for
you. In this section, we'l1 show you how you can change the defaults by using
lower layers of the RPC library.
There are several occasions when you may need to use lower layers of RPC.
First, you may need to use TCP, since the higher layer uses UDP, which restricts
RPC calls to 8K bytes of data. Using TCP pennits calls to send long streams of
data. For an example, see the TCP section below. Second, you may want to
allocate and free memory while serializing or deserializing with XDR routines.
There is no call at the higher level to let you free memory explicitly. For more
explanation, see the Memory Allocation with XDR section below. Third, you
may need to perfonn authentication on either the client or server side, by supplying credentials or verifying them. See the explanation in the Authentication section below.
Revision A. of 27 March 1990
Chapter 4 -
More on the Server Side
Remote Procedure Call Programming Guide
73
The server for the nuser s () program shown below does the same thing as the
one using registerrpc () above, but is written using a lower layer of the
RPC package:
finclude
finclude
finclude
finclude
<stdio.h>
<rpc/rpc.h>
<utmp.h>
<rpcsvc/rusers.h>
main ()
{
SVCXPRT *transpi
int nuser()i
transp = svcudp_create(RPC_ANYSOCK)i
if (transp == NULL) {
fprintf(stderr, "can't create an RPC server\nn)i
exit(l)i
pmap_unset(RUSERSPROG, RUSERSVERS)i
if (!svc_register(transp, RUSERSPROG, RUSERSVERS,
nuser, IPPROTO_UDP»
{
fprintf(stderr, "can't register RUSER service\n");
exit(l)i
svc_run () ;
/ * Never returns * /
fprintf(stderr, "should never reach this point\n")i
nuser(rqstp, transp)
struct svc_req *rqstpi
SVCXPRT *transpi
unsigned long nuserSi
switch (rqstp->r~roc)
case NULLPROC:
if (!svc_sendreply(transp, xdr_void, 0»
fprintf(stderr, "can't reply to RPC call\n");
returni
case RUSERSPROC NUM:
/*
* Code here to compute the number of users
* and assign it to the variable nusers
*/
if (!svc_sendreply(transp, xdr_u_long, &nusers»
fprintf(stderr, "can't reply to RPC call\n");
returni
default:
svcerr_noproc(transp)i
returni
Revision A. of 27 March 1990
74
Network Programming
(
)
First, the server gets a transport handle, which is used for receiving and replying
to RPC messages. If the argument to svcudp_create () is RPC_ANYSOCK,
the RPC library creates a socket on which to receive and reply to RPC calls.
Otherwise, svcudp_create () expects its argument to be a valid socket
number. If you specify your own socket, it can be bound or unbound. If it is
bound to a port by the user, the port numbers of svcudp_create () and
clntudp_create () (the low-level client routine) must match.
registerrpc () uses svcudp_create () to get a UDPhandle. If you
require a more reliable protocol, call svctcp_create () instead.
After creating an SVCXPRT, the next step is to call pmap_unset () so that if
the nus e r s () server crashed earlier, any previous trace of it is erased before
restarting. More precisely, pmap_unset () erases the entry for RUSERSPROG
from the portmapper's tables.
Finally, we associate the program number RUSERSPROG and version
RUSERSVERS with the procedure nuser ( ) , which in this case, is
IPPROTO_UDP. Notice that unlike registerrpc () ,there are no XDR routines involved in the registration process. Also, registration is done on the program level rather than procedure level. A service may choose to register its port
number with the local portmapper service. This is done by specifying a non-zero
protocol number in the final argument of svc_register (). A client can discover the server's port number by consulting the portmapper on their server's
machine. This can be done automatically by specifying a zero port number in
clntudp_create() orclnttcp_create().
The user routine nuser () must call and dispatch the appropriate XDR routines
based on the procedure number. Note that two things are handled by nuser ()
that registerrpc () handles automatically. The first is that procedure
NULLPROC (currently zero) returns with no results. This can be used as a simple
test for detecting if a remote program is running. Second, there is a check for
invalid procedure numbers. If one is detected, svcerr _ noproc () is called to
handle the error.
The user service routine serializes the results and returns them to the RPC caller
via svc_sendreply (). Its first parameter is the SVCXPRT handle, the
second is the XDR routine, and the third is a pointer to the data to be returned.
Note that it is not required to have nusers declared as static here because
svc_sendreply () is called within that function itself. Not illustrated above
is how a server handles an RPC program that receives data. As an example, we
can add a procedure RUSERSPROC_BOOL, which has an argument nusers (),
and returns TRUE or FALSE depending on whether there are nusers logged on. It
would look like this:
case RUSERSPROC_BOOL: {
int bool;
unsigned nuserquerYi
if (!svc_getargs(transp, xdr_u_int, &nuserquery)
{
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Progranuning Guide
75
svcerr_decode(transp);
return;
1*
* Code to set nusers = number of users
*1
if (nuserquery == nusers)
bool
TRUE;
else
FALSE;
bool
if (!svc_sendreply(transp, xdr_bool, &bool»
fprintf(stderr, "can't reply to RPC call\n");
return;
The relevant routine is svc_getargs () ,which takes an SVCXPRT handle, the
XDR routine, and a pointer to where the input is to be placed as arguments.
More on the Client Side
When you use callrpc () , you have no control over the RPC delivery
mechanism or the socket used to transport the data. To illustrate the layer of
RPC that lets you adjust these parameters, consider the following code to call the
nusers service:
*include
*include
*include
*include
*include
<stdio.h>
<rpc/rpc.h>
<rpcsvc/rusers.h>
<sys/time.h>
<netdb.h>
main (argc, argv)
int argc;
char **argv;
struct hostent *hp;
struct timeval pertry_timeout, total_timeout;
struct sockaddr_in server_addr;
int sock = RPC_ANYSOCK;
register CLIENT *client;
enum clnt_stat clnt_stat;
unsigned long nusers;
if (argc ! = 2) {
fprintf(stderr,
exit(-l);
"usage: nusers hostname\n");
if «hp = gethostbyname(argv[l]»
== NULL) {
fprintf(stderr, "can't get addr for %s\n",argv[l]);
exit(-l);
pertry_timeout.tv_sec = 3;
pertry_timeout.tv_usec = 0;
Revision A, of 27 March 1990
76
Network Programming
bcopy(hp->h_addr, (caddr_t)&server_addr.sin_addr,
hp->h_length);
server_addr.sin_family = AF_INET;
server_addr.sin-port = 0;
if «client = clntudp_create(&server_addr, RUSERSPROG,
RUSERSVERS, pertry_timeout, &sock»
== NULL) {
clnt-pcreateerror("clntudp_create");
exit(-l);
total timeout. tv_sec = 20;
total_timeout.tv_usec = 0;
clnt_stat = clnt_call(client, RUSERSPROC_NUM, xdr_void,
0, xdr_u_Iong, &nusers, total_timeout);
if (clnt_stat != RPC_SUCCESS) {
clnt-perror(client, "rpc");
exit(-l);
printf("%d users on %s\n", nusers, argv[l]);
clnt_destroy(client);
exit(O);
The CLIENT pointer is encoded with the transport mechanism. callrpc ()
uses UDP, thus it calls clntudp_create () to get a CLIENT pointer. To get
TCP you would use clnttcp_create ().
The parameters to clntudp_create () are the server address, the program
number, the version number, a timeout value (between tries), and a pointer to a
socket. Only when the sin_po rt is 0, the remote portmapper is queried to find
out the address of the remote service.
The low-level version of callrpc () is clnt_call (), which takes a
CLIENT pointer rather than a host name. The parameters to clnt_call () are
a CLIENT pointer, the procedure number, the XDR routine for serializing the
argument, a pointer to the argument, the XDR routine for deserializing the return
value, a pointer to where the return value will be placed, and the time in seconds
to wait for a reply. If the client does not hear from the server within the time
specified in pertry_timeout, the request may be sent again to the server.
Thus, the number of tries that the clnt_call () will make to contact the
server is the clnt_call () timeout divided by the clntudp_create ()
timeout.
Note that the clnt _destroy () call always deallocates the space associated
with the CLIENT handle. It closes the socket associated with the CLIENT handle only if the RPC library opened it. It the socket was opened by the user, it
stays open. This makes it possible, in cases where there are multiple client handles using the same socket, to destroy one handle without closing the socket that
other handles are using.
To make a stream connection, the call to clntudp_create () is replaced
with clnttcp_create ().
Revision A. of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
77
clnttcp_create(&server_addr, prognum, versnum, &sock,
inbufsize, outbufsize);
There is no timeout argument; instead, the receive and send buffer sizes must be
specified. When the clnttcp_create () call is made, a TCP connection is
established. All RPC calls using that CLIENT handle would use this connection.
The server side of an RPC call using TCP has svcudp_create () replaced by
svctcp_create ( ) .
transp = svctcp_create(RPC_ANYSOCK, 0, 0);
The last two arguments to svctcp_create () are send and receive sizes
respectively. If '0' is specified for either of these, the system chooses default
values.
The simplest routine to create a client handle is clnt_create () .
clnt=clnt_create(server_host,prognum,versnum, transport);
The parameters are the name of the host on which the service resides, the program and version number and the transport to be used. The transport can be
either "udp" for UDP or "tcp" for TCP. It is possible to change the default
timeouts using clnt_ control (). For more details look under Client Programming Notes section in %peg-en Programming Guide.
Memory Allocation with XDR
XDR routines not only do input and output, they may also do memory allocation.
This is why the second parameter of x dr _ a r ray () is a pointer to an array,
rather than the array itself. Ifit is NULL, then xdr_array () allocates space
for the array and returns a pointer to it, putting the size of the array in the third
argument. As an example, consider the following XDR routine
xdr_ char ar r 1 () , which deals with a fixed array of bytes with length S I ZE:
xdr_chararrl(xdrsp, chararr)
XDR *xdrsp;
char chararr[]i
char *p;
int leni
p = chararri
len = SIZEi
return (xdr_bytes(xdrsp, &p, &len, SIZE»;
If space has already been allocated in chararr, it can be called from a server like
this:
Revision A, of 27 March 1990
78
Network Programming
char chararr[SIZE];
svc_getargs(transp, xdr_chararrl, chararr);
If you want XDR to do the allocation, you would have to rewrite this routine in
the following way:
xdr_chararr2(xdrsp, chararrp)
XDR *xdrsp;
char **chararrp;
int len;
len = SIZE;
return (xdr_bytes(xdrsp, charrarrp, &len, SIZE»;
Then the RPC call might look like this:
char *arrptr;
arrptr = NULL;
svc_getargs(transp, xdr_chararr2, &arrptr);
1*
* Use the result here
*1
svc_freeargs(transp, xdr_chararr2, &arrptr);
Note that, after being used, the character array can be freed with
svc_ freeargs ( ). svc_ freeargs () will not attempt to free any memory
if the variable indicating it is NULL. For example, in the the routine
xdr _f inalexarnple () , given earlier, if f inalp->str ing was NULL,
then it would not be freed. The same is true for finalp->simplep.
To summarize, each XDR routine is responsible for serializing, deserializing, and
freeing memory. When an XDR routine is called from callrpc ( ) , the serializing part is used. When called from svc_getargs () , the deserializer is used.
And when called from svc_freeargs () ,the memory deallocator is used.
When building simple examples like those in this section, a user doesn't have to
worry about the three modes. See the External Data Representation: Sun Technical Notes chapter for examples of more sophisticated XDR routines that determine which of the three modes they are in and adjust their behavior accordingly.
4.2. RawRPC
Finally, there are two pseudo-RPC interface routines which are intended only for
testing purposes. These routines, clntraw_create () and
svcraw_create ( ) , don't actually involve the use of any real transport at all.
They exist to help the developer debug and test the non-communications oriented
aspects of their application before running it over a real network. Here's an
example of their use:
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
79
1*
* A simple program to increment the nwnber by 1
*1
#include <stdio.h>
#include <rpc/rpc.h>
#include <rpc/raw.h>
/ * required for raw * /
struct timeval TIMEOUT = {O, O}i
static void server()i
main ()
{
CLIENT *clnt;
SVCXPRT *svc;
int num = 0, anSi
if (argc == 2)
num = atoi(argv[l]);
svc = svcraw_create();
if (svc == NULL) {
fprintf (stderr, "Couldnot create serverhandle\n") i
exit(l);
svc_register(svc, 200000, 1, server, 0);
clnt = clntraw_create(200000, 1);
if (clnt == NULL) {
clntycreateerror("raw");
exit(l);
if (clnt_call(clnt, 1, xdr_int, &num, xdr_int, &numl,
TIMEOUT) != RPC_SUCCESS) {
clntyerror(clnt, "raw");
exit(l);
printf("Client: number returned %d\n", numl);
exit (0) ;
static void
server (rqstp, transp)
struct svc_req *rqstpi
SVCXPRT *transp;
int num;
switch(rqstp->r~roc)
case 0:
if (svc_sendreply(transp, xdr_void, 0) == NULL)
fprintf(stderr, "error in null proc\n");
exit(l);
return;
case 1:
Revision A, of 27 March. 1990
80
Network Programming
break;
default:
svcerr_noproc(transp);
return;
if (!svc_getargs(transp, xdr_int, &num»
svcerr_decode(transp);
return;
{
num++;
if (svc_sendreply(transp, xdr_int, &num) == NULL)
fprintf(stderr, "error in sending answer\n");
exit(l);
return;
Note the following points:
1.
All the RPC calls occur within the same thread of control.
2.
svc_run () is not called.
3.
It is necessary that the server be created before the client.
4.
svcraw_create () takes no parameters.
5.
The last parameter to svc_register is 0, which means that it will not
register with portmapper.
6.
The server dispatch routine is the same as it is for nonnal RPC servers.
4.3. Other RPC Features
This section discusses some other aspects of RPC that are useful for the RPC
programmer.
Select on the Server Side
Suppose a process is processing RPC requests while performing some other
activity. If the other activity involves periodically updating a data structure, the
process can set an alann signal before calling svc- run ().
. But if the other
activity involves waiting on a a file descriptor, the svc_run () call won't work.
The code for svc_run () is as follows:
void
svc_run ()
{
fd set readfds;
int dtbsz = getdtablesize();
for (;;) {
readfds = svc_fds;
switch (select (dtbsz, &readfds, NULL,NULL,NULL»
{
case -1:
if (errno != EBADF)
sun
microsystems
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
81
continue;
perror("select");
return;
case 0:
continue;
default:
svc_getreqset(&readfds);
You can bypass svc_run () and call svc_getreqset () yourself. All you
need to know are the file descriptors of the socket(s) associated with the programs you are waiting on. Thus you can have your own select () that waits
on both the RPC socket, and your own descriptors. Note that s vc _ f ds is a bit
mask of all the file descriptors that RPC is using for services. It can change
everytime that any RPC library routine is called, because descriptors are constantly being opened and closed, for example for TCP connections.
Caution: if you are handling signals in your application, then either make sure
that you do not make any system calls and inadvertently set errno or reset
errno to its old value before returning from your signal handler.
Broadcast RPC
The portmapper is a daemon that converts RPC program numbers into DARPA
protocol port numbers; see The Portmapper section in the Network Services
chapter. You can't do broadcast RPC without the portmapper. Here are the main
differences between broadcast RPC and normal RPC:
1.
Normal RPC expects one answer, whereas broadcast RPC expects many
answers (one or more answer from each responding server).
2.
Broadcast RPC can only be supported by packet-oriented (connectionless)
transport protocols like UDP/IP.
3.
The implementation of broadcast RPC treats all unsuccessful responses as
garbage by filtering them out. Thus, if there is a version mismatch between
the broadcaster and a remote service, the user of broadcast RPC never
knows.
4.
All broadcast messages are sent to the portmap port. Thus, only services
that register themselves with their portmapper are accessible via the broadcast RPC mechanism.
5.
Broadcast requests are limited in size to 1400 bytes. Replies can be up to
8800 bytes (the current maximum UDP packet size).
Revision A, of 27 March 1990
82
Network Programming
Broadcast RPC Synopsis
#include <rpc/pmap_clnt.h>
enum clnt stat
clnt_stat;
clnt_stat = clnt_broadcast(prognum, versnum, procnum,
inproc, in, outproc, out, eachresult)
u_long
prognum;
/* program number * /
u _long
versnum;
/ * version number * /
u_long
procnum;
/* procedwe number * /
xdrproc_t inproc;
/* xdr routine/or args * /
caddr t i n ;
/* pointer to args * /
xdrproc_t outproc;
/* xdr routine/or results * /
caddr t o u t ;
/ * pointer to results * /
boo 1 t
(*eachresul t) () ; /* call with each result gotten * /
The procedure eachresul t () is called each time a response is obtained. It
returns a boolean that indicates whether or not the user wants more responses.
bool_t done;
done = eachresult(resultsp, raddr)
caddr_t resultsp;
struct sockaddr_in *raddr; /* Addr of responding server
*/
If done is TRUE, then broadcasting stops and clnt_broadcast () returns
successfully. Otherwise, the routine waits for another response. The request is
rebroadcast after a few seconds of waiting. If no responses come back in a
default total timeout period, the routine returns with RPC_TlMEDOUT. You
may also refer to Handling Broadcast on the Server Side section in the rpcqen
Programming Guide chapter.
Batching
In nonnal RPC clients send a call message and wait for the seIVer to reply that
the call succeeded. This implies that clients do not compute while servers are
processing a call. This is inefficient if the client does not want or need an acknowledgement for every message sent. Actually calis made by clients are buffered, thus causing no processing on the seIVers. When the connection is flushed,
a normal RPC request is sent. The seIVer processes the request and sends the
reply back.
RPC messages can be placed in a "pipeline" of calls to a desired seIVer; this is
called batching. Batching assumes that:
1.
Each RPC call in the pipeline requires no response from the seIVer, and the
seIVer does not send a response message.
2.
The pipeline of calls is transported on a reliable byte stream transport such
as TCP/IP.
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
83
Since the server does not respond to every call, the client can generate new calls
in parallel with the selVer executing previous calls. Furthennore, the TCP/IP
implementation can buffer up many call messages, and send them to the server in
one wr it e () system call. This overlapped execution greatly decreases the
interprocess communication overhead of the client and server processes, and the
total elapsed time of a series of calls.
Since the batched calls are buffered, the client should eventually do a nonbatched
call in order to flush the pipeline.
A contrived example of batching follows. Assume a string rendering service
(like a window system) has two similar calls: one renders a string and returns
void results, while the other renders a string and remains silent. The service
(using the TCP/IP transport) may look like:
#include <stdio.h>
#include <rpc/rpc.h>
#include <suntool/windows.h>
void windowdispatch();
main ()
{
SVCXPRT *transp;
transp = svctcp_create(RPC_ANYSOCK, 0, 0);
if (transp == NULL) {
fprintf(stderr, "can't create an RPC server\n");
exit(l);
pmap_unset(WINDOWPROG, WINDOWVERS);
if (!svc_register(transp, WINDOWPROG, WINDOWVERS,
windowdispatch, IPPROTO_TCP»
{
fprintf(stderr, "can't register WINDOW service\nn);
exit(l);
svc_run () ;
/ * Never returns * /
fprintf(stderr, "should never reach this point\n");
void
windowdispatch(rqstp, transp)
struct svc_req *rqstp;
SVCXPRT *transp;
char *s
=
NULL;
switch (rqstp->r~roc)
case NULLPROC:
if (!svc_sendreply(transp, xdr_void, 0»
fprintf(stderr, "can't reply to RPC call\nn);
return;
case RENDERSTRING:
Revision A, of 27 March 1990
84
Network Programming
if (!svc_getargs(transp, xdr_wrapstring, &s)) {
fprintf(stderr, "can't decode arguments\n");
1*
* Tell caller he screwed up
*1
svcerr_decode(transp);
return;
1*
* Code here to render the string s
*1
if (!svc_sendreply(transp, xdr_void, NULL))
fprintf(stderr, "can't reply to RPC call\nn);
break;
case RENDERSTRING BATCHED:
if (!svc_getargs(transp, xdr_wrapstring, &s)) {
fprintf(stderr, "can't decode arguments\n");
1*
* We are silent in the/ace o/protocol errors
*1
break;
1*
* Code here to render string s, but send no reply!
*1
break;
default:
svcerr_noproc(transp);
return;
1*
* Now free string allocated while decoding arguments
*1
svc_freeargs(transp, xdr_wrapstring, &s);
Of course the service could have one procedure that takes the string and a
boolean to indicate whether or not the procedure should respond.
In order for a client to take advantage of batching, the client must perfonn RPC
calls on a TCP-based transport and the actual calls must have the following attributes:
1.
the result's XDR routine must be zero (NULL),
2.
the RPC call's timeout must be zero. Do not rely on c In t _ co n t ro I () to
assist in batching.
If a UDP transport is used instead, the client call becomes a message to the server
and the RPC mechanism reduces to a message passing system. No batching is
possible here.
Here is an example of a client that uses batching to render a bunch of strings; the
batching is flushed when the client gets a null string (BOF):
Revision A, of 27 March 1990
Chapter 4 - Remote Procedme Call Programming Guide
85
iinclude <stdio.h>
iinclude <rpc/rpc.h>
iinclude <suntool/windows.h>
main (argc, argv)
int argc;
char **argv;
struct timeval total_timeout;
register CLIENT *cIient;
enum clnt_stat clnt_stat;
char buf[lOOO], *s = buf;
if «client = cInt_create(argv[l],
WINDOWPROG, WINDOWVERS, "tcp"»
perror("clnttcp_create");
exit(-l);
/*
NULL)
set timeout to zero
{
*/
total_timeout.tv_usec = 0;
while (scanf("%s", s) != EOF) {
clnt_stat = clnt_call(client, RENDERSTRING_BATCHED,
xdr_wrapstring, &s, NULL, NULL, total_timeout);
if (clnt_stat != RPC_SUCCESS) {
clntyerror(client, "batching rpc");
exit(-l);
/ * Now flush the pipeline * /
total_timeout. tv_sec = 20;
clnt_stat = clnt_call(client, NULLPROC, xdr_void, NULL,
xdr_void, NULL, total_timeout);
if (cInt_stat != RPC_SUCCESS) {
cIntyerror(client, "bat ching rpc");
exit(-l);
clnt_destroy(client);
exit(O);
Since the server sends no message, the clients cannot be notified of any of the
failures that may occur. Therefore, clients are on their own when it comes to
handling errors.
The above example was completed to render all of the (2000) lines in the file
letcltermcap. The rendering service did nothing but throw the lines away. The
example was run in the following four configurations:
1. machine to itself, regular RPC -
.~!!!!
50 seconds
Revision A, of 27 March 1990
86
Network Programming
2.
machine to itself, batched RPC -
16 seconds
3.
machine to another, regularRPC -
52 seconds
4.
machine to another, batched RPC -
10 seconds
Running only f scanf () on / etc/termcap requires six seconds. These timings show the advantage of protocols that allow for overlapped execution, though
these protocols are often hard to design.
Authentication
In the examples presented so far, the caller never identified itself to the server,
and the server never required an ID from the caller. Clearly, some network services, such as a network filesystem, require stronger security than what has been
presented so far.
In reality, every RPC call is authenticated by the RPC package on the server, and
similarly, the RPC client package generates and sends authentication parameters.
Just as different transports (TCP/IP or UDP/IP) can be used when creating RPC
clients and servers, different fonns of authentication can be associated with RPC
clients; the default authentication type used as a default is type none.
The authentication subsystem of the RPC package is open ended. That is,
numerous types of authentication are easy to support.
UNIX Authentication
The Client Side
When a caller creates a new RPC client handle as in:
clnt = clntudp_create(address, prognum, versnum,
wait, sockp)
the appropriate transport instance defaults the associate authentication handle to
be
[ clnt->cl_auth = authnone_create();
J
The RPC client can choose to use UNIX style authentication by setting
c1n t ->c I_a u t h after creating the RPC client handle:
clnt->cl_auth
=
authunix_create_default()i
This causes each RPC call associated with cInt to carry with it the following
authentication credentials structure:
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
87
1*
* UNIX style credentials.
*1
struct authunix-parms
u_long aup_timei
char
*aup_machnamei
int
aup_uidi
int
aup_gidi
u int
aup_leni
int
*aup_gidsi
/ * credentials creation time * /
/* host name where client is * /
/ * client's UNIX effective uid * /
/ * client's current group id * /
/ * element length of aup_gids * /
/ * array of groups user is in * /
} i
These fields are set by authunix_create_default () by invoking the
appropriate system calls. Since the RPC user created this new style of authentication, the user is responsible for destroying it with:
J
[_a_u_t_h___d_e_s_t_r_o_y__<c_l_n_t__
->
__
c_l___
au
__
t_h_)_i_______________________________
This should be done in all cases, to conselVe memory.
The Server Side
SelVice implementors have a harder time dealing with authentication issues since
the RPC package passes the service dispatch routine a request that has an arbitrary authentication style associated with it. Consider the fields of a request handle passed to a service dispatch routine:
1*
* An RPC Service request
*1
struct svc_req {
u_long rqyrogi
u_long r~vers i
u_long rqyroci
struct opaque_auth r~credi
caddr t r~clntcredi
/* service program number * /
/ * service protocol vers num * /
/* desired procedure number * /
/* raw credentials/rom wire * /
/* credentials (read only) * /
} ;
The rCLcred is mostly opaque, except for one field of interest: the style or
flavor of authentication credentials:
Revision A, of27 March 1990
88
Network Programming
1*
* Authentication info.
Mostly opaque to the programmer.
*1
struct opaque_auth {
enum t
oa_flavOri
caddr t oa_baSei
u int
oa_lengthi
/ * style of credentials * /
/*
address of more auth stuff * /
/* nottoexceedMAX_AUTH_BYTES */
} i
The RPC package guarantees the following to the service dispatch routine:
1.
That the request's r'Lcred is well fonned. Thus the service implementor
may inspect the request's r'Lcred. oa_flavor to determine which style
of authentication the caller used. The service implementor may also wish to
inspect the other fields of r'L cred if the style is not one of the styles supported by the RPC package.
2.
That the request's r'Lclntcred field is either NULL or points to a well
formed structure that corresponds to a supported style of authentication
credentials. r'Lclntcred could be cast to a pointer to an
authunixyarms structure. If r'Lclntcred is NULL, the service
implementor may wish to inspect the other (opaque) fields of r'Lcred in
case the service knows about a new type of authentication that the RPC
package does not know about.
Our remote users service example can be extended so that it computes results for
all users except UID 16:
nuser(rqstp, transp)
struct svc_req *rqstpi
SVCXPRT *transpi
struct authunix-parms *unix_cred;
int uidi
unsigned long nuserSi
1*
* we don't care about authentication for null proc
*1
if
== NULLPROC) {
if (!svc_sendreply(transp, xdr_void, 0»
fprintf(stderr, "can't reply to RPC call\nn)i
return;
(rqstp->r~roc
1*
* now get the uid
*1
switch (rqstp->r~cred.oa_flavor)
case AUTH UNIX:
unix cred =
(struct authunix-parms *)rqstp->r~clntcredi
uid = unix_cred->aup_uidi
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
89
break;
case AUTH NULL:
defaul t:
/* return weak authentication error * /
svcerr_weakauth(transp);
return;
switch (rqstp->r~roc)
case RUSERSPROC_NUM:
1*
* make sure caller is allowed to call this proc
*1
if (uid == 16) {
svcerr_systemerr(transp);
return;
1*
* Code here to compute the number of users
* and assign it to the variable nusers
*1
if (!svc_sendreply(transp, xdr_u_long, &nusers»
fprintf(stderr, "can't reply to RPC call\n");
return;
default:
svcerr_noproc(transp);
return;
A few things should be noted here. First, it is customary not to check the authentication parameters associated with the NULLPROC (procedure number zero).
Second, if the authentication parameter's type is not suitable for your service,
you should call svcerr_ weakauth ( ). And finally, the service protocol itself
should return status for access denied; in the case of our example, the protocol
does not have such a status, so we call the service primitive
svcerr_systemerr () instead.
The last point underscores the relation between the RPC authentication package
and the services; RPC deals only with authentication and not with individual services' access control. The services themselves must implement their own access
control policies and reflect these policies as return statuses in their protocols.
DES Authentication
UNIX authentication can be defeated, which we won't explain here. Therefore
DES authentication is recommended for people who want more security than
what UNIX authentication offers. The details of the DES authentication protocol
are complicated and are not explained here. Please see the Remote Procedure
Calls: Protocol Specification section for the details.
In orderfor DES authentication to work, the keyserv (8c) daemon must be
running on both the server and client machines. The users on these machines
need public keys assigned by the network administrator in the publickey (5) database. And, they need to have decrypted their secret keys
using their login password. This automatically happens when one logs in
Revision A, of 27 March 1990
90
Network Programming
using login (1) , or can be done manually using key login (1). The Network Services chapter of Network Programming explains more how to setup
secure networking.
Client Side
If a client wishes to use DES authentication, it must set its authentication handle
appropriately. Here is an example:
cl->cl auth =
authdes_create(servername, 60, &server_addr, NULL);
The first argument is the network name or "netname" of the owner of the server
process. Typically, server processes are root processes and their netname can be
derived using the following call:
char servername[MAXNETNAMELEN];
host2netname(servername, rhostname, NULL);
Here, rhostname is the hostname of the machine the server process is running on.
host2netname () fills in servername to contain this root process's netname.
If the server process was run by a regular user, one could use the call
user2netname () instead. Here is an example for a server process with the
same user ID as the client:
char servername[MAXNETNAMELEN];
user2netname(servername, getuid(), NULL);
The last argument to both of these calls, user2netname () and
host2netname () ,is the name of the naming domain where the server is
located. The NULL used here means "use the local domain name."
The second argument to authdes_create () is a lifetime for the credential.
Here it is set to sixty seconds. What that means is that the credential will expire
60 seconds from now. If a user tries to reuse the credential, the server RPC subsystem will recognize that it has expired and not grant any requests. If the same
user tries to reuse the credential within the sixty second lifetime, he will still be
rejected because the server RPC subsystem remembers which credentials it has
already seen in the near past, and will not grant requests to duplicates.
The third argument to authdes_create () is the address of the host to synchronize with. In order for DES authentication to work, the server and client
must agree upon the time. Here we pass the address of the server itself, so the
client and server will both be using the same time: the server's time. The argument can be NULL, which means "don't bother synchronizing." You should only
do this if you are sure the client and server are already synchronized.
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
91
The final argument to authdes _create () is the address of a DES encryption
key to use for encrypting timestamps and data. If this argument is NULL, as it is
in this example, a random key will be chosen. The client may find out the
encryption key being used by consulting the ah_key field of the authentication
handle.
Server Side
The server side is a lot simpler than the client side. Here is the previous example
rewritten to use AUTH- DES instead of AUTH- UNIX:
#include <sys/time.h>
#include <rpc/auth_des.h>
nuser(rqstp, transp)
struct svc_req *rqstp;
SVCXPRT *transp;
struct authdes cred *des_cred;
int uid;
int gid;
int gidlen;
int gidlist[lO];
1*
* we don't care about authenticationfor null proc
*1
if
(rqstp->r~roc
==
NULLPROC)
{
/ * same as before * /
1*
* now get the uid
*1
switch (rqstp->r~cred.oa_flavor)
case AUTH DES:
des cred =
(struct authdes_cred *) rqstp->r~clntcred;
if (! netname2user(des_cred->adc_fullname.name,
&uid, &gid, &gidlen, gidlist»
{
fprintf(stderr, "unknown user: %s\n",
des_cred->adc_fullname.name);
svcerr_systemerr(transp);
return;
break;
case AUTH NULL:
default:
svcerr_weakauth(transp);
return;
Revision A, of 27 March 1990
92
Network Programming
[
1*
* The rest is the same as before
1
*1
Note the use of the routine netname2user (), the inverse of
user2netname () : it takes a network ID and converts to a unix ID.
netnarne2user () also supplies the group IDs which we don't use in this
example, but which may be useful to other UNIX programs.
Using Inetd
An RPC server can be started from inetd. The only difference from the usual
code is that the service creation routine should be called in the following form:
transp
transp
transp
svcudp_create (0);
/* For UDP * /
svctcp_create (0,0, 0); / * For listener TCP sockets * /
svcfd_create(O,O,O); /* ForconnectedTCPsockets */
since inetd passes a socket as file descriptorO. Also, svc_register ()
should be called as
svc_register(transp, PROGNUM, VERSNUM, service, 0);
with the final flag as 0, since the program would already be registered with
portmapper by inetd. Remember that if you want to exit from the server
process and return control to inetd, you need to explicitly exit, since
svc_run () never returns.
The format of entries in jetc/ inetd. conf forRPC services is in one of the
following two forms:
p_name/version dgram rpc/udp wait/nowait user server args
p_name/version stream rpc/tcp wait/nowait user server args
where p _name is the symbolic name of the program as it appears in rpc (5 ) ,
server is the program implementing the server, and program and version are the
program and version numbers of the service. For more information, see
inetd. conf (5).
If the same program handles multiple versions, then the version number can be a
range, as in this example:
rstatd/1-2 dgram rpc/udp wait root /usr/etc/rpc.rstatd
4.4. More Examples
Versions on Server Side
By convention, the first version number of program PROG is PROGVERS _OR! G
and the most recent version is P ROGVERS. Suppose there is a new version of the
user program that returns an unsigned short rather than a long. If we
name this version RUSERSVERS_SHORT, then a server that wants to support
both versions would do a double register. Note that there is no need to create
another server handle for the new version.
Revision A of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
93
if (!svc_register(transp, RUSERSPROG, RUSERSVERS_ORIG,
nuser, IPPROTO_TCP»
{
fprintf(stderr, "can't register RUSER service\n");
exit(!);
if (!svc_register(transp, RUSERSPROG, RUSERSVERS_SHORT,
nuser, IPPROTO_TCP»
{
fprintf(stderr, "can't register new service\n");
exit(!);
Both versions can be handled by the same C procedure:
nuser(rqstp, transp)
struct svc_req *rqstp;
SVCXPRT *transp;
unsigned long nusers;
unsigned short nusers2;
switch (rqstp->r~roc)
case NULLPROC:
if (!svc_sendreply(transp, xdr_void, 0»
{
fprintf(stderr, "can't reply to RPC call\n");
return;
return;
case RUSERSPROC NUM:
1*
* Code here to compute the number of users
* and assign it to the variable nusers
*1
nusers2 = nusers;
switch (rqstp->r~vers)
case RUSERSVERS ORIG:
if (!svc_sendreply(transp, xdr_u_Iong,
&nusers»
{
fprintf (stderr, "can't reply to RPC call \n") ;
break;
case RUSERSVERS SHORT:
if (!svc_sendreply(transp, xdr_u_short,
&nusers2»
{
fprintf (stderr, "can't reply to RPC call \n ") i
break;
default:
svcerr_noproc(transp)i
return;
Revision A, of 27 March 1990
94
Network Programming
(
Versions on Client Side
)
Since different machines may run different versions of the RPC servers, the
client should be prepared to deal with the world. It is possible to have one server
running with the old version of RUSERSPROG (RUSERSVERS _ ORIG) while
another server is running with the newer version (RUSERSVERS_SHORT).
If the version of the server running does not match with the version number in
the client create routines, then clnt_call fails with
RPCPROGVERSMISMATCH error. You can find out the version numbers supported by the server and then create a client handle with an appropriate version
number. Either the routine below can be used, or clnt_create_vers ().
See the rpc (3N) manual page for more details.
main ()
{
enum clnt_stat status;
u_int num_l;
struct rpc_err rpcerr;
int maxvers, minvers;
= clnt_create(host, RUSERSPROG,
RUSERSVERS_SHORT, "udp");
if (clnt == NULL) {
clntycreateerror("clnt");
exit(-l);
clnt
to.tv_sec = 10; /* setthetimeouts */
to.tv_usec = 0;
status = clnt_call(clnt, RUSERSPROC_NUM,
xdr_void, NULL, xdr_u_short, &num_s, to);
if (status == RPC_SUCCESS) {
/*
We found the latest version number
*/
clnt_destroy(clnt);
printf(nnum
%d\n",num_s);
exit(O);
if (status != RPC_PROGVERSMISMATCH)
/*
Some other error
*/
clntyerror(clnt, "rusersn);
exit(-l);
clnt_geterr(clnt, &rpcerr);
maxvers = rpcerr. re_vers. high; /* highest version supported * /
minvers = rpcerr.re_vers.low;
/*lowestversionsupported */
if (RUSERSVERS_ORIG < minvers I I
RUSERS_ORIG > maxvers) {
/*
doesn't meet minimum standards
*/
clntyerror(clnt, "version mismatch");
exit(-l);
* §,!!!!
Revision A. of 27 March 1990
Chapter 4 -
Remote Procedure Call Programming Guide
95
/ * This version not supported * /
clnt_destroy (clnt); /* destroy the earlier handle * /
clnt = clnt_create(host, RUSERSPROG,
RUSERSVERS_ORIG, II udp II ) ; 1* try different version * 1
if (clnt == NULL) {
clntycreateerror("clnt")i
exit(-l);
status = clnt_call(clnt, RUSERSPROCNUM,
xdr_void, NULL, xdr_u_long, &num_l, to);
if (status == RPC_SUCCESS) {
1* We found the latest version number
printf("num = %d\n", num_l)i
else {
clntyerror(clnt, "rusers")i
exit(-l)i
TCP
*1
Here is an example that is essentially rep. The initiator of the RPC snd call
takes its standard input and sends it to the server rev, which prints it on standard
output. The RPC call uses TCP. This also illustrates an XDR procedure that
behaves differently on serialization than on deserialization.
/*
* The xdr routine:
*
on decode, read from wire, write onto fp
*
on encode, readfromfp, write onto wire
*/
#include <stdio.h>
#include <rpc/rpc.h>
xdr_rcp(xdrs, fp)
XDR *xdrsi
FILE *fp;
unsigned long size;
char buf[BUFSIZ], *Pi
if (xdrs->x_op
XDR_FREE)/* nothing to free *1
return 1;
while (1) {
if (xdrs->x_op == XDR_ENCODE) {
if «size = fread(buf, sizeof(char) , BUFSIZ,
fp»
== 0 && ferror(fp» {
fprintf(stderr, "can't fread\n");
return (1);
P
buf;
if (!xdr_bytes(xdrs, &p, &size, BUFSIZ»
Revision A. of 27 March 1990
96
Network Programming
return (0);
if (size == 0)
return (1);
if (xdrs->x_op == XDR_DECODE) {
if (fwrite(buf, sizeof(char), size,
fp) ! = size) {
fprintf(stderr, "can't fwrite\n");
return (1);
1*
* The sender routines
*1
*include
*include
*include
*include
*include
<stdio.h>
<netdb.h>
<rpc/rpc.h>
<sys/socket.h>
"rcp. h"
/* for prog, vers definitions * /
main (argc, argv)
int argc;
char **argv;
int xdr_rcp () ;
int err;
if (argc < 2) {
fprintf(stderr, "usage: %s servername\n", argv[O]);
exit(-l);
if «err = callrpctcp(argv[l], RCPPROG, RCPPROC,
{
RCPVERS, xdr_rcp, stdin, xdr_void, 0) > 0»
clnt-perrno(err);
fprintf(stderr, "can't make RPC call\nn);
exit(l);
exit(O);
callrpctcp(host, prognum, procnum, versnum,
inproc, in, outproc, out)
char *host, *in, *out;
xdrproc_t inproc, outproc;
struct sockaddr_in server_addr;
int socket = RPC_ANYSOCK;
enum clnt stat clnt_stat;
struct hostent *hp;
sun
microsystems
Revision A, of 27 March 1990
Chapter 4 -
Remote Procedure Call Programming Guide
97
register CLIENT *client;
struct timeval total_timeout;
if «hp = gethostbyname(host»
== NULL)
fprintf(stderr, "can't get addr for '%s'\nn, host);
return (-1);
bcopy(hp->h_addr, (caddr_t)&server_addr.sin_addr,
hp->h_length);
server_addr.sin_family = AF_INET;
server_addr.sin-port = 0;
if «client = clnttcp_create(&server_addr, prognum,
versnum, &socket, BUFSIZ, BUFSIZ»
== NULL) {
clnt_createerror(nrpctcp_create n );
return (-1);
total_timeout.tv_sec = 20;
total_timeout.tv_usec = 0;
clnt_stat = clnt_call(client, procnum,
inproc, in, outproc, out, total_timeout);
clnt_destroy(client);
return «int)clnt_stat);
1*
* The receiving routines
*1
#include <stdio.h>
#include <rpc/rpc.h>
#include nrcp.h"
/* Jorprog, versdefinitions */
main ()
{
register SVCXPRT *transp;
int rcp_service(), xdr_rcp();
if «transp = svctcp_create(RPC_ANYSOCK,
BUFSIZ, BUFSIZ»
== NULL) {
fprintf("svctcp_create: error\nn);
exit(l);
pmap_unset(RCPPROG, RCPVERS);
if (!svc_register(transp, RCPPROG,
RCPVERS, rcp_service, IPPROTO_TCP»
fprintf(stderr, "svc_register: error\n");
exit(l);
svc_run () ;
/* never returns * /
fprintf(stderr, "svc run should never return\nn);
sun
microsystems
Revision A. of 27 March 1990
98
Network Programming
rcp_service(rqstp, transp)
register struct svc_req *rqstp;
register SVCXPRT *transp;
switch (rqstp->r~roc) {
case NULLPROC:
if (svc_sendreply(transp, xdr_void, 0) == 0)
fprintf(stderr, "err: rcp_service");
return;
case RCPPROC FP:
if (!svc_getargs(transp, xdr_rcp, stdout»
{
svcerr_decode(transp);
return;
if (!svc_sendreply(transp, xdr_void, 0»
fprintf(stderr, "can't reply\n");
return;
default:
svcerr_noproc(transp);
return;
Callback Procedures
Occasionally, it is useful to have a seIVer become a client, and make an RPC call
back to the process which is its client. An example is remote debugging, where
the client is a window system program, and the seIVer is a debugger running on
the remote machine. Most of the time, the user clicks a mouse button at the
debugging window, which converts this to a debugger command, and then makes
an RPC call to the seIVer (where the debugger is actually running), telling it to
execute that command. However, when the debugger hits a breakpoint, the roles
are reversed, and the debugger wants to make an rpc call to the window program,
so that it can infonn the user that a breakpoint has been reached.
Another case when callback can be useful is when the client cannot block waiting
to hear back from the seIVer (possibly because of the huge amount of processing
involved in serving the request). In such cases, the seIVer would first acknowledge the request and then use callback to reply.
In order to do an RPC callback, you need a program number to make the RPC
call on. Since this will be a dynamically generated program number, it should be
in the transient range, Ox40000000 - Ox5fffffff. The routine gettransient () returns a valid program number in the transient range, and registers it with the portmapper. It only talks to the portmapper running on the same
machine as the gettransient () routine itself. The call to pmap_set () is
a test and set operation, in that it indivisibly tests whether a program number has
already been registered, and if it has not, then reseIVes it.
#include <stdio.h>
#include <rpc/rpc.h>
Revision A, of 27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
99
gettransient(proto, vers, portnum)
int proto;
u_long vers;
u_short portnum;
static u_long prognum = Ox40000000;
while (!pmap_set(prognum++, vers, proto, portnum»
continue;
return (prognum - 1);
NOTE
The call to ntohs () for portnum is not necessary because it was already
passed in host byte order (as pmap_set () expects). See the
byteorder (3N) man page for more details on the conversion of network
addresses from network to host byte order.
The following pair of programs illustrate how to use the get transient ()
routine. The client makes an RPC call to the server, passing it a transient program number. Then the client waits around to receive a callback from the server
at that program number. The server registers the program EXAMPLEPROG, so
that it can receive the RPC call infonning it of the callback program number.
Then at some random time (on receiving an ALRM signal in this example), it
sends a callback RPC call, using the program number it received earlier.
In this example, both the client and the server are on the same machine. they
could very well be on different machines - in that case the handling of the hostname would be different.
/*
* client
*/
#include <stdio.h>
#include <rpc/rpc.h>
#include "example.h"
int callback () ;
main ()
{
int tmpyrog;
char hostname[256];
SVCXPRT *xprt;
int stat;
gethostname(hostname, sizeof(hostname»;
if «xprt = svcudp_create(RPC_ANYSOCK»
== NULL) {
fprintf(stderr, "rpc_server: svcudp_create\nn);
exit(1);
if (tmpyrog = gettransient(IPPROTO_UDP, 1,
xprt->xpyort) == 0) {
Revision A, of 27 March 1990
100
Network Programming
fprintf (stderr, "failed to get transient number\n") ;
exit(1);
fprintf(stderr, "client gets prognum %d\n", tmpyrog);
'* protocol is 0 - gettransienl does registering *'
(void)svc_register(xprt, tmpyrog, 1, callback, 0);
stat - callrpc(hostname, EXAMPLEPROG, EXAMPLEVERS,
EXAMPLEPROC_CALLBACK,xdr_int, &tmpyrog,xdr_void, 0);
if (stat !- RPC_SUCCESS) {
clnt-perrno(stat);
exit(1);
svc_runO;
fprintf(stderr, "Error: svc run shouldn't return\n");
callback (rqstp, transp)
register struct svc_req *rqstp;
register SVCXPRT *transp;
switch (rqstp->r~roc) {
case 0:
if (!svc_sendreply(transp, xdr_void, 0» {
fprintf(stderr, "err: exampleprog\n");
return (1);
return (0);
case 1:
fprintf(stderr, "client got callback\n");
if (! svc_sendreply (transp, xdr_void, 0» {
fprintf(stderr, "err: exampleprog\n")i
return (1);
return (0);
'** server
*1
'include
'include
'include
'include
<stdio.h>
<rpc/rpc.h>
<sys/signal.h>
"example.h"
char *getnewprog();
char hostname[256];
int docallback();
int pnum == -1;
/* program number for callback routine * /
main ()
Revision A, of27 March 1990
Chapter 4 - Remote Procedure Call Programming Guide
101
gethostname(hostname, sizeof(hostname»;
registerrpc(EXAMPLEPROG, EXAMPLEVERS,
EXAMPLEPROC_CALLBACK, getnewprog, xdr_int, xdr_void);
signal (SIGALRM, docallback);
alarm(10) ;
svc_run () ;
fprintf(stderr, "Error: svc run shouldn't return\n");
char *
getnewprog(pnump)
int *pnumpi
pnum = *(int *)pnump;
return NULL;
docallback ()
{
int anSi
if (pnum == -1) {
signal (SIGALRM, docallback);
return;
/* program number not yet received
*/
ans = callrpc(hostname, pnum, 1, 1, xdr_void, 0,
xdr_void, 0) i
if (ans != RPC_SUCCESS)
fprintf(stderr, "server: %s\n",clnt_sperrno(ans»;
4.5. Futures
Sun currently supports RPC on top of both UDP (datagram) and TCP (circuit
oriented) transports. The RPC library uses sockets API for communicating with
the underlying transport layers.
It is likely that in the future releases, the RPC library will use Transport Layer
Interface (TLI) API for communicating with the underlying protocol layers.
Usage of TLI will help in making RPC transport independent and thus users will
be able to use any TLI conforming transport for communication.
Almost all of the current RPC API will be supported. Exceptions would include
passing of an open socket to the client and server create routines.
One of the ways to have a very smooth transition to transport independent RPC is
to use rpcgen to generate the client and the server skeletons, in addition to not
using any transport specific feature of UDP and TCP. Code written this way will
not be bound to run only on UDP and TCP, but will be able to run on all transports of datagram and circuit oriented type. The actual RPC protocol will
Revision A, of 27 March 1990
102
Network Programming
however remain the same.
Revision A, of 27 March 1990
5
External Data Representation: Sun
Technical Notes
This chapter contains technical notes on Sun's implementation of the External
Data Representation (XDR) standard, a set of library routines that allow a C programmer to describe arbitrary data structures in a machine-independent fashion.
For a formal specification of the XDR standard, see the External Data Representation Standard: Protocol Specification. XDR is the backbone of Sun's Remote
Procedure Call package, in the sense that data for remote procedure calls is
transmitted using the standard. XDR library routines should be used to transmit
data that is accessed (read or written) by more than one type of machine. 7
This chapter contains a short tutorial overview of the XDR library routines, a
guide to accessing currently available XDR streams, and information on defining
new streams and data types. XDR was designed to work across different
languages, operating systems, and machine architectures. Most users (particularly RPC users) will only need the information in the Number Filters, Floating
Point Filters, and Enumeration Filters sections. Programmers wishing to implement RPC and XDR on new machines will be interested in the rest of the
chapter, as well as the External Data Representation Standard: Protocol
Specification, which will be their primary reference.
NOTE
rpcgen can be used to write XDR routines even in cases where no RPC calls
are being made.
On Sun systems, C programs that want to use XDR routines must include the file
<rpc I rpc . h>, which contains all the necessary interfaces to the XDR system.
Since the C library libc. a contains all the XDR routines, compile as normal.
]
( example% cc program. c
7 For a complete specification of the system External Data Representation routines, see the xdr (3N)
manual page.
sun
microsystems
103
Revision A. of 27 March 1990
104
Network Programming
J ostification
Consider the following two programs, wr iter:
iinclude <stdio.h>
/*
main ()
writer.c
*I
{
long i;
for (i = 0; i < 8; i++) {
if (fwrite«char *)&i, sizeof(i), 1, stdout)
fprintf (stderr, "failed! \n");
exit (1);
!= 1) {
exit(O);
and reader:
iinclude <stdio.h>
1* reader.c *1
main ()
{
long i,
j;
for (j - 0; j < 8; j++) {
if (fread«char*)&i, sizeof (i), 1, stdin)
fprintf (stderr, Itfailed! \nlt);
exit(I);
!=1) {
printf (It%ld ", i);
printf(II\nlt);
exit(O);
The two programs appear to be portable, because (a) they pass lint checking,
and (b) they exhibit the same behavior when executed on two different hardware
architectures, a Sun and a VAX.
Piping the output of the writer program to the reader program gives identical results on a Sun or a VAX.
sun% writer I reader
0 1 2 3 4 5 6 7
sun%
vax% writer I reader
0 1 2 3 4 5 6 7
vax%
With the advent of local area networks and 4.2BSD came the concept of "network pipes" - a process produces data on one machine, and a second process
Revision A, of 27 March 1990
Chapter 5 - External Data Representation: Sun Technical Notes
105
consumes data on another machine. A network pipe can be constructed with
writer and reader. Here are the results if the first produces data on a Sun,
and the second consumes data on a VAX.
sun% writer I rsh vax reader
o 16777216 33554432 50331648 67108864 83886080 100663296
117440512
sun%
Identical results can be obtained by executing wr iter on the VAX and
reader on the Sun. These results occur because the byte ordering of long
integers differs between the VAX and the Sun, even though word size is the
same. Note that 16777216 is 224 - when four bytes are reversed, the 1 winds up
in the 24th bit.
Whenever data is shared by two or more machine types, there is a need for portable data. Programs can be made data-portable by replacing the read () and
wr i te () calls with calls to an XDR library routine xdr_long () , a filter that
knows the standard representation of a long integer in its external fonn. Here are
the revised versions of wri ter:
#include <stdio.h>
#include <rpc/rpc.h>
main ()
/*
writer.c
/ * xdr is a sub-library of rpc * /
*/
{
XDR xdrs;
long i;
xdrstdio_create(&xdrs, stdout, XDR_ENCODE);
for (i = 0; i < 8; i++) {
if (!xdr_long(&xdrs, &i»
{
fprintf (stderr, "failed! \n") ;
exit(l);
exit(O);
and reader:
+m.!!
Revision A, of 27 March 1990
106
Network Programming
iinclude <stdio.h>
iinclude <rpc/rpc.h>
main ()
/ * xlir is a sub-library of rpc * /
/ * reader.c * /
{
XDR xdrs;
long i, j;
xdrstdio_create(&xdrs, stdin, XDR_DECODE);
for (j = 0; j < 8; j++) {
if (!xdr_long(&xdrs, &i»
fprintf(stderr, "failed!\n");
exit(l);
printf("%ld ", i);
printf("\n");
exit(O);
The new programs were executed on a Sun, on a VAX, and from a Sun to a
VAX; the results are shown below.
sun% writer I reader
012 3 4 5 6 7
sun%
vax% writer I reader
012 3 4 5 6 7
vax%
sun% writer I rsh vax reader
012 3 4 5 6 7
sun%
NOTE
A Canonical Standard
Integers are just the tip of the portable-data iceberg. Arbitrary data structures
present portability problems, particularly with respect to alignment and pointers.
Alignment on word boundaries may cause the size of a structure to vary from
machine to machine. Andpointers, which are very convenient to use, have no
meaning outside the machine where they are defined.
XDR's approach to standardizing data representations is canonical. That is,
XDR defines a single byte order (Big Endian), a single floating-point representation (IEEE), and so on. Any program running on any machine can use XDR to
create portable data by translating its local representation to the XDR standard
representations; similarly, any program running on any machine can read portable data by translating the XDR standard representaions to its local equivalents.
The single standard completely decouples programs that create or send portable
data from those that use or receive portable data. The advent of a new machine
or a new language has no effect upon the community of existing portable data
+~t!!
Revision A. of 27 March 1990
Chapter 5 - External Data Representation: Sun Technical Notes
107
creators and users. A new machine joins this community by being "taught" how
to convert the standard representations and its local representations; the local
representations of other machines are irrelevant. Conversely, to existing programs running on other machines, the local representations of the new machine
are also irrelevant; such programs can immediately read portable data produced
by the new machine because such data conforms to the canonical standards that
they already understand.
There are strong precedents forXDR's canonical approach. For example,
TCP/IP, UDP/lP, XNS, Ethernet, and, indeed, all protocols below layer five of
the ISO model, are canonical protocols. The advantage of any canonical
approach is simplicity; in the case of XDR, a single set of conversion routines is
written once and is never touched again. The canonical approach has a disadvantage, but it is unimportant in real-world data transfer applications. Suppose two
Little-Endian machines are transferring integers according to the XDR standard.
The sending machine converts the integers from Little-Endian byte order to XDR
(Big-Endian) byte order, the receiving machine performs the reverse conversion.
Because both machines observe the same byte order, their conversions are
unnecessary. The point, however, is not necessity, but cost as compared to the
alternative.
The time spent converting to and from a canonical representation is insignificant,
especially in networking applications. Most of the time required to prepare a
data structure for transfer is not spent in conversion but in traversing the elements
of the data structure. To transmit a tree, for example, each leaf must be visited
and each element in a leaf record must be copied to a buffer and aligned there;
storage for the leaf may have to be deallocated as well. Similarly, to receive a
tree, storage must be allocated for each leaf, data must be moved from the buffer
to the leaf and properly aligned, and pointers must be constructed to link the
leaves together. Every machine pays the cost of traversing and copying data
structures whether or not conversion is required. In networking applications,
communications overhead-the time required to move the data down through the
sender's protocol layers, across the network and up through the receiver's protocollayers-dwarfs conversion overhead.
The XDR Library
The XDR library not only solves data portability problems, it also allows you to
write and read arbitrary C constructs in a consistent, specified, well-documented
manner. Thus, it can make sense to use the library even when the data is not
shared among machines on a network.
The XDR library has filter routines for strings (null-terminated arrays of bytes),
structures, unions, and arrays, to name a few. Using more primitive routines, you
can write your own specific XDR routines to describe arbitrary data structures,
including elements of arrays, arms of unions, or objects pointed at from other
structures. The structures themselves may contain arrays of arbitrary elements,
or pointers to other structures.
Let's examine the two programs more closely. There is a family ofXDR stream
creation routines in which each member treats the stream of bits differently. In
our example, data is manipulated using standard I/O routines, so we use
xdrstdio_create (). The parameters to XDR stream creation routines vary
Revision A, of 27 March 1990
108
Network Programming
according to their function. In our example, xdrstdio_create () takes a
pointer to an XDR structure that it initializes, a pointer to a FILE that the input
or output is perfonned on, and the operation. The operation may be
XDR_ENCODE for serializing in the writer program, or XDR_DECODE for
deserializing in the reader program.
Note: RPC users never need to create XDR streams; the RPC system itself
creates these streams, which are then passed to the users.
The xdr_long () primitive is characteristic of most XDR library primitives
and all client XDR routines. First, the routine returns FALSE (0) if it fails, and
TRUE (1) if it succeeds. Second, for each data type, xxx, there is an associated
XDR routine of the fonn:
xdr_xxx(xdrs, xp)
XDR *xdrs;
xxx *xp;
In our case, xxx is long, and the corresponding XDR routine is a primitive,
xdr_long (). The client could also define an arbitrary structure xxx in which
case the client would also supply the routine xdr_xxx ( ) I describing each field
by calling XDR routines of the appropriate type. In all cases the first parameter,
xdr s can be treated as an opaque handle, and passed to the primitive routines.
XDR routines are direction independent; that is, the same routines are called to
serialize or deserialize data. This feature is critical to software engineering of
portable data. The idea is to call the same routine for either operation - this
almost guarantees that serialized data can also be deserialized. One routine is
used by both producer and consumer of networked data. This is implemented by
always passing the address of an object rather than the object itself - only in the
case of deserialization is the object modified. This feature is not shown in our
trivial example, but its value becomes obvious when nontrivial data structures are
passed among machines. If needed, the user can obtain the direction of the XDR
operation. See the XDR Operation Directions section of this chapter for details.
Let's look at a slightly more complicated example. Assume that a person's gross
assets and liabilities are to be exchanged among processes. Also assume that
these values are important enough to warrant their own data type:
struct gnumbers {
long g_assets;
long g_liabilities;
} ;
The corresponding XDR routine describing this structure would be:
Revision A. of 27 March 1990
Chapter 5 - External Data Representation: Sun Technical Notes
109
bool t
/* TRUE is success, FALSE isfailure * /
xdr_gnumbers(xdrs, gp)
XDR *xdrsi
struct gnumbers *gp;
if (xdr_long(xdrs, &gp->g_assets) &&
xdr_long(xdrs, &gp->g_liabilities»
return (TRUE) ;
return(FALSE)i
Note that the parameter xdr s is never inspected or modified; it is only passed on
to the subcomponent routines. It is imperative to inspect the return value of each
XDR routine call, and to give up immediately and return FALSE if the subroutine fails.
This example also shows that the type boo 1_t is declared as an integer whose
only values are TRUE (1) and FALSE (0). This document uses the following
definitions:
idefine bool t i n t
idefine TRUE
1
idefine FALSE
0
Keeping these conventions in mind, xdr_gnumbers () can be rewritten as follows:
xdr_gnumbers(xdrs, gp)
XDR *xdrSi
struct gnumbers *gpi
return (xdr_long(xdrs, &gp->g_assets) &&
xdr_long(xdrs, &gp->g_liabilities»;
This document uses both coding styles.
5.1. XDR Library
Primitives
This section gives a synopsis of each XDR primitive. It starts with basic data
types and moves on to constructed data types. Finally, XDR utilities are discussed. The interface to these primitives and utilities is defined in the include file
<rpc/xdr. h>, automatically included by <rpc/rpc. h>.
Number Filters
The XDR library provides primitives to translate between numbers and their
corresponding external representations. Primitives cover the set of numbers in:
[signed, unsigned] * [short, int, long]
Revision A. of 27 March 1990
110
Network Programming
Specifically, the eight primitives are:
bool_t xdr_char(xdrs, cp)
XDR *xdrsi
char *CPi
XDR *xdrsi
unsigned char *UCPi
bool_t xdr_int(xdrs, ip)
XDR *xdrsi
int *iPi
bool_t xdr_u_int(xdrs, up)
XDR *xdrsi
unsigned *UPi
bool_t xdr_long(xdrs, lip)
XDR *xdrsi
long *lipi
bool_t xdr_u_long(xdrs, lup)
XDR *xdrsi
u_long *lUPi
bool_t xdr_short(xdrs, sip)
XDR *xdrsi
short *SiPi
bool_t xdr_u_short(xdrs, sup)
XDR *xdrsi
u_short *SUPi
The first parameter, xdrs, is an XDR stream handle. The second parameter is
the address of the number that provides data to the stream or receives data from
it. All routines return TRUE if they complete successfully, and FALSE otherwise.
Floating Point Filters
The XDR library also provides primitive routines for C's floating point types:
bool_t xdr_float(xdrs, fp)
XDR *xdrs;
float *fPi
bool_t xdr_double(xdrs, dp)
XDR *xdrsi
double *dp;
The first parameter, xdr s is an XDR stream handle. The second parameter is
the address of the floating point number that provides data to the stream or
receives data from it. Both routines return TRUE if they complete successfully,
and FALSE otherwise.
Revision A. of 27 March 1990
Chapter 5 - External Data Representation: Sun Technical Notes
111
Note: Since the numbers are represented in IEEE floating point, routines may fail
when decoding a valid IEEE representation into a machine-specific representation, or vice-versa.
Enumeration Filters
The XDR library provides a primitive for generic enumerations. The primitive
assumes that a C en urn has the same representation inside the machine as a C
integer. The boolean type is an important instance of the enUffi. The external
representation of a boolean is always TRUE (1) or FALSE (0).
*define bool_t
*define FALSE
*define TRUE
int
0
1
*define enum tint
bool_t xdr_enum(xdrs, ep)
XDR *xdrsi
enum_t *epi
bool_t xdr_bool(xdrs, bp)
XDR *xdrsi
bool_t *bPi
The second parameters ep and bp are addresses of the associated type that provides data to, or receives data from, the stream xdrs.
No Data
Occasionally, an XDR routine must be supplied to the RPC system, even when
no data is passed or required. The library provides such a routine:
bool t xdr_void () ;
Constructed Data Type Filters
/ * always returns TRUE * /
Constructed or compound data type primitives require more parameters and perfonn more complicated functions then the primitives discussed above. This section includes primitives for strings, arrays, unions, and pointers to structures.
Constructed data type primitives may use memory management. In many cases,
memory is allocated when deserializing data with XDR_DECODE. Therefore, the
XDR package must provide means to deallocate memory. This is done by an
XDR operation, XDR_FREE. To review, the three XDR directional operations
are XDR_ENCODE, XDR_DECODE, and XDR_FREE.
Strings
In C, a string is defined as a sequence of bytes tenninated by a null byte, which is
not considered when calculating string length. However, when a string is passed
or manipulated, a pointer to it is employed. Therefore, the XDR library defines a
string to be a char *, and not a sequence of characters. The external representation of a string is drastically different from its internal representation. Externally, strings are represented as sequences of ASCII characters, while internally,
they are represented with character pointers. Conversion between the two
representations is accomplished with the routine xdr_string () :
Revision A, of 27 March 1990
112
Network Programming
bool_t xdr_string(xdrs, sp, maxlength)
XDR *xdrs;
char **sp;
u_int maxlength;
The first parameter xdr s is the XDR stream handle. The second parameter sp
is a pointer to a string (type char **). The third parameter maxlength
specifies the maximum number of bytes allowed during encoding or decoding.
its value is usually specified by a protocol. For example, a protocol specification
may say that a file name may be no longer than 255 characters.
Keep maxlength small. If it is too
big you can blow the heap, since
xdr string () will call malloe ()
for space.
The routine returns FALSE if the number of characters exceeds maxlength,
and TRUE if it doesn't
The behavior of xdr_ str ing () is similar to the behavior of other routines discussed in this section. The direction XDR ENCODE is easiest to understand. The
parameter sp points to a string of a certain length; if the string does not exceed
maxlengt h, the bytes are serialized.
The effect of deserializing a string is subtle. First the length of the incoming
string is detennined; it must not exceed maxlength. Next sp is dereferenced;
if the value is NULL, then a string of the appropriate length is allocated and *sp
is set to this string. If the original value of * s p is non-null, then the XDR package assumes that a target area has been allocated, which can hold strings no
longer than maxlength. In either case, the string is decoded into the target
area. The routine then appends a null character to the string.
In the XDR_FREE operation, the string is obtained by dereferencing sp. If the
string is not NULL, it is freed and * sp is set to NULL. In this operation,
xdr_string () ignores the maxlength parameter.
Byte Arrays
Often variable-length arrays of bytes are preferable to strings. Byte arrays differ
from strings in the following three ways: 1) the length of the array (the byte
count) is explicitly located in an unsigned integer, 2) the byte sequence is not terminated by a null character, and 3) the external representation of the bytes is the
same as their internal representation. The primitive xdr_bytes () converts
between the internal and external representations of byte arrays:
bool_t xdr_bytes(xdrs, bpp, Ip, maxlength)
XDR *xdrs;
char **bpp;
u_int *lp;
u_int maxlength;
The usage of the first, second and fourth parameters are identical to the first,
second and third parameters of xdr string () ,respectively. The length of
the byte area is obtained by dereferencing 1 p when serializing; * 1 P is set to the
byte length when deserializing.
Revision A, of 27 March 1990
Chapter 5 - External Data Representation: Sun Technical Notes
Arrays
113
The XDR library package provides a primitive for handling arrays of arbitrary
elements. The xdr_bytes () routine treats a subset of generic arrays, in which
the size of array elements is known to be 1, and the external description of each
element is built-in. The generic array primitive, xdr _array () requires parameters identical to those of xdr_bytes () plus two more: the size of array elements, and an XDR routine to handle each of the elements. This routine is called
to encode or decode each element of the array.
bool t
xdr_array(xdrs, ap, Ip, maxlength, elementsiz, xdr_element)
XDR *xdrsi
char **api
u_int *lPi
u_int maxlengthi
u_int elementsizi
bool_t (*xdr_element) ()i
The parameter ap is the address of the pointer to the array. If * ap is NULL
when the array is being deserialized, XDR allocates an array of the appropriate
size and sets *ap to that array. The element count of the array is obtained from
* lp when the array is serialized; * lp is set to the array length when the array is
deserialized. The parameter maxlength is the maximum number of elements
that the array is allowed to have; elementsiz is the byte size of each element
of the array (the C function sizeof () can be used to obtain this value). The
xdr_element () routine is called to serialize, deserialize, or free each element
of the array.
Before defining more constructed data types, it is appropriate to present three
examples.
ExampZeA:
A user on a networked machine can be identified by (a) the machine name, such
as krypton: see the gethostname man page; (b) the user's UID: see the
geteuid man page; and (c) the group numbers to which the user belongs: see
the getgroups man page. A structure with this information and its associated
XDR routine could be coded like this:
Revision A, of 27 March 1990
114
Network Programming
struct netuser {
*nu_machinename;
char
nu_uid;
int
nu_glen;
u int
*nu_gids;
int
};
#define NLEN 255
#define NGRPS 20
/ * machine names < 256 chars * /
/ * user can't be in > 20 groups * /
bool t
xdr_netuser(xdrs, nup)
XDR *xdrs;
struct netuser *nup;
return (xdr_string(xdrs, &nup->nu_machinename, NLEN) &&
xdr_int(xdrs, &nup->nu_uid) &&
xdr_array(xdrs, &nup->nu_gids, &nup->nu_glen,
NGRPS, sizeof (int), xdr_int»;
ExampleB:
A party of network users could be implemented as an array of net user structure. The declaration and its associated XDR routines are as follows:
struct party {
u_int p_len;
struct net user *p_nusers;
};
#define PLEN 500 /* max number o/users in a party * /
bool t
xdr-party(xdrs, pp)
XDR *xdrs;
struct party *pp;
return (xdr_array(xdrs, &pp->p_nusers, &pp->p_len, PLEN,
sizeof (struct netuser), xdr_netuser»;
Example C:
The well-known parameters to main, argc and argv can be combined into a
structure. An array of these structures can make up a history of commands. The
declarations and XDR routines might look like:
Revision A, of 27 March 1990
Chapter 5 - External Data Representation: S\D1 Technical Notes
115
struct cmd {
u_int c_argci
char **c_argvi
}i
/* args cannot be> 1000 chars * /
/* commands cannot have> 100 args * /
fdefine ALEN 1000
fdefine NARGC 100
struct history {
u_int h_leni
struct cmd *h_cmdsi
} i
fdefine NCMDS 75
/* history is no more than 75 commands * /
bool t
xdr_wrapstring(xdrs, sp)
XDR *xdrsi
char **SPi
return (xdr_string(xdrs, sp, ALEN»;
bool t
xdr_cmd(xdrs, cp)
XDR *xdrsi
struct cmd *CPi
return (xdr_array (xdrs, &cp->c_argv, &cp->c_argc, NARGC,
sizeof (char *), xdr_wrapstring»i
bool t
xdr_history(xdrs, hp)
XDR *xdrsi
struct history *hPi
return (xdr_array(xdrs, &hp->h_cmds, &hp->h_len, NCMDS,
sizeof (struct cmd), xdr_cmd»i
The most confusing part of this example is that the routine
xdr_wrapstring () is needed to package the xdr_string () routine,
because the implementation of xdr_array () only passes two parameters to
the array element description routine; xdr_ wr a pst ring () supplies the third
parameter to xdr_string ().
By now the recursive nature of the XDR library should be obvious. Let's continue with more constructed data types.
Opaque Data
In some protocols, handles are passed from a server to client. The client passes
the handle back to the server at some later time. Handles are never inspected by
clients; they are obtained and submitted. That is to say, handles are opaque. The
xdr _ opaque () primitive is used for describing fixed sized, opaque bytes.
Revision A, of 27 March 1990
116
Network Programming
bool_t xdr_opaque(xdrs, p, len)
XDR *xdrs;
char *p;
u_int len;
The parameter p is the location of the bytes; len is the number of bytes in the
opaque object. By definition, the actual data contained in the opaque object are
not machine portable.
Fixed Sized Arrays
The XDR library provides a primitive, xdr_vector () , for fixed-length arrays.
fdefine NLEN 255
fdefine NGRPS 20
/*
/*
machine names must be < 256 chars * /
user belongs to exactly 20 groups * /
struct netuser {
char *nu_machinename;
int nu_uid;
int nu_gids[NGRPS];
};
bool t
xdr_netuser(xdrs, nup)
XDR *xdrs;
struct netuser *nup;
int i;
if (!xdr_string(xdrs, &nup->nu_machinename, NLEN»
return(FALSE);
if (!xdr_int(xdrs, &nup->nu_uid»
return(FALSE);
if (!xdr_vector(xdrs, nup->nu_gids, NGRPS, sizeof(int),
xdr_int»
{
return(FALSE);
return(TRUE);
Discriminated Unions
The XDR library supports discriminated unions. A discriminated union is a C
union and an enum t value that selects an "arm" of the union.
Revision A, of27 March 1990
Chapter 5 - External Data Representation: Sun Technical Notes
117
struct xdr_discrim {
enum t value;
bool t (*proc) () ;
} ;
bool t xdr_union(xdrs, dscmp, unp, arms, defaultarm)
XDR *xdrs;
enum_t *dscmp;
char *unp;
struct xdr discrim *arms;
bool_t (*defaultarm) ();
/* mayequalNUU */
First the routine translates the discriminant of the union located at *ds cmp. The
discriminant is always an enum_ t. Next the union located at *unp is
translated. The parameter arms is a pointer to an array of xdr_ discrim
structures. Each structure contains an ordered pair of [value, pro C ]. If the
union's discriminant is equal to the associated value, then the proc is called to
translate the union. The end of the xdr_discrim structure array is denoted by
a routine of value NULL (0). If the discriminant is not found in the arms array,
then the defaultarm procedure is called if it is non-null; otherwise the routine
returns FALSE.
Example D: Suppose the type of a union may be integer, character pointer (a
string), or a gnumbers structure. Also, assume the union and its current type
are declared in a structure. The declaration is:
enum utype { INTEGER=l, STRING=2, GNUMBERS=3 };
struct u_tag {
enum utype utype;
/ * the union's discriminant
union {
int ivaI;
char *pval;
struct gnumbers gn;
uval;
*/
} ;
The following constructs and XDR procedure (de)serialize the discriminated
union:
Revision A, of 27 March 1990
118
Network Programming
struct xdr_discrim u_tag_arms[4]
INTEGER, xdr_int },
GNUMBERS, xdr_gnumbers
STRING, xdr_wrapstring },
__dontcare__ , NULL }
/ * always terminate arms with a NULL xdryroc
*/
bool t
xdr_u_tag(xdrs, utp)
XDR *xdrs;
struct u_tag *utp;
return (xdr_union (xdrs, &utp->utype, &utp->uval,
u_tag_ar.ms, NULL»;
The routine xdr_gnumber s () was presented above in The XDR Library section. xdr_wrapstring () was presented in example C. The default arm
parameter to xdr union () (the last parameter) is NULL in this example.
Therefore the value of the union's discriminant may legally take on only values
listed in the u_tag_arms array. This example also demonstrates that the elements of the ann's array do not need to be sorted.
It is worth pointing out that the values of the discriminant may be sparse, though
in this example they are not. It is always good practice to assign explicitly
integer values to each element of the discriminant's type. This practice both
documents the external representation of the discriminant and guarantees that different C compilers emit identical discriminant values.
Exercise: Implement xdr_union () using the other primitives in this section.
Pointers
In C it is often convenient to put pointers to another structure within a structure.
The xdr_reference () primitive makes it easy to serialize, deserialize, and
free these referenced structures.
bool_t xdr_reference(xdrs, pp, size, proc)
XDR *xdrs;
char **pp;
u_int ssize;
bool_t (*proc) ();
Parameter pp is the address of the pointer to the structure; parameter s s i z e is
the size in bytes of the structure (use the C function sizeof () to obtain this
value); and proc is the XDR routine that describes the structure. When decoding data, storage is allocated if *pp is NULL.
There is no need for a primitive xdr_struct () to describe structures within
structures, because pointers are always sufficient.
Revision A, of 27 March 1990
Chapter 5 - External Data Representation: Sl.Ul Technical Notes
119
Exercise: Implement xdr_reference () using xdr_array (). Warning:
xdr_reference () and xdr_array () are NOT interchangeable external
representations of data.
Example E: Suppose there is a structure containing a person's name and a pointer
to a gnumbers structure containing the person's gross assets and liabilities.
The construct is:
struct pgn {
char *name;
struct gnumbers *gnpi
} i
The corresponding XDR routine for this structure is:
bool t
xdr-pgn(xdrs, pp)
XDR *xdrs;
struct pgn *pp;
if (xdr_string(xdrs, &pp->name, NLEN) &&
xdr_reference(xdrs, &pp->gnp,
sizeof(struct gnumbers), xdr_gnumbers»
return (TRUE) i
return (FALSE) ;
Pointer Semantics and XDR
In many applications, C programmers attach double meaning to the values of a
pointer. Typically the value NULL (or zero) means data is not needed, yet some
application-specific interpretation applies. In essence, the C programmer is
encoding a discriminated union efficiently by overloading the interpretation of
the value of a pointer. For instance, in example E a NULL pointer value for gnp
could indicate that the person's assets and liabilities are unknown. That is, the
pointer value encodes two things: whether or not the data is known; and if it is
known, where it is located in memory. Linked lists are an extreme example of
the use of application-specific pointer interpretation.
The primitive xdr_reference () cannot and does not attach any special
meaning to a null-value pointer during serialization. That is, passing an address
of a pointer whose value is NULL to xdr _ref erence () when serialing data
will most likely cause a memory fault and, on the UNIX system, a core dump.
xdryointer () correctly handles NULL pointers. For more information
about its use, see Linked Lists.
Exercise: After reading the section on Linked Lists, return here and extend example E so that it can correctly deal with NULL pointer values.
Exercise: Using the xdr_union () ,xdr_reference () and xdr_void ()
primitives, implement a generic pointer handling primitive that implicitly deals
Revision A, of 27 March 1990
120
Network Programming
with NULL pointers. That is, implement xdryointer ( ) .
Non-filter Primitives
XDR streams can be manipulated with the primitives discussed in this section.
u_int xdr_getpos(xdrs)
XDR *xdrs;
bool_t xdr_setpos(xdrs, pos)
XDR *xdrs;
u_int pos;
xdr_destroy(xdrs)
XDR *xdrs;
The routine xdr _getpos () returns an unsigned integer that describes the
current position in the data stream. Warning: In some XDR streams, the returned
value ofxdr_getpos () is meaningless; the routine returns a -1 in this case
(though -1 should be a legitimate value).
The routine xdr_setpos () sets a stream position to pos. Warning: In some
XDR streams, setting a position is impossible; in such cases, xdr_ setpos ()
will return FALSE. This routine will also fail if the requested position is out-ofbounds. The definition of bounds varies from stream to stream.
The xdr_destroy () primitive destroys the XDR stream. Usage of the stream
after calling this routine is undefined.
XDR Operation Directions
At times you may wish to optimize XDR routines by taking advantage of the
direction of the operation- XDR_ENCODE, XDR_DECODE, or XDR_FREE.
The value xdrs->x_op always contains the direction of the XDR operation.
Programmers are not encouraged to take advantage of this information. Therefore, no example is presented here. However, an example in the Linked Lists section, below, demonstrates the usefulness of the xdrs->x_op field.
XDR Stream Access
An XDR stream is obtained by calling the appropriate creation routine. These
creation routines take arguments that are tailored to the specific properties of the
stream.
Streams currently exist for (de)serialization of data to or from standard I/O FILE
streams, TCP/IP connections and UNIX files, and memory.
Standard I/O Streams
XDR streams can be interfaced to standard I/O using the
xdrstdio_create () routine as follows:
Revision A, of 27 March 1990
Chapter 5 - External Data Representation: SWl Technical Notes
*include <stdio.h>
*include <rpc/rpc.h>
121
/ * xdr streams part of rpc * /
void
xdrstdio_create(xdrs, fp, x_op)
XDR *xdrsi
FILE *fPi
enum xdr_op X_OPi
The routine xdrstdio_ create () initializes an XDR stream pointed to by
xdrs. The XDR stream interfaces to the standard I/O library. Parameter fp is
an open file, and x_ op is an XDR direction.
Memory Streams
Memory streams allow the streaming of data into or out of a specified area of
memory:
*include <rpc/rpc.h>
void
xdrmem_create(xdrs, addr, len, x_op)
XDR *xdrs;
char *addri
u_int len;
enum xdr_op X_OPi
The routine xdrmem_create () initializes an XDR stream in local memory.
The memory is pointed to by parameter addr; parameter len is the length in
bytes of the memory. The parameters xdrs and x_op are identical to the
corresponding parameters of xdr stdio_create ( ). Currently, the UDP/IP
implementation ofRPC uses xdrmem_create (). Complete call or result
messages are built in memory before calling the sendto () system routine.
Record (TCPIIP) Streams
A record stream is an XDR stream built on top of a record marking standard that
is built on top of the UNIX file or 4.2 BSD connection interface.
*include <rpc/rpc.h>
/ * xdr streams part of rpc * /
xdrrec_create(xdrs,
sendsize, recvsize, iohandle, readproc, writeproc)
XDR *xdrs;
u_int sendsize, recvsizei
char *iohandlei
int (*readproc) (), (*writeproc) () i
The routine xdrrec_create () provides an XDR stream interface that allows
for a bidirectional, arbitrarily long sequence of records. The contents of the
records are meant to be data in XDR form. The stream's primary use is for interfacing RPC to TCP connections. However, it can be used to stream data into or
out of normal UNIX files.
Revision A, of 27 March 1990
122
Network Programming
The parameter xdr s is similar to the corresponding parameter described above.
The stream does its own data buffering similar to that of standard I/O. The
parameters sendsize and recvsize determine the size in bytes of the output
and input buffers, respectively; if their values are zero (0), then predetermined
defaults are used. When a buffer needs to be filled or flushed, the routine readproc () or writeproc () is called, respectively. The usage and behavior of
these routines are similar to the UNIX system calls read () and write () .
However, the first parameter to each of these routines is the opaque parameter
iohandle. The other two parameters (buf and nbytes) and the results (byte
count) are identical to the system routines. If xxx is readproc () or writeproc ( ) , then it has the following form:
1*
* returns the actual number of bytes transferred.
* -1 is an error
*1
int
xxx(iohandle, buf, len)
char *iohandle;
char *buf;
int nbytes;
The XDR stream provides means for delimiting records in the byte stream. The
implementation details of delimiting records in a stream are discussed in the
Advanced Topics section, below. The primitives that are specific to record
streams are as follows:
bool t
xdrrec_endofrecord(xdrs, flushnow)
XDR *xdrs;
bool_t flushnow;
bool t
xdrrec_skiprecord(xdrs)
XDR *xdrs;
bool t
xdrrec_ eof (xdrs)
XDR *xdrs;
The routine xdrrec_ endofrecord () causes the current outgoing data to be
marked as a record. If the parameter flushnow is TRUE, then the stream's
writeproc will be called; otherwise, writeproc will be called when the
output buffer has been filled.
The routine xdrrec_skiprecord () causes an input stream's position to be
moved past the current record boundary and onto the beginning of the next
record in the stream.
If there is no more data in the stream's input buffer, then the routine
xdrrec_eof () returns TRUE. That is not to say that there is no more data in
the underlying file descriptor.
Revision A, of 27 March 1990
Chapter 5 - External Data Representation: S1m Technical Notes
123
XDR Stream Implementation
This section provides the abstract data types needed to implement new instances
of XDR streams.
The XDR Object
The following structure defines the interface to an XDR stream:
enum xdr_op { XDR_ENCODE=O, XDR_DECODE=l, XDR_FREE=2 };
typedef struct {
enum xdr_op x_op;
/ * operation; fast added param * /
struct xdr_ops {
(*x_getlong) () ;
bool t
/* get long from stream * /
(*xyutlong) () ;
bool t
/* put long to stream * /
(*x_getbytes) () ; /* get bytes from stream * /
bool t
(*xyutbytes) () ; /* put bytes to stream * /
bool t
(*x_getpostn) () ; /* return stream offset * /
u int
bool t
(*x_setpostn) () ; /* reposition offset * /
(*x_inline)
() i
caddr t
/* ptr to buffered data * /
VOID
(*x_destroy) () ;
/ * free private area * /
*X_OpSi
caddr_t xyublici
/ * users' data * /
caddr_t xyrivatei
/ * pointer to private data * /
caddr t x_base;
/ * private for position info * /
int
x_handy;
/ * extra private word * /
XDR;
The x _ op field is the current operation being perfonned on the stream. This
field is important to the XDR primitives, but should not affect a stream's implementation. That is, a stream's implementation should not depend on this value.
The fields x_private, x_base, and x_handy are private to the particular
stream's implementation. The field xyubl i c is for the XDR client and should
never be used by the XDR stream implementations or the XDR primitives.
x_getpostn () , x_ setpo stn ( ) , and x_de stroy ( ) , are macros for
accessing operations. The operation x_inline () takes two parameters: an
XDR *, and an unsigned integer, which is a byte count. The routine returns a
pointer to a piece of the stream's internal buffer. The caller can then use the
buffer segment for any purpose. From the stream's point of view, the bytes in
the buffer segment have been consumed or put. The routine may return NULL if
it cannot return a buffer segment of the requested size. (The x_ i nl ine () routine is for cycle squeezers. Use of the resulting buffer is not data-portable. Users
are encouraged not to use this feature.)
The operations x_getbytes () and x_putbytes () blindly get and put
sequences of bytes from or to the underlying stream; they return TRUE if they are
successful, and FALSE otherwise. The routines have identical parameters
(replace xxx):
Revision A, of 27 March 1990
124
Network Programming
boo 1 t
xxxbytes(xdrs, buf, bytecount)
XDR *xdrs;
char *buf;
u_int bytecount;
The operations x_getlong () and xyutlong () receive and put long
numbers from and to the data stream. It is the responsibility of these routines to
translate the numbers between the machine representation and the (standard)
external representation. The UNIX primitives htonl () and ntohl () can be
helpful in accomplishing this. The higher-level XDR implementation assumes
that signed and unsigned long integers contain the same number of bits, and that
nonnegative integers have the same bit representations as unsigned integers. The
routines return TRUE if they succeed, and FALSE otherwise. They have identical parameters:
bool t
xxxlong(xdrs, Ip)
XDR *xdrs;
long *lp;
Implementors of new XDR streams must make an XDR structure (with new
operation routines) available to clients, using some kind of create routine.
5.2. Advanced Topics
This section describes techniques for passing data structures that are not covered
in the preceding sections. Such structures include linked lists (of arbitrary
lengths). Unlike the simpler examples covered in the earlier sections, the following examples are written using both the XDR C library routines and the XDR
data description language. The External Data Representation Standard: Protocol Specification chapter of this Network Programming manual describes this
language in complete detail.
Linked Lists
The last example in the Pointers section presented a C data structure and its associated XDR routines for a individual's gross assets and liabilities. The example
is duplicated below:
Revision A, of 27 March 1990
Chapter 5 - External Data Representation: SWl Technical Notes
125
struct gnumbers {
long g_assetsi
long g_liabilitiesi
}i
bool t
xdr_gnumbers(xdrs, gp)
XDR *xdrs;
struct gnumbers *gpi
if (xdr_long(xdrs, &(gp->g_assets»)
return (xdr_long(xdrs, &(gp->g_liabilities»)i
return(FALSE);
Now assume that we wish to implement a linked list of such information. A data
structure could be constructed as follows:
struct gnumbers_node {
struct gnumbers gn_numbers;
struct gnumbers_node *gn_nexti
};
typedef struct gnumbers_node *gnumbers_list;
The head of the linked list can be thought of as the data object; that is, the head is
not merely a convenient shorthand for a structure. Similarly the gn_ next field
is used to indicate whether or not the object has terminated. Unfortunately, if the
object continues, the gn_next field is also the address of where it continues.
The link addresses carry no useful information when the object is serialized.
The XDR data description of this linked list is described by the recursive declaration of gnurnbers_list:
struct gnumbers {
int g_assetsi
int g_liabilities;
};
struct gnumbers_node {
gnumbers gn_numbersi
gnumbers_node *gn_next;
};
In this description, the boolean indicates whether there is more data following it.
If the boolean is FALSE, then it is the last data field of the structure. If it is
TRUE, then it is followed by a gnumbers structure and (recursively) by a
gnurnbers_list. Note that the C declaration has no boolean explicitly
declared in it (though the gn next field implicitly carries the infonnation),
while the XDR data description has no pointer explicitly declared in it.
Revision A, of 27 March 1990
126
Network Programming
Hints for writing the XDR routines for a gnumbers_list follow easily from
the XDR description above. Note how the primitive xdryointer () is used
to implement the XDR union above.
bool t
xdr_gnumbers_node(xdrs, gn)
XDR *xdrs;
gnumbers_node *gn;
return (xdr_gnumbers (xdrs, &gn->gn_numbers) &&
xdr_gnumbers_list(xdrs, &gp->gn_next»;
bool t
xdr_gnumbers_list(xdrs, gnp)
XDR *xdrs;
gnumbers_list *gnp;
return (xdr-pointer(xdrs, gnp,
sizeof(struct gnumbers_node),
xdr_gnumbers_node»;
The unfortunate side effect of XDR'ing a list with these routines is that the C
stack grows linearly with respect to the number of node in the list. This is due to
the recursion. The following routine collapses the above two mutually recursive
into a single, non-recursive one.
Revision A, of 27 March 1990
Chapter 5 - External Data Representation: SlDl Technical Notes
127
bool t
xdr_gnumbers_list(xdrs, gnp)
XDR *xdrs;
gnumbers_list *gnp;
bool_t more_data;
gnumbers_list *nextpi
for (;;) {
more_data = (*gnp != NULL);
if (!xdr_bool(xdrs, &more_data»
return (FALSE) ;
if (! more_data)
break;
if (xdrs->x_op == XDR_FREE) {
nextp = &(*gnp)->gn_next;
if (!xdr_reference(xdrs, gnp,
sizeof(struct gnumbers_node), xdr_gnumbers»
{
return(FALSE);
}
gnp
= (xdrs->x_op == XDR_FREE) ?
nextp : &(*gnp)->gn_next;
*gnp
NULL;
return(TRUE);
The first task is to find out whether there is more data or not, so that this boolean
infonnation can be serialized. Notice that this statement is unnecessary in the
XDR_DECODE case, since the value of more_data is not known until we deserialize it in the next statement.
The next statement XDR's the more_data field of the XDR union. Then if there
is truly no more data, we set this last pointer to NULL to indicate the end of the
list, and return TRUE because we are done. Note that setting the pointer to NULL
is only important in the XDR_DECODE case, since it is already NULL in the
XDR ENCODE and XDR_FREE cases.
Next, if the direction is XDR_FREE, the value of nextp is set to indicate the
location of the next pointer in the list. We do this now because we need to
dereference gnp to find the location of the next item in the list, and after the next
statement the storage pointed to by gnp will be freed up and no be longer valid.
We can't do this for all directions though, because in the XDR_DECODE direction the value of gnp won't be set until the next statement.
Next, we XDR the data in the node using the primitive xdr_reference () .
xdr_reference () is like xdr yointer () which we used before, but it
does not send over the boolean indicating whether there is more data. We use it
Revision A. of 27 March 1990
128
Network Programming
instead ofxdr-pointer () because we have already XDR'd this information
ourselves. Notice that the xdr routine passed is not the same type as an element
in the list. The routine passed is xdr _gnumber s ( ) , for XDR'ing gnumbers,
but each element in the list is actually of type gnumbers_node. We don't
pass xdr_gnurnbers_node () because it is recursive, and instead use
xdr_gnurnbers () which XDR's all of the non-recursive part. Note that this
trick will work only if the gn_nurnbers field is the first item in each element,
so that their addresses are identical when passed to xdr_reference () .
Finally, we update gnp to point to the next item in the list. If the direction is
XDR_FREE, we set it to the previously saved value, otherwise we can dereference gnp to get the proper value. Though harder to understand than the recursive
version, this non-recursive routine is far less likely to blow the C stack. It will
also run more efficiently since a lot of procedure call overhead has been
removed. Most lists are small though (in the hundreds of items or less) and the
recursive version should be sufficient for them.
Revision A, of 27 March 1990
PART TWO: Protocol Specifications
PART TWO: Protocol Specifications- Continued
6
External Data Representation Standard:
Protocol Specification
6.1. Status of this Standard
Note: This chapter specifies a protocol that Sun Microsystems, Inc., and others
are using. It has been designated RFCIOl4 by the ARPA Network Infonnation
Center.
6.2. Introduction
XDR is a standard for the description and encoding of data. It is useful for
transferring data between different computer architectures, and has been used to
communicate data between such diverse machines as the Sun Workstation, VAX,
IBM-PC, and Cray. XDR fits into the ISO presentation layer, and is roughly
analogous in pUIpOse to X.409, ISO Abstract Syntax Notation. The major difference between these two is that XDR uses implicit typing, while X.409 uses explicit typing.
XDR uses a language to describe data formats. The language can only be used
only to describe data; it is not a programming language. This language allows
one to describe intricate data formats in a concise manner. The alternative of
using graphical representations (itself an infonnallanguage) quickly becomes
incomprehensible when faced with complexity. The XDR language itself is
similar to the C language [I], just as Courier [4] is similar to Mesa. Protocols
such as Sun RPC (Remote Procedure Call) and the NFS (Network File System)
use XDR to describe the format of their data.
The XDR standard makes the following assumption: that bytes (or octets) are
portable, where a byte is defined to be 8 bits of data. A given hardware device
should encode the bytes onto the various media in such a way that other
hardware devices may decode the bytes without loss of meaning. For example,
the Ethernet standard suggests that bytes be encoded in "little-endian" style [2],
or least significant bit first.
Basic Block Size
The representation of all items requires a multiple of four bytes (or 32 bits) of
data. The bytes are numbered 0 through n-l. The bytes are read or written to
some byte stream such that byte m always precedes byte m+l. If the n bytes
needed to contain the data are not a multiple of four, then the n bytes are followed by enough (0 to 3) residual zero bytes, r, to make the total byte count a
multiple of 4.
We include the familiar graphic box notation for illustration and comparison. In
most illustrations, each box (delimited by a plus sign at the 4 comers and vertical
bars and dashes) depicts a byte. Ellipses (...) between boxes show zero or more
131
Revision A, of 27 March 1990
132
Protocol Specifications
additional bytes where required.
A Block
+--------+--------+ ... +--------+--------+ ... +--------+
I byte 0 I byte 1 1... lbyte n-11
0
1... 1
0
1
+--------+--------+ ... +--------+--------+ ... +--------+
I<-----------n bytes---------->I<------r bytes------>I
I<-----------n+r (where (n+r) mod 4 = 0»----------->1
6.3. XDR Data Types
Each of the sections that follow describes a data type defined in the XDR standard, shows how it is declared in the language, and includes a graphic illustration
of its encoding.
For each data type in the language we show a general paradigm declaration.
Note that angle brackets « and » denote variable length sequences of data and
square brackets ([ and ]) denote fixed-length sequences of data. "n", "m" and "r"
denote integers. For the full language specification and more formal definitions
oftenns such as "identifier" and "declaration", refer to The XDR Language
Specification, below.
For some data types, more specific examples are included. A more extensive
example of a data description is in An Example of an XDR Data Description,
below.
Integer
An XDR signed integer is a 32-bit datum that encodes an integer in the range [2147483648,2147483647]. The integer is represented in two's complement notation. The most and least significant bytes are 0 and 3, respectively. Integers are
declared as follows:
Integer
(MSB)
(LSB)
+-------+-------+-------+-------+
Ibyte 0 Ibyte 1 Ibyte 2 Ibyte 3 I
+-------+-------+-------+-------+
<------------32 bits------------>
Unsigned Integer
An XDR unsigned integer is a 32-bit datum that encodes a nonnegative integer in
the range [0,4294967295]. It is represented by an unsigned binary number whose
most and least significant bytes are 0 and 3, respectively. An unsigned integer is
declared as follows:
Unsigned Integer
(MSB)
(LSB)
+-------+-------+-------+-------+
Ibyte 0 Ibyte 1 Ibyte 2 Ibyte 3 I
+-------+-------+-------+-------+
<------------32 bits------------>
Revision A, of 27 March 1990
Chapter 6 -
External Data Representation Standard: Protocol Specification
133
Enumerations have the same representation as signed integers. Enumerations are
handy for describing subsets of the integers. Enumerated data is declared as follows:
Enumeration
enum { name-identifier = constant,
... } identifier;
For example, the three colors red, yellow, and blue could be described by an
enumerated type:
enum { RED = 2, YELLOW = 3, BLUE = 5 } colors;
It is an error to encode as an enum any other integer than those that have been
given assignments in the enum declaration.
Booleans are important enough and occur frequently enough to warrant their own
explicit type in the standard. Booleans are declared as follows:
Boolean
bool identifier;
This is equivalent to:
enum { FALSE
Hyper Integer and Unsigned
Hyper Integer
=
0, TRUE
=
1 } identifier;
The standard also defines 64-bit (8-byte) numbers called hyper integer and
unsigned hyper integer. Their representations are the obvious extensions of
integer and unsigned integer defined above. They are represented in two's complement notation. The most and least significant bytes are 0 and 7, respectively.
Their declarations:
Hyper Integer
Unsigned Hyper Integer
(MSB)
(LSB)
+-------+-------+-------+-------+-------+-------+-------+-------+
Ibyte 0 Ibyte 1 Ibyte 2 Ibyte 3 Ibyte 4 Ibyte 5 Ibyte 6 Ibyte 7 I
+-------+-------+-------+-------+-------+-------+-------+-------+
<----------------------------64 bits---------------------------->
Floating-point
The standard defines the floating-point data type "float" (32 bits or 4 bytes). The
encoding used is the IEEE standard for normalized single-precision floatingpoint numbers [3]. The following three fields describe the single-precision
floating-point number:
s:
The sign of the number. Values 0 and 1 represent positive and negative,
respectively. One bit.
E:
The exponent of the number, base 2. 8 bits are devoted to this field.
The exponent is biased by 127.
F:
The fractional part of the number's mantissa, base 2. 23 bits are
devoted to this field.
Therefore, the floating-point number is described by:
Revision A, of 27 March 1990
134
Protocol Specifications
(-l)**S * 2**(E-Bias)
*
1.F
It is declared as follows:
Single-Precision Floating-Point
+-------+-------+-------+-------+
Ibyte 0 Ibyte 1 Ibyte 2 Ibyte 3 1
SI
ElF
1
+-------+-------+-------+-------+
11<- 8 ->1<-------23 bits------>I
<------------32 bits------------>
Just as the most and least significant bytes of a number are 0 and 3, the most and
least significant bits of a single-precision floating- point number are 0 and 31.
The beginning bit (and most significant bit) offsets of S, E, and F are 0, 1, and 9,
respectively. Note that these numbers refer to the mathematical positions of the
bits, and NOT to their actual physical locations (which vary from medium to
medium).
The IEEE specifications should be consulted concerning the encoding for signed
zero, signed infinity (overflow), and denonnalized numbers (underflow) [3].
According to IEEE specifications, the "NaN" (not a number) is system dependent
and should not be used externally.
Double-precision Floatingpoint
The standard defines the encoding for the double-precision floating- point data
type "double" (64 bits or 8 bytes). The encoding used is the IEEE standard for
nonnalized double-precision floating-point numbers [3]. The standard encodes
the following three fields, which describe the double-precision floating-point
number:
s:
The sign of the number. Values 0 and 1 represent positive and negative,
respectively. One bit.
E:
The exponent of the number, base 2. 11 bits are devoted to this field.
The exponent is biased by 1023.
F:
The fractional part of the number's mantissa, base 2. 52 bits are
devoted to this field.
Therefore, the floating-point number is described by:
(-l)**S * 2**(E-Bias) * 1.F
It is declared as follows:
Double-Precision Floating-Point
+------+------+------+------+------+------+------+------+
Ibyte Olbyte 11byte 21byte 31byte 41byte 51byte 61byte 71
E l F
1
S1
+------+------+------+------+------+------+------+------+
11<--11-->1<-----------------52 bits------------------->I
<-----------------------64 bits------------------------->
Just as the most and least significant bytes of a number are 0 and 3, the most and
~~sun
~~ microsystem&
Revision A, of 27 March 1990
Chapter 6 -
External Data Representation Standard: Protocol Specification
135
least significant bits of a double-precision floating- point number are 0 and 63.
The beginning bit (and most significant bit) offsets of S, E , and F are 0, 1, and
12, respectively. Note that these numbers refer to the mathematical positions of
the bits, and NOT to their actual physical locations (which vary from medium to
medium).
The IEEE specifications should be consulted concerning the encoding for signed
zero, signed infinity (overflow), and denonnalized numbers (underflow) [3].
According to IEEE specifications, the "NaN" (not a number) is system dependent
and should not be used externally.
Fixed-length Opaque Data
At times, fixed-length uninterpreted data needs to be passed among machines.
This data is called "opaque" and is declared as follows:
opaque identifier[n];
where the constant n is the (static) number of bytes necessary to contain the
opaque data. Ifn is not a multiple of four, then the n bytes are followed by
enough (0 to 3) residual zero bytes, r, to make the total byte count of the opaque
object a multiple of four.
Fixed-Length Opaque
o
1
+--------+--------+ ... +--------+--------+ ... +--------+
1 byte 0 1 byte 1 I ... 1byte n-ll
0
1 ... 1
0
+--------+--------+ ... +--------+--------+ ... +--------+
I<-----------n bytes---------->I<------r bytes------>I
I<-----------n+r (where (n+r) mod 4 = 0)------------>1
Variable-length Opaque Data
The standard also provides for variable-length (counted) opaque data, defined as
a sequence ofn (numbered 0 through n-l) arbitrary bytes to be the number n
encoded as an unsigned integer (as described below), and followed by the n bytes
of the sequence.
Byte m of the sequence always precedes byte m+ 1 of the sequence, and byte 0 of
the sequence always follows the sequence's length (count). enough (0 to 3) residual zero bytes, r, to make the total byte count a multiple of four. Variablelength opaque data is declared in the following way:
opaque identifier<m>i
or
opaque identifier<>i
The constant m denotes an upper bound of the number of bytes that the sequence
may contain. If m is not specified, as in the second declaration, it is assumed to
be (2**32) - 1, the maximum length. The constant m would normally be found
in a protocol specification. For example, a filing protocol may state that the maximum data transfer size is 8192 bytes, as follows:
opaque filedata<8192>i
This can be illustrated as follows:
Revision A, of 27 March 1990
136
Protocol Specifications
Variable-Length Opaque
o
1
2
3
4
5
+-----+-----+-----+-----+-----+-----+ ... +-----+-----+ ... +-----+
1
length n
1by teO 1by tel 1... 1 n-1 1 0 1 ... 1 0 1
+-----+-----+-----+-----+-----+-----+ ... +-----+-----+ ... +-----+
1<-------4 bytes------->I<------n bytes------>I<---r bytes--->1
I<----n+r (where (n+r) mod 4 = 0)---->1
It is an error to encode a length greater than the maximum described in the
specification.
TI\e standard defines a string of n (numbered 0 through n-1) ASCII bytes to be
the number n encoded as an unsigned integer (as described above), and followed
by the n bytes of the string. Byte m of the string always precedes byte m+ 1 of
the string, and byte 0 of the string always follows the string's length. Ifn is not a
multiple of four, then the n bytes are followed by enough (0 to 3) residual zero
bytes, r, to make the total byte count a multiple of four. Counted byte strings are
declared as follows:
String
string object<m>;
or
string object<>;
The constant m denotes an upper bound of the number of bytes that a string may
contain. If m is not specified, as in the second declaration, it is assumed to be
(2**32) - 1, the maximum length. The constant m would nonnally be found in a
protocol specification. For example, a filing protocol may state that a file name
can be no longer than 255 bytes, as follows:
string filename<255>;
Which can be illustrated as:
A String
o
1
2
3
4
5
+-----+-----+-----+-----+-----+-----+ ... +-----+-----+ ... +-----+
1
length n
IbyteO 1by tel 1 ... 1 n-1 1 0 1 ... 1 0 1
+-----+-----+-----+-----+-----+-----+ ... +-----+-----+ ... +-----+
1<-------4 bytes------->I<------n bytes------>I<---r bytes--->1
I<----n+r (where (n+r) mod 4 = 0)---->1
It is an error to encode a length greater than the maximum described in the
specification.
Fixed-length Array
Declarations for fixed-length arrays of homogeneous elements are in the following form:
type-name identifier[n];
Fixed-length arrays of elements numbered 0 through n-l are encoded by
Revision A, of 27 March 1990
Chapter 6 - External Data Representation Standard: Protocol Specification
137
individually encoding the elements of the array in their natural order, 0 through
n-1. Each element's size is a multiple of four bytes. Though all elements are of
the same type, the elements may have different sizes. For example, in a fixedlength array of strings, all elements are of type "string", yet each element will
vary in its length.
Fixed-Length Array
+---+---+---+---+---+---+---+---+ ... +---+---+---+---+
1
element 0
1
element 1
1 ... 1
element n-l
1
+---+---+---+---+---+---+---+---+ ... +---+---+---+---+
I<--------------------n elements------------------->1
Variable-length Array
Counted arrays provide the ability to encode variable-length arrays of homogeneous elements. The array is encoded as the element count n (an unsigned integer)
followed by the encoding of each of the array's elements, starting with element 0
and progressing through element n- 1. The declaration for variable-length arrays
follows this form:
type-name identifier<m>;
or
type-name identifier<>;
The constant m specifies the maximum acceptable element count of an array; if
m is not specified, as in the second declaration, it is assumed to be (2**32) - 1.
Counted Array
012 3
+--+--+--+--+--+--+--+--+--+--+--+--+ ... +--+--+--+--+
1
n
1 element 0 1 element 1 I ... I element n-ll
+--+--+--+--+--+--+--+--+--+--+--+--+ ... +--+--+--+--+
1<-4 bytes->I<--------------n elements------------->1
It is an error to encode a value of n that is greater than the maximum described in
the specification.
Structure
Structures are declared as follows:
struct {
component-declaration-A;
component-declaration-B;
identifier;
The components of the structure are encoded in the order of their declaration in
the structure. Each component's size is a multiple of four bytes, though the components may be different sizes.
Revision A. of 27 March 1990
138
Protocol Specifications
Structure
+-------------+-------------+ .. .
1 component A 1 component B I .. .
+-------------+-------------+ .. .
Discriminated Union
A discriminated union is a type composed of a discriminant followed by a type
selected from a set of prearranged types according to the value of the discriminant. The type of discriminant is either "int", "unsigned int" , or an enumerated
type, such as "bool". The component types are called "arms" of the union, and
are preceded by the value of the discriminant which implies their encoding.
Discriminated unions are declared as follows:
union switch (discriminant-declaration)
case discriminant-value-A:
arm-declaration-A;
case discriminant-value-B:
arm-declaration-B;
default: default-declaration;
identifier;
Each "case" keyword is followed by a legal value of the discriminant. The
default arm is optional. If it is not specified, then a valid encoding of the union
cannot take on unspecified discriminant values. The size of the implied arm is
always a multiple of four bytes.
The discriminated union is encoded as its discriminant followed by the encoding
of the implied ann.
Discriminated Union
o
1
2
3
+---+---+---+---+---+---+---+---+
discriminant 1 implied arm
+---+---+---+---+---+---+---+---+
1<---4 bytes--->1
Void
An XDR void is a O-byte quantity. Voids are useful for describing operations
that take no data as input or no data as output. They are also useful in unions,
where some arms may contain data and others do not. The declaration is simply
as follows:
void;
Voids are illustrated as follows:
++
II
++
--><-- 0 bytes
Revision A. of 27 March 1990
Chapter 6 -
Constant
External Data Representation Standard: Protocol Specification
139
The data declaration for a constant follows this fonn:
const name-identifier
=
n;
"const" is used to define a symbolic name for a constant; it does not declare any
data. The symbolic constant may be used anywhere a regular constant may be
used. For example, the following defines a symbolic constant DOZEN, equal to
12.
const DOZEN
Typedef
=
12;
"typedef' does not declare any data either, but serves to define new identifiers for
declaring data. The syntax is:
typedef declaration;
The new type name is actually the variable name in the declaration part of the
typedef. For example, the following defines a new type called "eggbox" using an
existing type called "egg":
typedef egg eggbox[DOZEN);
Variables declared using the new type name have the same type as the new type
name would have in the typedef, if it was considered a variable. For example,
the following two declarations are equivalent in declaring the variable
"fresheggs" :
eggbox
egg
fresheggs;
fresheggs[DOZEN);
When a typedef involves a struct, enum, or union definition, there is another (preferred) syntax that may be used to define the same type. In general, a typedef of
the following fonn:
typedef «struct, union, or enum definition»
identifier;
may be converted to the alternative fonn by removing the "typedef' part and
placing the identifier after the "struct", "union", or "enum" keyword, instead of at
the end. For example, here are the two ways to define the type "bool":
typedef enum {
FALSE = 0,
TRUE = 1
} bool;
/ * using typedef * /
enum bool {
FALSE = 0,
TRUE = 1
/ * preferred alternative * /
};
The reason this syntax is preferred is one does not have to wait until the end of a
declaration to figure out the name of the new type.
sun
microsystems
Revision A, of 27 March 1990
140
Protocol Specifications
Optional-data
Optional-data is one kind of union that occurs so frequently that we give it a special syntax of its own for declaring it. It is declared as follows:
type-name *identifier;
This is equivalent to the following union:
union switch (bool opted)
case TRUE:
type-name element;
case FALSE:
void;
identifier;
It is also equivalent to the following variable-length array declaration, since the
boolean "opted" can be interpreted as the length of the array:
type-name identifier<l>;
Optional-data is not so interesting in itself, but it is very useful for describing
recursive data-structures such as linked-lists and trees. For example, the following defines a type "stringlist" that encodes lists of arbitrary length strings:
struct *stringlist {
string item<>;
stringlist next;
} ;
It could have been equivalently declared as the following union:
union stringlist switch (bool opted)
case TRUE:
struct {
string item<>;
stringlist next;
element;
case FALSE:
void;
{
} ;
or as a variable-length array:
struct stringlist<l>
string item<>;
stringlist next;
} ;
Both of these declarations obscure the intention of the stringlist type, so the
optional-data declaration is preferred over both of them. The optional-data type
also has a close correlation to how recursive data structures are represented in
high-level languages such as Pascal or C by use of pointers. In fact, the syntax is
the same as that of the C language for pointers.
Revision A, of 27 March 1990
Chapter 6 - External Data Representation Standard: Protocol Specification
Areas for Future
Enhancement
141
The XDR standard lacks representations for bit fields and bitmaps, since the standard is based on bytes. Also missing are packed (or binary-coded) decimals.
The intent of the XDR standard was not to describe every kind of data that people have ever sent or will ever want to send from machine to machine. Rather, it
only describes the most commonly used data-types of high-level languages such
as Pascal or C so that applications written in these languages will be able to communicate easily over some medium.
One could imagine extensions to XDR that would let it describe almost any
existing protocol, such as TCP. The minimum necessary for this are support for
different block sizes and byte-orders. The XDR discussed here could then be
considered the 4-byte big-endian member of a larger XDR family.
6.4. Discussion
Why a Language for
Describing Data?
There are many advantages in using a data-description language such as XDR
versus using diagrams. Languages are more formal than diagrams and lead to
less ambiguous descriptions of data. Languages are also easier to understand and
allow one to think of other issues instead of the low-level details of bit-encoding.
Also, there is a close analogy between the types of XDR and a high-level
language such as C or Pascal. This makes the implementation of XDR encoding
and decoding modules an easier task. Finally, the language specification itself is
an ASCII string that can be passed from machine to machine to perform on-thefly data interpretation.
Why Only one Byte-Order for
an XDR Unit?
Supporting two byte-orderings requires a higher level protocol for determining in
which byte-order the data is encoded. Since XDR is not a protocol, this can't be
done. The advantage of this, though, is that data in XDR format can be written
to a magnetic tape, for example, and any machine will be able to interpret it,
since no higher level protocol is necessary for determining the byte-order.
Why does XDR use BigEndian Byte-Order?
Yes, it is unfair, but having only one byte-order means you have to be unfair to
somebody. Many architectures, such as the Motorola 68000 and IBM 370, support the big-endian byte-order.
Why is the XDR Unit Four
Bytes Wide?
There is a tradeoff in choosing the XDR unit size. Choosing a small size such as
two makes the encoded data small, but causes alignment problems for machines
that aren't aligned on these boundaries. A large size such as eight means the data
will be aligned on virtually every machine, but causes the encoded data to grow
too big. We chose four as a compromise. Four is big enough to support most
architectures efficiently, except for rare machines such as the eight-byte aligned
Cray. Four is also small enough to keep the encoded data restricted to a reasonable size.
Revision A, of 27 March 1990
142
Protocol Specifications
Why must Variable-Length
Data be Padded with Zeros?
It is desirable that the same data encode into the same thing on all machines, so
that encoded data can be meaningfully compared or checksummed. Forcing the
padded bytes to be zero ensures this.
Why is there No Explicit
Data-Typing?
Data-typing has a relatively high cost for what small advantages it may have.
One cost is the expansion of data due to the inserted type fields. Another is the
added cost of interpreting these type fields and acting accordingly. And most
protocols already know what type they expect, so data-typing supplies only
redundant information. However, one can still get the benefits of data-typing
using XDR. One way is to encode two things: first a string which is the XDR
data description of the encoded data, and then the encoded data itself. Another
way is to assign a value to all the types in XDR, and then define a universal type
which takes this value as its discriminant and for each value, describes the
corresponding data type.
6.5. The XDR Language
Specification
Notational Conventions
This specification uses an extended Backus-Naur Form notation for describing
the XDR language. Here is a brief description of the notation:
* are special.
1.
The characters I, (, ), [, ], ,and
2.
Terminal symbols are strings of any characters surrounded by double quotes.
3.
Non-terminal symbols are strings of non-special characters.
4.
Alternative items are separated by a vertical bar ( I).
5.
Optional items are enclosed in brackets.
6.
Items are grouped together by enclosing them in parentheses.
7.
A
* following an item means 0 or more occurrences of that item.
For example, consider the following pattern:
"a " "very" (", " " very")* [" cold" "and"] " rainy" ("day" I "night")
An infinite number of strings match this pattern. A few of them are:
"a
"a
"a
"a
Lexical Notes
very rainy day"
very, very rainy day"
very cold and rainy day"
very, very, very cold and rainy night"
1.
Comments begin with '/*' and terminate with '*/'.
2.
White space serves to separate items and is otherwise ignored.
3.
An identifier is a letter followed by an optional sequence of letters, digits or
underbar (' _'). The case of identifiers is not ignored.
Revision A, of 27 March 1990
Chapter 6 - External Data Representation Standard: Protocol Specification
4,
Syntax Information
143
A constant is a sequence of one or more decimal digits, optionally preceded
by a minus-sign (' -'),
declaration:
type-specifier identifier
I type-specifier identifier "[" value "]"
I type-specifier identifier "<" [value
">"
I "opaque" identifier "[" value "l"
I "opaque" identifier "<" [value
">"
I "string" identifier "<" [value
">,,
I type-specifier "*,, identifier
I "void"
value:
constant
I identifier
type-specifier:
[ "unsigned"
"int"
[ "unsigned"
"hyper"
"float"
"double"
"bool"
enurn-type-spec
struct-type-spec
union-type-spec
identifier
enurn-type-spec:
"en urn" enum-body
enum-body:
"{"
( identifier "=,, value )
( "," identifier "=,, value )*
"I"
struct-type-spec:
"struct" struct-body
struct-body:
"{"
{ declaration
( declaration
,
11.11
";" ) *
"I"
union-type-spec:
"union" union-body
union-body:
"switch" "(" declaration ")" "{"
"case" value "'" declaration ";"
"case" value "'" declaration ";" )*
sun
microsystems
Revision A, of 27 March 1990
144
Protocol Specifications
,
[ ndefault n n.n declaration n."
"}"
constant-def:
,
"const" identifier ,,=n constant n.n
type-def:
"typedef" declaration ";"
nenum" identifier enum-body ,,;n
"struct" identifier struct-body ,,;n
"union" identifier union-body";"
definition:
type-def
I constant-def
specification:
definition
Syntax Notes
*
1.
The following are keywords and cannot be used as identifiers: "bool",
"case", "const", "default", "double", "enum", "float", "hyper", "opaque",
"string", "struct", "switch", "typedef', "union", "unsigned" and "void".
2.
Only unsigned constants may be used as size specifications for arrays. If an
identifier is used, it must have been declared previously as an unsigned constant in a "const" definition.
3.
Constant and type identifiers within the scope of a specification are in the
same name space and must be declared uniquely within this scope.
4.
Similarly, variable names must be unique within the scope of struct and
union declarations. Nested struct and union declarations create new scopes.
5.
The discriminant of a union must be of a type that evaluates to an integer.
That is, "int", "unsigned int", "bool", an enumerated type or any typedefed
type that evaluates to one of these is legal. Also, the case values must be
one of the legal values of the discriminant. Finally, a case value may not be
specified more than once within the scope of a union declaration.
Revision A, of 27 March 1990
Chapter 6 - External Data Representation Standard: Protocol Specification
6.6. An Example of an
XDR Data Description
145
Here is a short XDR data description of a thing called a "file", which might be
used to transfer files from one machine to another.
const MAXUSERNAME = 32;
const MAXFILELEN
65535;
const MAXNAMELEN = 255;
/ * max length of a user name * /
/ * max length of a file
*/
/ * max length of a file name * /
1*
* Types offiles:
*1
enum filekind
TEXT
0,
DATA
1,
EXEC
2
/ * ascii data * /
/* raw data * /
1* executable * /
} ;
1*
* File information, per Idnd offile:
*1
union filetype switch (filekind kind) {
case TEXT:
void;
/ * no extra information * /
case DATA:
string creator<MAXNAMELEN>;
*/
/ * data creator
case EXEC:
string interpretor<MAXNAMELEN>; /* program interpretor * /
} ;
1*
* A complete file:
*1
struct file {
string filename<MAXNAMELEN>;
filetype type;
string owner<MAXUSERNAME>;
opaque data<MAXFILELEN>;
/* name offile * /
/* info about file * /
/* owner offile * /
1* file data
*/
} ;
Suppose now that there is a user named "john" who wants to store his lisp program "sillyprog" that contains just the data "(quit)". His file would be encoded as
follows:
Revision A, of 27 March 1990
146
Protocol Specifications
Offset
0
4
8
12
16
20
24
28
32
36
40
44
6.7. References
Hex Bytes
00
73
79
67
00
00
6c
00
6a
00
28
74
00
69
70
00
00
00
69
00
6£
00
71
29
00
6c
72
00
00
00
73
00
68
00
75
00
09
6c
6f
00
02
04
70
04
6e
06
69
00
ASCII
Description
· ...
Length of filename =9
Filename characters
... and more characters ...
... and 3 zero-bytes of fill
Filekind is EXEC =2
Length of interpretor =4
Interpretor characters
Length of owner =4
Owner characters
Length of file data =6
File data bytes ...
... and 2 zero-bytes of fill
sill
ypro
g ...
· ...
· ...
lisp
· ...
john
· ...
(qui
t) ..
[1] Brian W. Kernighan & Dennis M. Ritchie, "The C Programming Language",
Bell Laboratories, Murray Hill, New Jersey, 1978.
[2] Danny Cohen, "On Holy Wars and a Plea for Peace", IEEE Computer,
October 1981.
[3] "IEEE Standard for Binary Hoating-Point Arithmetic", ANSI/IEEE Standard
754-1985, Institute of Electrical and Electronics Engineers, August 1985.
[4] "Courier: The Remote Procedure Call Protocol", XEROX Corporation, XSIS
038112, December 1981.
Revision A. of 27 March 1990
7
Remote Procedure Calls: Protocol
Specification
7.1. Status of this Memo
Note: This chapter specifies a protocol that Sun Microsystems, Inc., and others
are using. It has been designated RFC 1050 by the ARPA-Internet Network
Information Center.
7.2. Introduction
This chapter specifies a message protocol used in implementing Sun's Remote
Procedure Call (RPC) package. (The message protocol is specified with the
External Data Representation (XDR) language. See the External Data Representation Standard: Protocol Specification for the details. Here, we assume that the
reader is familiar with XDR and do not attempt to justify it or its uses). The
paper by Birrell and Nelson [1] is recommended as an excellent background to
and justification of RPC.
Terminology
This chapter discusses servers, services, programs, procedures, clients, and versions. A server is a piece of software where network services are implemented.
A network service is a collection of one or more remote programs. A remote
program implements one or more remote procedures; the procedures, their
parameters, and results are documented in the specific program's protocol
specification (see the Port Mapper Program Protocol, below, for an example).
Network clients are pieces of software that initiate remote procedure calls to services. A server may support more than one version of a remote program in order
to be forward compatible with changing protocols.
For example, a network file service may be composed of two programs. One
program may deal with high-level applications such as file system access control
and locking. The other may deal with low-level file 10 and have procedures like
"read" and "write". A client machine of the network file service would call the
procedures associated with the two programs of the service on behalf of some
user on the client machine.
The RPC Model
The remote procedure call model is similar to the local procedure call model. In
the local case, the caller places arguments to a procedure in some well-specified
location (such as a result register). It then transfers control to the procedure, and
eventually gains back control. At that point, the results of the procedure are
extracted from the well-specified location, and the caller continues execution.
The remote procedure call is similar, in that one thread of control logically winds
through two processes-one is the caller's process, the other is a server's
147
Revision A, of 27 March 1990
148
Protocol Specifications
process. That is, the caller process sends a call message to the server process and
waits (blocks) for a reply message. The call message contains the procedure's
parameters, among other things. The reply message contains the procedure's
results, among other things. Once the reply message is received, the results of
the procedure are extracted, and caller's execution is resumed.
On the server side, a process is dormant awaiting the arrival of a call message.
When one arrives, the server process extracts the procedure's parameters, computes the results, sends a reply message, and then awaits the next call message.
Note that in this model, only one of the two processes is active at any given time.
However, this model is only given as an example. The RPC protocol makes no
restrictions on the concurrency model implemented, and others are possible. For
example, an implementation may choose to have RPC calls be asynchronous, so
that the client may do useful work while waiting for the reply from the server.
Another possibility is to have the server create a task to process an incoming
request, so that the server can be free to receive other requests.
Transports and Semantics
The RPC protocol is independent of transport protocols. That is, RPC does not
care how a message is passed from one process to another. The protocol deals
only with specification and interpretation of messages.
It is important to point out that RPC does not try to implement any kind of reliability and that the application must be aware of the type of transport protocol
underneath RPC. If it knows it is running on top of a reliable transport such as
TCP/IP[6] , then most of the work is already done for it. On the other hand, ifit
is running on top of an unreliable transport such as UDPIIP[7], it must implement
is own retransmission and time-out policy as the RPC layer does not provide this
service.
Because of transport independence, the RPC protocol does not attach specific
semantics to the remote procedures or their execution. Semantics can be inferred
from (but should be explicitly specified by) the underlying transport protocol.
For example, consider RPC running on top of an unreliable transport such as
UDP/IP. If an application retransmits RPC messages after short time-outs, the
only thing it can infer ifit receives no reply is that the procedure was executed
zero or more times. If it does receive a reply, then it can infer that the procedure
was executed at least once.
A server may wish to remember previously granted requests from a client and not
regrant them in order to insure some degree of execute-at-most-once semantics.
A server can do this by taking advantage of the transaction ID that is packaged
with every RPC request. The main use of this transaction is by the client RPC
layer in matching replies to requests. However, a client application may choose
to reuse its previous transaction ID when retransmitting a request. The server
application, knowing this fact, may choose to remember this ID after granting a
request and not regrant requests with the same ID in order to achieve some
degree of execute-at-most-once semantics. The server is not allowed to examine
this ID in any other way except as a test for equality.
On the other hand, if using a reliable transport such as TCP/IP, the application
can infer from a reply message that the procedure was executed exactly once, but
Revision A. of 27 March 1990
Chapter 7 - Remote Procedure Calls: Protocol Specification
149
if it receives no reply message, it cannot assume the remote procedure was not
executed. Note that even if a connection-oriented protocol like TCP is used, an
application still needs time-outs and reconnection to handle server crashes.
There are other possibilities for transports besides datagram- or connectionoriented protocols. For example, a request-reply protocol such as VMTP[2] is
perhaps the most natural transport for RPC.
NOTE
Binding and Rendezvous
Independence
At Sun, RPC is currently implemented on top of both TCPI/P and UDP/IP transports.
The act of binding a client to a service is NOT part of the remote procedure call
specification. This important and necessary function is left up to some higherlevel software. (The software may use RPC itself-see the Port Mapper Program Protocol, below).
Implementors should think of the RPC protocol as the jump-subroutine instruction ("JSR") of a network; the loader (binder) makes JSR useful, and the loader
itself uses JSR to accomplish its task. Likewise, the network makes RPC useful,
using RPC to accomplish this task.
Authentication
The RPC protocol provides the fields necessary for a client to identify itself to a
service and vice-versa. Security and access control mechanisms can be built on
top of the message authentication. Several different authentication protocols can
be supported. A field in the RPC header indicates which protocol is being used.
More information on specific authentication protocols can be found in the
Authentication Protocols, below.
7.3. RPC Protocol
Requirements
The RPC protocol must provide for the following:
1.
Unique specification of a procedure to be called.
2.
Provisions for matching response messages to request messages.
3.
Provisions for authenticating the caller to service and vice-versa.
Besides these requirements, features that detect the following are worth supporting because of protocol roll-over errors, implementation bugs, user error, and network administration:
1.
RPC protocol mismatches.
2.
Remote program protocol version mismatches.
3.
Protocol errors (such as misspecification of a procedure's parameters).
4.
Reasons why remote authentication failed.
5.
Any other reasons why the desired procedure was not called.
+!!!..!!
Revision At of 27 March 1990
150
Protocol Specifications
Programs and Procedures
The RPC call message has three unsigned fields: remote program number, remote
program version number, and remote procedure number. The three fields
uniquely identify the procedure to be called. Program numbers are administered
by some central authority (like Sun). Once an implementor has a program
number, he can implement his remote program; the first implementation would
most likely have the version number of 1. Because most new protocols evolve
into better, stable, and mature protocols, a version field of the call message
identifies which version of the protocol the caller is using. Version numbers
make speaking old and new protocols through the same server process possible.
The procedure number identifies the procedure to be called. These numbers are
documented in the specific program's protocol specification. For example, a file
service's protocol specification may state that its procedure number 5 is Itread
and procedure number 12 is It write" .
lt
Just as remote program protocols may change over several versions, the actual
RPC message protocol could also change. Therefore, the call message also has
in it the RPC version number, which is always equal to two for the version of
RPC described here.
The reply message to a request message has enough information to distinguish
the following error conditions:
1. The remote implementation of RPC does speak protocol version 2. The
lowest and highest supported RPC version numbers are returned.
2.
The remote program is not available on the remote system.
3. The remote program does not support the requested version number. The
lowest and highest supported remote program version numbers are returned.
Authentication
4.
The requested procedure number does not exist. (This is usually a caller side
protocol or programming error.)
5.
The parameters to the remote procedure appear to be garbage from the
server's point of view. (Again, this is usually caused by a disagreement
about the protocol between client and service.)
Provisions for authentication of caller to service and vice-versa are provided as a
part of the RPC protocol. The call message has two authentication fields, the
credentials and verifier. The reply message has one authentication field, the
response verifier. The RPC protocol specification defines all three fields to be the
following opaque type:
Revision A, of 27 March 1990
Chapter 7 -
Remote Procedure Calls: Protocol Specification
151
enum auth_flavor {
AUTH
AUTH
AUTH
AUTH
NULL
UNIX
SHORT
DES
0,
1,
2,
3
/ * and more to be defined * /
} ;
struct opaque_auth {
auth_flavor flavor;
opaque body<400>;
};
In simple English, any opaque_auth structure is an auth_flavor enumera-
tion followed by bytes which are opaque to the RPC protocol implementation.
The interpretation and semantics of the data contained within the authentication
fields is specified by individual, independent authentication protocol
specifications. (See Authentication Protocols, below, for definitions of the vari0us authentication protocols.)
If authentication parameters were rejected, the response message contains infor-
mation stating why they were rejected.
Program Number Assignment
Program numbers are given out in groups of 0 x2 0000000 (decimal
536870912) according to the following chart:
Program Numbers
o-
Ifffffff
- 3fffffff
- 5fffffff
- 7fffffff
- 9fffffff
aOOOOOOO - bfffffff
cOOOOOOO - dfffffff
eOOOOOOO - ffffffff
20000000
40000000
60000000
80000000
Description
Defined by Sun
Defined by user
Transient
Reserved
Reserved
Reserved
Reserved
Reserved
The first group is a range of numbers administered by Sun Microsystems and
should be identical for all sites. The second range is for applications peculiar to a
particular site. This range is intended primarily for debugging new programs.
When a site develops an application that might be of general interest, that application should be given an assigned number in the first range. The third group is
for applications that generate program numbers dynamically. The final groups
are reselVed for future use, and should not be used.
Other Uses of the RPC
Protocol
The intended use of this protocol is for calling remote procedures. That is, each
call message is matched with a response message. However, the protocol itself is
a message-passing protocol with which other (non-RPC) protocols can be implemented. Sun currently uses, or perhaps abuses, the RPC message protocol for the
following two (non-RPC) protocols: batching (or pipelining) and broadcast RPC.
Revision A, of 27 March 1990
152
Protocol Specifications
These two protocols are discussed but not defined below.
Batching
Batching allows a client to send an arbitrarily large sequence of call messages to
a selVer; batching typically uses reliable byte stream protocols (like TCP/lP) for.
its transport. In the case of batching, the client never waits for a reply from the
seIVer, and the selVer does not send replies to batch requests. A sequence of
batch calls is usually terminated by a legitimate RPC in order to flush the pipeline (with positive acknowledgement).
Broadcast RPC
In broadcast RPC-based protocols, the client sends a broadcast packet to the net-
work and waits for numerous replies. Broadcast RPC uses unreliable, packetbased protocols (like UDP/IP) as its transports. SelVers that support broadcast
protocols only respond when the request is successfully processed, and are silent
in the face of errors. Broadcast RPC uses the Port Mapper RPC seIVice to
achieve its semantics. See the Port Mapper Program Protocol, below, for more
information.
7.4. The RPC Message
Protocol
This section defines the RPC message protocol in the XDR data description
language. The message is defined in a top-down style.
enum msg_type {
CALL
REPLY
=
0,
1
};
1*
* A reply to a call message can take on two forms:
* The message was either accepted or rejected.
*1
enum reply_stat {
MSG ACCEPTED
MSG DENIED
0,
1
};
1*
* Given that a call message was accepted, the following is the
* status of an attempt to call a remote procedure.
*1
enum accept_stat
SUCCESS
PROG UNAVAIL
PROG MISMATCH
PROC UNAVAIL
GARBAGE ARGS
°, /* RPC executed successfully
1,
2,
3,
4
/
/
/
/
*/
* remote hasn't exported program * /
* remote can't support version # * /
* program can't support procedure * /
* procedure can't decode params * /
};
1*
* Reasons why a call message was rejected:
*1
enum reject_stat
RPC MISMATCH
AUTH ERROR
1
o, / * RPC version number! =2
*/
/ * remote can't authenticate caller * /
Revision A. of 27 March 1990
Chapter 7 -
Remote Procedure Calls: Protocol Specification
153
} ;
1*
* Why authentication failed:
*1
enum auth_stat {
AUTH BADCRED
AUTH REJECTEDCRED
AUTH BADVERF
AUTH REJECTEDVERF
AUTH TOOWEAK
1,
2,
3,
4,
/*
/*
/*
/*
/*
5
bad credentials * /
client must begin new session * /
bad verifier * /
verifier expired or replayed * /
rejected for security reasons * /
} ;
1*
* The RPC message:
* All messages start with a transaction identifier, xid,
* followed by a two-armed discriminated union. The union's
* discriminant is a msg_type which switches to one of the two
* types of the message. The xid of a REPLY message always
* matches that of the initiating CALL message. NB: The xid
* field is only used for clients matching reply messages with
* call messages or for servers detecting retransmissions; the
* service side cannot treat this id as any type of sequence
* number.
*1
struct rpc_msg {
unsigned int xid;
union switch (msg_type mtype)
case CALL:
call_body cbody;
case REPLY:
reply_body rbody;
body;
} ;
1*
* Body of an RPC request call:
* In version 2 of the RPC protocol specification, rpcvers must
* be equal to 2. The fields prog, vers, and proc specify the
* remote program, its version number, and the procedure within
* the remote program to be called. After these fields are two
* authentication parameters: cred (authentication credentials)
* and verf (authentication verifier). The two authentication
* parameters are followed by the parameters to the remote
* procedure, which are specified by the specific program
* protocol.
*1
struct call_body
unsigned int rpcvers;
unsigned int prog;
unsigned int vers;
unsigned int proc;
opaque_auth cred;
+ §J!!!
/
* must be equal to two (2) * /
Revision A. of 27 March 1990
154
Protocol Specifications
opaque_auth verf;
/ * procedure specific parameters start here * /
};
1*
* Body of a reply to an RPC request:
* The call message was either accepted or rejected.
*1
union reply_body switch (reply_stat stat)
case MSG ACCEPTED:
accepted_reply areply;
case MSG DENIED:
rejected_reply rreply;
reply;
{
1*
* Reply to an RPC request that was accepted by the server:
* there could be an error even though the request was accepted.
* The first field is an authentication verifier that the server
* generates in order to validate itself to the caller. It is
*followed by a union wlwse discriminant is an enum
* accept_stat. The SUCCESS arm of the union is protocol
* specific. The PROG_UNAVAIL, PROC_UNAVAIL, and GARBAGE_ARGP
* arms of the union are void. The PROG_MISMATCH arm specifies
* the lowest and highest version numbers of the remote program
* supported by the server.
*1
struct accepted_reply
opaque_auth verf;
union switch (accept_stat stat)
case SUCCESS:
opaque results[O];
/*
procedure-specific results start here
*/
case PROG MISMATCH:
struct {
unsigned int low;
unsigned int high;
mismatch_info;
default:
1*
* Void. Cases include PROG_UNAVAIL,
* and GARBAGE ARGS.
PROC_UNAVAIL,
*1
void;
reply_data;
};
1*
* Reply to an RPC request that was rejected by the server:
* The request can be rejected for two reasons: either the
* server is not running a compatible version of the RPC
* protocol (RPC_MISMATCH), or the server refuses to
* authenticate the caller (AUTH_ERROR). In case of an RPC
Revision A, of 27 March 1990
Chapter 7 -
Remote Procedure Calls: Protocol Specification
155
* version mismatch. the server returns the lowest and highest
* supported RPC version numbers. In case of refused
* authentication. failure status is returned.
*1
union rejected_reply switch (reject_stat stat) {
case RPC MISMATCH:
struct {
unsigned int low;
unsigned int high;
mismatch_info;
case AUTH ERROR:
auth stat stat;
} ;
7.5. Authentication
Protocols
As previously stated, authentication parameters are opaque, but open-ended to
the rest of the RPC protocol. This section defines some "flavors" of authentication implemented at (and supported by) Sun. Other sites are free to invent new
authentication types, with the same rules of flavor number assignment as there is
for program number assignment.
Null Authentication
Often calls must be made where the caller does not know who he is or the server
does not care who the caller is. In this case, the flavor value (the discriminant of
the opaque auth's union) of the RPC message's credentials, verifier, and
response verifier is AUTH_NULL. The bytes of the opaque_auth's body are
undefined. It is recommended that the opaque length be zero.
UNIX Authentication
The caller of a remote procedure may wish to identify himself as he is identified
on a UNIX system. The value of the credential's discriminant of an RPC call
message is AUTH_UNIX. The bytes of the credential's opaque body encode the
following structure:
struct auth_unix {
unsigned int stamp;
string machinename<255>;
unsigned int uid;
unsigned int gid;
unsigned int gids<10>;
} ;
The stamp is an arbitrary ID which the caller machine may generate. The
machinename is the name of the caller's machine (like "krypton"). The uid is
the caller's effective user ID. The gid is the caller's effective group ID. The
gids is a counted array of groups which contain the caller as a member. The
verifier accompanying the credentials should be of AUTH_NULL (defined above).
The value of the discriminant of the response verifier received in the reply message from the server may be AUTH_NULL or AUTH_SHORT. In the case of
AUTH_SHORT, the bytes of the response verifier's string encode an opaque structure. This new opaque structure may now be passed to the server instead of the
original AUTH _UNIX flavor credentials. The server keeps a cache which maps
shorthand opaque structures (passed back by way of an AUTH_SHORT style
Revision A, of 27 March 1990
156
Protocol Specifications
response verifier) to the original credentials of the caller. The caller can save network bandwidth and server cpu cycles by using the new credentials.
The server may flush the shorthand opaque structure at any time. If this happens,
the remote procedure call message will be rejected due to an authentication error.
The reason for the failure will be AUTH_REJECTEDCRED. At this point, the
caller may wish to try the original AUT H_ UN IX style of credentials.
DES Authentication
UNIX authentication suffers from two major problems:
1. The naming is too UNIX-system oriented.
2.
There is no verifier, so credentials can easily be faked.
DES authentication attempts to fix these two problems.
Naming
The first problem is handled by addressing the caller by a simple string of characters instead of by an operating system specific integer. This string of characters
is known as the "netname" or network name of the caller. The server is not
allowed to interpret the contents of the caller's name in any other way except to
identify the caller. Thus, netnames should be unique for every caller in the internet.
It is up to each operating system's implementation of DES authentication to gen-
erate netnames for its users that insure this uniqueness when they call upon
remote servers. Operating systems already know how to distinguish users local
to their systems. It is usually a simple matter to extend this mechanism to the
network. For example, a UNIX user at Sun with a user ID of 515 might be
assigned the following netname: "[email protected]". This netname contains
three items that serve to insure it is unique. Going backwards, there is only one
naming domain called "sun.com" in the internet. Within this domain, there is
only one UNIX user with user ID 515. However, there may be another user on
another operating system, for example VMS, within the same naming domain
that, by coincidence, happens to have the same user ID. To insure that these two
users can be distinguished we add the operating system name. So one user is
"[email protected]" and the other is "[email protected]".
The first field is actually a naming method rather than an operating system name.
It just happens that today there is almost a one-to-one correspondence between
naming methods and operating systems. If the world could agree on a naming
standard, the first field could be the name of that standard, instead of an operating
system name.
DES Authentication Verifiers
Unlike UNIX authentication, DES authentication does have a verifier so the
server can validate the client's credential (and vice-versa). The contents of this
verifier is primarily an encrypted timestamp. The server can decrypt this timestamp, and if it is close to what the real time is, then the client must have
encrypted it correctly. The only way the client could encrypt it correctly is to
know the "conversation key" of the RPC session. And if the client knows the
conversation key, then it must be the real client.
Revision A, of 27 March 1990
Chapter 7 - Remote Procedure Calls: Protocol Specification
157
The conversation key is a DES [5] key which the client generates and notifies the
server of in its first RPC call. The conversation key is encrypted using a public
key scheme in this first transaction. The particular public key scheme used in
DES authentication is Diffie-Hellman [3] with 192-bit keys. The details of this
encryption method are described later.
The client and the server need the same notion of the current time in order for all
of this to work. If network time synchronization cannot be guaranteed, then the
client can synchronize with the server before beginning the conversation, perhaps
by consulting the Internet Time Server (TIME[4]).
The way a server determines if a client timestamp is valid is somewhat complicated. For any other transaction but the first, the server just checks for two
things:
1.
the timestamp is greater than the one previously seen from the same client.
2.
the timestamp has not expired.
A timestamp is expired if the server's time is later than the sum of the client's
timestamp plus what is known as the client's "window". The "window" is a
number the client passes (encrypted) to the server in its first transaction. You can
think of it as a lifetime for the credential.
This explains everything but the first transaction. In the first transaction, the
server checks only that the timestamp has not expired. If this was all that was
done though, then it would be quite easy for the client to send random data in
place of the timestamp with a fairly good chance of succeeding. As an added
check, the client sends an encrypted item in the first transaction known as the
"window verifier" which must be equal to the window minus 1, or the server will
reject the credential.
The client too must check the verifier returned from the server to be sure it is legitimate. The server sends back to the client the encrypted timestamp it received
from the client, minus one second. If the client gets anything different than this,
it will reject it.
Nicknames and Clock
Synchronization
After the first transaction, the server's DES authentication subsystem returns in
its verifier to the client an integer "nickname" which the client may use in its
further transactions instead of passing its netname, encrypted DES key and window every time. The nickname is most likely an index into a table on the server
which stores for each client its netname, decrypted DES key and window.
Though they originally were synchronized, the client's and server's clocks can
get out of sync again. When this happens the client RPC subsystem most likely
will get back RPC_AUTHERROR at which point it should resynchronize.
A client may still get the RPC_AUTHERROR error even though it is synchronized
with the server. The reason is that the server's nickname table is a limited size,
and it may flush entries whenever it wants. A client should resend its original
credential in this case and the server will give it a new nickname. If a server
crashes, the entire nickname table gets flushed, and all clients will have to resend
their original credentials.
Revision A, of 27 March 1990
158
Protocol Specifications
DES Authentication Protocol
(in XDR language)
1*
* There are two kinds of credentials: one in which the client uses
* its full network name, and one in which it uses its "nickname"
* (just an unsigned integer) given to it by the server. The
* client must use its fullname in its first transaction with the
* server, in which the server will return to the client its
* nickname. The client may use its nickname in allfurther
* transactions with the server. There is no requirement to use the
* nickname, but it is wise to use itfor performance reasons.
*1
enum authdes namekind
ADN FULLNAME
0,
ADN NICKNAME = 1
};
1*
* A 64-bit block of encrypted DES data
*1
typedef opaque des_block[8];
1*
* Maximum length of a network user's name
*1
const MAXNETNAMELEN = 255;
1*
* Afullname contains the network name of the client, an encrypted
* conversation key and the window. The window is actually a
* lifetime for the credential. If the time indicated in the
* verifier timestamp plus the window has past, then the server
* should expire the request and not grant it. To insure that
* requests are not replayed, the server should insist that
* timestamps are greater than the previous one seen, unless it is
* the first transaction. In the first transaction, the server
* checks instead that the window verifier is one less than the
* window.
*1
struct authdes_fullname {
string name<MAXNETNAMELEN>;
des_block key;
unsigned int window;
/ * name of client * /
/* PK encrypted conversation key * /
/ * encrypted window * /
} i
1*
* A credential is either a fullname or a nickname
*1
union authdes_cred switch (authdes_namekind adc_namekind)
case ADN FULLNAME:
authdes fullname adc_fullnamei
case ADN NICKNAME:
unsigned int adc_nickname;
}i
Revision A, of 27 March 1990
Chapter 7 -
Remote Procedure Calls: Protocol Specification
159
1*
* A timestamp encodes the time since midnight, January 1,1970.
*1
struct timestamp
unsigned int seconds;
unsigned int useconds;
/ * seconds * /
/ * and microseconds * /
};
1*
* Verifier: client variety
* The window verifier is only used in the first transaction. In
* conjunction with a fullname credential, these items are packed
* into the following structure before being encrypted:
*
* struct {
*
*
adv_timestamp;
-- one DES block
adc fullname. window; -- one half DES block
adv_winverf;
-- one half DES block
*
*}
* This structure is encrypted using CBC mode encryption with an
* input vector of zero. All other encryptions of timestamps use
* ECB mode encryption.
*1
struct authdes verf_clnt {
timestamp adv_timestamp;
unsigned int adv_winverfi
/ * encrypted timestamp * /
/ * encrypted window verifier * /
};
1*
* Verifier: server variety
* The server returns (encrypted) the same timestamp the client
* gave it minus one second. It also tells the client its nickname
* to be used in future transactions (unencrypted).
*1
struct authdes_verf_svr {
timestamp adv_timeverf;
unsigned int adv_nickname;
/ * encrypted verifier * /
/ * new nickname for client * /
}i
Diffie-Hellman Encryption
In this scheme, there are two constants, BASE and MODULUS. The particular
values Sun has chosen for these for the DES authentication protocol are:
const BASE = 3;
const MODULUS = "d4aOba0250b6fd2ec626e7efd637df76c716e22d09441
The way this scheme works is best explained by an example. Suppose there are
two people "A" and "B" who want to send encrypted messages to each other. So,
A and B both generate "secret" keys at random which they do not reveal to anyone. Let these keys be represented as SK(A) and SK(B). They also publish in a
public directory their "public" keys. These keys are computed as follows:
Revision A, of 27 March 1990
160
Protocol Specifications
PK(A)
PK(B)
= ( BASE ** SK(A) ) mod MODULUS
= ( BASE ** SK(B) ) mod MODULUS
The "**" notation is used here to represent exponentiation. Now, both A and B
can arrive at the "common" key between them, represented here as CK(A, B),
without revealing their secret keys.
A computes:
CK(A, B)
= ( PK(B) ** SK(A»
mod MODULUS
while B computes:
CK(A, B)
= ( PK(A) ** SK(B»
mod MODULUS
These two can be shown to be equivalent:
(PK(B) ** SK(A»
mod MODULUS = (PK(A) ** SK(B»
mod MODULUS
We drop the "mod MODULUS" parts and assume modulo arithmetic to simplify
things:
PK(B) ** SK(A)
= PK(A) ** SK(B)
Then, replace PK(B) by what B computed earlier and likewise for PK(A).
«BASE ** SK(B»
** SK(A) = (BASE ** SK(A»
** SK(B)
which leads to:
BASE ** (SK(A) * SK(B»
= BASE ** (SK(A) * SK(B»
This common key CK(A, B) is not used to encrypt the timestamps used in the
protocol. Rather, it is used only to encrypt a conversation key which is then used
to encrypt the timestamps. The reason for doing this is to use the common key as
little as possible, for fear that it could be broken. Breaking the conversation key
is a far less serious offense, since conversations are relatively short-lived.
The conversation key is encrypted using 56-bit DES keys, yet the common key is
192 bits. To reduce the number of bits, 56 bits are selected from the common
key as follows. The middle-most 8-bytes are selected from the common key, and
then parity is added to the lower order bit of each byte, producing a 56-bit key
with 8 bits of parity.
7.6. Record Marking
Standard
When RPC messages are passed on top of a byte stream protocol Oike TCP/IP), it
is necessary, or at least desirable, to delimit one message from another in order to
detect and possibly recover from user protocol errors. This is called record marking (RM). Sun uses this RM/fCP/IP transport for passing RPC messages on
TCP streams. One RPC message fits into one RM record.
A record is composed of one or more record fragments. A record fragment is a
four-byte header followed by 0 to (2**31) - 1 bytes of fragment data. The bytes
encode an unsigned binary number; as with XDR integers, the byte order is from
highest to lowest. The number encodes two values--a boolean which indicates
whether the fragment is the last fragment of the record (bit value 1 implies the
fragment is the last fragment) and a 31-bit unsigned binary value which is the
length in bytes of the fragment's data. The boolean value is the highest-order bit
Revision A, of 27 March 1990
Chapter 7 - Remote Procedure Calls: Protocol Specification
161
of the header, the length is the 31 low-order bits. (Note that this record
specification is NOT in XDR standard fonn!)
7.7. The RPC Language
Just as there was a need to describe the XDR data-types in a f011llallanguage,
there is also need to describe the procedures that operate on these XDR datatypes in a fonnallanguage as well. We use the RPC Language for this purpose.
It is an extension to the XDR language. The following example is used to
describe the essence of the language.
An Example Service
Described in the RPC
Language
Here is an example of the specification of a simple ping program.
1*
* Simple ping program
*1
program PING_PROG
/ * Latest and greatest version * /
version PING VERS PINGBACK
void
PINGPROC_NULL(void) = 0;
1*
* Ping the caller. return the round-trip time
* (in microseconds). Returns -1 if the operation
* timed out.
*1
int
PINGPROC_PINGBACK(void)
1;
} = 2;
1*
* Original version
*1
version PING_VERS_ORIG {
void
PINGPROC_NULL(void)
} = 1;
} = 1;
const PING VERS
=
2;
0;
/ * latest version * /
The first version described is PING_VERS_PINGBACK with two procedures,
PINGPROC NULL and PINGPROC PINGBACK. PINGPROC NULL takes no
arguments and returns no results, but it is useful for computing round-trip times
from the client to the server and back again. By convention, procedure 0 of any
RPC protocol should have the same semantics, and never require any kind of
authentication. The second procedure is used for the client to have the server do
a reverse ping operation back to the client, and it returns the amount of time (in
microseconds) that the operation used. The next version, PING_VERS_ORIG,
is the original version of the protocol and it does not contain
P INGPROC_ P INGBACK procedure. It is useful for compatibility with old client
programs, and as this program matures it may be dropped from the protocol
entirely.
Revision A. of 27 March 1990
162
Protocol Specifications
The RPC Language
Specification
The RPC language is identical to the XDR language, except for the added
definition of a program-def described below.
program-def:
nprogram n identifier n{n
version-def
version-def *
n}n n=n constant n.n
,
version-def:
nversion" identifier n{n
procedure-def
procedure-def *
n}n "=" constant n.n
,
procedure-def:
type-specifier identifier "(" type-specifier n)n
,
n=" constant n.n
Syntax Notes
7.8. Port Mapper Program
Protocol
1. The following keywords are added and cannot be used as identifiers: "program" and "version";
2.
A version name cannot occur more than once within the scope of a program
definition. Nor can a version number occur more than once within the scope
of a program definition.
3.
A procedure name cannot occur more than once within the scope of a version definition. Nor can a procedure number occur more than once within
the scope of version definition.
4.
Program identifiers are in the same name space as constant and type
identifiers.
5.
Only unsigned constants can be assigned to programs, versions and procedures.
The port mapper program maps RPC program and version numbers to transportspecific port numbers. This program makes dynamic binding of remote programs possible.
This is desirable because the range of reserved port numbers is very small and the
number of potential remote programs is very large. By running only the port
mapper on a reserved port, the port numbers of other remote programs can be
ascertained by querying the port mapper.
The port mapper also aids in broadcast RPC. A given RPC program will usually
have different port number bindings on different machines, so there is no way to
directly broadcast to all of these programs. The port mapper, however, does have
a fixed port number. So, to broadcast to a given program, the client actually
sends its message to the port mapper located at the broadcast address. Each port
mapper that picks up the broadcast then calls the local service specified by the
client. When the port mapper gets the reply from the local service, it sends the
reply on back to the client.
Revision A, of 27 March 1990
Port Mapper Protocol
Specification (in RPC
Language)
Chapter 7 -
Remote Procedure Calls: Protocol Specification
111;
/ * portmapper port number * /
const PMAP PORT
163
1*
* A mapping of (program. version. protocol) to port number
*1
struct mapping {
unsigned int
unsigned int
unsigned int
unsigned int
prog;
vers;
prot;
port;
} ;
1*
* Supported values for the "prot" field
*1
const IPPROTO TCP
const IPPROTO UDP
6;
17;
/ * protocol number for TCP lIP * /
/ * protocol number for UDPIIP * /
1*
* A list of mappings
*1
struct *pmaplist
mapping map;
pmaplist next;
};
1*
* Arguments to callit
*1
struct call_args
unsigned int progi
unsigned int vers;
unsigned int prOCi
opaque args<>;
};
1*
* Results of callit
*1
struct call result
unsigned int port;
opaque res<>;
};
1*
* Port mapper procedures
*1
program PMAP_PROG {
version PMAP VERS
void
PMAPPROC_NULL(void)
0;
bool
Revision A, of 27 March 1990
164
Protocol Specifications
}
Port Mapper Operation
=
PMAPPROC_SET(mapping)
1;
bool
PMAPPROC_UNSET(mapping)
2;
unsigned int
PMAPPROC_GETPORT(mapping)
3;
pmaplist
PMAPPROC_DUMP(void)
4;
call result
PMAPPROC_CALLIT(call_args)
} = 2;
100000;
5;
The portmapper program currently supports two protocols (UDP/IP and TCP/lP).
The portmapper is contacted by talking to it on assigned port number 111
(SUNRPC [8]) on either of these protocols. The following is a description of
each of the portmapper procedures:
PMAPPROC NULL:
This procedure does no work. By convention, procedure zero of any protocol takes no parameters and returns no results.
PMAPPROC SET:
When a program first becomes available on a machine, it registers itself with
the port mapper program on the same machine. The program passes its program number "prog", version number "vers", transport protocol number
"prot", and the port "port" on which it awaits service request. The procedure
returns a boolean response whose value is TRUE if the procedure successfully established the mapping and FALSE otherwise. The procedure refuses
to establish a mapping if one already exists for the tuple "(prog, vers, prot)".
PMAPPROC UNSET:
When a program becomes unavailable, it should unregister itself with the
port mapper program on the same machine. The parameters and results have
meanings identical to those of PMAPPROC_SET. The protocol and port
number fields of the argument are ignored.
PMAPPROC GETPORT:
Given a program number "prog" , version number "vers", and transport protocol number "prot", this procedure returns the port number on which the program is awaiting call requests. A port value of zeros means the program has
not been registered. The "port" field of the argument is ignored.
PMAPPROC DUMP:
This procedure enumerates all entries in the port mapper's database. The
procedure takes no parameters and returns a list of program, version, protocol, and port values.
PMAPPROC CALLIT:
This procedure allows a caller to call another remote procedure on the same
Revision A. of 27 March 1990
Chapter 7 -
Remote Procedure Calls: Protocol Specification
165
machine without knowing the remote procedure's port number. It is
intended for supporting broadcasts to arbitrary remote programs via the
well-known port mapper's port. The parameters "prog", "vers", "proc", and
the bytes of "args" are the program number, version number, procedure
number, and parameters of the remote procedure. Note:
1.
This procedure only sends a response if the procedure was successfully
executed and is silent (no response) otherwise.
2.
The port mapper communicates with the remote program using UDP/IP
only.
The procedure returns the remote program's port number, and the bytes of results
are the results of the remote procedure.
7.9. References
[1] Birrell, Andrew D. & Nelson, Bruce Jay; "Implementing Remote Procedure
Calls"; XEROX CSL-83-7, October 1983.
[2] Cheriton, D.; "VMTP: Versatile Message Transaction Protocol", Preliminary
Version 0.3; Stanford University, January 1987.
[3] Diffie & Hellman; "New Directions in Cryptography"; IEEE Transactions on
Information Theory IT-22, November 1976.
[4] Harrenstien, K.; "Time Server", RFC 738; Information Sciences Institute,
October 1977.
[5] National Bureau of Standards; "Data Encryption Standard"; Federal Information Processing Standards Publication 46, January 1977.
[6] Postel, J.; "Transmission Control Protocol- DARPA Internet Program Protocol Specification" , RFC 793; Information Sciences Institute, September 1981.
[7] Postel, 1.; "User Datagram Protocol", RFC 768; Information Sciences Institute, August 1980.
[8] Reynolds, 1. & Postel, J.; "Assigned Numbers", RFC 923; Information Sciences Institute, October 1984.
Revision A. of 27 March 1990
Chapter 7 -
Remote Procedure Calls: Protocol Specification
Revision A, of 27 March 1990
8
Network File System: Version 2
Protocol Specification
8.1. Status of this Standard
Note: This chapter specifies a protocol that Sun Microsystems, Inc., and others
are using. It specifies it in standard ARPA RFC form.
8.2. Introduction
The Sun Network Filesystem (NFS) protocol provides transparent remote access
to shared filesystems over local area networks. The NFS protocol is designed to
be machine, operating system, network architecture, and transport protocol
independent. This independence is achieved through the use of Remote Procedure Call (RPC) primitives built on top of an External Data Representation
(XDR). Implementations exist for a variety of machines, from personal computers to supercomputers.
The supporting mount protocol allows the server to hand out remote access
privileges to a restricted set of clients. It performs the operating system-specific
functions that allow, for example, to attach remote directory trees to some local
file system.
Remote Procedure Call
Sun's remote procedure call specification provides a procedure- oriented interface to remote services. Each server supplies a program that is a set of procedures. NFS is one such "program". The combination of host address, program
number, and procedure number specifies one remote service procedure. RPC
does not depend on services provided by specific protocols, so it can be used with
any underlying transport protocol. See the Remote Procedure Calls: Protocol
Specification chapter of this manual.
External Data Representation
The External Data Representation (XDR) standard provides a common way of
representing a set of data types over a network. The NFS Protocol Specification
is written using the RPC data description language. For more information, see
the External Data Representation Standard: Protocol Specification chapter of
this manual. Sun provides implementations of XDR and RPC, but NFS does not
require their use. Any software that provides equivalent functionality can be
used, and if the encoding is exactly the same it can interoperate with other implementations of NFS.
167
Revision A. of 27 March 1990
168
Protocol Specifications
Revision A, of 27 March 1990
Chapter 8 -
Stateless Servers
Network File System: Version 2 Protocol Specification
169
The NFS protocol is stateless. That is, a server does not need to maintain any
extra state infonnation about any of its clients in order to function correctly.
Stateless servers have a distinct advantage over stateful servers in the event of a
failure. With stateless servers, a client need only retry a request until the server
responds; it does not even need to know that the server has crashed, or the network temporarily went down. The client of a stateful server, on the other hand,
needs to either detect a server crash and rebuild the server's state when it comes
back up, or cause client operations to fail.
This may not sound like an important issue, but it affects the protocol in some
unexpected ways. We feel that it is worth a bit of extra complexity in the protocol to be able to write very simple servers that do not require fancy crash
recovery.
On the other hand, NFS deals with objects such as files and directories that
inherently have state -- what good would a file be if it did not keep its contents
intact? The goal is to not introduce any extra state in the protocol itself. Another
way to simplify recovery is by making operations "idempotent" whenever possible (so that they can potentially be repeated).
8.3. NFS Protocol
Definition
Servers have been known to change over time, and so can the protocol that they
use. So RPC provides a version number with each RPC request. This RFC
describes version two of the NFS protocol. Even in the second version, there are
various obsolete procedures and parameters, which will be removed in later versions. An RFC for version three of the NFS protocol is currently under preparation.
File System Model
NFS assumes a file system that is hierarchical, with directories as all but the
bottom-level files. Each entry in a directory (file, directory, device, etc.) has a
string name. Different operating systems may have restrictions on the depth of
the tree or the names used, as well as using different syntax to represent the
"pathname", which is the concatenation of all the "components" (directory and
file names) in the name. A "file system" is a tree on a single server (usually a
single disk or physical partition) with a specified "root". Some operating systems
provide a "mount" operation to make all file systems appear as a single tree,
while others maintain a "forest" of file systems. Files are unstructured streams of
unintetpreted bytes. Version 3 of NFS uses a slightly more general file system
model.
NFS looks up one component of a pathname at a time. It may not be obvious
why it does not just take the whole pathname, traipse down the directories, and
return a file handle when it is done. There are several good reasons not to do
this. First, pathnames need separators between the directory components, and
different operating systems use different separators. We could define a Network
Standard Pathname Representation, but then every pathname would have to be
parsed and converted at each end. Other issues are discussed in NFS Implementation Issues below.
Although files and directories are similar objects in many ways, different procedures are used to read directories and files. This provides a network standard
fonnat for representing directories. The same argument as above could have
Revision A, of 27 March 1990
170
Protocol Specifications
been used to justify a procedure that returns only one directory entry per call.
The problem is efficiency. Directories can contain many entries, and a remote
call to return each would be just too slow.
RPC Information
Authentication
The NFS service uses AUTH_UNIX, AUTH_DES, or AUTH_SHORT style
authentication, except in the NULL procedure where AUTH_NONE is also
allowed.
Transport Protocols
NFS currently is supported on UDP/IP only.
Port Number
The NFS protocol currently uses the UDP port number 2049. This is not an
officially assigned port, so later versions of the protocol use the "Portmapping" facility of RPC.
Sizes of XDR Structures
These are the sizes, given in decimal bytes, of various XDR structures used in the
protocol:
/*
The maximum number of bytes of data in a READ or WRITE request
*/
const MAXDATA = 8192;
/*
The maximum number of bytes in a pathname argument
const MAXPATHLEN
=
*/
1024;
/ * The maximum number of bytes in a file name argument * /
const MAXNAMLEN
=
255;
/* The size in bytes of the opaque "cookie" passed by READDIR * /
const COOKIESIZE
=
4;
/ * The size in bytes of the opaque file handle * /
const FHSIZE = 32;
Basic Data Types
The following XDR definitions are basic structures and types used in other structures described further on.
Revision A, of27 March 1990
Chapter 8 - Network File System: Version 2 Protocol Specification
stat
171
enum stat {
NFS OK = 0,
NFSERR_PERM=l,
NFSERR_NOENT=2,
NFSERR_IO=S,
NFSERR_NXIO=6,
NFSERR_ACCES=13,
NFSERR_EXIST=17,
NFSERR_NODEV=19,
NFSERR_NOTDIR=20,
NFSERR_ISDIR=21,
NFSERR_FBIG=27,
NFSERR_NOSPC=28,
NFSERR_ROFS=30,
NFSERR_NAMETOOLONG=63,
NFSERR_NOTEMPTY=66,
NFSERR_DQUOT=69,
NFSERR_STALE=70,
NFSERR WFLUSH=99
};
The stat () type is returned with every procedure's results. A value of
NF S_OK indicates that the call completed successfully and the results are valid.
The other values indicate some kind of error occurred on the server side during
the servicing of the procedure. The error values are derived from UNIX error
numbers.
NFSERR PERM:
Not owner. The caller does not have correct ownership to perfonn the
requested operation.
NFSERR NOENT:
No such file or directory. The file or directory specified does not exist.
NFSERR 10:
Some sort of hard error occurred when the operation was in progress. This
could be a disk error, for example.
NFSERR NXIO:
No such device or address.
NFSERR ACCES:
Permission denied. The caller does not have the correct pennission to perfonn the requested operation.
NFSERR EXIST:
File exists. The file specified already exists.
NFSERR NODEV:
No such device.
NFSERR NOTDIR:
Not a directory. The caller specified a non-directory in a directory operation.
+~t!!
Revision A, of 27 March 1990
172
Protocol Specifications
NF SERR :ISD:IR:
Is a directory. The caller specified a directory in a non- directory operation.
NFSERR FB:IG:
File too large. The operation caused a file to grow beyond the server's limit.
NFSERR NOSPC:
No space left on device. The operation caused the server's filesystem to
reach its limit.
NFSERR ROFS:
Read-only filesystem. Write attempted on a read-only filesystem.
NFSERR NAME TOOLONG:
File name too long. The file name in an operation was too long.
NFSERR NOTEMPTY:
Directory not empty. Attempted to remove a directory that was not empty.
NFSERR_DQUOT:
Disk quota exceeded. The client's disk quota on the server has been
exceeded.
NFSERR STALE:
The "fhandle" given in the arguments was invalid. That is, the file referred
to by that file handle no longer exists, or access to it has been revoked.
NFSERR WFLUSB:
The server's write cache used in the WRITECACHE call got flushed to disk.
ftype
enum ftype {
NFNON = 0,
NFREG = 1,
NFDIR = 2,
NFBLK = 3,
NFCHR = 4,
NFLNK = 5
}i
The enumeration ftype gives the type of a file. The type NFNON indicates a
non-file, NFREG is a regular file, NFDIR is a directory, NFBLK is a block-special
device, NFCHR is a character-special device, and NFLNK is a symbolic link.
fhandle
typedef opaque fhandle[FHSIZE]i
The fhandle is the file handle passed between the server and the client. All
file operations are done using file handles to refer to a file or directory. The file
handle can contain whatever information the server needs to distinguish an individual file.
Revision A, of 27 March 1990
Chapter 8 - Network File System: Version 2 Protocol Specification
time val
173
struct timeval {
unsigned int seconds;
unsigned int useconds;
} ;
The timeval structure is the number of seconds and microseconds since midnight January 1, 1970, Greenwich Mean Time. It is used to pass time and date
infonnation.
fattr
struct fattr
ftype
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
timeval
timeval
timeval
int
int
int
int
int
int
int
int
int
int
type;
mode;
nlink;
uid;
gid;
size;
blocksize;
rdev;
blocks;
fsid;
fileid;
atime;
mtime;
ctime;
};
The fattr structure contains the attributes of a file; "type" is the type of the
file; "nlink" is the number of hard links to the file (the number of different names
for the same file); "uid" is the user identification number of the owner of the file;
"gid" is the group identification number of the group of the file; "size" is the size
in bytes of the file; "blocksize" is the size in bytes of a block of the file; "rdev" is
the device number of the file if it is type NFCHR or NFBLK; "blocks" is the
number of blocks the file takes up on disk; "fsid" is the file system identifier for
the filesystem containing the file; "fileid" is a number that uniquely identifies the
file within its filesystem; "atime" is the time when the file was last accessed for
either read or write; "mtime" is the time when the file data was last modified
(written); and "ctime" is the time when the status of the file was last changed.
Writing to the file also changes "ctime" if the size of the file changes.
"mode" is the access mode encoded as a set of bits. Notice that the file type is
specified both in the mode bits and in the file type. This is really a bug in the
protocol and will be fixed in future versions. The descriptions given below
specify the bit positions using octal numbers.
Revision A. of 27 March 1990
174
Protocol Specifications
Bit
0040000
0020000
0060000
0100000
0120000
0140000
0004000
0002000
0001000
0000400
0000200
0000100
0000040
0000020
0000010
0000004
0000002
0000001
Description
This is a directory; "type" field should be NFDIR.
This is a character special file; "type" field should be NFCHR.
This is a block special file; "type" field should be NFBLK.
This is a regular file; "type" field should be NFREG.
This is a symbolic link file; "type" field should be NFLNK.
This is a named socket; "type" field should be NFNON.
Set user id on execution.
Set group id on execution.
Save swapped text even after use.
Read permission for owner.
Write permission for owner.
Execute and search pennission for owner.
Read permission for group.
Write permission for group.
Execute and search pennission for group.
Read permission for others.
Write permission for others.
Execute and search pennission for others.
Notes:
The bits are the same as the mode bits returned by the stat (2) system call
in the UNIX system. The file type is specified both in the mode bits and in
the file type. This is fixed in future versions.
The "rdev" field in the attributes structure is an operating system specific
device specifier. It will be removed and generalized in the next revision of
the protocol.
sattr
struct sattr
unsigned
unsigned
unsigned
unsigned
timeval
timeval
int
int
int
int
mode;
uid;
gid;
sizei
atimei
mtimei
} ;
The sattr structure contains the file attributes which can be set from the client.
The fields are the same as for fat tr above. A "size" of zero means the file
should be truncated. A value of -1 indicates a field that should be ignored.
filename
typedef string filename<MAXNAMLEN>i
The type filename is used for passing file names or pathname components.
path
typedef string path<MAXPATHLEN>i
The type path is a pathname. The server considers it as a string with no internal
structure, but to the client it is the name of a node in a file system tree.
Revision A, of 27 March 1990
Chapter 8 - Network File System: Version 2 Protocol Specification
attrstat
union attrstat switch (stat status)
case NFS OK:
fattr attributes;
default:
void;
175
{
} ;
The attrstat structure is a common procedure result. It contains a "status"
and, if the call succeeded, it also contains the attributes of the file on which the
operation was done.
diropargs
struct diropargs
fhandle dir;
filename name;
} ;
The diropargs structure is used in directory operations. The "fhandle" "dir"
is the directory in which to find the file "name". A directory operation is one in
which the directory is affected.
diropres
union diropres switch (stat status)
case NFS OK:
struct {
fhandle file;
fattr
attributes;
diropok;
default:
void;
{
} ;
The results of a directory operation are returned in a diropres structure. If the
call succeeded, a new file handle "file" and the "attributes" associated with that
file are returned along with the "status".
Server Procedures
The protocol definition is given as a set of procedures with arguments and results
defined using the RPC language. A brief description of the function of each procedure should provide enough infonnation to allow implementation.
All of the procedures in the NFS protocol are assumed to be synchronous. When
a procedure returns to the client, the client can assume that the operation has
completed and any data associated with the request is now on stable storage. For
example, a client WRI TE request may cause the server to update data blocks,
filesystem infonnation blocks (such as indirect blocks), and file attribute information (size and modify times). When the WRITE returns to the client, it can
assume that the write is safe, even in case of a server crash, and it can discard the
data written. This is a very important part of the statelessness of the server. If
the server waited to flush data from remote requests, the client would have to
save those requests so that it could resend them in case of a server crash.
Revision A, of 27 March 1990
176
Protocol Specifications
1*
* Remote file service routines
*1
program NFS_PROGRAM {
version NFS_VERSION
void NFSPROC_NULL(void) = 0;
attrstat NFSPROC_GETATTR(fhandle)= 1;
attrstat NFSPROC_SETATTR(sattrargs)
2;
void NFSPROC_ROOT(void) = 3;
diropres NFSPROC_LOOKUP(diropargs)
4;
readlinkres NFSPROC_READLINK(fhandle) = 5;
readres NFSPROC_READ(readargs) = 6;
void NFSPROC_WRITECACHE(void) = 7;
attrstat NFSPROC_WRITE(writeargs) = 8;
diropres NFSPROC_CREATE(createargs)
9;
stat NFSPROC_REMOVE(diropargs) = 10;
stat NFSPROC_RENAME(renameargs) = 11;
stat NFSPROC_LINK(linkargs) = 12;
stat NFSPROC_SYMLINK(symlinkargs) = 13;
diropres NFSPROC_MKDIR(createargs)
= 14;
stat NFSPROC_RMDIR(diropargs)
= 15;
readdirres NFSPROC_READDIR(readdirargs)
16;
statfsres NFSPROC_STATFS(fhandle) = 17;
} = 2;
} = 100003;
Do Nothing
void
NFSPROC_NULL(void) = 0;
This procedure does no work. It is made available in all RPC services to allow
seIVer response testing and timing.
Get File Attributes
attrstat
NFSPROC GETATTR (fhandle) = 1;
If the reply status is NFS_OK, then the reply attributes contains the attributes for
the file given by the input fhandle.
Set File Attributes
struct sattrargs {
fhandle file;
sattr attributes;
};
attrstat
NFSPROC SETATTR (sattrargs) = 2;
The "attributes" argument contains fields which are either -lor are the new value
for the attributes of "file". If the reply status is NF S_OK, then the reply attributes
have the attributes of the file after the "SETATTR" operation has completed.
Note: The use of -1 to indicate an unused field in "attributes" is changed in the
next version of the protocol.
Revision A, of 27 March 1990
Chapter 8 -
Get Filesystem Root
void
NFSPROC_ROOT(void)
Network File System: Version 2 Protocol Specification
=
177
3;
Obsolete. This procedure is no longer used because finding the root file handle
of a filesystem requires moving pathnames between client and seIVer. To do this
right we would have to define a network standard representation of pathnames.
Instead, the function of looking up the root file handle is done by the
MNTPROC_MNT () procedure. (See the Mount Protocol Definition below for
details).
Look Up File Name
diropres
NFSPROC_LOOKUP(diropargs) = 4;
If the reply "status" is NFS_OK, then the reply "file" and reply "attributes" are the
file handle and attributes for the file "name" in the directory given by "dir" in the
argument.
Read From Symbolic Link
union readlinkres switch (stat status)
case NFS OK:
path data;
default:
void;
{
} ;
readlinkres
NFSPROC_READLINK(fhandle)
=
5;
If "status" has the value NFS _OK, then the reply "data" is the data in the sym-
bolic link given by the file referred to by the thandle argument.
Note: since NFS always parses pathnames on the client, the patbname in a symbolic link may mean something different (or be meaningless) on a different client
or on the server if a different patbname syntax is used.
Read From File
struct readargs {
fhandle file;
unsigned offset;
unsigned count;
unsigned totalcount;
} ;
union readres switch (stat status)
case NFS OK:
fattr attributes;
opaque data<NFS_MAXDATA>;
default:
void;
{
};
readres
NFSPROC_READ(readargs)
=
6;
Returns up to "count" bytes of "data" from the file given by "file", starting at
.~!l1!
Revision A, of 27 March 1990
178
Protocol Specifications
"offset" bytes from the beginning of the file. The first byte of the file is at offset
zero. The file attributes after the read takes place are returned in "attributes".
Note: The argument "totalcount" is unused, and is removed in the next protocol
revision.
Write to Cache
void
NFSPROC_WRITECACHE(void)
=
7;
To be used in the next protocol revision.
Write to File
struct writeargs {
fhandle file;
unsigned beginoffset;
unsigned offset;
unsigned totalcount;
opaque data<NFS_MAXDATA>;
};
attrstat
NFSPROC_WRITE(writeargs)
= 8;
Writes "data" beginning "offset" bytes from the beginning of "file". The first
byte of the file is at offset zero. If the reply "status" is NFS_OK, then the reply
"attributes" contains the attributes of the file after the write has completed. The
write operation is atomic. Data from this call to WRITE will not be mixed with
data from another client's calls.
Note: The arguments "beginoffset" and "totalcount" are ignored and are removed
in the next protocol revision.
Create File
struct createargs {
diropargs where;
sattr attributes;
};
diropres
NFSPROC_CREATE(createargs)
=
9;
The file "name" is created in the directory given by "dir". The initial attributes of
the new file are given by "attributes". A reply "status" ofNFS_OK indicates that
the file was created, and reply "file" and reply "attributes" are its file handle and
attributes. Any other reply "status" means that the operation failed and no file
was created.
Note: This routine should pass an exclusive create flag, meaning" create the file
only if it is not already there".
Remove File
stat
NFSPROC_REMOVE(diropargs)
= 10;
The file "name" is removed from the directory given by "dir". A reply of
NFS_OK means the directory entry was removed.
Revision A, of 27 March 1990
Chapter 8 -
Network File System: Version 2 Protocol Specification
179
Note: possibly non-idempotent operation.
Rename File
struct renameargs {
diropargs from;
diropargs to;
};
stat
NFSPROC_RENAME(renameargs)
=
11;
The existing file "from.name" in the directory given by "from.dir" is renamed to
"to.name" in the directory given by "to.dir". If the reply is NFS_OK, the file was
renamed. The RENAME operation is atomic on the server; it cannot be interrupted in the middle.
Note: possibly non-idempotent operation.
Create Link to File
struct linkargs {
fhandle from;
diropargs to;
} ;
stat
NFSPROC_LINK(linkargs) = 12;
Creates the file "to.name" in the directory given by "to.dir", which is a hard link
to the existing file given by "from". If the return value is NFS_OK, a link was
created. Any other return value indicates an error, and the link was not created.
A hard link should have the property that changes to either of the linked files are
reflected in both files. When a hard link is made to a file, the attributes for the
file should have a value for "nlink" that is one greater than the value before the
link.
Note: possibly non-idempotent operation.
Create Symbolic Link
struct symlinkargs {
diropargs from;
path to;
sattr attributes;
} ;
stat
NFSPROC_SYMLINK(symlinkargs)
=
13;
Creates the file "from.name" with ftype NFLNK in the directory given by
"from.dir". The new file contains the pathname "to" and has initial attributes
given by "attributes". If the return value is NFS_OK, a link was created. Any
other return value indicates an error, and the link was not created.
A symbolic link is a pointer to another file. The name given in "to" is not interpreted by the server, only stored in the newly created file. When the client references a file that is a symbolic link, the contents of the symbolic link are normally
transparently reinterpreted as a pathname to substitute. A READLINK operation
Revision A, of 27 March 1990
180
Protocol Specifications
returns the data to the client for inteIpretation.
Note: On UNIX servers the attributes are never used, since symbolic links always
have mode 0777.
Create Directory
diropres
NFSPROC MKDIR (createargs)
=
14;
The new directory "where.name" is created in the directory given by "where.dir".
The initial attributes of the new directory are given by "attributes". A reply
"status" ofNFS_OK indicates that the new directory was created, and reply "file"
and reply "attributes" are its file handle and attributes. Any other reply "status"
means that the operation failed and no directory was created.
Note: possibly non-idempotent operation.
Remove Directory
stat
NFSPROC_RMDIR(diropargs)
=
15;
The existing empty directory "name" in the directory given by "dir" is removed.
If the reply is NFS_OK, the directory was removed.
Note: possibly non-idempotent operation.
Read From Directory
struct readdirargs {
fhandle dir;
nfscookie cookie;
unsigned count;
};
struct entry
unsigned fileid;
filename name;
nfscookie cookie;
entry *nextentry;
};
union readdirres switch (stat status)
case NFS_OK:
struct {
entry *entries;
bool eof;
readdirok;
default:
void;
{
};
readdirres
NFSPROC READDIR (readdirargs)
=
16;
Returns a variable number of directory entries, with a total size of up to "count"
bytes, from the directory given by "dir". If the returned value of "status" is
NF S_OK, then it is followed by a variable number of "entry"s. Each "entry" contains a "fileid" which consists of a unique number to identify the file within a
Revision A, of 27 March 1990
Chapter 8 - Network File System: Version 2 Protocol Specification
181
filesystem, the "name" of the file, and a "cookie" which is an opaque pointer to
the next entry in the directory. The cookie is used in the next READDIR call to
get more entries starting at a given point in the directory. The special cookie
zero (all bits zero) can be used to get the entries starting at the beginning of the
directory. The "fileid" field should be the same number as the "fileid" in the attributes of the file. (See the Basic Data Types section.) The "eof' flag has a value
of TRUE if there are no more entries in the directory.
Get Filesystem Attributes
union statfsres (stat status)
case NFS OK:
struct {
unsigned tsize;
unsigned bsize;
unsigned blocks;
unsigned bfreei
unsigned bavail;
info;
default:
void;
{
}i
statfsres
NFSPROC_STATFS(fhandle) = 17;
If the reply "status" is NFS_OK, then the reply "info" gives the attributes for the
file system that contains file referred to by the input fhandle. The attribute fields
contain the following values:
tsize:
The optimum transfer size of the server in bytes. This is the number of bytes
the server would like to have in the data part of READ and WRITE requests.
bsize:
The block size in bytes of the filesystem.
blocks:
The total number of "bsize" blocks on the filesystem.
bfree:
The number of free "bsize" blocks on the filesystem.
bavail:
The number of "bsize" blocks available to non-privileged users.
Note: This call does not work well if a filesystem has variable size blocks.
8.4. NFS Implementation
Issues
The NFS protocol is designed to be operating system independent, but since this
version was designed in a UNIX environment, many operations have semantics
similar to the operations of the UNIX file system. This section discusses some of
the implementation-specific semantic issues.
Revision A. of 27 March 1990
182
Protocol Specifications
Server/Client Relationship
The NFS protocol is designed to allow servers to be as simple and general as possible. Sometimes the simplicity of the seIVer can be a problem, if the client
wants to implement complicated file system semantics.
For example, some operating systems allow removal of open files. A process can
open a file and, while it is open, remove it from the directory. The file can be
read and written as long as the process keeps it open, even though the file has no
name in the filesystem. It is impossible for a stateless seIVer to implement these
semantics. The client can do some tricks such as renaming the file on remove,
and only removing it on close. We believe that the seIVer provides enough functionality to implement most file system semantics on the client.
Every NFS client can also potentially be a seIVer, and remote and local mounted
filesystems can be freely intermixed. This leads to some interesting problems
when a client travels down the directory tree of a remote filesystem and reaches
the mount point on the server for another remote filesystem. Allowing the server
to follow the second remote mount would require loop detection, server lookup,
and user revalidation. Instead, we decided not to let clients cross a server's
mount point. When a client does a LOOKUP on a directory on which the seIVer
has mounted a filesystem, the client sees the underlying directory instead of the
mounted directory. A client can do remote mounts that match the server's mount
points to maintain the server's view.
Path name Interpretation
There are a few complications to the rule that pathnames are always parsed on
the client. For example, symbolic links could have different interpretations on
different clients. Another common problem for non-UNIX implementations is
the special interpretation of the pathname ".. " to mean the parent of a given directory. The next revision of the protocol uses an explicit flag to indicate the parent
instead.
Permission Issues
The NFS protocol, strictly speaking, does not define the permission checking
used by seIVers. However, it is expected that a seIVer will do normal operating
system permission checking using AUTH_UNIX style authentication as the basis
of its protection mechanism. The server gets the client's effective "uid", effective "gid", and groups on each call and uses them to check permission. There are
various problems with this method that can been resolved in interesting ways.
Using "uid" and "gid" implies that the client and seIVer share the same "uid" list.
Every seIVer and client pair must have the same mapping from user to "uid" and
from group to "gid". Since every client can also be a seIVer, this tends to imply
that the whole network shares the same "uid/gid" space. AUTH_DES (and the
next revision of the NFS protocol) uses string names instead of numbers, but
there are still complex problems to be solved.
Another problem arises due to the usually stateful open operation. Most operating systems check permission at open time, and then check that the file is open
on each read and write request. With stateless seIVers, the server has no idea that
the file is open and must do permission checking on each read and write call. On
a local filesystem, a user can open a file and then change the permissions so that
no one is allowed to touch it, but will still be able to write to the file because it is
Revision A, of 27 March 1990
Chapter 8 -
Network File System: Version 2 Protocol Specification
183
open. On a remote filesystem, by contrast, the write would fail. To get around
this problem, the server's permission checking algorithm should allow the owner
of a file to access it regardless of the permission setting.
A similar problem has to do with paging in from a file over the network. The
operating system usually checks for execute permission before opening a file for
demand paging, and then reads blocks from the open file. The file may not have
read permission, but after it is opened it doesn't matter. An NFS server can not
tell the difference between a normal file read and a demand page-in read. To
make this work, the server allows reading of files if the "uid" given in the call has
execute or read permission on the file.
In most operating systems, a particular user (on the user ID zero) has access to all
files no matter what permission and ownership they have. This "super-user" permission may not be allowed on the server, since anyone who can become superuser on their workstation could gain access to all remote files. The UNIX server
by default maps user id 0 to -2 before doing its access checking. This works
except for NFS root filesystems, where super-user access cannot be avoided.
Setting RPC Parameters
Various file system parameters and options should be set at mount time. The
mount protocol is described in the appendix below. For example, "Soft" mounts
as well as "Hard" mounts are usually both provided. Soft mounted file systems
return errors when RPC operations fail (after a given number of optional
retransmissions), while hard mounted file systems continue to retransmit forever.
Clients and servers may need to keep caches of recent operations to help avoid
problems with non-idempotent operations.
8.5. Mount Protocol
Definition
Introduction
The mount protocol is separate from, but related to, the NFS protocol. It provides operating system specific services to get the NFS off the ground -- looking
up server path names, validating user identity, and checking access permissions.
Clients use the mount protocol to get the first file handle, which allows them
entry into a remote filesystem.
The mount protocol is kept separate from the NFS protocol to make it easy to
plug in new access checking and validation methods without changing the NFS
server protocol.
Notice that the protocol definition implies stateful servers because the server
maintains a list of client's mount requests. The mount list information is not critical for the correct functioning of either the client or the server. It is intended for
advisory use only, for example, to warn possible clients when a server is going
down.
Version one of the mount protocol is used with version two of the NFS protocol.
The only connecting point is the fhandle structure, which is the same for both
protocols.
Revision A, of 27 March 1990
184
Protocol Specifications
RPC Information
Authentication
The mount service uses AUTH_UNIX and AUTH_DES style authentication
only.
Transport Protocols
The mount service is currently supported on UDP/IP only.
Port Number
Consult the serverts portmapper, described in the Remote Procedure Calls:
Protocol Specification, to find the port number on which the mount service
is registered.
Sizes of XDR Structures
These are the sizes t given in decimal bytes t of various XDR structures used in the
protocol:
/ * The maximum number of bytes in a pathname argument * /
canst MNTPATHLEN = 1024;
/ * The maximum number of bytes in a name argument * /
= 255;
canst MNTNAMLEN
/ * The size in bytes of the opaque file handle * /
= 32;
canst FHSIZE
Basic Data Types
fhandle
This section presents the data types used by the mount protocol. In many cases
they are similar to the types used in NFS.
typedef opaque fhandle[FHSIZE];
The type fhandle is the file handle that the server passes to the client. All file
operations are done using file handles to refer to a file or directory. The file handle can contain whatever information the server needs to distinguish an individual file.
This is the same as the "tbandle" XDR definition in version 2 of the NFS protocol; see Basic Data Types in the definition of the NFS protocol, above.
fhstatus
union fhstatus switch (unsigned status)
case 0:
fhandle directory;
default:
void;
{
};
The type f h s tat us is a union. If a "status" of zero is returned, the call completed successfully, and a file handle for the "directory" follows. A non-zero
status indicates some sort of error. In this case the status is a UNIX error
number.
dirpath
typedef string dirpath<MNTPATHLEN>;
The type dirpath is a server patbname of a directory.
Revision A, of 27 March 1990
Chapter 8 -Network File System: Version 2 Protocol Specification
name
185
typedef string name<MNTNAMLEN>;
The type name is an arbitrary string used for various names.
Server Procedures
The following sections define the RPC procedures supplied by a mount server.
1*
* Protocol descriptionfor the mount program
*1
program MOUNTPROG
1*
* Version 1 of the mount protocol used with
* version 2 of the NFS protocol.
*1
version MOUNTVERS
void MOUNTPROC_NULL(void) = 0;
fhstatus MOUNTPROC_MNT(dirpath) = 1;
mountlist MOUNTPROC_DUMP(void) = 2;
void MOUNTPROC_UMNT(dirpath) = 3;
void MOUNTPROC_UMNTALL(void) = 4;
exportlist MOUNTPROC_EXPORT(void)
5;
} = 1;
100005;
Do Nothing
void
MNTPROC_NULL(void)
=
0;
This procedure does no work. It is made available in all RPC services to allow
server response testing and timing.
Add Mount Entry
fhstatus
MNTPROC_MNT(dirpath)
=
1;
If the reply "status" is 0, then the reply "directory" contains the file handle for the
directory "dirname". This file handle may be used in the NFS protocol. This
procedure also adds a new entry to the mount list for this client mounting "dirname".
Return Mount Entries
struct *mountlist {
name
hostname;
dirpath
directory;
mountlist nextentry;
};
mountlist
MNTPROC_DUMP(void) = 2;
Returns the list of remote mounted filesystems. The "mountlist" contains one
entry for each "hostname" and "directory" pair.
Revision A, of 27 March 1990
186
Protocol Specifications
Remove Mount Entry
void
MNTPROC_UMNT(dirpath)
=
3;
Removes the mount list entry for the input "dirpath".
Remove All Mount Entries
void
MNTPROC_UMNTALL(void)
=
4;
Removes all of the mount list entries for this client.
Return Export List
struct *groups {
name grname;
groups grnext;
};
struct *exportlist {
dirpath filesys;
groups groups;
exportlist next;
};
exportlist
MNTPROC_EXPORT(void) = 5;
Returns a variable number of export list entries. Each entry contains a filesystem
name and a list of groups that are allowed to import it. The filesystem name is in
"filesys", and the group name is in the list "groups".
Note: The exportlist should contain more information about the status of the
filesystem, such as a read-only flag.
+~t!!
Revision A, of 27 March 1990
PART THREE: Transport-Level
ProgralDming
PART THREE: Transport-Level Programming -
Continued
9
Transport Level Interface Programming
This chapter provides detailed infonnation, with various examples, on the UNIX
system Transport Interface. This interface is intended to supercede the socketbased interprocess communications mechanisms as the standard means of gaining direct access to transport selVices. Network application developers who do
not require such direct access should instead work within the Remote Procedure
Call (RPC) framework- which is documented in PART I of this manual.
NOTE
SunOS 4.1 does not support RPC on TLI. This is afeature that will appear in
future products.
The following discussion assumes a working knowledge of UNIX system programming and data communication concepts. Familiarity with the Reference
Model of Open Systems Interconnection (OSI) is required as well.
9.1. Background
To place the Transport Interface in perspective, a discussion of the OSI Reference Model is first presented. The Reference Model partitions networking functions into seven layers, as depicted in Figure 9-1.
Figure 9-1
OSI Reference Model
Layer 7
application
Layer 6
presentation
Layer 5
session
Layer 4
transport
Layer 3
network
Layer 2
data link
Layer 1
physical
Layer 1
The physical layer is responsible for the transmission of raw data over a
communication medium.
189
Revision A, of 27 March 1990
190
Transport-Level Programming
Layer 2
The data link layer provides the exchange of data between network layer
entities. It detects and corrects any errors that may occur in the physical
layer transmission.
Layer 3
The network layer manages the operation of the network. In particular, it is
responsible for the routing and management of data exchange between transport layer entities within the network.
Layer 4
The transport layer provides transparent data transfer services between session layer entities by relieving them from concerns of how reliable and
cost-effective transfer of data is achieved.
LayerS
The session layer provides the services needed by presentation layer entities
that enable them to organize and synchronize their dialogue and manage
their data exchange.
Layer 6
The presentation layer manages the representation of information that application layer entities either communicate or reference in their communication.
Layer 7
The application layer serves as the window between corresponding application processes that are exchanging infonnation.
A basic principle of the Reference Model is that each layer provides services
needed by the next higher layer in a way that frees the upper layer from concern
about how these services are provided. This approach simplifies the design of
each particular layer.
Industry standards either have been or are being defined at each layer of the
Reference Model. Two standards are defined at each layer: one that specifies an
interface to the services of the layer, and one that defines the protocol by which
services are provided. A service interface standard at any layer frees users of the
service from details of how that layer's protocol is implemented, or even which
protocol is used to provide the service.
The transport layer is important because it is the lowest layer in the Reference
Model that provides the basic service of reliable, end-to-end data transfer needed
by applications and higher layer protocols. In doing so, this layer hides the
topology and characteristics of the underlying network from its users. More
important, however, the transport layer defines a set of services common to
layers of many contemporary protocol suites, including the International Standards Organization (ISO) protocols, the Transmission Control Protocol and Internet Protocol (TCPIIP) of the ARPANET, Xerox Network Systems (XNS), and
the Systems Network Architecture (SNA).
A transport service interface, then, enables applications and higher layer protocols to be implemented without knowledge of the underlying protocol suite.
Revision A, of27 March 1990
Chapter 9 - Transport Level Interface Programming
191
That is a principle goal of the UNIX system Transport Interface. Also, because
an inherent characteristic of the transport layer is that it hides details of the physical medium being used, the Transport Interface offers both protocol and medium
independence to networking applications and higher layer protocols.
The UNIX system Transport Interface was modeled after the industry standard
ISO Transport Service Definition (ISO 8072). As such, it is intended for those
applications and protocols that require transport services. Because the Transport
Interface provides reliable data transfer, and because its services are common to
several protocol suites, many networking applications will find these services
useful.
The Transport Interface is implemented as a user library using the STREAMS
input/output mechanism. Therefore, many services available to STREAMS
applications are also available to users of the Transport Interface. These services
will be highlighted throughout this guide. For detailed information about
STREAMS, see the STREAMS Programming manual.
9.2. Document
Organization
This section is organized as follows:
D
D
D
D
D
D
D
D
D
Overview of the Transport Interface, a summary of the basic services
available to Transport Interface users and a presentation of the background
infonnation needed for the remainder of the section.
Introduction to Connection-Mode Service, a description of the services
associated with connection-based (or virtual circuit) communication.
Introduction to Connectionless-Mode Service, a description of the services associated with connectionless (or datagram) communication.
A Read/Write Interface, a description of how users can use the services of
read(2) and wr i te(2) to communicate over a transport connection.
Advanced Topics, a discussion of important concepts not covered in earlier
sections. These include asynchronous event handling and processing of
multiple, simultaneous connect requests.
State Transitions, an appendix which defines the allowable state transitions
associated with the Transport Interface.
Guidelines for Protocol Independence, an appendix which establishes
necessary guidelines for developing software that can be run without change
over any transport protocol developed for the Transport Interface.
Examples, an appendix that presents the full listing of each programming
example used throughout the guide.
Glossary, a definition of the Transport Interface terms and acronyms used in
this section.
This section describes the more important and common facilities of the Transport
Interface, but is not meant to be exhaustive. Section 3N of the SunOS Reference
Manual contains a complete description of each Transport Interface routine.
t~~.!!tn
Revision A, of 27 March 1990
192
Transport-Level Programming
9.3. Overview of the
Transport Interface
Figure 9-2
This section presents a high level overview of the services of the Transport Interface, which supports the transfer of data between two user processes. Figure 9-2
illustrates the Transport Interface.
Transportlnte~ace
transport
user
service
requests
L...... ········t
..... ... ....... Transport Interface
service events
and indications
~
transport
provider
The transport provider is the entity that provides the services of the Transport
Interface, and the transport user is the entity that requires these services. An
example of a transport provider is the ISO transport protocol, while a transport
user may be a networking application or session layer protocol.
The transport user accesses the services of the transport provider by issuing the
appropriate service requests. One example is a request to transfer data over a
connection. Similarly, the transport provider notifies the user of various events,
such as the arrival of data on a connection.
The Network Services Library of UNIX System V includes a set of functions that
support the services of the Transport Interface for user processes [see
intro(3)]. These functions enable a user to initiate requests to the provider and
process incoming events. Programs using the Transport Interface can link the
appropriate routines as follows:
[ cc prog.c -Insl s
Modes of Service
)
Two modes of service, connection-mode and connectionless-mode, are provided
by the Transport Interface. Connection-mode is circuit-oriented and enables data
to be transmitted over an established connection in a reliable, sequenced manner.
It also provides an identification mechanism that avoids the overhead of address
resolution and transmission during the data transfer phase. This service is attractive for applications that require relatively long-lived, datastream-oriented
interactions.
Connectionless-mode, in contrast, is message-oriented and supports data transfer
in self-contained units with no logical relationship required among mUltiple
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
193
units. This service requires only a preexisting association between the peer users
involved, which detennines the characteristics of the data to be transmitted. All
the infonnation required to deliver a unit of data (for example, the destination
address) is presented to the transport provider, together with the data to be
transmitted, in one service access (which need not relate to any other service
access). Each unit of data transmitted is entirely self-contained.
Connectionless-mode service is attractive for applications that:
o
involve short-tenn request/response interactions
o
exhibit a high level of redundancy
o
are dynamically reconfigurable
o
do not require guaranteed, in-sequence delivery of data
Connection-Mode Service
The connection-mode transport service is characterized by four phases: local
management, connection establishment, data transfer, and connection release.
Local Management
The local management phase defines local operations between a transport user
and a transport provider. For example, a user must establish a channel of communication with the transport provider, as illustrated in Figure 9-3. Each channel
between a transport user and transport provider is a unique endpoint of communication, and will be called the transport endpoint. The t _ open(3N) routine
enables a user to choose a particular transport provider that will supply the
connection-mode services, and establishes the transport endpoint.
Figure 9-3
Channel Between User and Provider
transport
user
,
~
transport endpoint
.......... ......... ............... Transport Interface
transport
provider
Another necessary local function for each user is to establish an identity with the
transport provider. Each user is identified by a transport address. More accurately, a transport address is associated with each transport endpoint, and one
user process may manage several transport endpoints. In connection-mode service, one user requests a connection to another user by specifying that user's
address. The structure of a transport address is defined by the address space of
the transport provider. An address may be as simple as a random character string
(for example, "file_server"), or as complex as an encoded bit pattern that
specifies all infonnation needed to route data through a network. Each transport
Revision A, of 27 March 1990
194
Transport-Level Programming
provider defines its own mechanism for identifying users. Addresses may be
assigned to each transport endpoint by t bind(3N).
Table 9-1
Local Management Routines
Command
Description
t_alloc
Allocates Transport Interface data structures.
t_bind
Binds a transport address to a transport endpoint.
t_close
Closes a transport endpoint.
t_error
Prints a Transport Interface error message.
t_free
Frees structures allocated using t_alloc.
t~etinfo
Returns a set of parameters associated with a particular
transport provider.
t~etstate
Returns the state of a transport endpoint.
t_Iook
Returns the current event on a transport endpoint.
t_open
Establishes a transport endpoint connected to a chosen
transport provider.
t_optmgmt
Negotiates protocol-specific options with the transport
provider.
t_sync
Synchronizes a transport endpoint with the transport provider.
t_unbind
Unbinds a transport address from a transport endpoint.
In addition to t_open and t_bind, several routines are available to support
local operations. Table 9-1 summarizes all local management routines of the
Transport Interface.
Connection Establishment
The connection establishment phase enables two users to create a connection, or
virtual circuit, between them, as demonstrated in Figure 9-4.
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
Figure 9-4
195
Transport Connection
user 1
user 2
......... ................................. ..... Tran sport Interface
,
----------------(
Transport Connection
transport provider
This phase is illustrated by a client-server relationship between two transport
users. One user, the server, typically advertises some service to a group of users,
and then listens for requests from those users. As each client requires the service,
it attempts to connect itself to the server using the server's advertised transport
address. The t_connect(3N) routine initiates the connect request. One argument to t_connect, the transport address, identifies the server the client
wishes to access. The server is notified of each incoming request using
t_listen(3N), and may call t_accept(3N) to accept the client's request for
access to the service. If the request is accepted, the transport connection is established.
Table 9-2 summarizes all routines available for establishing a transport connection.
Table 9-2
Connection Establishment Routines
Command
Description
t_accept
Accepts a request for a transport connection.
t_connect
Establishes a connection with the transport user at a
speci fied destination.
t_listen
Retrieves an indication of a connect request from
another transport user.
t_rcvconnect
Completes connection establishment if t_connect
was called in asynchronous mode (see the Advanced
Topics section).
Revision A, of 27 March 1990
196
Transport-Level Programming
The data transfer phase enables users to transfer data in both directions over an
established connection. Two routines, t_snd(3N) and t_rcv(3N), send and
receive data over this connection. All data sent by a user is guaranteed to be
delivered to the user on the other end of the connection in the order in which it
was sent. Table 9-3 summarizes the connection mode data transfer routines.
Data Transfer
Table 9-3
Connection Mode Data Transfer Routines
Description
Command
t_rcv
Retrieves data that has arrived over a transport connection.
t_snd
Send data over an established transport connection.
The connection release phase provides a mechanism for breaking an established
connection. When you decide that the conversation should tenninate, you can
request that the provider release the transport connection. Two types of connection release are supported by the Transport Interface. The first is an abortive
release, which directs the transport provider to release the connection immediately. Any previously sent data that has not yet reached the other transport user
may be discarded by the transport provider. The t_snddis(3N) routine initiates this abortive disconnect, and t_rcvdis(3N) processes the incoming
indication of an abortive disconnect.
Connection Release
All transport providers must support the abortive release procedure. In addition,
some transport providers may also support an orderly release facility that enables
users to tenninate communication gracefully with no data loss. The functions
t _ sndrel(3N) and t _ rcvrel(3N) support this capability. Table 9-4 summarizes the connection release routines.
Table 9-4
Connection Release Routines
Command
Description
t_rcvdis
Returns an indication of an aborted connection, including a reason code and user data.
t_rcvrel
Returns an indication that the remote user has requested
an orderly release of a connection.
t_snddis
Aborts a connection or rejects a connect request.
t_sndrel
Requests the orderly release of a connection.
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
Connectionless-Mode Service
197
The connectionless-mode transport service is characterized by two phases: local
management and data transfer. The local management phase defines the same
local operations described above for the connection-mode service.
The data transfer phase enables a user to transfer data units (sometimes called
datagrams) to the specified peer user. Each data unit must be accompanied by
the transport address of the destination user. Two routines, t_sndudata(3N)
and t _ rcvuda ta(3N) support this message-based data transfer facility. Table
9-5 summarizes all routines associated with connectionless-mode data transfer.
Table 9-5
Connectionless-mode Data Transfer Routines
Command
Description
t_rcvudata
Retrieves a message sent by another transport user.
t_rcvuderr
Retrieves error information associated with a previously
sent message.
t_sndudata
Sends a message to the specified destination user.
The Transport Interface has two components:
State Transitions
o
the library routines that provide the transport services to users
o
the state transition rules that define the sequence in which the transport routines may be invoked
The state transition rules can be found in the State Transitions section of this
chapter in the form of state tables. The state tables define the legal sequence of
library calls based on state infonnation and the handling of events. These events
include user-generated library calls, as well as provider-generated event indications.
NOTE
9.4. Introduction to
Connection-Mode
Services
Any user of the Transport Interface must completely understand all possible state
transitions before writing software using the interface.
This section describes the connection-mode service of the Transport Interface.
As discussed in the previous section, the connection-mode service can be illustrated using a client-server paradigm. The important concepts of connectionmode service will be presented using two programming examples. The examples
are related in that the first illustrates how a client establishes a connection to a
server and then communicates with the server. The second example shows the
server's side of the interaction. All examples discussed in this guide are
presented in their entirety in the Some Examples section, below.
In the examples, the client establishes a connection with a server process. The
server then transfers a file to the client. The client, in turn, receives the data from
the server and writes it to its standard output file.
Revision A. of 27 March 1990
198
Transport-Level Programming
Before the client and server can establish a transport connection, each must first
establish a local channel (the transport endpoint) to the transport provider using
t _open, and establish its identity (or address) using t _bind.
Local~anage~ent
The set of services supported by the Transport Interface may not be implemented
by all transport protocols. Each transport provider has a set of characteristics
associated with it that detennine the services it offers and the limits associated
with those services. This information is returned to the user by t _open, and
consists of the following:
addr
maximum size of a transport address
options
maximum bytes of protocol-specific options that may be passed between the
transport user and transport provider
tsdu
maximum message size that may be transmitted in either connection-mode
or connectionless-mode
etsdu
maximum expedited data message size that may be sent over a transport
connection
connect
maximum number of bytes of user data that may be passed between users
during connection establishment
discon
maximum bytes of user data that may be passed between users during the
abortive release of a connection
servtype
the type of service supported by the transport provider
The three service types defined by the Transport Interface are:
T_COTS
The transport provider supports connection-mode service but does not provide the optional orderly release facility.
T_COTS_ORD
The transport provider supports connection-mode service with the optional
orderly release facility.
T_CLTS
The transport provider supports connectionless-mode service. Only one
such service can be associated with the transport provider identified by
t_open.
NOTE
t _open returns the default provider characteristics associated with a transport
endpoint. However, some characteristics may change after an endpoint has been
opened. This will occur if the characteristics are associated with negotiated
options (option negotiation is described later in this section). For example, if the
Revision At of 27 March 1990
199
Chapter 9 - Transport Level Interface Programming
support of expedited data transfer is a negotiated option, the value of this characteristic may change. t_getinfo may be called to retrieve the current characteristics of a transport endpoint.
Once a user establishes a transport endpoint with the chosen transport provider, it
must establish its identity. As mentioned earlier, t _bind accomplishes this by
binding a transport address to the transport endpoint. In addition, for servers, this
routine informs the transport provider that the endpoint will be used to listen for
incoming connect requests, also called connect indications.
An optional facility, t _ optmgmt (3N), is also available during the local
management phase. It enables a user to negotiate the values of protocol options
with the transport provider. Each transport protocol is expected to define its own
set of negotiable protocol options, which may include such information as
Quality-of-Service parameters. Because of the protocol-specific nature of
options, only applications written for a particular protocol environment are
expected to use this facility.
The Client
The local management requirements of the example client and server are used to
discuss details of these facilities. The following are the definitions needed by the
client program, followed by its necessary local management steps.
#include <stdio.h>
#include <tiuser.h>
#include <fcntl.h>
#define SRV ADDR
1
/ * server's well known address * /
main ( )
{
int fd;
int nbytes;
int flags = 0;
char buf[1024];
struct t_call *sndcall;
extern int t_errno;
if ((fd = t_open("/dev/tivc", O_RDWR, NULL»
t_error("t_open failed");
exit(l);
< 0)
{
if (t_bind(fd, NULL, NULL) < 0)
t error("t_bind failed");
exit(2);
The first argument to t _open is the patbname of a file system node that
identifies the transport protocol that will supply the transport service. In this
example, / dev /ti vc is a STREAMS clone device node that identifies a
generic, connection-based transport protocol [see clone(4)].
Revision A. of 27 March 1990
200
Transport-Level Programming
NOTE
The name / dev /ti vc does not exist in SunOS. This is just a name used as an
example that represents the transport selection node.
The clone device finds an available minor device of the transport provider for
the user. It is opened for both reading and writing, as specified by the O_RDWR
open flag. The third argument may be used to return the service characteristics of
the transport provider to the user. This information is useful when writing
protocol-independent software (discussed in the Guidelines for Protocol
Independence section, below.) For simplicity, the client and server in this example ignore this information and assume the transport provider has the following
characteristics:
[J
The transport address is an integer value that uniquely identifies each user.
D
The transport provider supports the T_COTS_ORD service type, and the
example will use the orderly release facility to release the connection.
D
User data may not be passed between users during either connection establishment or abortive release.
D
The transport provider does not support protocol-specific options. Because
these characteristics are not needed by the user, NULL is specified in the
third argument to t _open. If the user needed a service other than
T_COTS_ORD, another transport provider would be opened. An example
of the T_CLTS service invocation is presented in the Introduction to
Connectionless-Mode Service section.
The return value of t _ope n is an identifier for the transport endpoint that will
be used by all subsequent Transport Interface function calls. This identifier is
actually a file descriptor obtained by opening the transport protocol file [see
open(2)]. The significance of this fact is highlighted in the A Read/Write Interface section.
After the transport endpoint is created, the client calls t _bind to assign an
address to the endpoint. The first argument identifies the transport endpoint. The
second argument describes the address the user would like to bind to the endpoint, and the third argument is set on return from t_bind to specify the
address that the provider bound.
The address associated with a server's transport endpoint is important, because
that is the address used by all clients to access the server. However, the typical
client does not care what its own address is, because no other process will try to
access it. That is the case in this example, where the second and third arguments
to t_bind are set to NULL. A NULL second argument will direct the transport provider to choose an address for the user. A NULL third argument indicates that the user does not care what address was assigned to the endpoint.
If either t _open or t _bind fail, the program will call t _ error(3N) to
print an appropriate error message to s t de r r. If any Transport Interface routine fails, the global integer t _ errno will be assigned an appropriate transport
error value. A set of such error values has been defined (in <ti user. h» for
the Transport Interface, and t _error will print an error message corresponding to the value in t errno. This routine is analogous to perror(3), which
prints an error message based on the value of err no. If the error associated
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
201
with a transport function is a system error, t _ er rno will be set to TSYSERR,
and errno will be set to the appropriate value.
The server in this example must take similar local management steps before communication can begin. The server must establish a transport endpoint through
which it will listen for connect indications. The necessary definitions and local
management steps are shown below:
The Server
iinclude
iinclude
iinclude
iinclude
iinclude
<tiuser.h>
<stropts.h>
<fcntl.h>
<stdio.h>
<signal.h>
idefine DISCONNECT
idefine SRV ADDR
int conn_fd;
extern int t_errno;
-1
1
/* server's well known address * /
/ * connection established here * /
main ( )
{
int listen_fd;
/* listening transport endpoint * /
struct t bind *bindi
struct t_call *calli
if «listen_fd = t_open (" /dev /tivc",
O_RDWR, NULL»
< 0) {
t_error(nt_open failed for listen_fd l l ) ;
exit(l);
1*
* By asswning that the address is an integer value,
* this program may not run over another protocol.
*1
if «bind = (struct t_bind *)t_alloc(listen_fd,
T_BIND, T_ALL»
== NULL) {
t_error(nt_alloc of t bind structure failed n );
exit(2);
bind->qlen = 1;
bind->addr.len = sizeof(int);
*(int *)bind->addr.buf = SRV_ADDR;
if (t_bind(listen_fd, bind, bind) < 0) {
t_error(nt_bind failed for listen_fd l l ) ;
exit(3);
•
§!!.!!
Revision A, of 27 March 1990
202
Transport-Level Programming
1*
* Was the correct address bound?
*1
if (*(int *)bind->addr.buf != SRV_ADDR)
fprintf(stderr, nt_bind bound wrong address\O);
exit(4);
As with the client, the first step is to call t _open to establish a transport endpoint with the desired transport provider. This endpoint, listenJd, will be used
to listen for connect indications. Next, the server must bind its well-known
address to the endpoint. This address is used by each client to access the server.
The second argument to t_bind requests that a particular address be bound to
the transport endpoint. This argument points to a t _bind structure with the
following format:
struct t bind
struct netbuf addri
unsigned qleni
where addr describes the address to be bound, and qlen indicates the maximum
outstanding connect indications that may arrive at this endpoint. All Transport
Interface structure and constant definitions are found in <tiuser. h>.
The address is specified using a netbuf structure that contains the following
members:
struct netbuf
unsigned int maxleni
unsigned int leni
char *bufi
where buJ points to a buffer containing the data, len specifies the bytes of data in
the buffer, and maxlen indicates the maximum bytes the buffer can hold (and
need only be set when data is returned to the user by a Transport Interface routine). Forthe t_bind structure, the data pointed to by buJidentifies a transport
address. It is expected that the structure of addresses will vary among each protocol implementation under the Transport Interface. The netbuf structure is
intended to support any such structure.
If the value of qlen is greater than 0, the transport endpoint may be used to listen
for connect indications. In such cases, t _bind directs the transport provider to
immediately begin queueing connect indications destined for the bound address.
Furthermore, the value of qlen indicates the maximum outstanding connect indications the server wishes to process. The server must respond to each connect
indication, either accepting or rejecting the request for connection. An outstanding connect indication is one to which the server has not yet responded. Often, a
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
203
server will fully process a single connect indication and respond to it before
receiving the next indication. In this case, a value of 1 is appropriate for qlen.
However, some servers may wish to retrieve several connect indications before
responding to any of them. In such cases, qlen indicates the maximum number
of such outstanding indications the server will process. An example of a server
that manages multiple outstanding connect indications is presented in the
Advanced Topics section.
t_alloc(3N) is called to allocate the t_bind structure needed by t_bind.
t_alloc takes three arguments. The first is a file descriptor that references a
transport endpoint. This is used to access the characteristics of the transport provider [see t open(3N)]. The second argument identifies the appropriate Transport Interface structure to be allocated. The third argument specifies which, if
any, netbuf buffers should be allocated for that structure. T_ALL specifies
that all netbuf buffers associated with the structure should be allocated, and
will cause the addr buffer to be allocated in this example. The size of this buffer
is detennined from the transport provider characteristic that defines the maximum address size. The maxlen field of this netbuf structure will be set to the
size of the newly allocated buffer by t_alloc. The use of t_alloc will
help ensure the compatibility of user programs with future releases of the Transport Interface.
The server in this example will process connect indications one at a time, so qlen
is set to 1. The address information is then assigned to the newly allocated
t_bind structure. This t_bind structure will be used to pass infonnation to
t_bind in the second argument, and also will be used to return infonnation to
the user in the third argument.
On return, the t _bind structure will contain the address that was bound to the
transport endpoint. If the provider could not bind the requested address (perhaps
because it had been bound to another transport endpoint), it will choose another
appropriate address.
NOTE
Each transport provider will manage its address space differently. Some transport providers may allow a single transport address to be bound to several transport endpoints, while others may require a unique address per endpoint. The
Transport Interface supports either choice. Based on its address management
rules, a provider will determine if it can bind the requested address. If not, it
will choose another valid address from its address space and bind it to the transport endpoint.
The server must check the bound address to ensure that it is the one previously
advertised to clients. Otherwise, the clients will be unable to reach the server.
If t _bind succeeds, the provider will begin queueing connect indications. The
next phase of communication, connection establishment, is entered.
Revision A, of 27 March 1990
204
Transport-Level Programming
Connection Establishment
The connection establishment procedures highlight the distinction between
clients and selVers. The Transport Interface imposes a different set of procedures
in this phase for each type of transport user. The client initiates the connection
establishment procedure by requesting a connection to a particular selVer using
t connect(3N). The selVer is then notified of the client's request by calling
t _Ii sten(3N). The selVer may either accept or reject the client's request. It
will call t_accept(3N) to establish the connection, or call t_snddis(3N)
to reject the request. The client will be notified of the selVer's decision when
t_connect completes.
The Transport Interface supports two facilities during connection establishment
that may not be supported by all transport providers. The first is the ability to
transfer data between the client and selVer when establishing the connection.
The client may send data to the selVer when it requests a connection. This data
will be passed to the selVerby t_listen. Similarly, the selVer can send data
to the client when it accepts or rejects the connection. The connect characteristic
returned by t_open determines how much data, if any, two users may transfer
during connect establishment.
The second optional selVice supported by the Transport Interface during connection establishment is the negotiation of protocol options. The client may specify
protocol options that it would like the transport provider and/or the remote user.
The Transport Interface supports both local and remote option negotiation. As
discussed earlier, option negotiation is inherently a protocol-specific function.
Use of this facility is discouraged if protocol independent software is a goal (see
the Guidelines for Protocol Independence section).
The Client
Continuing with the client/server example, the steps needed by the client to
establish a connection are shown next:
1*
* By assuming that the address is an integer value,
* this program may not run over another protocol.
*1
if «sndcall = (struct t_call *)t_alloc(fd,
T_CALL, T_ADDR» == NULL) {
t_error(tlt_alloc failed");
exit(3);
sndcall->addr.len - sizeof(int);
*(int *)sndcall->addr.buf = SRV_ADDR;
if (t_connect(fd, sndcall, NULL) < 0)
t_error(tlt_connect failed for fd tl );
exit(4);
The t _ connect call establishes the connection with the server. The first argument to t_connect identifies the transport endpoint through which the connection is established, and the second argument identifies the destination server.
This argument is a pointer to a t _call structure, which has the following format:
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
205
struct t_call {
struct netbuf addr;
struct netbuf opt;
struct netbuf udata;
int sequence;
addr identifies the address of the server, opt may be used to specify protocolspecific options that the client would like to associate with the connection, and
udata identifies user data that may be sent with the connect request to the server.
The sequence field has no meaning for t_connect.
t_alloc is called above to allocate the t_call structure dynamically. Once
allocated, the appropriate values are assigned. In this example, no options or
user data are associated with the t_connect call, but the server's address
must be set. The third argument to t_alloc is set to T_ADDR to indicate that
an appropriate netbuf buffer should be allocated for the address. The seIVer's
address is then assigned to buf, and len is set accordingly.
The third argument to t_connect can be used to return infonnation about the
newly established connection to the user, and may retrieve any user data sent by
the .server in its response to the connect request. It is set to NULL by the client
here to indicate that this infonnation is not needed. The connection will be established on successful return of t_ connect. If the server rejects the connect
request, t_connect will fail and set t_errno to TLOOK.
Event Handling
The TLOOK error has special significance in the Transport Interface. Some
Transport Interface routines may be interrupted by an unexpected asynchronous
transport event on the given transport endpoint, and TLOOK notifies the user that
an event has occurred. As such, TLOOK does not indicate an error with a Transport Interface routine, but the nonnal processing of that routine will not be perfonned because of the pending event. The events defined by the Transport Interface are listed here:
T_LISTEN
A request for a connection, called a connect indication, has arrived at the
transport endpoint.
T_CONNECf
The confinnation of a previously sent connect request, called a connect
confinnation, has arrived at the transport endpoint. The confirmation is generated when a server accepts a connect request.
T_DATA
User data has arrived at the transport endpoint.
T_EXDATA
Expedited user data has arrived at the transport endpoint. Expedited data
will be discussed later in this section.
Revision A, of 27 March 1990
206
Transport-Level Programming
T_DISCONNECf
A notification that the connection was aborted or that the server rejected a
connect request, called a disconnect indication, has arrived at the transport
endpoint.
T_ORDREL
A request for the orderly release of a connection, called an orderly release
indication, has arrived at the transport endpoint.
T_UDERR
The notification of an error in a previously sent datagram, called a unitdata
error indication, has arrived at the transport endpoint (see the Introduction to
Connectionless-Mode Service section).
It is possible in some states to receive one of several asynchronous events, as
described in the state tables of the State Transitions section. The t_look(3N)
routine enables a user to determine what event has occurred if a TLOOK error is
returned. The user can then process that event accordingly. In the example, if a
connect request is rejected, the event passed to the client will be a disconnect
indication. The client will exit if its request is rejected.
The Server
Returning to the example, when the client calls t_connect, a connect indication will be generated on the server's listening transport endpoint. The steps
required by the server to process the event are presented below. For each client,
the server accepts the connect request and spawns a server process to manage the
connection.
if «call = (struct t_call *}t_alloc(listen_fd,
T_CALL, T_ALL)} == NULL} {
t_error (IIt_alloc of t_call structure failed") i
exit(S}i
while (1) {
if (t_Iisten(listen_fd, call) < 0) {
t_error(IIt_listen failed for listen_fd
exit(6}i
ll
if «conn_fd = accept_call (listen_fd, call»
run_server(listen_fd);
};
!= DISCONNECT)
The server will loop forever, processing each connect indication. First, the server
calls t_listen to retrieve the next connect indication. When one arrives, the
server calls accept_call to accept the connect request. accept_call
accepts the connection on an alternate transport endpoint (as discussed below)
and returns the value of that endpoint. connJd is a global variable that identifies
the transport endpoint where the connection is established. Because the connection is accepted on an alternate endpoint, the server may continue listening for
connect indications on the endpoint that was bound for listening. If the call is
accepted without error, run_server will spawn a process to manage the connection.
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
207
The server allocates a t_call structure to be used by t_Iisten. The third
argument to t_alloc, T_ALL, specifies that all necessary buffers should be
allocated for retrieving the caller's address, options, and user data. As mentioned
earlier, the transport provider in this example does not support the transfer of
user data during connection establishment, and also does not support any protocol options. Therefore, t_alloc will not allocate buffers for the user data and
options. It must, however, allocate a buffer large enough to store the address of
the caller. t alloc detennines the buffer size from the addr characteristic
returned by t _open. The maxlen field of each netbuf structure will be set
to the size of the newly allocated buffer by t_alloc (maxlen is 0 for the user
data and options buffers).
Using the t_call structure, the server calls t_listen to retrieve the next
connect indication. If one is currently available, it is returned to the server
immediately. Otherwise, t_listen will block until a connect indication
arrives.
NOTE
The Transport Interface supports an asynchronous mode for such routines that
will prevent a process from blocking. This feature is discussed in the Advanced
Topics section.
When a connect indication arrives, the seNer calls accept _call to accept the
client's request, as follows:
accept_call (listen_fd, call)
int listen_fd;
struct t_call *call;
{
int resfd;
if «resfd = t_open (" /dev/tivc", O_RDWR, NULL» < 0) {
t_error(IIt_open for responding fd failed");
exit(7);
if (t_bind(resfd, NULL, NULL) < 0) {
t_error(IIt_bind for responding fd failed");
exit(8);
if (t_accept(listen_fd, resfd, call) < 0) {
if (t_errno === TLOOK) { /* must be a disconnect * /
if (t_rcvdis(listen_fd, NULL) < 0) {
t_error(lIt_rcvdis failed for listen_fd");
exit(9);
if (t_close(resfd) < 0) {
t_error (Itt_close failed for responding fd") ;
exit(lO);
}
/ * go back up and listen for other calls * /
return(DISCONNECT);
t_error("t_accept failed");
Revision A. of 27 March 1990
208
Transport-Level Programming
exit(ll);
return (resfd) ;
accept_call takes two arguments. listenJd identifies the transport endpoint
where the connect indication arrived, and call is a pointer to a t _ c all structure
that contains all infonnation associated with the connect indication. The server
will first establish another transport endpoint by opening the clone device node of
the transport provider and binding an address. As with the client, a NULL value
is passed to t _bind to specify that the user does not care what address is
bound by the provider. The newly established transport endpoint, resjd, is used
to accept the client's connect request.
The first two arguments of t _accept specify the listening transport endpoint
and the endpoint where the connection will be accepted respectively. A connection may be accepted on the listening endpoint. However, this would prevent
other clients from accessing the server for the duration of that connection.
The third argument of t_accept points to the t_call structure associated
with the connect indication. This structure should contain the address of the calling user and the sequence number returned by t_Iisten. The value of
sequence has particular significance if the server manages multiple outstanding
connect indications. The Advanced Topics section presents such an example.
Also, the t _call structure should identify protocol options the user would like
to specify, and user data that may be passed to the client. Because the transport
provider in this example does not support protocol options or the transfer of user
data during connection establishment, the t _call structure returned by
t_listen may be passed without change to t_accept.
For simplicity in the example, the server will exit if either the t _open or
t_hind call fails. exit(2) will close the transport endpoint associated with
listenJd, causing the transport provider to pass a disconnect indication to the
client that requested the connection. This disconnect indication notifies the client
that the connection was not established; t _ connect will fail, setting
t errno to TLOOK.
t _ a c c e pt may fail if an asynchronous event has occurred on the listening transport endpoint before the connection is accepted, and t _ errno will be set to
TLOOK. The state transition table in the State Transitions section shows that the
only event that may occur in this state with only one outstanding connect indication is a disconnect indication. This event may occur if the client decides to
undo the connect request it had previously initiated. If a disconnect indication
arrives, the server must retrieve the disconnect indication using t_rcvdis.
This routine takes a pointer to a t _ di s con structure as an argument, which is
used to retrieve infonnation associated with a disconnect indication. In this
example, however, the server does not care to retrieve this information, so it sets
the argument to NULL. After receiving the disconnect indication,
accept _call closes the responding transport endpoint and returns DISCONNECf, which informs the server that the connection was disconnected by the
client. The server then listens for further connect indications.
+~,!!
Revision A, of27 March 1990
Chapter 9 - Transport Level Interface Programming
209
Figure 9-5 illustrates how the server establishes connections.
Figure 9-5
Listening and Responding Transport Endpoints
server
client
..
-EE:---
responding
-----:~
endpoint
listening
endpoint
. . . . . . . . .. .......................... ............ . ....... Transport Interface
_ _ _ _ _ _ _ _ _ _ _ _ _ ..
~----+---
transport
connection
transport provider
The transport connection is established on the newly created responding endpoint, and the listening endpoint is freed to retrieve further connect indications.
Data Transfer
Once the connection has been established, both the client and server may begin
transferring data over the connection using t _ s nd and t _ r c v. In fact, the
Transport Interface does not differentiate the client from the server from this
point on. Either user may send and receive data, or release the connection. The
Transport Interface guarantees reliable, sequenced delivery of data over an existing connection.
Two classes of data may be transferred over a transport connection: nonnal and
expedited. Expedited data is typically associated with information of an urgent
nature. The exact semantics of expedited data are subject to the interpretations of
the transport provider. Furthermore, all transport protocols do not support the
notion of an expedited data class [see t _ open(3N)].
All transport protocols support the transfer of data in byte stream mode, where
"byte stream" implies no concept of message boundaries on data that is
transferred over a connection. However, some transport protocols support the
preservation of message boundaries over a transport connection. This service is
supported by the Transport Interface, but protocol-independent software must not
rely on its existence.
The message interface for data transfer is supported by a special flag of t _ s n d
and t_rcv called T_MORE. The messages, called Transport Service Data
Units (TSDU), may be transferred between two transport users as distinct units.
The maximum size of a TSDU is a characteristic of the underlying transport protocol. This infonnation is available to the user from t _open and
t_getinfo. Because the maximum TSDU size can be large (possibly unlimited), the Transport Interface enables a user to transmit a message in multiple
units.
Revision A, of 27 March 1990
210
Transport-Level Programming
To send a message in multiple units over a transport connection, the user must
set the T _MORE flag on every t _ snd call except the last. This flag indicates
that the user will send more data associated with the message in a subsequent call
to t _ snd. The last message unit should be transmitted with T_MORE turned
off to indicate that this is the end of the TSDU.
Similarly, a TSDU may be passed to the user on the receiving side in multiple
units. Again, if t_rcv returns with the T_MORE flag set, the user should continue calling t_rcv to retrieve the remainder of the message. The last unit in
the message will be indicated by a call to t_rcv that does not set T_MORE.
CAUTION
The T_MORE flag implies nothing about how the data may be packaged
below the Transport Interface. Furthermore, it implies nothing about how
the data may be delivered to the remote user. Each transport protocol, and
each implementation of that protocol, may package and deliver the data differently.
For example, if a user sends a complete message in a single call to t _ s nd, there
is no guarantee that the transport provider will deliver the data in a single unit to
the remote transport user. Similarly, a TSDU transmitted in two message units
may be delivered in a single unit to the remote transport user. The message
boundaries may only be preserved by noting the value of the T _MORE flag on
t _ snd and t _ r cv. This will guarantee that the receiving user will see a message with the same contents and message boundaries as was sent by the remote
user.
The Oient
Continuing with the client/server example, the server will transfer a log file to the
client over the transport connection. The client receives this data and writes it to
its standard output file. A byte stream interface is used by the client and server,
where message boundaries (that is, the T _MORE flag) are ignored. The client
receives data using the following instructions:
while ((nbytes = t_rev(fd, buf, 1024, &flags»
if (fwrite(buf, 1, nbytes, stdout) < 0)
fprintf(stderr, "fwrite failed\O);
exit(S);
~=
-1)
{
The client continuously calls t _rev to process incoming data. If no data is
currently available, t_rcv blocks until data arrives. t_rcv will retrieve the
available data up to 1024 bytes, which is the size of the client's input buffer, and
will return the number of bytes that were received. The client then writes this
data to standard output and continues. The data transfer phase will complete
when t rcv fails. t _ rcv will fail if an orderly release indication or disconnect indication arrives, as will be discussed later in this section. If the
fwri te(3S) call fails for any reason, the client will exit, thereby closing the
transport endpoint. If the transport endpoint is closed (either by exit or
t close) when it is in the data transfer phase, the connection will be aborted
and the remote user will receive a disconnect indication.
Revision A. of 27 March 1990
211
Chapter 9 - Transport Level Interface Programming
The SeIVer
Looking now at the other side of the connection, the seIVer manages its data
transfer by spawning a child process to send the data to the client. The parent
process then loops back to listen for further connect indications.
run_server is called by the seIVerto spawn this child process as follows:
connrelease ( )
{
/ * connJd is global because needed here * /
if (t_look(conn_fd) == T_DISCONNECT)
fprintf(stderr, "connection aborted\O);
exit(12);
/ * else orderly release indication - normal exit * /
exit(O);
run_server (listen_fd)
int listen_fd;
int nbytes;
FILE *logfp;
char buf[1024];
/ * file pointer to log file * /
switch (fork ( ) )
case -1:
perror("fork failed");
exit(20);
default:
/*
/* parent * /
close connJd and then go up and listen again
*/
if (t_close(conn_fd) < 0) {
t_error("t_close failed for conn_fd");
exit(21);
return;
case 0:
/* child * /
/ * close listenJd and do service * /
if (t_close(listen_fd) < 0)
t_error("t_close failed for listen_fd");
exit(22);
if ((logfp = fopen (" logfile", "r"»
perror("cannot open logfile");
exit(23);
NULL)
signal (SIGPOLL, connrelease);
if (ioctl(conn_fd, I_SETSIG, S_INPUT) < 0)
perror("ioctl I_SETSIG failed");
exit(24);
if (t_look(conn_fd)
~ ~t!!
!= 0)
{
{
{ /* was disconnect there? */
Revision A. of 27 March 1990
212
Transport-Level Programming
fprintf(stderr, lit look: unexpected event\O);
exit(25);
while «nbytes = fread(buf, 1, 1024, logfp» > 0)
if (t_snd(conn_fd, buf, nbytes, 0) < 0) {
t_error(IIt_snd failed");
exit(26);
After the for k, the parent process will return to the main processing loop and
listen for further connect indications. Meanwhile, the child process will manage
the newly established transport connection. If the fork call fails, exit will
close the transport endpoint associated with listenJd. This action will cause a
disconnect indication to be passed to the client, and the client's t _connect
call will fail.
The selVer process reads 1024 bytes of the log file at a time and sends that data to
the client using t_snd. bujpoints to the start of the data buffer, and nbytes
specifies the number of bytes to be transmitted. The fourth argument is used to
specify optional flags. Two flags are currently supported: T_EXPEDITED may
be set to indicate that the data is expedited, and T_MORE may be set to define
message boundaries when transmitting messages over a connection. Neither flag
is set by the server in this example.
If the user begins to flood the transport provider with data, the provider may exert
back pressure to provide flow control. In such cases, t _ snd will block until the
flow control is relieved, and will then resume its operation. t s nd will not
complete until nbyte bytes have been passed to the transport provider.
The t _ snd routine does not look for a disconnect indication (signifying that the
connection was broken) before passing data to the provider. Also, because the
data traffic is flowing in one direction, the user will never look for incoming
events. If, for some reason, the connection is aborted, the user should be notified
because data may be lost. One option available to the user is to use t _look to
check for incoming events before each t _ snd call. A more efficient solution is
the one presented in the example. The STREAMS I_SETSIG ioctl enables a
user to request a signal when a given event occurs [see streamio(5) and
signal(2)]. The STREAMS event of concern here is S_INPUT, which will
cause a signal to be sent to the user if any input arrives on the Stream referenced
by connJd. If a disconnect indication arrives, the signal catching routine
(connrelease) will print an appropriate error message and then exit.
If the data traffic flowed in both directions in this example, the user would not
have to monitor the connection for disconnects. If the client alternated t snd
and t_rcv calls, it could rely on t_rcv to recognize an incoming disconnect
indication.
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
Connection Release
213
At any point during data transfer, either user may release the transport connection
and end the conversation. As mentioned earlier, two fonns of connection release
are supported by the Transport Interface. The first, abortive release, breaks a
connection immediately and may result in the loss of any data that has not yet
reached the destination user. t_snddis may be called by either user to generate an abortive release. Also, the transport provider may abort a connection if a
problem occurs below the Transport Interface. t s nddi s enables a user to
send data to the remote user when aborting a connection. Although the abortive
release is supported by all transport providers, the ability to send data when
aborting a connection is not.
When the remote user is notified of the aborted connection, t _ r cvdi s must be
called to retrieve the disconnect indication. This call will return a reason code
that indicates why the connection was aborted, and will return any user data that
may have accompanied the disconnect indication (if the abortive release was initiated by the remote user). This reason code is specific to the underlying transport protocol, and should not be interpreted by protocol-independent software.
The second fonn of connection release is orderly release, which gracefully terminates a connection and guarantees that no data will be lost. All transport providers must support the abortive release procedure, but orderly release is an
optional facility that is not supported by all transport protocols.
The Server
The client-server example in this section assumes that the transport provider does
support the orderly release of a connection. When all the data has been
transferred by the server, the connection may be released as follows:
if (t_sndrel(conn_fd) < 0) {
t_error("t_sndrel failed");
exit(27);
pause ( ) ;
/ * until orderly release indication arrives * /
The orderly release procedure consists of two steps by each user. The first user
to complete data transfer may initiate a release using t _ sndrel, as illustrated
in the example. This routine informs the client that no more data will be sent by
the server. When the client receives such an indication, it may continue sending
data back to the server if desired. When all data has been transferred, however,
the client must also call t _ sndrel to indicate that it is ready to release the
connection. The connection will be released only after both users have requested
an orderly release and received the corresponding indication from the other user.
In this example, data is transferred in one direction from the server to the client,
so the server does not expect to receive data from the client after it has initiated
the release procedure. Thus, the server simply calls pause(2) after initiating
the release. Eventually, the remote user will respond with its orderly release
request, and the indication will generate a signal that will be caught by
connrelease. Remember that the server earlier issued an I_SETSIG ioct 1
Revision A, of 27 March 1990
214
Transport-Level Programming
call to generate a signal on any incoming event. Since the only possible Transport Interface events that can occur in this situation are a disconnect indication
or orderly release indication, connrelease will tenninate normally when the
orderly release indication arrives. The exit call in connrelease will close
the transport endpoint, thereby freeing the bound address for use by another user.
If a user process wants to close a transport endpoint without exiting, it may call
t close.
The Client
The client's view of connection release is similar to that of the server. As mentioned earlier, the client continues to process incoming data until t _ rev fails.
If the server releases the connection (using either t_snddis or t_sndrel),
t_rcv will fail and set t_errno to TLOOK. The client then processes the
connection release as follows:
if «t_errno == TLOOK) &&
(t_look(fd)
if (t_rcvrel(fd) < 0) {
t_error (IIt_rcvrel failed");
exit(6);
T_ORDREL) )
if (t_sndrel(fd) < 0)
t_error(lIt_sndrel failed");
exit(7);
exit(O);
t_error (IIt_rcv failed");
exit(8);
Under normal circumstances, the client terminates the transfer of data by calling
t_sndrel to initiate the connection release. When the orderly release indication arrives at the client's side of the connection, the client checks to make sure
the expected orderly release indication has arrived. If so, it proceeds with the
release procedures by calling t _ rcvrel to process the indication and
t _ sndrel to infonn the server that it is also ready to release the connection.
At this point the client exits, thereby closing its transport endpoint.
Because all transport providers do not support the orderly release facility just
described, users may have to use the abortive release facility provided by
t_snddis and t_rcvdis. However, steps must be taken by each user to
prevent any loss of data. For example, a special byte pattern may be inserted in
the data stream to indicate the end of a conversation. Many mechanisms are possible for preventing data loss. Each application and high level protocol must
choose an appropriate mechanism given the target protocol environment and
requirements.
Revision A, of 27 March 1990
215
Chapter 9 - Transport Level Interface Programming
9.5. Introduction to
Connectionless-Mode
Service
This section describes the connectionless-mode service of the Transport Interface. Connectionless-mode service is appropriate for short-tenn request/response
interactions, such as transaction processing applications. Data are transferred in
self-contained units with no logical relationship required among multiple units.
The connectionless-mode services will be described using a transaction server as
an example. This server waits for incoming transaction queries, and processes
and responds to each query.
Local Management
Just as with connection-mode service, the transport users must perform appropriate local management steps before data can be transferred. A user must choose
the appropriate connectionless service provider using t _ open and establish its
identity using t_bind.
t _ optrngrnt may be used to negotiate protocol options that may be associated
with the transfer of each data unit. As with the connection-mode service, each
transport provider specifies the options, if any, that it supports. Option negotiation is therefore a protocol-specific activity.
In the example, the definitions and local management calls needed by the transaction server are as follows:
iinclude <stdio.h>
iinclude <fcntl.h>
iinclude <tiuser.h>
idefine SRV ADDR
2
/*
server's well known address
*/
main ( )
{
int fd;
int flags;
struct t bind *bind;
struct t_unitdata *ud;
struct t_uderr *uderr;
extern int t_errno;
if «fd = t_open(n/dev/tidg n , 0 RDWR, NULL»
<
t_error(nunable to open /dev/provider n );
exit(l);
0)
{
if «bind = (struct t bind *)t alloc(fd,
T_BIND, T_ADDR»
== NULL) {
t_error("t_alloc of t bind structure failed");
exit(2);
bind->addr.len = sizeof(int);
*(int *)bind->addr.buf
SRV_ADDR;
bind->qlen = 0;
Revision A, of 27 March 1990
216
Transport-Level Programming
if (t_bind(fd, bind, bind) < 0)
t_error (nt_bind failed n );
exit(3);
1*
* is the bound address correct?
*1
if (*(int *)bind->addr.buf != SRV_ADDR)
fprintf(stderr, nt_bind bound wrong address\O);
exit(4);
The local management steps should look familiar by now. The selVer establishes
a transport endpoint with the desired transport provider using t _open. Each
provider has an associated service type, so the user may choose a particular service by opening the appropriate transport provider file. This connectionlessmode selVer ignores the characteristics of the provider returned by t _ open in
the same way as the users in the connection-mode example, setting the third
argument to NULL. For simplicity, the transaction selVer assumes the transport
provider has the following characteristics:
o
The transport address is an integer value that uniquely identifies each user.
o
The transport provider supports the T_CLTS selVice type (connectionless
transport selVice, or datagram).
o
The transport provider does not support any protocol-specific options.
The connectionless selVer also binds a transport address to the endpoint, so that
potential clients may identify and access the server. A t _bind structure is
allocated using t alloe, and the bufand len fields of the address are set
accordingly.
One important difference between the connection-mode server and this
connectionless-mode server is that the qlen field of the t _bind structure has no
meaning for connectionless-mode service. That is because all users are capable
of receiving datagrams once they have bound an address. The Transport Interface defines an inherent client-selVer relationship between two users while establishing a transport connection in the connection-mode selVice. However, no such
relationship exists in the connectionless-mode service. It is the context of this
example, not the Transport Interface, that defines one user as a server and another
as a client.
Because the address of the server is known by all potential clients, the server
checks the bound address returned by t_bind to ensure it is correct.
Revision A. of 27 March 1990
Chapter 9 - Transport Level Interface Progranuning
Data Transfer
217
Once a user has bound an address to the transport endpoint, datagrams may be
sent or received over that endpoint Each outgoing message is accompanied by
the address of the destination user. In addition, the Transport Interface enables a
user to specify protocol options that should be associated with the transfer of the
data unit (for example, transit delay). As discussed earlier, each transport pro·
vider defines the set of options, if any, that m·ay accompany a datagram. When
the datagram is passed to the destination user, the associated protocol options
may be returned as well.
The following sequence of calls illustrates the data transfer phase of the
connectionless·mode server:
if «ud = (struct t_unitdata *)t_alloc(fd,
T_UNITDATA, T_ALL»
==
NULL) {
t_error("t_alloc of t_unitdata structure failed");
exit (5) ;
if «uderr = (struct t_uderr *)t_alloc(fd,
T_UDERROR, T_ALL»
NULL) {
t_error("t_alloc of t uderr structure failed");
exit (6);
while (1) {
if (t_rcvudata(fd, ud, &flags) < 0) {
if (t_errno
TLOOK) {
/*
* Error on previously sent datagram
*/
if (t_rcvuderr(fd, uderr) < 0) {
exit(7);
fprintf (stderr, "baddatagram, error = %d\n",
uderr->error);
continue;
t_error("t_rcvudata failed");
exit(8);
/*
* Query( ) processes the request and places the
* response in ud·>udata.buf, setting ud·>udata.len
*/
query (ud) ;
if (t_sndudata(fd, ud, 0) < 0) {
t_error("t_sndudata failed");
Revision A. of27 March 1990
218
Transport-Level Programming
exit(9);
query ( )
{
1* Merely a stub for simplicity *1
The server must first allocate a t _ uni tdata structure for storing datagrams,
which has the following fonnat:
struct t unitdata
struct netbuf addr;
struct netbuf opt;
struct netbuf udata;
addr holds the source address of incoming datagrams and the destination address
of outgoing datagrams, opt identifies any protocol options associated with the
transfer of the datagram, and udata holds the data itself. The addr, opt, and
udata fields must all be allocated with buffers that are large enough to hold any
possible incoming values. As described in the previous section, the T_ALL
argument to t_a11oc will ensure this and will set the maxlen field of each
netbuf structure accordingly. Because the provider does not support protocol
options in this example, no options buffer will be allocated, and maxlen will be
set to zero in the netbuf structure for options. A t _ uder r structure is also
allocated by the server for processing any datagram errors, as will be discussed
later in this section.
The transaction server loops forever, receiving queries, processing the queries,
and responding to the clients. It first calls t _ rcvuda ta to receive the next
query. t_rcvudata will retrieve the next available incoming datagram. If
none is currently available, t _ rcvudata will block, waiting for a datagram to
arrive. The second argument of t_rcvudata identifies the t_unitdata
structure where the datagram should be stored.
The third argument, flags, must point to an integer variable and may be set to
T_MORE on return from t_rcvudata to indicate that the user's udata buffer
was not large enough to store the full datagram. In this case, subsequent calls to
t _ r c vuda ta will retrieve the remainder of the datagram. Because t _a 110 c
allocates a udata buffer large enough to store the maximum datagram size, the
transaction server does not have to check the value offlags.
If a datagram is received successfully, the transaction server calls the query routine to process the request. This routine will store the response in the structure
pointed to by ud, and will set ud->udata.len to indicate the number of bytes in
the response. The source address returned by t rcvudata in ud->addr will
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
219
be used as the destination address by t_sndudata.
When the response is ready, t_sndudata is called to return the response to
the client. The Transport Interface prevents a user from flooding the transport
provider with datagrams using the same flow control mechanism described for
the connection-mode service. In such cases, t_sndudata will block until the
flow control is relieved, and will then resume its operation.
Datagram Errors
If the transport provider cannot process a datagram that was passed to it by
t_sndudata, it will return a unit data error event, T_UDERR, to the user.
This event includes the destination address and options associated with the
datagram, plus a protocol-specific error value that describes what may be wrong
with the datagram. The reason a datagram could not be processed is
protocol-specific. One reason may be that the transport provider could not interpret the destination address or options. Each transport protocol is expected to
specify all reasons for which it is unable to process a datagram.
NOTE
The unit data error indication is not necessarily intended to indicate success or
failure in delivering the datagram to the specified destination. The transport
protocol decides how the indication will be used. Remember, the connectionless
service does not guarantee reliable delivery of data.
The transaction server will be notified of this error event when it attempts to
receive another datagram. In this case, t_revudata will fail, setting
t_errno to TLOOK. IfTLOOK is set, the only possible event is T_UDERR,
so the server calls t _ rcvuderr to retrieve the event. The second argument to
t - revuderr is the t - uderr structure that was allocated earlier. This structure is filled in by t_revuderr and has the following format:
struct t_uderr {
struct netbuf addr;
struct netbuf opt;
long error;
where addr and opt identify the destination address and protocol options as
specified in the bad datagram, and error is a protocol-specific error code that
indicates why the provider could not process the datagram. The transaction
server prints the error code and then continues by entering the processing loop
again.
9.6. A ReadIWrite
Interface
A user may wish to establish a transport connection and then exec(2) an existing user program such as eat(l) to process the data as it arrives over the connection. However, existing programs use read(2) and wr i te(2) for their
input/output needs. The Transport Interface does not directly support a
read/wr i te interface to a transport provider, but one is available with UNIX
System V. This interface enables a user to issue read and write calls over a
transport connection that is in the data transfer phase. This section describes the
read/wri te interface to the connection-mode service of the Transport Interface. This interface is not available with the connectionless-mode service.
sun
mlcrosystems
Revision At of 27 March 1990
220
Transport-Level Programming
The read/write interface is presented using the client example of the
Connection-Mode Client section with some minor modifications. The clients are
identical until the data transfer phase is reached. At that point, this client will use
the read/write interface and cat(1) to process incoming data. cat can
be run without change over the transport connection. Only the differences
between this client and that of the example in the Connection-Mode Client section are shown below.
#include <stropts.h>
1*
* Same local management and connection
* establishment steps.
*1
if (ioctl(fd, I_PUSH, "tirdwr") < 0) {
perror(nI_PUSH of tirdwr failed");
exit(5);
close(O);
dup (fd) ;
execl("/bin/cat P , "/bin/cat", 0);
perror(nexecl of /bin/cat failed");
exit(6);
The client invokes the read/write interface by pushing the tirdwr(5)
module onto the Stream associated with the transport endpoint where the connection was established [see I_PUSH in streamio(5)]. This module converts the
Transport Interface above the transport provider into a pure read/write interface. With the module in place, the client calls close(2) and dup(2) to establish the transport endpoint as its standard input file, and uses /bin/ cat to
process the input. Because the transport endpoint identifier is a file descriptor,
the facility for duping the endpoint is available to users.
Because the Transport Interface has been implemented using STREAMS, the
facilities of this character input/output mechanism can be used to provide
enhanced user seIVices. By pushing the tirdwr module above the transport
provider, the user's interface is effectively changed. The semantics of read
and wr it e must be followed, and message boundaries will not be preseIVed.
CAUTION
The tirdwr module may only be pushed onto a Stream when the transport
endpoint is in the data transfer phase. Once the module is pushed, the user may
not call any Transport Interface routines. If a Transport Interface routine is
invoked, tirdwr will generate a fatal protocol error, EPROTO, on that Stream,
rendering it unusable. Furthennore, if the user pops the tirdwr module off the
Stream [see I_POP in streamio(5)], the transport connection will be aborted.
The exact semantics of write, read, and close using tirdwr are
described below. To summarize, tirdwr enables a user to send and receive
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
221
data over a transport connection using read and write. This module will
translate all Transport Interface indications into the appropriate actions. The
connection can be released with the close system call.
write
The user may transmit data over the transport connection using wr i teo The
tirdwr module will pass data through to the transport provider. However, if a
user attempts to send a zero-length data packet, which the STREAMS mechanism allows, tirdwr will discard the message. If for some reason the transport
connection is aborted (for example the remote user aborts the connection using
t_snddis), a STREAMS hangup condition will be generated on that Stream,
and further wr i te calls will fail and set errno to ENXIO. The user can still
retrieve any available data after a hangup, however.
read
read may be used to retrieve data that has arrived over the transport connection.
The tirdwr module will pass data through to the user from the transport provider. However, any other event or indication passed to the user from the provider will be processed by tirdwr as follows:
o
read cannot process expedited data because it cannot distinguish expedited
data from nonnal data for the user. If an expedited data indication is
received, tirdwr will generate a fatal protocol error, EPROTO, on that
Stream. This error will cause further system calls to fail. You must therefore be aware that you should not communicate with a process that is sending expedited data.
o
If an abortive disconnect indication is received, tirdwr will discard the
indication and generate a STREAMS hangup condition on that Stream. Subsequent read calls will retrieve any remaining data, and then read will
return zero for all further calls (indicating end-of-file).
o
If an orderly release indication is received, t irdwr will discard the indica-
tion and deliver a zero-length STREAMS message to the user. As described
in read(2), this notifies the user of end-of-file by returning 0 to the user.
o
If any other Transport Interface indication is received, t ir dwr will gen-
erate a fatal protocol error, EPROTO, on that Stream. This will cause
further system calls to fail. Ifa user pushes tirdwr onto a Stream after
the connection has been established, such indications will not be generated.
cl.ose
With tirdwr on a Stream, the user can send and receive data over a transport
connection for the duration of that connection. Either user may tenninate the
connection by closing the file descriptor associated with the transport endpoint or
by popping the tirdwr module off the Stream. In either case, tirdwr will
take the following actions:
o
If an orderly release indication had previously been received by tirdwr,
an orderly release request will be passed to the transport provider to com-
plete the orderly release of the connection. The remote user, who initiated
the orderly release procedure, will receive the expected indication when data
transfer completes.
Revision A, of 27 March 1990
222
Transport-Level Programming
o
If a disconnect indication had previously been received by t irdwr, no
special action is taken.
o
If neither an orderly release indication nor disconnect indication had previously been received by tirdwr, a disconnect request will be passed to the
transport provider to abortively release the connection.
o
If an error had previously occurred on the Stream and a disconnect indication has not been received by t irdwr, a disconnect request will be passed
to the transport provider.
A process may not initiate an orderly release after tirdwr is pushed onto a
Stream, but tirdwr will handle an orderly release properly if it is initiated by
the user on the other side of a transport connection. If the client in this section is
communicating with the server program in the Connection-Mode Client section,
that server will tenninate the transfer of data with an orderly release request. The
server then waits for the corresponding indication from the client. At that point,
the client exits and the transport endpoint is closed. As explained in the first bullet item above, when the file descriptor is closed, tirdwr will initiate the orderly release request from the client's side of the connection. This will generate
the indication that the server is expecting, and the connection will be released
properly.
9.7. Advanced Topics
This section presents important concepts of the Transport Interface that have not
been covered in the previous section. First, an optional non-blocking (asynchronous) mode for some library calls is described. Then, an advanced programming
example is presented that defines a server that supports multiple outstanding connect indications and operates in an event driven manner.
Asynchronous Execution
Mode
Many Transport Interface library routines may block waiting for an incoming
event or the relaxation of flow control. However, some time-critical applications
should not block for any reason. Similarly, an application may wish to do local
processing while waiting for some asynchronous transport interface event.
Support for asynchronous processing of Transport Interface events is available to
applications using a combination of the STREAMS asynchronous features
(poll) and the non-blocking mode of the Transport Interface library routines
(I_SETSIG ioet 1).
In addition, each Transport Interface routine that may block waiting for some
event can be run in a special non-blocking mode. For example, t _1 is t en will
nonnally block, waiting for a connect indication. However, a server can periodically poll a transport endpoint for existing connect indications by calling
t_listen in the non-blocking (or asynchronous) mode. The asynchronous
mode is enabled by setting O_NDELAYon the file descriptor. This can be set as
a flag on t open, or by calling f entl(2) before calling the Transport Interface routine. fen t 1 can be used to enable or disable this mode at any time.
All programming examples illustrated throughout this guide use the default, synchronous mode of processing.
o_NDELAY affects each Transport Interface routine in a different manner. To
determine the exact semantics of 0 _NDELA Y for a particular routine, see the
Revision A. of 27 March 1990
Chapter 9 - Transport Level Interface Programming
223
appropriate pages in Section 3N of the SunOS Reference Manual.
Advanced Programming
Example
The following example demonstrates two important concepts. The first is a
setver's ability to manage multiple outstanding connect indications. The second
is an illustration of the ability to write event-driven software using the Transport
Interface and the STREAMS system call interface.
The setver example in the Connection-Mode Client section was capable of supporting only one outstanding connect indication, but the Transport Interface supports the ability to manage multiple outstanding connect indications. One reason
a selVer might wish to receive several, simultaneous connect indications is to
impose a priority scheme on each client. A server may retrieve several connect
indications, and then accept them in an order based on a priority associated with
each client. A second reason for handling several outstanding connect indications is that the single-threaded scheme has some limitations. Depending on the
implementation of the transport provider, it is possible that while the selVer is
processing the current connect indication, other clients will find it busy. If, however, multiple connect indications can be processed simultaneously, the server
will be found to be busy only if the maximum allowed number of clients attempt
to call the setver Simultaneously.
The server example is event-driven: the process polls a transport endpoint for
incoming Transport Interface events, and then takes the appropriate actions for
the current event. The example demonstrates the ability to poll multiple transport endpoints for incoming events.
The definitions and local management functions needed by this example are
similar to those of the server example in the Introduction to ConnectionlessMode Service section.
*include
*include
*include
*include
-#include
-#include
<tiuser.h>
<fcntl.h>
<stdio.h>
<poll.h>
<stropts.h>
<signal.h>
*define NUM FDS
-#define MAX CONN IND
-#define SRV ADDR
int conn_fdi
extern int t_errnoi
1
4
1
/* server's well known address * /
/ * server connection here * /
/ * holds connect indications * /
struct t call *calls[NUM_FDS] [MAX CONN IND]i
main ( )
{
struct pollfd pollfds[NUM_FDS]i
struct t_bind *bindi
int ii
Revision A, of 27 March 1990
224
Transport-Level Programming
1*
* Only opening and binding one transport endpoint.
* but more could be supported
*1
if «pollfds[O].fd = t_open(n/dev/tivc",
O_RDWR, NULL»
< 0) {
t_error(nt_open failed");
exit(l);
if «bind = (struct t_bind *)t_alloc(pollfds[O] .fd,
T_BIND, T_ALL»
== NULL) {
t_error(nt_alloc of t_bind structure failed n );
exit(2);
bind->qlen = MAX_CONN_IND;
bind->addr.len = sizeof(int);
*(int *)bind->addr.buf = SRV_ADDR;
if (t_bind(pollfds[O] .fd, bind, bind) < 0)
t_error(nt_bind failed");
exit(3);
{
1*
* Was the correct address bound?
*1
if (*(int *)bind->addr.buf != SRV_ADDR)
fprintf(stderr, tIt bind bound wrong addressO);
exit(4);
The file descriptor returned by t_open is stored in a pollfd structure [see
poll(2)] that will be used to poll the transport endpoint for incoming data.
Notice that only one transport endpoint is established in this example. However,
the remainder of the example is written to manage multiple transport endpoints.
Several endpoints could be supported with minor changes to the above code.
An important aspect of this server is that it sets q/en to a value greater than 1 for
t _bind. This indicates that the server is willing to handle multiple outstanding
connect indications. Remember that the earlier examples single-threaded the
connect indications and responses. The server would accept the current connect
indication before retrieving additional connect indications. This example; however, can retrieve up to MAX_CONN_IND connect indications at one time
before responding to any of them. The transport provider may negotiate the
value of qlen downward if it cannot support MAX_CONN_IND outstanding connect indications.
Once the server has bound its address and is ready to process incoming connect
requests, it does the following:
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
pollfds[O] . events
=
POLLIN;
while (1) {
if (poll(pollfds, NUM_FDS, -1) < 0)
perror("poll failed");
exit(S);
for
(i
225
{
= 0; i < NUM_FDS; i++) {
switch (pollfds[i] .revents)
default:
perror("poll returned error event");
exit(6);
case 0:
continue;
case POLLIN:
do_event(i, pollfds[i] .fd);
service_conn_ind(i, pollfds[i] .fd);
The events field of the pollfd structure is set to POLLIN, which will notify
the server of any incoming Transport Interface events. The server then enters an
infinite loop, in which it will poll the transport endpoint(s) for events, and
then process those events as they occur.
The poll call will block indefinitely, waiting for an incoming event. On
return, each entry (corresponding to each transport endpoint) is checked for an
existing event. If revents is set to 0, no event has occurred on that endpoint. In
this case, the server continues to the next transport endpoint. If revents is set to
POLLIN, an event does exist on the endpoint. In this case, do_event is called
to process the event. If revents contains any other value, an error must have
occurred on the transport endpoint, and the server will exit.
For each iteration of the loop, if any event is found on the transport endpoint,
service_conn_ind is called to process any outstanding connect indications.
However, if another connect indication is pending, service_conn_ind will
save the current connect indication and respond to it later. This routine will be
explained shortly.
If an incoming event is discovered, the following routine is called to process it:
do_event (slot, fd)
{
struct t discon *discon;
int i;
switch (t_look(fd))
default:
Revision A, of 27 March 1990
226
Transport-Level Programming
fprintf (stderr, lit_look: unexpected event\O);
exit(7);
case TERROR:
fprintf(stderr,"t_Iook returned TERROR event\O);
exit(8);
case -1:
t_error(IIt_look failed");
exit(9);
case 0:
/ * since POWN returned, this should not happen * /
fprintf(stderr,"t_Iook returned no event\O);
exit(lO);
case T LISTEN:
1*
* find free element in calls array
*1
for (i = 0; i < MAX_CONN_IND; i++)
if (calls [slot] [i] == NULL)
break;
if «calls[slot] [i] = (struct t_call *)t_alloc(fd,
T_CALL, T_ALL»
== NULL) {
t_error(IIt_alloc of t call structure failed");
exit(ll);
if (t_Iisten (fd, calls [slot] [i]) < 0)
t_error(IIt_listen failed");
exit(12);
{
break;
case T DISCONNECT:
discon = (struct t_discon *)t_alloc(fd,
T_DIS, T_ALL);
if (t_rcvdis(fd, discon) < 0) {
t_error(IIt_rcvdis failed");
exit(13);
1*
* find call ind in array and delete it
*1
for (i = 0; i < MAX_CONN_IND; i++)
if (discon->sequence ==
calls[slot] [i]->sequence) {
t_free(calls[slot] [i], T_CALL);
calls[slot] [i] = NULL;
t_free(discon, T_DIS);
break;
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
227
()
J
This routine takes a number, slot, and a file descriptor,jd, as arguments. slot is
used as an index into the global array calls. This array contains an entry for each
polled transport endpoint, where each entry consists of an array of t _call
structures that hold incoming connect indications for that transport endpoint.
The value of slot is used to identify the transport endpoint of interest.
do_event calls t_look to detennine the Transport Interface event that has
occurred on the transport endpoint referenced by jd. If a connect indication
(T_LISTEN event) or disconnect indication (T_DISCONNECf event) has
arrived, the event is processed. Otherwise, the server prints an appropriate error
message and exits.
For connect indications, do _ even t scans the array of outstanding connect
indications looking for the first free entry. A t _ call structure is then allocated
for that entry, and the connect indication is retrieved using t _listen. There
must always be at least one free entry in the connect indication array, because the
array is large enough to hold the maximwn nwnber of outstanding connect indications as negotiated by t _bind. The processing of the connect indication is
deferred until later.
If a disconnect indication arrives, it must correspond to a previously received
connect indication. This scenario arises if a client attempts to undo a previous
connect request. In this case, do_event allocates a t_discon structure to
retrieve the relevant disconnect infonnation. This structure has the following
members:
struct t_discon {
struct netbuf udata;
int reason;
int sequence;
where udata identifies any user data that might have been sent with the disconnect indication, reason contains a protocol-specific disconnect reason code, and
sequence identifies the outstanding connect indication that matches this disconnect indication.
Next, t _ r cvdi s is called to retrieve the disconnect indication. The array of
connect indications for slot is then scanned for one that contains a sequence
number that matches the sequence number in the disconnect indication. When
the connect indication is found, it is freed and the corresponding entry is set to
NULL.
As mentioned earlier, if any event is found on a transport endpoint,
service_conn_ind is called to process all currently outstanding connect
indications associated with that endpoint as follows:
Revision A, of 27 March 1990
228
Transport-Level Programming
service_conn_ind(slot, fd)
int i;
for (i = 0; i < MAX_CONN_IND; i++)
if (calls[slot] [i] == NULL)
continue;
if «conn_fd
<
0)
= t_open("/dev/tivc", O_RDWR, NULL»
{
t_error("open failed");
exit(14);
if (t_bind(conn_fd, NULL, NULL) < 0)
t_error("t_bind failed");
exit(15);
{
if (t_accept(fd, conn_fd, calls[slot] [i]) < 0)
if (t_errno == TLOOK) {
t_close(conn_fd);
return;
{
t_error("t_accept failed");
exit(16);
t_free(calls[slot] [i], T_CALL);
calls[slot] [i] = NULL;
run_server(fd);
For the given slot (the transport endpoint), the array of outstanding connect indications is scanned. For each indication, the server will open a responding transport endpoint, bind an address to the endpoint, and then accept the connection
on that endpoint. If another event (connect indication or disconnect indication)
arrives before the current indication is accepted, t_accept will fail and set
t errno to TLOOK.
NOTE
The user cannot accept an outstanding connect indication if any pending connect
indication events or disconnect indication events exist on that transport endpoint.
If this error occurs, the responding transport endpoint is closed and
service_conn_ ind will return immediately (saving the current connect indication for later processing). This causes the server's main processing loop to be
entered, and the new event will be discovered by the next call to po 11. In this
way, multiple connect indications may be queued by the user.
Eventually, all events will be processed, and service_conn_ind will be
able to accept each connect indication in tum. Once the connection has been
established, the run_server routine used by the server in the ConnectionMode Client section is called to manage the data transfer.
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
229
9.8. State Transitions
These tables describe all state transitions associated with the Transport Interface.
First, however, the states and events will be described.
Transport Interface States
Table 9-6 defines the states used to describe the Transport Interface state transitions.
Table 9-6
Transport Interface States
T_UNBND
Description
uninitialized - initial and
final state of interface
initialized but not bound
T_IDLE
no connection established
T_OUTCON
outgoing connection
pending for client
incoming connection
pending for server
data transfer
outgoing orderly release
(waiting for orderly
release indication)
incoming orderly release
(waiting to send orderly
release request)
State
T_UNINIT
T_INCON
T_DATAXFER
T_OUTREL
T_INREL
Outgoing Events
Service Type
T_COTS,
T_COTS_ORD,T_CLTS
T_COTS,
T_COTS_ORD,T_CLTS
T_COTS,
T_COTS_ORD,T_CLTS
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS_ORD
T_COTS_ORD
The outgoing events described in Table 9-7 correspond to the return of the
specified transport routines, where these routines send a request or response to
the transport provider.
In the figure, some events (such as acceptN) are distinguished by the context in
which they occur. The context is based on the values of the following variables:
Dent
count of outstanding connect indications
fd
file descriptor of the current transport endpoint
resfd
file descriptor of the transport endpoint where a connection will be accepted
Revision A, of 27 March 1990
230
Transport-Level Programming
Table 9-7
Transport Interface Outgoing Events
Event
opened
Description
successful return of t _open
bind
successful return oft- bind
optmgmt
successful return of t _ optmgmt
unbind
successful return of t unbind
closed
successful return oft close
connectl
successful return of t connect in
synchronous mode
TNODATAerroron t connect
in asynchronous mode, or TLOOK
error due to a disconnect indication
arriving on the transport endpoint
successful return of t_accept
with Dent == 1,jd == resjd
successful return of t _accept
with oent== 1,jd!= resfd
successful return of t_accept
with oent> 1
successful return of t snd
successful return of t s nddi s
with oent <= 1
successful return of t - snddis
with oent> 1
successful return of t sndrel
successful return of t sndudata
connect2
accept!
accept2
accept3
snd
snddisl
snddis2
sndrel
sndudata
Incoming Events
Service Type
T_COTS,
T_COTS_ORD,T_CLTS
T_COTS,
T_COTS_ORD,T_CLTS
T_COTS,
T_COTS_ORD,T_CLTS
T_COTS,
T_COTS_ORD,T_CLTS
T_COTS,
T_COTS_ORD,T_CLTS
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS,T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS_ORD
T_CLTS
The incoming events correspond to the successful return of the specified routines,
where these routines retrieve data or event infonnation from the transport provider. The only incoming event not associated directly with the return of a routine is pass_conn, which occurs when a user transfers a connection to another
transport endpoint. This event occurs on the endpoint that is being passed the
connection, despite the fact that no Transport Interface routine is issued on that
endpoint. pass_conn is included in the state tables to describe the behavior when
a user accepts a connection on another transport endpoint.
In Table 9-8, the revdis events are distinguished by the context in which they
occur. The context is based on the value of oent, which is the count of outstanding connect indications on the transport endpoint.
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
Table 9-8
Transport Interface Incoming Events
Incoming
Event
listen
rcvconnect
rcv
rcvdis1
rcvdis2
rcvdis3
rcvrel
rcvudata
rcvuderr
pass_conn
Transport User Actions
231
Description
successful return 9f t listen
successful return of t rcvconnect
successful return of t rcv
successful return of t - rcvdis with
ocnt <= 0
successful return of t_rcvdis with
oent== 1
successful return of t rcvdis with
oent> 1
successful return of t rcvrel
successful return of t rcvudata
successful return of t rcvuderr
receive a passed connection
Service Type
T_COTS,T_COTS_ORD
T_COTS,T_COTS_ORD
T_COTS,T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS, T_COTS_ORD
T_COTS_ORD
T_CLTS
T_CLTS
T_COTS, T_COTS_ORD
In the state tables that follow, some state transitions are accompanied by a list of
actions the transport user must take. These actions are represented by the notation [n], where n is the number of the specific action as described below.
[1] Set the count of outstanding connect indications to zero.
[2] Increment the count of outstanding connect indications.
[3] Decrement the count of outstanding connect indications.
[4] Pass a connection to another transport endpoint as indicated in t_accept.
State Tables
The following tables describe the Transport Interface state transitions. Given a
current state and an event, the transition to the next state is shown, as well as any
actions that must be taken by the transport user (indicated by [n]). The state is
that of the transport provider as seen by the transport user.
The contents of each box represent the next state, given the current state
(column) and the current incoming or outgoing event (row). An empty box
represents a state/event combination that is invalid. Along with the next state,
each box may include an action list (as specified in the previous section). The
transport user must take the specific actions in the order specified in the state
table.
The following should be understood when studying the state tables:
o
The t_close routine is referenced in the state tables (see closed event in
Table 9-1), but may be called from any state to close a transport endpoint. If
t_close is called when a transport address is bound to an endpoint, the
address will be unbound. Also, if t _close is called when the transport
connection is still active, the connection will be aborted.
+1Y..,!!
Revision A, of 27 March 1990
232
Transport-Level Programming
D
D
D
If a transport user issues a routine out of sequence, the transport provider
will recognize this and the routine will fail, setting t_errno to TOUTSTATE. The state will not change.
If any other transport error occurs, the state will not change unless explicitly
stated on the manual page for that routine. The exception to this is a
TLOOKorTNODATA error on t_connect, as described in Table 9-1.
The state tables assume correct use of the Transport Interface.
The support routines t_getinfo, t_getstate, t_alloc,
t_free, t_sync, t_look, and t_error are excluded from the state
tables because they do not affect the state.
A separate table is shown for common local management steps, data transfer in
connectionless-mode, and connection-establishment/connection-release/datatransfer in connection-mode.
Figure 9-6
Common Local Management State Table
~
event
opened
T_UNINIT
T_UNBND
T_UNBND
T_IDLE [1]
bind
T_IDLE
optmgmt
T_UNBND
unbind
T_UNINIT
closed
Figure 9-7
T_IDLE
Connectionless-Mode State Table
~
event
T_IDLE
sndudata
T_IDLE
rcvudata
T_IDLE
rcvuderr
T_IDLE
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
Figure 9-8
Connection-Mode State Table
~
event
T_IDLB
connectl
T_DATAXFER
connect2
T_orrrcoN
reveonnect
listen
T_OrrrOON
T_INCON
TJ)ATAXFER
T_INREL
T_INCON[2]
T_INCON[2]
accept!
T_DATAXFER[3]
accept2
T_IDLB [3](4]
accept3
T_INCON [3](4]
snd
T_DATAXFER
rev
T_DATAXFER
TJDLB
snddis2
T_lDLB[3]
T_INREL
T_OUTREL
T_IDLE
T_IDLE
T_IDLE
LIDLE
T_IDLE
LIDLE
T_INCON(3)
rcvdisl
TJDLE
rcvdis2
T_IDLE [3]
rcvdis3
T_INCON(3)
sndrel
T_OlITREL
rcvrel
T_INREL
pass_conn
T_OUTREL
TJ)ATAXFER
snddisl
9.9. Guidelines for Protocol
Independence
233
T_IDLE
T_lDLE
T_DATAXFER
By defining a set of services common to many transport protocols, the Transport
Interface offers protocol independence for user software. However, all transport
protocols do not support all the services supported by the Transport Interface. If
software must be run in a variety of protocol environments, only the common
services should be accessed. The following guidelines highlight services that
may not be common to all transport protocols.
o
In the connection-mode service, the concept of a transport service data unit
(TSDU) may not be supported by all transport providers. The user should
make no assumptions about the preservation of logical data boundaries
across a connection. If messages must be transferred over a connection, a
protocol should be implemented above the Transport Interface to support
message boundaries.
o
Protocol and implementation specific service limits are returned by the
t_open and t_getinfo routines. These limits are useful when allocating buffers to store protocol-specific transport addresses and options. It is
the responsibility of the user to access these limits and then adhere to the
limits throughout the communication process.
Revision A, of 27 March 1990
234
Transport-Level Programming
[J
[J
[J
[J
[J
[J
[J
[J
9.10. Some Examples
User data should not be transmitted with connect requests or disconnect
requests [see t_connect(3N) and t_snddis(3N)]. All transport protocols do not support this capability.
The buffers in the t_call structure used for t_listen must be large
enough to hold any information passed by the client during connection
establishment. The server should use the T_ALL argument to t _ allo c,
which will determine the maximum buffer sizes needed to store the address,
options, and user data for the current transport provider.
The user program should not look at or change options that are associated
with any Transport Interface routine. These options are specific to the
underlying transport protocol. The user should choose not to pass options
with t_connect or t_sndudata. In such cases, the transport provider
will use default values. Also, a server should use the options returned by
t _1 is ten when accepting a connection.
Protocol-specific addressing issues should be hidden from the user program.
A client should not specify any protocol address on t bind, but instead
should allow the transport provider to assign an appropriate address to the
transport endpoint. Similarly, a server should retrieve its address for
t_hind in such a way that it does not require knowledge of the transport
provider's address space. Such addresses should not be hard-coded into a
program. A name server mechanism could be useful in this scenario, but the
details for providing such a service are outside the scope of the Transport
Interface.
The reason codes associated with t _ r c vdi s are protocol-dependent. The
user should not interpret this information if protocol-independence is a concern.
The error codes associated with t_rcvuderr are protocol-dependent.
The user should not interpret this information if protocol-independence is a
concern.
The names of devices should not be hard-coded into programs, because the
device node identifies a particular transport provider, and is not protocol
independent.
The optional orderly release facility of the connection-mode service (provided by t_sndrel and t_rcvrel) should not be used by programs
targeted for multiple protocol environments. This facility is not supported
by all connection-based transport protocols. In particular, its use will
prevent programs from successfully communicating with ISO open systems.
The examples presented throughout this guide are shown in entirety in this
appendix.
Revision A, of 27 March 1990
235
Chapter 9 - Transport Level Interface Programming
Connection-Mode Client
The following code represents the connection-mode client program described in
the Connection-Mode Client section. This client establishes a transport connection with a server, and then receives data from the server and writes it to its standard output. The connection is released using the orderly release facility of the
Transport Interface. This client will communicate with each of the connectionmode servers presented in the guide.
finclude <stdio.h>
finclude <tiuser.h>
finclude <fcntl.h>
1
/ * server's well known address * /
main ( )
{
int fd;
int nbytes;
int flags = 0;
char buf[1024];
struct t_call *sndcall;
extern int t_errno;
if «fd = t_open(lI/dev/tivc", O_RDWR, NULL»
t_error(lIt_open failed");
exit(l);
< 0)
{
if (t_bind(fd, NULL, NULL) < 0)
t_error (lit_bind failed n );
exit(2);
1*
* By asswning that the address is an integer value,
* this program may not run over another protocol.
*1
if «sndcall = (struct t_call *)t_alloc(fd,
T_CALL, T_ADDR»
== NULL) {
t_error(nt_alloc failed n );
exit(3);
sndcall->addr.len = sizeof(int);
*(int *)sndcall->addr.buf = SRV_ADDR;
if (t_connect(fd, sndcall, NULL) < 0)
t_error(nt_connect failed for fd n );
exit(4);
while «nbytes = t_rcv(fd, buf, 1024, &flags»
if (fwrite(buf, 1, nbytes, stdout) < 0)
fprintf(stderr, IIfwrite failed\O);
!= -1)
{
Revision A, of 27 March 1990
236
Transport-Level Programming
exit(S);
if «t_errno == TLOOK)
&&
(t_look(fd)
if (t_rcvrel(fd) < 0) {
t_error ("t_rcvrel failed") i
exit(6);
if (t_sndrel(fd) < 0)
t_error(nt_sndrel failed n );
exit(7);
exit(O);
t_error(nt_rcv failed n );
exit(S);
Connection-Mode Server
The following code represents the connection-mode server program described in
the Connection-Mode Client section. This server establishes a transport connection with a client, and then transfers a log file to the client on the other side of the
connection. The connection is released using the orderly release facility of the
Transport Interface. The connection-mode client presented earlier will communiCate with this server.
#include
#include
#include
#include
#include
<tiuser.h>
<stropts.h>
<fcntl.h>
<stdio.h>
<signal.h>
#define DISCONNECT
#define SRV ADDR
-1
1
/
* server's well known address * /
/ * connection established here * /
int conn_fd;
extern int t_errno;
main ( )
{
int listen_fd;
/* listening transport endpoint * /
struct t bind *bind;
struct t_call *call;
if «listen_fd
<
0)
=
t_open (n /dev/tivc", O_RDWR, NULL»
{
t_error(nt_open failed for listen_fd n ) ;
exit(l);
1*
Revision A, of 27 March 1990
237
Chapter 9 - Transport Level Interface Programming
* By assuming that the address is an integer value,
* this program may not run over another protocol.
*1
if ((bind = (struct t_bind *)t_alloc(listen_fd,
T_BIND, T_ALL»
== NULL) {
t_error(nt_alloc of t bind structure failed n );
exit(2);
bind->qlen = 1;
bind->addr.len = sizeof(int);
*(int *)bind->addr.buf = SRV_ADDR;
if (t_bind(listen_fd, bind, bind) < 0) {
t_error(nt_bind failed for listen_fd n );
exit(3);
1*
* Was the correct address bound?
*1
if (*(int *)bind->addr.buf != SRV_ADDR)
fprintf(stderr, nt bind bound wrong address\O);
exit(4);
if ((call = (struct t_call *)t_alloc(listen_fd,
T_CALL, T_ALL»
== NULL) {
t_error(nt_alloc of t call structure failed n );
exit(5);
while (1) {
if (t_listen(listen_fd, call) < 0) {
t_error(nt_listen failed for listen_fd n );
exit(6);
if ((conn_fd = accept_call (listen_fd, call»
!= DISCONNECT)
run_server(listen_fd);
accept_call (listen_fd, call)
int listen_fd;
struct t_call *call;
{
int resfd;
if ((resfd = t_open(n/dev/tivc n , O_RDWR, NULL»
<
t_error(nt_open for responding fd failed n );
exit(7);
0)
{
Revision A, of 27 March 1990
238
Transport-Level Programming
if (t_bind(resfd, NULL, NULL) < 0) {
t_error("t_bind for responding fd failed");
exit(S);
if (t_accept(listen_fd, resfd, call) < 0) {
if (t_errno == TLOOK) { 1* must be a disconnect *1
if (t_rcvdis(listen_fd, NULL) < 0) {
t_error("t_rcvdis failed for listen_fd");
exit(9);
if (t_close(resfd) < 0) {
t_error (nt_close failed for responding fd") ;
exit(lO);
1* go back up and listen for other calls * 1
return(DISCONNECT);
t_error("t_accept failed");
exit(1l);
return (resfd) :
connrelease ( )
{
1* connJd is global because needed here * 1
if (t_look(conn_fd) == T_DISCONNECT)
fprintf(stderr, "connection abortedO):
exit(12):
/ * else orderly release indication - normal exit * 1
exit(O):
run_server (listen_fd)
int listen_fd:
int nbytes:
FILE *logfp;
char buf[1024];
/ * file pointer to log file * 1
switch (fork(»
case -1:
perror("fork failed");
exit(20);
Revision A. of 27 March 1990
239
Chapter 9 - Transport Level Interface Programming
default:
/* parent * /
/ * close connJd and then go up and listen again * /
if (t_close(conn_fd) < 0) {
t_error(nt_close failed for conn_fd n );
exit(21);
return;
case 0:
/* child * /
/ * close listenJd and do service * /
if (t_close(listen_fd) < 0)
t_error(nt_close failed for listen_fd n );
exit(22);
if «logfp = fopen(nlogfile", "r"»
perror(ncannot open logfile n );
exit(23);
NULL)
signal (SIGPOLL, connrelease);
if (ioctl(conn_fd, I_SETSIG, S_INPUT) < 0)
perror(nioctl I SETSIG failed n );
exit(24);
if (t_look(conn_fd)
fprintf(stderr,
exit(25);
{
{
!= 0) { /* wasdisconnectthere? */
nt look: unexpected eventO);
while «nbytes = fread(buf, 1, 1024, logfp»
if (t_snd(conn_fd, buf, nbytes, 0) < 0)
t_error(nt_snd failed n );
exit(26);
> 0)
{
if (t_sndrel(conn_fd) < 0) {
t_error(nt_sndrel failed n );
exit(27);
pause ( ) ;
Connectionless-Mode
Transaction Server
/ * until orderly release indication arrives * /
The following code represents the connectionless-mode transaction selVer program described in the Introduction to Connectionless-Mode Service section.
This selVer waits for incoming datagram queries, and then processes each query
and sends a response.
(
*inclUde <stdio.h>
+ §,!!,!!
]
-----------"
,---.
Revision A, of 27 March 1990
240
Transport-Level Programming
finclude <fcntl.h>
finclude <tiuser.h>
fdefine SRV ADDR
2
/ * server's well known address * /
main ( )
{
int fd;
int flags;
struct t bind *bind;
struct t_unitdata *ud;
struct t_uderr *uderr;
extern int t_errno;
if «fd = t_open(n/dev/tidg", O_RDWR, NULL»
< 0)
t_error(nunable to open /dev/provider n ) ;
exit(l);
{
if «bind = (struct t_bind *)t_alloc(fd,
T_BIND, T_ADDR»
== NULL) {
t_error("t_alloc of t bind structure failed");
exit (2) ;
bind->addr.len = sizeof(int);
*(int *)bind->addr.buf
SRV_ADDR;
bind->qlen = 0;
if (t_bind(fd, bind, bind) < 0)
t_error(nt_bind failed n );
exit(3);
1*
* is the bound address correct?
*1
if (*(int *)bind->addr.buf != SRV_ADDR)
fprintf(stderr, "t bind bound wrong address\O);
exit(4);
if «ud = (struct t_unitdata *)t_alloc(fd,
T_UNITDATA, T_ALL»
== NULL) {
t_error(nt_alloc of t unitdata structure failed");
exit(S);
if «uderr = (struct t_uderr *)t_alloc(fd,
T_UDERROR, T_ALL»
NULL) {
t_error(nt_alloc of t uderr structure failed");
exit(6);
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
241
while (1) {
if (t_rcvudata(fd, ud, &flags) < 0) {
if (t_errno == TLOOK) {
1*
* Error on previously sent datagram
*1
if (t_rcvuderr(fd, uderr) < 0) {
t_error(nt_rcvuderr failed");
exit(7);
fprintf(stderr, "bad datagram,
error = %d\O, uderr->error);
continue;
t_error(nt_rcvudata failed");
exit(8);
1*
* Query( ) processes the request and places the
* response in ud->udata.buf, setting ud->udata.len
*1
query (ud) ;
if (t_sndudata(fd, ud, 0) < 0) {
t_error("t_sndudata failed");
exit(9);
query ( )
{
/* Merely a stub/or simplicity * /
Read/Write Client
The following code represents the connection-mode read/write client program described in the A Read/Write Interface section. This client establishes a
transport connection with a server, and then uses cat(l) to retrieve the data sent
by the seIVer and write it to its standard output. This client will communicate
with each of the connection-mode seIVers presented in the guide.
:fI:include
:fI:include
:fI:include
:fI:include
<stdio.h>
<tiuser.h>
<fcntl.h>
<stropts.h>
:fI:define SRV ADDR
1
/ * server's well known address * /
main ( )
{
~ sun
microsystems
Revision A, of 27 March 1990
242
Transport-Level Programming
int fd;
int nbytes;
int flags = 0;
char buf{1024];
struct t_call *sndcall;
extern int t_errno;
if «fd = t_open(n/dev/tivc n , O_RDWR, NULL»
t_error(nt_open failed n );
exit(l);
< 0)
{
if (t_bind(fd, NULL, NULL) < 0)
t_error(nt_bind failed");
exit(2);
1*
* By assuming that the address is an integer value.
* this program may not run over another protocol.
*1
if «sndcall = (struct t_call *)t_alloc(fd,
T_CALL, T_ADDR»
== NULL) {
t_error(nt_alloc failed n );
exit(3);
sndcall->addr.len = sizeof(int);
*(int *)sndcall->addr.buf = SRV_ADDR;
if (t_connect(fd, sndcall, NULL) < 0)
t_error(nt_connect failed for fd n );
exit(4);
if (ioctl(fd, I_PUSH, ntirdwr n ) < 0) {
perror(nI_PUSH of tirdwr failed n );
exit(5);
close(O);
dup (fd) ;
execl(n/bin/cat n , "/bin/cat n , 0);
perror(nexecl of /bin/cat failed n );
exit(6);
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
Event-Driven Server
243
The following code represents the connection-mode server program described in
the Advanced Topics section. This server manages multiple connect indications
in an event-driven manner. Either connection-mode client presented earlier will
communicate with this selVer.
*include
*include
*include
*include
*include
*include
<tiuser.h>
<fcntl.h>
<stdio.h>
<poll.h>
<stropts.h>
<signal.h>
*define NUM FDS
*define MAX- CONN- IND
*define SRV ADDR
1
4
1
/ * server's well known address * /
/ * server connection here * /
int conn fd;
extern int t_errnoi
/ * holds connect indications * /
struct t call *calls[NUM_FDS] [MAX_CONN_IND];
main ( )
{
struct pollfd pollfds[NUM_FDS]i
struct t bind *bindi
int ii
1*
* Only opening and binding one transport endpoint,
* but more could be supported
*1
if ((pollfds [0] . fd
<
0)
=
t_open (" /dev /tivc", O_RDWR, NULL»
{
t_error("t_open failed");
exit(l);
if ((bind = (struct t_bind *)t_alloc(pollfds[O] .fd,
T_BIND, T_ALL»
== NULL) {
t_error("t_alloc of t bind structure failed")i
exit(2);
bind->qlen = MAX_CONN_IND;
bind->addr.len = sizeof(int);
*(int *)bind->addr.buf = SRV_ADDRi
if (t_bind(pollfds[O] .fd, bind, bind) < 0)
t_error ("t_bind failed") i
exit(3)i
{
1*
Revision A, of 27 March 1990
244
Transport-Level Programming
* Was the correct address bound?
*/
if (*(int *)bind->addr.buf != SRV_ADDR)
fprintf(stderr, nt bind bound wrong address\O);
exit(4);
pollfds[O] . events
POLLIN;
while (1) {
if (poll(pollfds, NUM_FDS, -1) < 0)
perror(npoll failed n );
exit(S);
for (i
=
0; i < NUM_FDS; i++)
{
{
switch (pollfds[i] .revents)
default:
perror(npoll returned error event");
exit(6);
case 0:
continue;
case POLLIN:
do_event(i, pollfds[i] .fd);
service_conn_ind(i, pollfds[i] .fd);
do_event (slot, fd)
{
struct t discon *discon;
int i;
switch (t_look(fd»
default:
fprintf(stderr,"t_look: unexpected event\O);
exit(7);
case TERROR:
fprintf(stderr,"t_look returned TERROR event\O);
exit(8);
case -1:
t error("t_look failed");
exit(9);
Revision A, of 27 March 1990
Chapter 9 - Transport Level Interface Programming
245
case 0:
/ * since POWN returned, this should not happen * /
fprintf(stderr,"t_Iook returned no event\O):
exit(lO):
case T LISTEN:
1*
* find free element in calls array
*1
for (i = 0; i < MAX CONN_IND: i++)
if (calls[slot] [i] == NULL)
break:
if «calls[slot] [i] = (struct t
T_CALL, T_ALL»
== NULL)
call *)t_alloc(fd,
t_error(nt_alloc of t call structure failed"):
exit(ll):
if (t_Iisten (fd, calls [slot] [i]) < 0)
t_error(nt_listen failed"):
exit(12):
{
break:
case T DISCONNECT:
discon = (struct t discon *)t_alloc(fd,
T_DIS, T_ALL);
if (t_rcvdis(fd, discon) < 0) {
t_error(lIt_rcvdis failed"):
exit (13) ;
/*
* find call ind in array and delete it
*1
for (i = 0; i < MAX CONN_IND; i++)
if (discon->sequence ==
calls[slot] [i]->sequence) {
t_free(calls[slot] [i], T_CALL):
calls[slot] [i] = NULL:
t_free(discon, T_DIS):
break:
service_conn_ind(slot, fd)
{
Revision A, of 27 March 1990
246
Transport-Level Programming
int i;
for
(i = 0; i < MAX_CONN_IND; i++)
if (calls [slot] [i] == NULL)
continue;
if «conn_fd = t_open(n/dev/tivc",
O_RDWR, NULL»
< 0) {
t_error("open failed");
exit(14);
if (t_bind(conn_fd, NULL, NULL) < 0)
t_error("t_bind failed n );
exit(15);
{
if (t_accept(fd, conn_fd, calls[slot] [i]) < 0)
if (t_errno == TLOOK) {
t_close(conn_fd);
return;
{
t_error (nt_accept failed");
exit(16);
t_free(calls[slot] [i], T_CALL);
calls[slot] [i] = NULL;
run_server (fd) ;
connrelease ( )
{
/ * connJd is global because needed here * /
== T_DISCONNECT)
if (t_look(conn_fd)
fprintf(stderr,
exit(12);
"connection aborted\O);
/ * else orderly release indication - normal exit * /
exit(O);
run_server (listen_fd)
int listen_fd;
int nbytes;
FILE *logfp;
char buf[1024];
/ * file pointer to log file * /
switch (fork ( ) )
Revision A, of 27 March 1990
247
Chapter 9 - Transport Level Interface Programming
case -1:
perror("fork failed");
exit(20);
default:
/* parent
*/
/ * close connJd and then go up and listen again * /
if (t_close(conn_fd) < 0) {
t_error(ltt_close failed for conn_fd tl );
exit(21);
return;
/* child * /
case 0:
/*
close listen Jd and do service
*/
if (t_close(listen_fd) < 0)
t_error(ltt_close failed for listen_fd n );
exit(22);
if «logfp = fopen (nlogfile n , nrn»
perror("cannot open logfile tl );
exit (23) ;
signal (SIGPOLL, connrelease);
if (ioctl(conn_fd, I_SETSIG, S_INPUT)
perror("ioctl I SETSIG failed");
exit(24);
if (t look (conn_fd)
fprintf(stderr,
exit(25);
NULL)
{
< 0) {
!= 0) { /* disconnect already there? *
"t look: unexpected event\O);
while «nbytes = fread(buf, 1, 1024, logfp»
if (t_snd(conn_fd, buf, nbytes, 0) < 0)
t_error (tlt_snd failed");
exit(26);
> 0)
{
if (t_sndrel(conn_fd) < 0) {
t_error(nt_sndrel failed tl );
exit(27);
pause ( ) ;
+~I!!
/ * until orderly release indication arrives * /
Revision A, of 27 March 1990
248
Transport-Level Programming
9.11. Glossary
The following tenns apply to the Transport Interface:
Abortive release
An abrupt termination of a transport connection, which may result in the loss
of data.
Asynchronous execution
The mode of execution in which Transport Interface routines will never
block while waiting for specific asynchronous events to occur, but instead
will return immediately if the event is not pending.
Client
The transport user in connection-mode that initiates the establishment of a
transport connection.
Connection establishment
The phase in connection-mode that enables two transport users to create a
transport connection between them.
Connection-mode
A circuit-oriented mode of transfer in which data are passed from one user to
another over an established connection in a reliable, sequenced manner.
Connectionless-mode
A mode of transfer in which data are passed from one user to another in
self-contained units with no logical relationship required among multiple
units.
Connection release
The phase in connection-mode that tenninates a previously established transport connection between two users.
Datagram
A unit of data transferred between two users of the connectionless-mode service.
Data transfer
The phase in connection-mode or connectionless-mode that supports the
transfer of data between two transport users.
Expedited data
Data that are considered urgent. The specific semantics of expedited data
are defined by the transport protocol that provides the transport service.
Expedited transport service data
The amount of expedited user data the identity of which is preserved from
one end of a transport connection to the other (that is, an expedited message).
Local management
The phase in either connection-mode or connectionless-mode in which a
transport user establishes a transport endpoint and binds a transport address
to the endpoint. Functions in this phase perform local operations, and
require no transport layer traffic over the network.
Revision A. of 27 March 1990
Chapter 9 - Transport Level Interface Programming
249
Orderly release
A procedure for gracefully tenninating a transport connection with no loss of
data.
Peer user
The user with whom a given user is communicating above the Transport
Interface.
Server
The transport user in connection-mode that offers services to other users
(clients) and enables these clients to establish a transport connection to it.
Service indication
The notification of a pending event generated by the provider to a user of a
particular selVice.
Service primitive
The unit of infonnation passed across a selVice interface that contains either
a service request or selVice indication.
Service request
A request for some action generated by a user to the provider of a particular
service.
Synchronous execution
The mode of execution in which Transport Interface routines may block
while waiting for specific asynchronous events to occur.
Transport address
The identifier used to differentiate and locate specific transport endpoints in
a network.
Transport connection
The communication circuit that is established between two transport users in
connection-mode.
Transport endpoint
The local communication channel between a transport user and a transport
provider.
Transportlnte~ace
The library routines and state transition rules that support the services of a
transport protocol.
Transport provider
The transport protocol that provides the selVices of the Transport Interface.
Transport service data unit
The amount of user data whose identity is preselVed from one end of a transport connection to the other (that is, a message).
Transport user
The user-level application or protocol that accesses the selVices of the Transport Interface.
• ~!!
Revision A, of 27 March 1990
250
Transport-Level Programming
Virtual circuit
A transport connection established in connection-mode. The following acronyms are used throughout this guide:
CLTS
Connectionless Transport Service
COTS
Connection Oriented Transport Service
ETSDU
Expedited Transport Service Data Unit
TSDU
Transport Service Data Unit
Revision A. of 27 March 1990
10
A Socket-Based Interprocess
Communications Tutorial
WARNING
Socket-based interprocess communication (IPC), while still supported, is no
longer the preferredframeworkfor transport-level programming. Socketbased IPC has been superceded as the "standard" method of accessing network
protocols by a set of OSI-compatible transport mechanisms based upon
STREAMS and accessed by way ofa Transport Library Interface (TLI). For
details on the TLI, see the previous chapter, Transport Level Interface Programming.
If you are building a new network application that requires direct access to transport facilities, use the TLI mechanisms. If you do not require such direct access,
Remote Procedure Calls (RPC) are the preferred programming framework - see
the Remote Procedure Call Programming Guide section of this manual for
details. New programs should not be based on sockets.
Various approaches are possible within the socket paradigm; this manual
discusses them, and then illustrates them by way a series of example programs.
These programs demonstrate in a simple way the use of pipes, socketpairs, and
the use of datagram socket and stream socket communication.
NOTE
Unlike RPC-based networking (which presumes XDR) socket-based [PC does not
contain a mechanismfor ensuring architecture independent code. Socket-based
programs must make judicious use of the host-to-network byte-order conversion
macros described in byteorder (3N) if they are to be portable.
The intent of this chapter is to present a few simple example programs, not to
describe the socket-based networking facilities in full. For more infonnation, see
the next chapter, An Advanced Socket-Based Interprocess Communications
Tutorial.
10.1. Goals
Facilities for interprocess communication (IPC) and networking were a major
addition to the UNIX system - first introduced in 4.2BSD. These facilities
required major additions and some changes to the system interface. The basic
idea of this interface is to make IPC similar to file I/O. In the UNIX system a
process has a set of I/O descriptors, from which one reads and to which one
writes. Descriptors may refer to normal files, to devices (including terminals), or
to communication channels. The use of a descriptor has three phases: creation,
use for reading and writing, and destruction. By using descriptors to write files,
rather than simply naming the target file in the write call, one gains a surprising
251
Revision A, of 27 March 1990
252
Transport-Level Programming
amount of flexibility. Often, the program that creates a descriptor will be different from the program that uses the descriptor. For example the shell can
create a descriptor for the output of the 1 s command that will cause the listing to
appear in a file rather than on a teonina!. Pipes are another fonn of descriptor
that have been used in the UNIX system for some time. Pipes allow one-way
data transmission from one process to another; the two processes and the pipe
must be set up by a common ancestor.
The use of descriptors is not the only communication interface provided by the
UNIX system. The signal mechanism sends a tiny amount of infoonation from
one process to another. The signaled process receives only the signal type, not
the identity of the sender, and the number of possible signals is small. The signal
semantics limit the flexibility of the signaling mechanism as a means of interprocess communication.
The identification of IPC with I/O is quite longstanding in the UNIX system and
has proved quite successful. At first, however, IPC was limited to processes
communicating within a single machine. With 4.2BSD this expanded to include
IPC between machines. This expansion has necessitated some change in the way
that descriptors are created. Additionally, new possibilities for the meaning of
read and write have been admitted. Originally the meanings, or semantics, of
these teons were fairly simple. When you wrote something it was delivered.
When you read something, you were blocked until the data arrived. Other possibilities exist, however. One can write without full assurance of delivery if one
can check later to catch occasional failures. Messages can be kept as discrete
units or merged into a stream. One can ask to read, but insist on not waiting if
nothing is immediately available. These new possibilities were implemented in
4.3BSD and then incorporated into SunOS.
Socket-based IPC offers several choices. This chapter presents simple examples
that illustrate some of them. The reader is presumed to be familiar with the C
programming language, but not necessarily with UNIX system calls or processes
and interprocess communication. The chapter reviews the notion of a process
and the types of communication that are supported by the socket abstraction. A
series of examples are presented that create processes that communicate with one
another. The programs show different ways of establishing channels of communication. Finally, the calls that actually transfer data are reviewed. To clearly
present how communication can take place, the example programs have been
cleared of anything that might be construed as useful work. They can serve as
models for the programmer trying to construct programs that are composed of
cooperating processes.
10.2. Processes
A process can be thought of as a single line of control in a program. Programs
can have a point where control splits into two independent lines, an action called
forking. In the UNIX system these lines can never join again. A call to the system routine fork () causes a process to split in this way. The result of this call
is that two independent processes will be running, executing exactly the same
code. Memory values will be the same for all values set before the fork, but, subsequently, each version will be able to change only the value of its own copy of
each variable. Initially, the only difference between the two will be the value
Revision A, of 27 March 1990
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
253
returned by forkO. The parent will receive a process id for the child, the child
will receive a zero. Calls to fork () typically precede, or are included in, an ifstatement.
A process views the rest of the system through a private table of descriptors. The
descriptors can represent open files or sockets (sockets are the endpoints of communications channels, as discussed below). Descriptors are referred to by their
index numbers in the table. The first three descriptors are often known by special
names, stdin, stdout, and stderr. These are the standard input, output, and error.
When a process forks, its descriptor table is copied to the child. Thus, if the
parent's standard input is being taken from a tenninal (devices are also treated as
files in the UNIX system), the child's input will be taken from the same terminal.
Whoever reads first will get the input. If, before forking, the parent changes its
standard input so that it is reading from a new file, the child will take its input
from the new file. It is also possible to take input from a socket, rather than from
a file.
10.3. Pipes
Most users of the UNIX system know that they can pipe the output of a program
progl, to the input of another, prog2, by typing the command
example# proql I proq2
This is called "piping" the output of one program to another because the mechanism used to transfer the output is called a pipe. When the user types a command,
the command is read by the shell, which decides how to execute it. If the command is simple, for example,
example# proql
the shell forks a process, which executes the program, progl, and then dies.
The shell waits for this termination and then prompts for the next command. If
the command is a compound command,
example# proql I proq2
the shell creates two processes connected by a pipe. One process runs the program, progl, the other runs prog2, The pipe is an I/O mechanism with two
ends. Data that is written into one end can be read from the other.
Since a program specifies its input and output only by the descriptor table
indices, the input source and output destination can be changed without changing
the text of the program. It is in this way that the shell is able to set up pipes.
Before executing progl, the process can close whatever is at stdout and replace
it with one end of a pipe. Similarly, the process that will execute prog2 can
substitute the opposite end of the pipe for stdin.
Now let's examine a program that creates a pipe for communication between its
child and itself. A pipe is created by a parent process, which then forks. When a
process forks, the parent's descriptor table is copied into the child's.
Figure 10-1
Useo!aPipe
[
*inC1Ude <stdio.h>
]
- - - - - -
--.
Revision A, of 27 March 1990
254
Transport-Level Programming
Idefine DATA "Bright star, would I
.
.
"
1*
* This program creates a pipe, then forks. The child communicates to the
* parent over the pipe. Notice that a pipe is a one-way communications
* device. I can write to the output socket (socket s [1] , the second
* socket of the array returned by pipe) and read from the input
* socket (sockets [0]), but not vice versa.
*1
main ()
{
int sockets[2], child;
/*
Create a pipe
*/
if (pipe (sockets) < 0) {
perror("opening stream socket pair");
exit(10);
if «child = fork(»
perror ("fork");
else if (child) {
char buf[1024];
-1)
/* This is still the parent. It reads the child's message. * /
close(sockets[l]);
if (read(sockets[O], buf, 1024) < 0)
perror(nreading message");
printf("-->%s\n", buf);
close(sockets[O]);
else {
/ * This is the child. It writes a message to its parent. * /
close(sockets[O]);
if (write(sockets[l], DATA, sizeof(DATA»
perror("writing message");
close(sockets[l]);
< 0)
exit(O);
Here the parent process makes a call to the system routine pipeO. This routine
creates a pipe and places descriptors for the sockets for the two ends of the pipe
in the process's descriptor table. pipeO. is passed an array into which it places
the index numbers of the sockets it creates. The two ends are not equivalent.
The socket whose index is returned in the first word of the array is opened for
reading only, while the socket in the second word is opened only for writing.
This corresponds to the fact that the standard input is the first descriptor of a
process's descriptor table and the standard output is the second. After creating
the pipe, the parent creates the child with which it will share the pipe by calling
forkO.
Revision A, of 27 March 1990
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
255
The following figure illustrates the effect of such a call to forkO. The parent
process's descriptor table points to both ends of the pipe. After the fork, both
parent's and child's descriptor tables point to the pipe. The child can then use
the pipe to send a message to the parent.
Figure 10-2
Sharing a Pipe between Parent and Child
Parent
__~(______~P~I~________)
C_~)
Parent
Q
Child
Q
C__)~__(~______~p~ip~e_________)
Just what is a pipe? It is a one-way communication mechanism, with one end
opened for reading and the other end for writing. Therefore, parent and child
need to agree on which way to tum the pipe, from parent to child or the other
way around. Using the same pipe for communication both from parent to child
and from child to parent would be possible (since both processes have references
to both ends), but very complicated. If the parent and child are to have a twoway conversation, the parent creates two pipes, one for use in each direction. (In
accordance with their plans, both parent and child in the example above close the
socket that they will not use. It is not required that unused descriptors be closed,
but it is good practice.) A pipe is also a stream communication mechanism; that
Revision A, of 27 March 1990
256
Transport-Level Programming
is, all messages sent through the pipe are placed in order and reliably delivered.
When the reader asks for a certain number of bytes from this stream, it is given
as many bytes as are available, up to the amount of the request. Note that these
bytes may have come from the same call to wr i t e () or from several calls to
wr it e () that were concatenated.
10.4. Socketpairs
SunOS provides a slight generalization of pipes. A pipe is now a pair of connected sockets for one-way stream communication. One may obtain a pair of
connected sockets for two-way stream communication by calling the routine
socketpairO. The program in figure 10-3, below, calls socketpair () to
create such a connection. The program uses the link for communication in both
directions. Since socketpairs are an extension of pipes, their use resembles that
of pipes. Figure 10-4 illustrates the result of a fork following a call to socketpairO.
socketpair () takes as arguments a specification of a communication
domain, a style of communication, and a protocol. These are the parameters
shown in the example. Domains and protocols will be discussed in the next section. Briefly, a domain specifies a socket name space and implies a set of conventions for manipulating socket names. Currently, socketpairs have only been
implemented for the UNIX domain. The UNIX domain uses UNIX path names
for naming sockets. It only allows communication between sockets on the same
machine.
Note that the header files <sys/ socket. h> and <sys/types . h>. are
required in this program. The constants AF_UNIX and SOCK_STREAM are
defined in <sys/ socket. h>, which in tum requires the file
<sys/types . h> for some of its definitions.
Figure 10-3
Use of a Socketpair
#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#define DATAl "In Xanadu, did Kublai Khan . .
#define DATA2 "A stately pleasure dome decree
"
"
1*
* This program creates a pair of connected sockets then/orks and
* communicates over them. This is very similar to communication with pipes,
* however, socketpairs are two-way communications objects. Therefore I can
* send messages in both directions.
*1
main ()
{
int sockets[2], child;
char buf[1024];
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockets) < 0)
{
Revision A, of 27 March 1990
Chapter 10 -
A Socket-Based Interprocess Communications Tutorial
257
perror("opening stream socket pair");
exit(l);
if «child = fork(»
== -1)
perror("fork");
else if (child) {
/* This is the parent */
close(sockets[O]);
if (read(sockets[l], buf, 1024, 0) < 0)
perror("reading stream message");
printf("-->%s\n", buf);
if (write(sockets[l], DATA2, sizeof(DATA2»
perror("writing stream message");
close(sockets[l]);
else {
/* This is the child * /
close(sockets[l]);
if (write(sockets[O], DATAl, sizeof(DATA1»
perror("writing stream message");
if (read(sockets [0], buf, 1024, 0) < 0)
perror("reading stream message");
printf("-->%s\n", buf);
close(sockets[O]);
< 0)
< 0)
exit(O);
Figure 10-4
Sharing a Socketpair between Parent and Child
Revision A. of 27 March 1990
258
Transport-Level Programming
Parent
0
<
----~>
0
Parent
Child
Q
0
10.5. Domains and
Protocols
<
----~>
Q
0
Pipes and socketpairs are a simple solution for communicating between a parent
and child or between child processes. What if we wanted to communicate
between processes that have no common ancestor. Neither standard UNIX pipes
nor socketpairs are the answer here, since both mechanisms require a common
ancestor to set up the communication. We would like to have two processes
separately create sockets and then have messages sent between them. This is
often the case when providing or using a service in the system. This is also the
case when the communicating processes are on separate machines.
Sockets created by different programs use names to refer to one another, names
generally must be translated into addresses for use. The space from which an
address is specified by a domain. There are several such domains for sockets.
Two that will be used in the examples here are the UNIX domain (or AF_UNIX,
for Address Fonnat UNIX) and the Internet domain (or AF_ INET). In the UNIX
domain, a socket is given a path name within the file system name space. A file
system node is created for the socket and other processes may then refer to it by
Revision A, of 27 March 1990
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
259
giving its pathname. UNIX domain names, thus, allow communication between
any two processes that reside on the same machine and that are able to access the
socket pathnames. The Internet domain is the UNIX implementation of the
DARPA Internet standard protocols IP /TCP /UDP. Addresses in the Internet
domain consist of a machine network address and an identifying number, called a
port. Internet domain names allow communication between separate machines.
Communication follows some particular "style." Currently, communication is
either through a stream socket or by datagram. Stream communication implies a
connection. The communication is reliable, error-free, and, as in pipes, no message boundaries are kept Reading from a stream may result in reading the data
sent from one or several calls to wri te () or only part of the data from a single
call, if there is not enough room for the entire message, or if not all the data from
a large message has been transferred. The protocol implementing such a style
will retransmit messages received with errors. It will also return error messages
if one tries to send a message after the connection has been broken. Datagram
communication does not use connections. Each message is addressed individually. If the address is correct, it will generally be received, although this is not
guaranteed. Often datagrams are used for requests that require a response from
the recipient. If no response arrives in a reasonable amount of time, the request
is repeated. The individual datagrams will be kept separate when they are read,
that is, message boundaries are preserved.
NOTE
Sockets under TU Emulation: wri teO should/ail with errno set to
ENOTCONN if it is used on an unconnected socket, however, under TU emulation, it will instead return success. Likewise, wri teO should/ail with errno
set to EPIPE if a connection is broken, but instead it will return with errno set
to ENXIO. Similarly, read() should/ail with errno set to ENOTCONN if it is
used on an unconnected socket, but instead it will return success, with zero bytes
read. In all of these cases, however, so_error will be correctly set. Along the
same lines, wri te(), should allow zero length data messages on the internet
UDP transport. This will not be the case. If it is attempted, wri teO will return
-1 with errno set to ERANGE. These incompatibilities are considered very
minor. Note that calling sendO, sendtoO or sendmsg() on a CLTS network
will succeed.
The difference in perfonnance between the two styles of communication is generally less important than the difference in semantics. The perfonnance gain that
one might find in using datagrams must be weighed against the increased complexity of the program, which must now concern itself with lost or out of order
messages. If lost messages may simply be ignored, the quantity of traffic may be
a consideration. The expense of setting up a connection is best justified by frequent use of the connection. Since the perfonnance of a protocol changes as it is
tuned for different situations, it is best to seek the most up-to-date infonnation
when making choices for a program in which performance is crucial.
A protocol is a set of rules, data fonnats, and conventions that regulate the
transfer of data between participants in the communication. In general, there is
one protocol for each socket type (stream, datagram, etc.) within each domain.
The code that implements a protocol keeps track of the names that are bound to
sockets, sets up connections, and transfers data between sockets, perhaps sending
Revision A, of 27 March 1990
260
Transport-Level Programming
the data across a network. This code also keeps track of the names that are
bound to sockets. It is possible for several protocols, differing only in low level
details, to implement the same style of communication within a particular
domain. Although it is possible to select which protocol should be used, for
nearly all uses it is sufficient to request the default protocol. This has been done
in all of the example programs.
One specifies the domain, style and protocol of a socket when it is created. For
example, in figure 10-6 the call to socket () causes the creation of a datagram
socket with the default protocol in the UNIX domain.
10.6. Datagrams in the
UNIX Domain
Let us now look at two programs that create sockets separately. The programs in
Figures 10-5 and 10-6 use datagram communication rather than a stream. The
structure used to name UNIX domain sockets is defined in the file
<sys/un. h>. The definition has also been included in the example for clarity.
Each program creates a socket with a call to socketO. These sockets are in the
UNIX domain. Once a name has been decided upon it is attached to a socket by
the system call bind(). The program in Figure 10-5 uses the name "socket",
which it binds to its socket. This name will appear in the working directory of
the program. The routines in Figure 10-6, use the socket only for sending messages. They do not create a name for the socket because no other process has to
refer to it
Figure 10-5
Reading UNIX Domain Datagrams
#include
#include
#include
#include
<sys/types.h>
<sys/socket.h>
<sys/un.h>
<stdio.h>
1*
* The include file <sys/un. h> defines sockaddr_un asfollows:
* struct sockaddr_un {
* short sun_family;
sunyath[108];
* char
* };
*1
-#define NAME "socket"
1*
* This program creates a UNIX domain datagram socket, binds a name to it,
* then readsfrom the socket.
*1
main ()
{
int sock, length;
struct sockaddr un name;
char buf[1024];
/*
Create socket from which to read.
*/
Revision A, of 27 March 1990
261
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
sock = socket(AF_UNIX, SOCK_DGRAM, 0);
if (sock < 0) {
perror("opening datagram socket");
exit(1);
/ * Create name. * /
name. sun_family = AF_UNIX;
strcpy(name.sun-path, NAME);
if (bind(sock, (struct sockaddr *)&name,
strlen (NAME) +3) < 0) {
perror("binding name to datagram socket n );
exit(1);
printf("socket -->%s\n", NAME);
/* Read/rom the socket. * /
if (read(sock, buf, 1024) < 0)
perror(nreceiving datagram packet");
printf(n_->%s\n", buf);
close(sock);
unlink(NAME);
exit(O);
Note that, in the call to bind () above, the &name parameter is cast to a
(struct sockaddr *). In writing networking code, one invariably has to
cast such address arguments to network-related system calls, since the systemcall routines must be able to handle a variety of address formats, yet each individual call will use a specialization of the general fonnat. It is poor programming
style to omit these casts, a fact which lint will be only to glad to remind you
of.
Figure 10-6
Sending a UNIX Domain Datagrams
iinclude
iinclude
iinclude
iinclude
<sys/types.h>
<sys/socket.h>
<sys/un.h>
<stdio.h>
idefine DATA "The sea is calm, the tide is full .
.
"
1*
* Here I send a datagram to a receiver whose name I get/rom the command
* line arguments. The/orm o/the command line is udgramsend pathname.
*1
main (argc, argv)
int argc;
char *argv[];
int sock;
struct sockaddr un name;
Revision A, of 27 March 1990
262
Transport-Level Programming
/ * Create socket on which to send. * /
sock = socket(AF_UNIX, SOCK_DGRAM, 0);
if (sock < 0) {
perror("opening datagram socket");
exit(l);
/ * Construct name of socket to send to . * /
name. sun_family = AF_UNIX;
strcpy(name.sun-path, argv[l]);
/ * Send message. * /
if (sendto(sock, DATA, sizeof(DATA), 0,
(struct sockaddr *)&name,
sizeof(struct sockaddr_un»
< 0) {
perror("sending datagram message");
close (sock) ;
exit(O);
Names in the UNIX domain are path names. Like file path names they may be
either absolute (e.g. "/dev/imaginary") or relative (e.g. "socket"). Because these
names are used to allow processes to rendezvous, relative path names can pose
difficulties and should be used with care. When a name is bound into the name
space, a file (vnode) is allocated in the file system. If the vnode is not deallocated, the name will continue to exist even after the bound socket is closed. This
can cause subsequent runs of a program to find that a name is unavailable, and
can cause directories to fill up with these objects. The names are removed by
calling unlink () or using the rm (1) command. Names in the UNIX domain
are only used for rendezvous. They are not used for message delivery once a
connection is established. Therefore, in contrast with the Internet domain,
unbound sockets need not be (and are not) automatically given addresses when
they are connected.
There is no established means of communicating names to interested parties. In
the example, the program in Figure 10-6 gets the name of the socket to which it
will send its message through its command line arguments. Once a line of communication has been created, one can send the names of additional, perhaps new,
sockets over the link.
Revision A, of 27 March 1990
Chapter 10 -
A Socket-Based Interprocess Conununications Tutorial
263
10.7. Datagrams in the
Internet Domain
Figure 10-7
Reading Internet Domain Datagrams
#include
#include
#include
#include
<sys/types.h>
<sys/socket.h>
<netinet/in.h>
<stdio.h>
1*
* The includefile <netinet/ in. h> defines sockaddr_in asfollows:
* struct sockaddr_in {
* short sin_family;
* u_short sin-port;
* struct in_addr sin_addr;
* char
sin_zeroES];
* };
*
* This program creates a datagram socket, binds a name to it, then reads
* from the socket.
*1
main ()
{
int sock, length;
struct sockaddr in name;
char buf[1024];
/ * Create socket from which to read. * /
sock = socket(AF_INET, SOCK_DGRAM, 0);
if (sock < 0) {
perror("opening datagram socket");
exit(l);
/ * Create name with wildcards. * /
name. sin_family = AF_INET;
name. sin addr.s addr = INADDR_ANY;
name.sin-port = 0;
if (bind(sock, (struct sockaddr *)&name,
sizeof name ) < 0) {
perror("binding datagram socket");
exit(l);
/*
Find assigned port value and print it out.
*/
length = sizeof(name);
if (getsockname(sock, (struct sockaddr *)&name,
&length) < 0) {
perror("getting socket name");
exit(l);
printf("Socket port #%d\n", ntohs(name.sin-port»;
/ * Read from the socket. * /
Revision A, of 27 March 1990
264
Transport-Level Programming
if (read(sock, buf, 1024) < 0)
perror("receiving datagram packet");
printf("-->%s\n", buf);
close(sock);
exit(O);
The examples in Figure 10-7 and 10-8 are very close to the previous examples
except that the socket is in the Internet domain. The structure of Internet domain
addresses is defined in the file <netinet/ in. h>. Internet addresses specify a
host address (a 32-bit number) and a delivery slot, or port, on that machine.
These ports are managed by the system routines that implement a particular protocol. Unlike UNIX domain names, Internet socket names are not entered into
the file system and, therefore, they do not have to be unlinked after the socket has
been closed. When a message must be sent between machines it is sent to the
protocol routine on the destination machine, which intetprets the address to
determine to which socket the message should be delivered. Several different
protocols may be active on the same machine, but, in general, they will not communicate with one another. As a result, different protocols are allowed to use the
same port numbers. Thus, implicitly, an Internet address is a triple including a
protocol as well as the port and machine address. An association is a temporary
or pennanent specification of a pair of communicating sockets. An association is
thus identified by the tuple <protocol, local machine address, local port, remote
machine address, remote port>. An association may be transient when using
datagram sockets; the association actually exists during a send () operation.
Figure 10-8
Sending an Internet Domain Datagram
*include
*include
*include
*include
*include
<sys/types.h>
<sys/socket.h>
<netinet/in.h>
<netdb.h>
<stdio.h>
*define DATA "The sea is calm, the tide is full .
.
"
1*
* Here I send a datagram to a receiver whose name I getfrom the command
* line arguments. The form of the command line is:
*dgramsend hostname portnumber
*1
main (argc, argv)
int argc;
char *argv[];
int sock;
struct sockaddr_in name;
struct hostent *hp, *gethostbyname();
Revision A. of 27 March 1990
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
/*
Create socket on which to send.
265
*/
sock = socket (AF_INET, SOCK_DGRAM, 0);
if (sock < 0) {
perror("opening datagram socket");
exit(l);
1*
* Construct name, with no wildcards, of the socket to send to.
* gethostbyname returns a structure including the network address
* of the specified host. The port number is taken from the command
* line.
*1
hp = gethostbyname(argv[l]);
if (hp == 0) {
fprintf(stderr, "%s: unknown hostO, argv[l]);
exit(2);
bcopy«char *)hp->h_addr, (char *)&name.sin_addr,
hp->h_length);
name. sin_family = AF_INET;
name.sin-port = htons(atoi(argv[2]»;
/ * Send message. * /
if (sendto(sock, DATA, sizeof DATA, 0,
(struct sockaddr *)&name, sizeof name) < 0)
perror("sending datagram message");
close (sock) ;
exit(O);
The protocol for a socket is chosen when the socket is created. The local
machine address for a socket can be any valid network address of the machine, if
it has more than one, or it can be the wildcard value INADDR_ANY. The wildcard value is used in the program in Figure 10-7. If a machine has several network addresses, it is likely that messages sent to any of the addresses should be
deliverable to a socket. This will be the case if the wildcard value has been
chosen. Note that even if the wildcard value is chosen, a program sending messages to the named socket must specify a valid network address. One can be willing to receive from "anywhere," but one cannot send a message "anywhere."
The program in Figure 10-8 is given the destination host name as a command
line argument. To determine a network address to which it can send the message, it looks up the host address by the call to gethostbynameO. The
returned structure includes the host's network address, which is copied into the
structure specifying the destination of the message.
The port number can be thought of as the number of a mailbox, into which the
protocol places one's messages. Certain daemons, offering certain advertised
selVices, have reselVed or "well-known" port numbers. These fall in the range
from 1 to 1023. Higher numbers are available to general users. Only selVers
need to ask for a particular number. The system will assign an unused port
number when an address is bound to a socket. This may happen when an explicit
bind () call is made with a port number of 0, or when a connect () or
Revision A, of 27 March 1990
266
Transport-Level Programming
send () is perfonned on an unbound socket. Note that port numbers are not
automatically reported back to the user. After calling bindO, asking for port 0,
one may call getsockname () to discover what port was actually assigned.
The routine getsockname () will not work for names in the UNIX domain.
NOTE
Sockets under TU Emulation: get sockname() can only work if the underlying
transport provider provides the necessary support, and under the TU, this is not
always true. Specifically, if the address given to bind() was INADDR_ANY, the
the socket module will not be able to map back/rom its real network address to
its local name. This is only a minor problem.
The fonnat of the socket address is specified in part by standards within the Internet domain. The specification includes the order of the bytes in the address.
Because machines differ in the internal representation they ordinarily use to
represent integers, printing out the port number as returned by getsockname may
result in a misinterpretation. To print out the number, it is necessary to use the
routine ntohs () (for network to host: short) to convert the number from the
network representation to the host's representation. On some machines, such as
68000-based machines, this is a null operation. On others, such as VAXes, this
results in a swapping of bytes. Another routine exists to convert a short integer
from the host fonnat to the network format, called htonsO; similar routines
exist for long integers. For further infonnation, see byteorder (3) .
10.S. Connections
Figure 10-9
To send data between stream sockets (having communication style
SOCK_STREAM), the sockets must be connected. Figures 10-9 and 10-10 show
two programs that create such a connection. The program in 10-9 is relatively
simple. To initiate a connection, this program simply creates a stream socket,
then calls connectO, specifying the address of the socket to which it wishes its
socket connected. Provided that the target socket exists and is prepared to handle
a connection, connection will be complete, and the program can begin to send
messages. Messages will be delivered in order without message boundaries, as
with pipes. The connection is destroyed when either socket is closed (or soon
thereafter). If a process persists in sending messages after the connection is
closed, a SIGPIPE signal is sent to the process by the operating system. Unless
explicit action is taken to handle the signal (see the signal (3) or
sigvec (3) man pages) the process will tenninate.
Initiating an Internet Domain Stream Connection
'include
'include
'include
'include
'include
<sys/types.h>
<sys/socket.h>
<netinet/in.h>
<netdb.h>
<stdio.h>
'define DATA "Half a league, half a league . .
"
1*
* This program creates a socket and initiates a connection with the socket
* given in the command line. One message is sent over the connection and
* then the socket is closed. ending the connection. The form of the command
Revision A, of 27 March 1990
267
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
* line is: streamwri te
hostname portnumber
*1
main (argc, argv)
int argc;
char *argv[];
int sock;
struct sockaddr in server;
struct hostent *hp, *gethostbyname();
char buf[l024];
/ * Create socket. * /
sock = socket(AF_INET, SOCK_STREAM, 0);
if (sock < 0) {
perror("opening stream socket");
exit(l);
/* Connect socket using name specified by command line. * /
server. sin_family = AF_INETi
hp = gethostbyname(argv[l])i
if (hp == 0) {
fprintf(stderr, "%s: unknown hostO, argv[l]);
exit(2);
bcopy«char *)hp->h_addr, (char *)&server.sin_addr,
hp->h_length);
server.sin-port = htons(atoi(argv[2]»;
if (connect (sock,
(struct sockaddr *)&server, sizeof server) < 0)
perror("connecting stream socket");
exit(l);
{
if (write (sock, DATA, sizeof DATA) < 0)
perror("writing on stream socket");
close(sock)i
exit(O);
Forming a connection is asymmetrical; one process, such as the program in Figure 10-9 requests a connection with a particular socket, the other process accepts
connection requests. Before a connection can be accepted a socket must be
created and an address bound to it. This situation is illustrated in the top half of
Figure 10-12. Process 2 has created a socket and bound a port number to it. Process 1 has created an unnamed socket. The address bound to process 2' s socket
is then made known to process 1 and, perhaps to several other potential communicants as well. If there are several possible communicants, this one socket might
receive several requests for connections. As a result, a new socket is created for
each connection. This new socket is the endpoint for communication within this
process for this connection. A connection may be destroyed by closing the
Revision A, of 27 March 1990
268
Transport-Level Programming
corresponding socket.
The program in Figure 10-10 is a rather trivial example of a server. It creates a
socket to which it binds a name, which it then advertises. (In this case it prints
out the socket number.) The program then calls listen () for this socket.
Since several clients may attempt to connect more or less simultaneously, a
queue of pending connections is maintained in the system address space.
listen () marks the socket as willing to accept connections and initializes the
queue. When a connection is requested, it is listed in the queue. If the queue is
full, an error status may be returned to the requester. The maximum length of
this queue is specified by the second argument of listenO; the maximum
length is limited by the system. Once the listen call has been completed, the program enters an infinite loop. On each pass through the loop, a new connection is
accepted and removed from the queue, and, hence, a new socket for the connection is created. The bottom half of Figure 10-12 shows the result of Process 1
connecting with the named socket of Process 2, and Process 2 accepting the connection. After the connection is created, the service, in this case printing out the
messages, is perfonned and the connection socket closed. The accept () call
will take a pending connection request from the queue if one is available, or
block waiting for a request. Messages are read from the connection socket.
Reads from an active connection will nonnally block until data is available. The
number of bytes read is returned. When a connection is destroyed, the read call
returns immediately. The number of bytes returned will be zero.
NOTE
Sockets under TU Emulation: listen() has a un/ortunateJailure condition
under TU emulation. The problem is rooted in the difference between TU and
socket semantics which creates a timing window within which a second transport
user can be allocated the address previously allocated to the caller oj
listen(). If this happens, the socket library will return -1, and errno will be
set to EADDRINUSE, an error not usually possible in sockets. Also note that,
both read() and wr i te() should return with errno set to ENOCONN when
used on an unconnected socket. Under the socket emulation, however, they will
return success ( read() will also report zero bytes read). so error will still
be properly set, so these incompatibilities are very minor.
The program in Figure 10-11 is a slight variation on the server in Figure 10-10.
lt avoids blocking when there are no pending connection requests by calling
select () to check for pending requests before calling acceptO. This strategy is useful when connections may be received on more than one socket, or
when data may arrive on other connected sockets before another connection
request.
The programs in Figures 10-13 and 10-14 show a program using stream socket
communication in the UNIX domain. Streams in the UNIX domain can be used
for this sort of program in exactly the same way as Internet domain streams,
except for the fonn of the names and the restriction of the connections to a
machine. There are some differences, however, in the functionality of streams in
the two domains, notably in the handling of out-oj-band data (discussed briefly
below). These differences are beyond the scope of this chapter.
Figure 10-10
Accepting an Internet Domain Stream Connection
Revision A, of 27 March 1990
Chapter 10 -
A Socket-Based Interprocess Communications Tutorial
269
iinclude <sys/types.h>
iinclude <sys/socket.h>
iinclude <netinet/in.h>
iinclude <netdb.h>
iinclude <stdio.h>
idefine TRUE 1
1*
* This program creates a socket and then begins an infinite loop. Each time
* through the loop it accepts a connection and prints out messages from it.
* When the connection breaks, or a termination message comes through, the
* program accepts a new connection.
*1
main ()
{
int sock, length;
struct sockaddr in server;
int msgsock;
char buf[1024];
int rval;
int i;
/ * Create socket. * /
sock = socket(AF_INET, SOCK_STREAM, 0);
if (sock < 0) {
perror("opening stream socket");
exit(l);
/ * Name socket using wildcards. * /
server. sin_family = AF_INET;
server.sin_addr.s_addr = INADDR_ANY;
server.sin-port = 0;
if (bind(sock, (struct sockaddr *)&server,
sizeof server ) < 0) {
perror("binding stream socket");
exit(l);
/ * Find out assigned port number and print it out. * /
length = sizeof server;
if (getsockname(sock, (struct sockaddr *)&server,
&length) < 0) {
perror("getting socket name");
exit(l);
printf("Socket port i%d\n", ntohs(server.sin-port»;
/ * Start accepting connections. * /
listen(sock, 5);
do {
msgsock = accept (sock,
(struct sockaddr *) 0,
if (msgsock == -1)
sun
microsystems
(int *) 0);
Revision A. of 27 March 1990
270
Transport-Level Programming
perror("accept");
else do {
bzero(buf, sizeof buf );
if «rval = read(msgsock, buf, 1024»
< 0)
perror("reading stream message");
i = 0;
if (rval == 0)
printf("Ending connection\n");
else
printf("-->%s\n", buf);
while (rval != 0);
close(msgsock);
while (TRUE);
1*
* Since this program has an infinite loop. the socket "sock" is
* never explicitly closed. However, all sockets will be closed
* automatically when a process is killed or terminates normally.
*1
exit(O);
Figure 10-11
Using
select ()
to Check/or Pending Connections
*include <sys/types.h>
*include <sys/socket.h>
*include <sys/time.h>
*include <netinet/in.h>
*include <netdb.h>
*include <stdio.h>
*define TRUE 1
1*
* This program uses select to check that someone is trying to connect
* before calling accept.
*1
main ()
{
int sock, length;
struct sockaddr in server;
int msgsock;
char buf[l024];
int rval;
fd set ready;
struct timeval to;
/ * Create socket. * /
sock = socket(AF_INET, SOCK_STREAM, 0);
if (sock < 0) {
perror("opening stream socket");
exit(l);
+ ~I!!
Revision A, of 27 March 1990
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
271
/ * Name socket using wildcards. * /
server. sin_family = AF_INET;
server.sin_addr.s_addr = INADDR_ANY;
server.sin-port = 0;
if (bind(sock, (struct sockaddr *)&server,
sizeof server) < 0) {
perror("binding stream socket");
exit(l);
/*
Find out assigned port number and print it out.
*/
length = sizeof server;
if (getsockname(sock, (struct sockaddr *)&server,
&length) < 0) {
perror("getting socket name");
exit(l);
printf("Socket port #%d\n", ntohs(server.sin-port»;
/*
Start accepting connections.
*/
listen (sock, 5);
do {
FD_ZERO (&ready) ;
FD_SET(sock, &ready);
to. tv_sec = 5;
if (select (sock + 1, &ready,
(fd_set *) 0, &to) < 0) {
perror("select");
continue;
(fd_set *) 0,
if (FD_ISSET(sock, &ready»
{
msgsock = accept(sock, (struct sockaddr *)0,
(int *)0);
if (msgsock == -1)
perror("accept");
else do {
bzero(buf, sizeof buf);
if «rval = read(msgsock, buf, 1024»
< 0)
perror("reading stream message");
else if (rval == 0)
printf(nEnding connection\nn);
else
printf(n-->%s\nn, buf);
while (rval > 0);
close(msgsock);
else
printf("Do something else\nn);
while (TRUE);
exit(O);
Revision A, of 27 March 1990
272
Transport-Level Programming
Figure 10-12
Establishing a Stream Connection
Process 1
Process 2
Q
Q
o
Q
Process 1
Process 2
Q
o
o
a
Figure 10-13
bAMEI
Initiating a UNIX Domain Stream Connection
iinclude
iinclude
iinclude
iinclude
<sys/types.h>
<sys/socket.h>
<sys/un.h>
<stdio.h>
idefine DATA "Half a league, half a league . .
"
1*
* This program connects to the socket named in the command line and sends a
* one line message to that socket. The form of the command line is:
*ustreamwrite pathname
*1
Revision A, of 27 March 1990
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
273
main (argc, argv)
int argc;
char *argv [] ;
int sock;
struct sockaddr un server;
char buf[1024];
/*
Create socket.
*/
sock = socket(AF_UNIX, SOCK_STREAM, 0);
if (sock < 0) {
perror("opening stream socket");
exit(l);
/ * Connect socket using name specified by command line. * /
server. sun_family = AF_UNIX;
strcpy(server.sun-path, argv[l]);
if (connect (sock, (struct sockaddr *)&server,
sizeof(struct sockaddr_un»
< 0) {
close(sock);
perror("connecting stream socket");
exit(l);
if (write (sock, DATA, sizeof(DATA»
< 0)
perror("writing on stream socket");
exit(O);
Figure 10-14
Accepting a UNIX Domain Stream Connection
#include
#include
#include
#include
<sys/types.h>
<sys/socket.h>
<sys/un.h>
<stdio.h>
#define NAME "socket"
1*
* This program creates a socket in the UNIX domain and binds a name to it.
* After printing the socket's name it begins a loop. Each time through the
* loop it accepts a connection and prints out messages from it. When the
* connection breaks, or a termination message comes through, the program
* accepts a new connection.
*1
main ()
{
int sock, msgsock, rval;
struct sockaddr un server;
char buf[l024];
Revision A, of 27 March 1990
274
Transport-Level Programming
/*
Create socket.
*/
sock = socket (AF_UNIX, SOCK_STREAM, 0);
if (sock < 0) {
perror("opening stream socket");
exit(l);
/ * Name socket using file system name. * /
server. sun_family = AF_UNIX;
strcpy(server.sun-path, NAME);
if (bind(sock, (struct sockaddr *)&server,
sizeof(struct sockaddr_un»
< 0) {
perror("binding stream socket");
exit(l);
printf("Socket has name %s\n", server.sunyath);
/*
Start accepting connections.
*/
listen(sock, 5);
for (;;) {
msgsock = accept(sock, (struct sockaddr *)0,
(int *)0);
if (msgsock == -1)
perror("accept");
else do {
bzero(buf, sizeof buf);
if «rval = read(msgsock, buf, 1024»
< 0)
perror("reading stream message");
else if (rval == 0)
printf("Ending connection\n");
else
printf("-->%s\n", buf);
while (rval > 0);
close(msgsock);
1*
* The following statements are not executed, because they follow an
* infinite loop. However, most ordinary programs will not run
* forever. In the UNIX domain it is necessary to tell the file
* system that one is through using NAME. In most programs one uses
* the call unlink as below. Since the user will have to Idll this
* program, it will be necessary to remove the name by a command from
* the shell.
*1
close(sock);
unlink(NAME);
exit(O);
Revision A, of 27 March 1990
Chapter 10 - A Socket-Based Interprocess Communications Tutorial
10.9. Reads, Writes, Recvs,
etc.
275
SunOS has several system calls for reading and writing infonnation. The simplest calls are read () and wr i teO. write () takes as arguments the index
of a descriptor, a pointer to a buffer containing the data, and the size of the data.
The descriptor may indicate either a file or a connected socket. "Connected" can
mean either a connected stream socket (as described in the Connections section
below, or a datagram socket for which a connect(3) call has provided a default
destination. read () also takes a descriptor that indicates either a file or a
socket. wr it e () requires a connected socket since no destination is specified
in the parameters of the system call. read () can be used for either a connected
or an unconnected socket. These calls are, therefore, quite flexible and may be
used to write applications that make no assumptions about the source of their
input or the destination of their output. There are variations on read () and
wr it e () that allow the source and destination of the input and output to use
several separate buffers, while retaining the flexibility to handle both files and
sockets. These are readv () and writevO, for read and write vector.
It is sometimes necessary to send high priority data over a connection that may
have unread low priority data at the other end. For example, a user interface process may be interpreting commands and sending them on to another process
through a stream socket connection. The user interface may have filled the
stream with as yet unprocessed requests when the user types a command to cancel all outstanding requests. Rather than have the high priority data wait to be
processed after the low priority data, it is possible to send it as out-oj-band
(OOB) data. The notification of pending OOB data results in the generation of a
SIGURG signal, if this signal has been enabled (see the signal (3) and
sigvec (3) man pages). See An Advanced Socket-Based Interprocess Communications Tutorial for a more complete description of the OOB mechanism.
There are a pair of calls similar to read () and wr i te () that allow options,
including sending and receiving OOB information; these are send () and
recvO. These calls are used only with sockets; specifying a descriptor for a file
will result in the return of an error status. These calls also allow peeking at data
in a stream. That is, they allow a process to read data without removing the data
from the stream. One use of this facility is to read ahead in a stream to determine
the size of the next item to be read. When not using these options, these calls
have the same functions as read () and writeO.
To send datagrams, one must be allowed to specify the destination. The call
sendto () takes a destination address as an argument and is therefore used for
sending datagrams. The call recvfrorn () is often used to read datagrams,
since this call returns the address of the sender, if it is available, along with the
data. If the identity of the sender does not matter, one may use read () or
recv ().
NOTE
Sockets under TU Emulation: A call to recvfrom() or recvmsg() should
return the source address if the user supplies a nun-NULL buffer. Under emulation, though, if the user specifies MSG_PEEK and/or MSG_ OOB then the source
address will not be returned. This is only a minor problem.
Finally, there are a pair of calls that allow the sending and receiving of messages
from multiple buffers, when the address of the recipient must be specified. These
Revision A, of 27 March 1990
276
Transport-Level Programming
are sendmsg () and recvmsg ( ). These calls are actually quite general and
have other uses, including, in the UNIX domain, the transmission of a file
descriptor from one process to another.
The various options for reading and writing, together with their parameters, are
shown in Figure 10-15 below. The parameters for each system call reflect the
differences in function of the different calls. In the examples given in this
chapter, the calls read () and wri te () have been used whenever possible.
Figure 10-15
Varieties of Read and Write Commands
Revision A, of 27 March 1990
Chapter 10 -
A Socket-Based Interprocess Communications Tutorial
277
1*
* The variable descriptor may be the descriptor of either a file
* or of a socket.
*1
cc = read(descriptor, buf, nbytes)
int cc, descriptor; char *buf; int nbytes;
1*
* An iovec can include several source buffers.
*1
cc = readv(descriptor, iov, iovcnt)
int cc, descriptor; struct iovec *iov; int iovcnt;
cc = write(descriptor, buf, nbytes)
int cc, descriptor; char *buf; int nbytes;
cc = writev(descriptor, iovec, ioveclen)
int cc, descriptor; struct iovec *iovec; int iovecleni
1*
* The variable sock" must be the descriptor of a socket.
* Flags may include MSG_OOB and MSG_PEEK.
Il
*1
cc = send(sock, msg, len, flags)
int cc, sock; char *msg; int len, flags;
cc = sendto(sock, msg, len, flags, to, tolen)
int cc, sock; char *msg; int len, flags;
struct sockaddr *to; int tolen;
cc = sendmsg(sock, msg, flags)
int cc, sock; struct msghdr msg[]; int flags;
cc = recv(sock, buf, len, flags)
int cc, sock; char *buf; int len, flags;
cc = recvfrom(sock, buf, len, flags, from,
int cc, sock; char *buf; int len, flags;
struct sockaddr *from; int *fromlen;
fromlen)
cc = recvmsg(sock, msg, flags)
int cc, socket; struct msghdr msg[]; int flags;
Note that the meaning assigned to the msg_accrights and
msg_accrightslen fields of the msghdr structure used in the recvmsg ()
and sendmsg () system calls is protocol-dependent. See the ScatterlGather
and Exchanging Access Rights section of the System Services Overview for
details about the msghdr structure.
Revision A, of 27 March 1990
278
Transport-Level Programming
10.10. Choices
This chapter has presented examples of some of the fonns of communication
supported by SunOS. These have been presented in an order chosen for ease of
presentation. It is useful to review these options emphasizing the factors that
make each attractive.
Pipes have the advantage of portability, in that they are supported in all UNIX
systems. They also are relatively simple to use. Socketpairs share this simplicity
and have the additional advantage of allowing bidirectional communication. The
major shortcoming of these mechanisms is that they require communicating
processes to be descendants of a common process. They do not allow intermachine communication.
The two communication domains, the UNIX domain and the Internet domain,
allow processes with no common ancestor to communicate. Of the two, only the
Internet domain allows communication between machines. This makes the Internet domain a necessary choice for processes running on separate machines.
The choice between datagrams and socket stream communication is best made
by carefully considering the semantic and performance requirements of the application. Streams can be both advantageous and disadvantageous. One disadvantage is that, since a process is only allowed a limited number of open file descriptors (normally 64) there is a limit on the number of streams that a process can
have open at any given time. This can cause problems if a single server must
talk with a large number of clients. Another is that for delivering a short message the stream setup and teardown time can be unnecessarily long. Weighed
against this are the reliability built into the streams. This will often be the deciding factor in favor of streams.
10.11. What to do Next
Many of the examples presented here can serve as models for multiprocess programs and for programs distributed across several machines. In developing a
new mUltiprocess program, it is often easiest to first write the code to create the
processes and communication paths. After this code is debugged, the code
specific to the application can be added.
Further documentation of the socket-based IPC mechanisms can be found in An
Advanced Socket-Based lnterprocess Communications Tutorial. More detailed
infonnation about particular calls and protocols is provided in the SunOS ReferenceManual.
Revision A, of 27 March 1990
.::
.~
'.'
...
~:.:.:,:
....
':~
11
. .' .'. ..
~:.,
An Advanced Socket-Based
Interprocess Communications Tutorial
WARNlNG
Socket-based interprocess communication (lPC), while still supported, is no
longer the preferred framework for transport-level programming. Socketbased lPC has been superceded as the "standard" method of accessing network
protocols by a set of OSl-compatible transport mechanisms based upon
STREAMS and accessed by way of a Transport Library Interface (TLI). For
details on the TLl, see the previous chapter, Transport LevellnterJace Programming.
If you are building a new network application that requires direct access to tran-
sport facilities, use the TLI mechanisms. If you do not require such direct access,
Remote Procedure Calls (RPC) are the preferred programming framework - see
the Remote Procedure Call Programming Guide section of this manual for
details. New programs should not be based on sockets.
SunOS contains socket-based IPC mechanisms derived from Berkeley UNIX.
This chapter describes the fine points of those mechanisms by supplementing the
more introductory infonnation given in A Socket-Based Interprocess Communications Tutorial. The majority of the chapter considers the use of these primitives in developing network applications. The reader is expected to be familiar
with the C programming language.
Socket-based interprocess communication was first introduced in 4.2BSD and
subsequently incorporated into SunOS. The design of these facilities was the
result of more than two years of discussion and research, and they incorporated
many ideas from then-current research, while maintaining the UNIX philosophy
of simplicity and conciseness.
Prior to the 4.2BSD IPC facilities, the only standard mechanism that allowed two
processes to communicate were pipes (the mpx files that were in Version 7 were
experimental). Unfortunately, pipes are very restrictive in that the two communicating processes must be related through a common ancestor. Further, the
semantics of pipes makes them almost impossible to maintain in a distributed
environment.
Earlier attempts at extending the IPC facilities of the UNIX system have met
with mixed reaction. The majority of the problems have been related to the fact
that these facilities have been tied to the UNIX file system, either through naming or implementation. Consequently, the 4.3BSD IPC facilities were designed
as a totally independent subsystem. They allow processes to rendezvous in many
ways. Processes may rendezvous through a UNIX file system-like name space (a
279
Revision A, of 27 March 1990
280
Transport-Level Programming
space where all names are path names) as well as through a network name space.
In fact, new name spaces may be added at a future time with only minor changes
visible to users. Further, the communication facilities have been extended to
include more than the simple byte stream provided by a pipe.
This chapter provides a high-level description of the socket-based IPC facilities
and their use. It is designed to complement the manual pages for the IPC primitives with examples of their use. After this initial description, come four more
sections. The Basics section introduces the IPC-related system calls and the
basic model of communication. The Library Routines section describes some of
the supporting library routines that users may find useful in constructing distributed applications. The Client/Server Model section is concerned with the
client/server model used in developing applications and includes examples of the
two major types of selVers. The Advanced Topics section delves into advanced
topics that sophisticated users are likely to encounter when using the these IPC
facilities.
11.1. Basics
The basic building block for communication is the socketO. A socket is an
endpoint of communication to which a name may be bound. Each socket in use
has a type and one or more associated processes. Sockets exist within communications domains. Domains are abstractions which imply both an addressing
structure (address family) and a set of protocols which implement various socket
types within the domain (protocol family). Communications domains are introduced to bundle common properties of processes communicating through sockets. One such property is the scheme used to name sockets. For example, in the
UNIX domain sockets are named with UNIX path names; e.g. a socket may be
named Idev Ifoo. Sockets nonnally exchange data only with sockets in the
same domain (it may be possible to cross between communications domains, but
only if some translation process is perfonned). The 4.3BSD, and thus the
socket-based SunOS IPC facilities support several separate communications
domains: notably the UNIX domain, for on-system communication, and the
Internet domain, which is used by processes that communicate using the DARPA
standard communication protocols. The underlying communication facilities
provided by these domains have a significant influence on the internal system
implementation as well as the interface to socket facilities available to a user. An
example of the latter is that a socket operating in the UNIX domain sees a subset
of the error conditions that are possible when operating in the Internet, DECNET,
X.25, or OSI domains.
Socket Types
Sockets are typed according to the communication properties visible to a user.
Processes are presumed to communicate only between sockets of the same type,
although there is nothing that prevents communication between sockets of different types should the underlying communication protocols support this.
There are several types of sockets currently available:
o
A stream socket provides for the bidirectional, reliable, sequenced, and
unduplicated flow of data without record boundaries. Aside from the
bidirectionality of data flow, a pair of connected stream sockets provides an
interface nearly identical to that of pipes8.
8 In the UNIX domain, in fact, the semantics are identical and, as one might expect, pipes have been
implemented internally as simply a pair of connected stream sockets.
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
281
A datagram socket supports bidirectional flow of data that is not promised to
be sequenced, reliable, or unduplicated. That is, a process receiving messages on a datagram socket may find messages duplicated, and, possibly, in
an order different from the order in which they were sent. An important
characteristic of a datagram socket is that record boundaries in data are
preserved. Datagram sockets closely model the facilities found in many
contemporary packet switched networks such as the Ethernet.
Cl
A raw socket provides users access to the underlying communication protocols which support socket abstractions. These sockets are normally
datagram oriented, though their exact characteristics are dependent on the
interface provided by the protocol. Raw sockets are not intended for the
general user; they have been provided mainly for those interested in
developing new communication protocols, or for gaining access to some of
the more esoteric facilities of an existing protocol. The use of raw sockets is
considered in the Advanced Topics section below.
Cl
Another potential socket type with interesting properties is the sequenced packet
socket. Such a socket would have properties similar to those of a stream socket,
except that it would preserve record boundaries. There is currently no support
for this type of socket.
Another potential socket type which has interesting properties is the reliably
delivered message socket. The reliably delivered message socket has similar
properties to a datagram socket, but with reliable delivery. There is currently no
support for this type of socket.
Socket Creation
To create a socket, the socket () system call is used:
s
= socket(domain, type, protocol);
This call requests that the system create a socket in the specified domain and of
the specified type. A particular protocol may also be requested. If the protocol is
left unspecified (a value of 0), the system will select an appropriate protocol from
those that comprise the domain and that may be used to support the requested
socket type. The user is returned a descriptor (a small integer number) that may
be used in later system calls that operate on sockets. The domain is specified as
one of the manifest constants defined in the file <sys/ socket. h>. For the
UNIX domain the constant is
AF _UNIX; for the Internet domain, it is AF _INET 9 . The socket types are also
defined in this file and one of SOCK_STREAM, SOCK_DGRAM, or
SOCK_RAW must be specified. To create a stream socket in the Internet domain
the following call might be used:
s
= socket(AF_INET, SOCK_STREAM, 0);
This call would result in a stream socket being created with the TCP protocol
9 The manifest constants are named AF_whatever as they indicate the "address format" to use in interpreting
names.
Revision A, of 27 March 1990
282
Transport-Level Programming
providing the underlying communication support. To create a datagram socket
for on-machine use the call might be:
s
=
socket(AF_UNIX, SOCK_DGRAM, 0);
The default protocol (used when the protocol argument to the socket () call is
0) should be correct for most every situation. However, it is possible to specify a
protocol other than the default; this will be covered in the Advanced Topics section below.
There are several reasons a socket call may fail. Aside from the rare occurrence
of lack of memory (ENOBUFS), a socket request may fail due to a request for an
unknown protocol (EPROTONOSUPPORT), or a request for a type of socket for
which there is no supporting protocol (EPROTOTYPE).
Binding Local Names
A socket is created without a name. Until a name is bound to a socket, processes
have no way to reference it and, consequently, no messages may be received on
it. Communicating processes are bound by an association. In the Internet
domain, an association is composed of local and foreign addresses, and local and
foreign ports, while in the UNIX domain, an association is composed of local and
foreign path names (the phrase "foreign pathname" means a pathname created by
a foreign process, not a pathname on a foreign system). In most domains, associations must be unique. In the Internet domain there may never be duplicate
<protocol, local address, local port,/oreign address,/oreign port>
tuples. UNIX domain sockets need not always be bound to a name, but when
bound there may never be duplicate
<protocol, local pathname,/oreign pathname>
tuples. Currently, the pathnames may not refer to files already existing on the
system, though this may change in future releases.
The bind () system call allows a process to specify half of an association,
<local address, local port> (or <local pathname»
while the connect () and accept () primitives are used to complete a
socket's association.
In the Internet domain, binding names to sockets can be fairly complex. Fortunately, it is usually not necessary to specifically bind an address and port
number to a socket, because the connect () and send () calls will automatically bind an appropriate address if they are used with an unbound socket.
The bind () system call is used as follows:
bind(s, name, namelen);
The bound name is a variable length byte string that is interpreted by the supporting protocol(s). Its interpretation may vary between communication domains
(this is one of the properties that comprise a domain). As mentioned, Internet
domain names contain an Internet address and port number. In the UNIX
domain, names contain a path name and a family, which is always AF _UNIX. If
one wanted to bind the name / tmp / f 00 to a UNIX domain socket, the
Revision A, of 27 March 1990
Chapter 11 -
An Advanced Socket-Based Interprocess Communications Tutorial
283
following code would be used: 10
*include <sys/un.h>
struct sockaddr_un addr;
strcpy(addr.sunyath, "/tmp/foo");
addr.sun_family = AF_UNIX;
bind(s, (struct sockaddr *) &addr, strlen(addr.sunyath) +
sizeof (addr.sun_familY»i
Note that in detennining the size of a UNIX domain address null bytes are not
counted, which is why strlen () is used. In the current implementation of
UNIX domain IPC, the file name referred to in addr. sun_path is created as a
socket in the system file space. The caller must, therefore, have write permission
in the directory where addr. sun_path is to reside, and this file should be
deleted by the caller when it is no longer needed. Future versions may not create
this file.
In binding an Internet address things become more complicated. The actual call
is similar,
*include <sys/types.h>
*include <netinet/in.h>
struct sockaddr_in sin;
bind(s,
(struct sockaddr *) &sin, sizeof sin);
but the selection of what to place in the address sin requires some discussion.
We will come back to the problem of formulating Internet addresses in the
Library Routines section when the library routines used in name resolution are
discussed.
Connection Establishment
Connection establishment is usually asymmetric, with one process a client and
the other a server. The server, when willing to offer its advertised services, binds
a socket to a well-known address associated with the service and then passively
listens on its socket. It is then possible for an unrelated process to rendezvous
with the server. The client requests services from the server by initiating a connection to the server's socket. On the client side the connect () call is used to
initiate a connection. Using the UNIX domain, this might appear as,
10 Beware of the tendency to call the "addr" structure "sun", which collides with a symbol predefined by the
C preprocessor.
Revision A, of 27 March 1990
284
Transport-Level Programming
struct sockaddr_un server;
connect(s, (struct sockaddr *)&server,
strlen(server.sun-path) + sizeof (server.sun_family);
while in the Internet domain,
struct sockaddr in server;
connect(s,
(struct sockaddr *)&server, sizeof server);
where server in the example above would contain either the UNIX pathname, or
the Internet address and port number of the server to which the client process
wishes to speak. If the client process's socket is unbound at the time of the connect call, the system will automatically select and bind a name to the socket if
necessary. See the Signals and Process Groups section below. This is the usual
way that local addresses are bound to a socket.
An error is returned if the connection was unsuccessful (however, any name
automatically bound by the system remains). Otherwise, the socket is associated
with the server and data transfer may begin. Some of the more common errors
returned when a connection attempt fails are:
ETIMEDOUT
After failing to establish a connection for a period of time, the system
decided there was no point in retrying the connection attempt any more.
This usually occurs because the destination host is down, or because problems in the network resulted in transmissions being lost.
ECONNREFUSED
The host refused service for some reason. This is usually due to a server
process not being present at the requested name.
ENETDO~orEHOSTDO~
These operational errors are returned based on status infonnation delivered
to the client host by the underlying communication services.
ENETUNREACHorEHOSTUNREACH
These operational errors can occur either because the network or host is unknown (no route to the network or host is present), or because of status information returned by intennediate gateways or switching nodes. Many times
the status returned is not sufficient to distinguish a network being down from
a host being down, in which case the system indicates the entire network is
unreachable.
For the server to receive a client's connection it must perform two steps after
binding its socket. The first is to indicate a willingness to listen for incoming
connection requests:
listen(s, 5);
The second parameter to the listen () call specifies the maximum number of
Revision A, of 27 March 1990
Chapter 11 -
An Advanced Socket-Based Interprocess Communications Tutorial
285
outstanding connections that may be queued awaiting acceptance by the server
process; this number may be limited by the system. Should a connection be
requested while the queue is full, the connection will not be refused, but rather
the individual messages that comprise the request will be ignored. This gives a
harried server time to make room in its pending connection queue while the
client retries the connection request. Had the connection been returned with the
ECONNREFUSED error, the client would be unable to tell if the server was up or
not. As it is now it is still possible to get the ET IMEDOUT error back, though
this is unlikely. The backlog figure supplied with the listen call is currently limited by the system to a maximum of 5 pending connections on anyone queue.
This avoids the problem of processes hogging system resources by setting an
infinite backlog, then ignoring all connection requests.
With a socket marked as listening, a server may accept () a connection:
struct sockaddr_in from;
fromlen
newsock
sizeof from;
accept (s, (struct sockaddr *) &from, &fromlen);
(For the UNIX domain,jrom would be declared as a struct sockaddr _un,
but nothing different would need to be done as far asfromlen is concerned. In
the examples that follow, only Internet routines will be discussed.) A new
descriptor is returned on receipt of a connection (along with a new socket). If the
server wishes to find out who its client is, it may supply a buffer for the client
socket's name. The value-result parameter fromlen is initialized by the server to
indicate how much space is associated with/rom, then modified on return to
reflect the true size of the name. If the client's name is not of interest, the second
parameter may be a null pointer.
accept () normally blocks. That is, accept () will not return until a connection is available or the system call is interrupted by a signal to the process.
Further, there is no way for a process to indicate it will accept connections from
only a specific individual, or individuals. It is up to the user process to consider
who the connection is from and close down the connection if it does not wish to
speak to the process. If the server process wants to accept connections on more
than one socket, or wants to avoid blocking on the accept call, there are alternatives; they will be considered in the Advanced Topics section below.
Data Transfer
With a connection established, data may begin to flow. To send and receive data
there are a number of possible calls. With the peer entity at each end of a connection anchored, a user can send or receive a message without specifying the
peer. As one might expect, in this case, then the nonnal read () and write ()
system calls are usable,
write(s, buf, sizeof buf);
read(s, buf, sizeof buf);
In addition to read () and wr i teO, the calls send () and recv () may be
used:
Revision A, of 27 March 1990
286
Transport-Level Programming
send(s, buf, sizeof buf, flags);
recv(s, buf, sizeof buf, flags);
While send () and recv () are virtually identical to read () and writeO,
the extrajlags argument is important. The flags, defined in <sys / socket. h>,
may be specified as a non-zero value if one or more of the following is required:
MSG OOB
MSG PEEK
MSG DONTROUTE
send/receive out of band data
look at data without reading
send data without routing packets (internal only)
Out of band data is a notion specific to stream sockets, and one that we will not
immediately consider. The option to have data sent without routing applied to
the outgoing packets is currently used only by the routing table management process, and is unlikely to be of interest to the casual user. However, the ability to
preview data is of interest. When MSG_PEEK is specified with a recv () call,
any data present is returned to the user, but treated as still "unread". That is, the
next read () or recv () call applied to the socket will return the data previously previewed.
Discarding Sockets
Once a socket is no longer of interest, it may be discarded by applying a
close () to the descriptor,
close(s);
If data is associated with a socket that promises reliable delivery (e.g. a stream
socket) when a close takes place, the system will continue to attempt to transfer
the data. However, after a fairly long period of time, if the data is still
undelivered, it will be discarded. Should a user have no use for any pending
data, it may perform a shutdown () on the socket prior to closing it. This call
is of the fonn:
shutdown(s, how);
where how is 0 if the user is no longer interested in reading data, 1 if no more
data will be sent, or 2 if no data is to be sent or received.
Connectionless Sockets
To this point we have been concerned mostly with sockets that follow a connection oriented model. However, there is also support for connectionless interactions typical of the datagram facilities found in contemporary packet switched
networks. A datagram socket provides a symmetric interface to data exchange.
While processes are still likely to be client and server, there is no requirement for
connection establishment. Instead, each message includes the destination
address.
Datagram sockets are created as before. If a particular local address is needed,
the bind () operation must precede the first data transmission. Otherwise, the
system will set the local address and/or port when data is first sent. To send data,
the sendto () primitive is used,
sendto(s, buf, buflen, flags,
(struct sockaddr *)&to, tolen);
The s, buj, bujlen, andjlags parameters are used as before. The to and tolen
values are used to indicate the address of the intended recipient of the message.
Revision A, of 27 March 1990
Chapter 11 -
An Advanced Socket-Based Interprocess Communications Tutorial
287
When using an unreliable datagram interface, it is unlikely that any errors will be
reported to the sender. When information is present locally to recognize a message that can not be delivered (for instance when a network is unreachable), the
call will return -1 and the global value errno will contain an error number.
To receive messages on an unconnected datagram socket, the recvfrom ()
primitive is provided:
recvfrom(s, buf, buflen, flags,
&fromlen);
(struct sockaddr *)&from,
Once again, thefromlen parameter is handled in a value-result fashion, initially
containing the size of the from buffer, and modified on return to indicate the
actual size of the address from which the datagram was received.
In addition to the two calls mentioned above, datagram sockets may also use the
connect () call to associate a socket with a specific destination address. In
this case, any data sent on the socket will automatically be addressed to the connected peer, and only data received from that peer will be delivered to the user.
Only one connected address is permitted for each socket at one time; a second
connect will change the destination address, and a connect to a null address
(domain AF UN5PEC) will disconnect. Connect requests on datagram sockets
return immediately, as this simply results in the system recording the peer's
address (as compared to a stream socket, where a connect request initiates establishment of an end to end connection). accept () and listen () are not used
with datagram sockets.
While a datagram socket is connected, errors from recent send () calls may be
returned asynchronously. These errors may be reported on subsequent operations
on the socket, or a special socket option used with getsockopt,
50_ERROR, may be used to interrogate the error status. A select () for reading or writing will return true when an error indication has been received. The
next operation will return the error, and the error status is cleared. Other of the
less important details of datagram sockets are described in the Advanced Topics
section below.
Revision A, of 27 March 1990
288
Transport-Level Programming
Input/Output Multiplexing
One last facility often used in developing applications is the ability to multiplex
i/o requests among multiple sockets and/or files. This is done using the
select () call:
#include <sys/time.h>
#include <sys/types.h>
fd set readmask, writemask, exceptmask;
struct timeval timeout;
select(nfds, &readmask, &writemask, &exceptmask, &timeout);
select () takes as arguments pointers to three sets, one for the set of file
descriptors on which the caller wishes to be able to read data, one for those
descriptors to which data is to be written, and one for which exceptional conditions are pending; out-of-band data is the only exceptional condition currently
implemented by the socket abstraction. If the user is not interested in certain
conditions (Le., read, write, or exceptions), the corresponding argument to the
select () should be a properly cast null pointer.
Each set is actually a structure containing an array of long integer bit masks; the
size of the array is set by the definition FD_SET SIZE. The array is long enough
to hold one bit for each ofFD_SETSIZE file descriptors.
The macros FD_SET lfd, &mask} and FD_ CLR (jd, &mask) have been provided for adding and removing file descriptor fd in the set mask. The set should
be zeroed before use, and the macro FD _ZERO ( &mask) has been provided to
clear the set mask. The parameter nfds in the select () call specifies the range
of file descriptors (Le. one plus the value of the largest descriptor) to be examined in a set.
A timeout value may be specified if the selection is not to last more than a
predetennined period of time. If the fields in timeout are set to 0, the selection
takes the fonn of a poll, returning immediately. If the last parameter is a null
pointer, the selection will block indefinitelyll. select () nonnally returns the
number of file descriptors selected; if the se Ie ct () call returns due to the
timeout expiring, then the value 0 is returned. If the select () tenninates
because of an error or interruption, a -1 is returned with the error number in
errno, and with the file descriptor masks unchanged.
Assuming a successful return, the three sets will indicate which file descriptors
are ready to be read from, written to, or have exceptional conditions pending.
The status of a file descriptor in a select mask may be tested with the
FD_ ISSET lfd, &mask) macro, which returns a non-zero value iffd is a
member of the set mask, and 0 if it is not.
11 To be more specific, a return takes place only when a descriptor is selectable, or when a signal is received
by the caller, interrupting the system call.
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
289
To detennine if there are connections waiting on a socket to be used with an
accept () call, select () can be used, followed by a FD ISSET (jd,
&mask) macro to check for read readiness on the appropriate socket If
FD_ISSET returns a non-zero value, indicating pennission to read, then a connection is pending on the socket.
As an example, to read data from two sockets, sl and s2 as it is available from
each and with a one-second timeout, the following code might be used:
iinclude <sys/time.h>
iinclude <sys/types.h>
fd set read_template;
struct timeval wait;
for
(;;) {
wait.tv_sec = 1;
wait.tv_usec = 0;
/ * one second * /
FD_SET(sl, &read_template);
FD_SET(s2, &read_template);
nb
=
select (FD_SETSIZE, &read_template,
(fd_set *) 0, &wait);
if (nb <= 0) {
(fd_set *) 0,
1*
* An error occurred during the select. or
* the select timed out.
*1
}
if (FD_ISSET(sl, &read_template»
{
/ * Socket #1 is ready to be read from. * /
if (FD_ISSET(s2, &read_template»
/ * Socket #2 is ready to be read from.
{
*/
In previous versions of s e 1 e c t 0, its arguments were pointers to integers
instead of pointers to fd_sets. This type of call will still work as long as the
number of file descriptors being examined is less than the number of bits in an
integer; however, the methods illustrated above should be used in all current programs.
select () provides a synchronous multiplexing scheme. Asynchronous
notification of output completion, input availability, and exceptional conditions
is possible through use of the SIGIO and SIGURG signals described in the
Revision A, of 27 March 1990
290
Transport-Level Programming
Advanced Topics section below.
11.2. Library Routines
The discussion in the Basics section above indicated the possible need to locate
and construct network addresses when using the interprocess communication
facilities in a distributed environment. To aid in this task a number of routines
have been added to the standard C run-time library. In this section we will consider the new routines provided to manipulate network addresses.
Locating a seJVice on a remote host requires many levels of mapping before
client and seJVer may communicate. A seJVice is assigned a name that is
intended for human consumption; e.g. the login server on host monet. This
name, and the name of the peer host, must then be translated into network
addresses that are not necessarily suitable for human consumption. Finally, the
address must then used in locating a physical location and route to the service.
The specifics of these three mappings are likely to vary between network architectures. For instance, it is desirable for a network to not require hosts to be
named in such a way that their physical location is known by the client host.
Instead, underlying seJVices in the network may discover the actual location of
the host at the time a client host wishes to communicate. This ability to have
hosts named in a location independent manner may induce overhead in connection establishment, as a discovery process must take place, but allows a host to be
physically mobile without requiring it to notify its clientele of its current location.
Standard routines are provided for mapping host names to network addresses,
network names to network numbers, protocol names to protocol numbers, and
service names to port numbers and the appropriate protocol to use in communicating with the seJVerprocess. The file <netdb. h> must be included when
using any of these routines.
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
Host Names
291
An Internet host name to address mapping is represented by the h 0 s ten t struc-
ture:
struct hostent {
char *h_namei
char **h_aliasesi
int h_addrtypei
int h_lengthi
char **h_addr_listi
/ * official name of host * /
/ * alias list * /
/* host address type (e.g., AF_INET) * /
/ * length of address * /
/ * list of addresses, null terminated * /
} i
#define h_addr h addr list[O]
/* first address, network byte order */
The routine gethostbyname(3N) takes an Internet host name and returns a
hostent structure, while the routine gethostbyaddr(3N) maps Internet
host addresses into a hostent structure. The routine inet ntoa(3N) maps an
Internet host address into an ASCII string for printing by log and error messages.
The official name of the host and its public aliases are returned by these routines,
along with the address type (domain) and a null terminated list of variable length
addresses. This list of addresses is required because it is possible for a host to
have many addresses, all having the same name. The h_ addr definition is provided for backward compatibility, and is defined to be the first address in the list
of addresses in the hostent structure.
The database for these calls is provided either by the Network Information Service lookup (the preferred alternative), from the / etc/hosts file (see
ho st s(5)), or by use of the re so 1 ver (5) nameserver. Because of the differences in these databases and their access protocols, the infonnation returned may
differ. When using the Network Infonnation Service on the host table version of
gethostbynameQ, only one address will be returned, but all listed aliases will
be included. The nameserver version may return alternate addresses, but will not
provide any aliases other than one given as argument.
Network Names
As for host names, routines for mapping network names to numbers, and back,
are provided. These routines return a netent structure:
1*
* Assumption here is that a network number
* fits in 32 bits -- probably a poor one.
*1
struct netent {
char *n_namei
char **n_aliasesi
int n_addrtype;
int n_neti
/*
/*
/*
/*
official name of net * /
alias list * /
net address type */
network number, host byte order */
} ;
The routines getnetbyname (3N), getnetbynumber (3N), and
getnetent (3N) are the network counterparts to the host routines described
above. The routines extract their infonnation from the Network Infonnation
Revision A, of 27 March 1990
292
Transport-Level Programming
Service maps hosts. byname and hosts. byaddr or from
/ etc/ net works.
Protocol Names
For protocols (which are defined in the Network Infonnation Service
protocols. byname map and / etc/protocols) the protoent structure
defines the protocol-name mapping used with the routines
getprotobyname(3N), getprotobynumber(3N), and
getprotoent(3N):
struct protoent
char *p_name;
char **p_aliases;
int pyroto;
/ * official protocol name * /
/ * alias list * /
/ * protocol number * /
};
Service Names
Information regarding services is a bit more complicated. A service is expected
to reside at a specific port and employ a particular communication protocol. This
view is consistent with the Internet domain, but inconsistent with other network
architectures. Further, a service may reside on multiple ports. If this occurs, the
higher level library routines will have to be bypassed or extended. Services
available are contained in the Network Infonnation Service
services .byname map and the file /etc/services. (Actually, thenarne
services. byname is a misnomer, since the map actually orders Internet ports
by number and protocol). 12 A service mapping is described by the servent
structure:
struct servent {
char *s_name;
char **s_aliases;
int syort;
char *syroto;
/ * official service name * /
/ * alias list * /
/ * port number, network byte order * /
/ * protocol to use * /
};
The routine getservbyname (3N) maps service names to a servent structure
by specifying a service name and, optionally, a qualifying protocol. Thus the call
sp
= getservbyname("telnet", (char *) 0);
returns the service specification for a telnet server using any protocol, while the
call
sp = getservbyname("telnet", "tcp");
returns only that teInet server which uses the TCP protocol. The routines
getservbyport (3N) and getservent (3N) are also provided. The getservbyport () routine has an interface similar to that provided by
12 For details about the association of RPC services with ports, see the Port Mapper Program Protocol
section of the Network Services chapter.
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
293
getservbynameO; an optional protocol name may be specified to qualify
lookups.
With the support routines described above, an Internet application program
should rarely have to deal directly with addresses. This allows seIVices to be
developed as much as possible in a network independent fashion. It is clear,
however, that purging all network dependencies is very difficult. So long as the
user is required to supply network addresses when naming services and sockets
there will always some network dependency in a program. For example, the normal code included in client programs, such as the remote login program, is of the
form shown in Figure 11-1. (This example will be considered in more detail in
the Client/Server Model section below.)
Miscellaneous
Aside from the address-related data base routines, there are several other routines
available in the run-time library that are of interest to users. These are intended
mostly to simplify manipulation of names and addresses. Table 11-1 summarizes
the routines for manipulating variable length byte strings and handling byte
swapping of network addresses and values.
Table 11-1
C Run-time Routines
Call
bemp (sl, s2, n)
beopy(sl, s2, n)
bzero(base, n)
htonl (val)
htons (val)
ntohl (val)
ntohs (val)
Synopsis
Compare byte-strings; 0 if same, not 0 otherwise
Copy n bytes from s 1 to s2
Zero-fill n bytes starting at base
32-bit quantity from host into network byte order
16-bit quantity from host into network byte order
32-bit quantity from network into host byte order
16-bit quantity from network into host byte order
The byte swapping routines are provided because the operating system expects
addresses to be supplied in network order. On some architectures, such as the
VAX, host byte ordering is different than network byte ordering. Consequently,
programs are sometimes required to byte swap quantities. The library routines
that return network addresses provide them in network order so that they may
simply be copied into the structures provided to the system. This implies users
should encounter the byte swapping problem only when interpreting network
addresses. For example, if an Internet port is to be printed out the following code
would be required:
printf("port number %d\n", ntohs(sp->syort»;
On machines such as the Sun-3 and Sun-4, where these routines are unneeded,
they are defined as null macros. 13
13 Sun-4 (SP ARC) machines do have alignment restrictions which network programmers need to be aware
of. See Porting Software to SP ARC Systems.
Revision A, of 27 March 1990
294
Transport-Level Programming
Figure 11-1
Remote Login Client Code
*include
*include
*include
*include
*include
<sys/types.h>
<sys/socket.h>
<netinet/in.h>
<stdio.h>
<netdb.h>
main (argc, argv)
int argc;
char *argv[];
struct sockaddr_in server;
struct servent *sp;
struct hostent *hp;
int S;
sp = getservbyname("login", "tcp");
if (sp == NULL) {
fprintf (stderr,
"rlogin: tcp/login: unknown service\n");
exit(l);
hp = gethostbyname(argv[l]);
if (hp == NULL) {
fprintf (stderr,
"rlogin: %s: unknown host\n", argv[l]);
exit(2);
bzero«char *)&server, sizeof server);
bcopy(hp->h_addr, (char *)&server.sin_addr,
hp->h_length);
server. sin_family = hp->h_addrtype;
server.sin-port = sp->s-port ;
s = socket (AF_INET, SOCK_STREAM, 0);
if (s < 0) {
perror("rlogin: socket");
exit(3);
/*
Connect does the bind/or us
*/
if (connect(s, (struct sockaddr *)&server,
sizeof server) < 0) {
perror("rlogin: connect");
exit (5) ;
exit(O);
Revision A. of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
11.3. Client/Server Model
295
The most commonly used paradigm in constructing distributed applications is the
client/server model. In this scheme client applications request services from a
server process. This implies an asymmetry in establishing communication
between the client and server that has been examined in the Basics section above.
In this section we will look more closely at the interactions between client and
server, and consider some of the problems in developing client and server applications.
The client and server require a well known set of conventions before service may
be rendered (and accepted). This set of conventions comprises a protocol that
must be implemented at both ends of a connection. Depending on the situation,
the protocol may be symmetric or asymmetric. In a symmetric protocol, either
side may play the master or slave roles. In an asymmetric protocol, one side is
immutably recognized as the master, with the other as the slave. An example of
a symmetric protocol is the TELNET protocol used in the Internet for remote terminal emulation. An example of an asymmetric protocol is the Internet file
transfer protocol, FfP. No matter whether the specific protocol used in obtaining
a service is symmetric or asymmetric, when accessing a service there is a client
process and a server process. We will first consider the properties of server
processes, then client processes.
A server process normally listens at a well known address for service requests.
That is, the server process remains dormant until a connection is requested by a
client's connection to the server's address. At such a time the selVer process
"wakes up" and services the client, performing whatever appropriate actions the
client requests of it.
Alternative schemes that use a service server may be used to eliminate a flock of
server processes clogging the system while remaining donnant most of the time.
For Internet servers, this scheme has been implemented via inetd, the so called
"internet super-server." inetd listens at a variety of ports, detennined at startup by reading a configuration file. When a connection is requested to a port on
which inetd is listening, inetd executes the appropriate server program to
handle the client. With this method, clients are unaware that an intermediary
such as inetd has played any part in the connection. inetd will be described
in more detail in the Advanced Topics section below.
In SunOS most servers are accessed at well known Internet addresses or UNIX
domain names. The form of their main loop is illustrated by the following code
fonn the remote-login server:
Servers
Figure 11-2
Remote Login Server
main (argc, argv)
int argci
char *argv[]i
int fi
struct sockaddr_in fromi
struct sockaddr_in sini
struct servent *SPi
Revision A, of 27 March 1990
296
Transport-Level Programming
sp = getservbyname("login", "tcp");
if (sp == NULL) {
fprintf(stderr,
"rlogind: tcp/login: unknown service\n");
exit(l);
#ifndef DEBUG
/ * Disassociate server from controlling terminal. * /
#endif
sp- >S yo rt ;
INADDR_ANY;
sin.sinyort
sin. sin addr
f
/
* Restricted port * /
= socket(AF_INET, SOCK_STREAM, 0);
if (bind(f,
(struct sockaddr *)&sin, sizeof sin) < 0)
listen(f, 5);
for (;;) {
int g, len
=
{
sizeof from;
g = accept(f, (struct sockaddr *) &from, &len);
if (g < 0) {
if (errno != EINTR)
syslog(LOG_ERR, "rlogind: accept: %m");
continue;
if (fork () == 0) {
close(f);
doit(g, &from);
close(g);
exit(O);
The first step taken by the server is look up its service definition:
sp = getservbyname(nlogin", "tcp");
if (sp == NULL) {
fprintf(stderr,
"rlogind: tcp/login: unknown service\n");
exit(l);
The result of the getservbyname () call is used in later portions of the code
to define the Internet port at which it listens for service requests (indicated by a
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
297
connection). Some standard port numbers are given in the file
/usr/ include/netinet/ in. h for backward compatibility purposes.
Step two is to disassociate the server from the controlling terminal of its invoker:
for (i = getdtablesize()-l; i >= 0; --i)
close(i);
open (n/dev/null n , O_RDONLY);
dup2(0, 1);
dup2(0, 2);
i = open(n/dev/tty", O_RDWR);
if (i >= 0) {
ioctl(i, TIOCNOTTY, 0);
close(i);
This step is important as the server will likely not want to receive signals
delivered to the process group of the controlling terminal. Note, however, that
once a server has disassociated itself it can no longer send reports of errors to a
terminal, and must log errors via syslogO.
Once a server has established a pristine environment, it creates a socket and
begins accepting service requests. The bind {} call is required to insure the
server listens at its expected location. It should be noted that the remote login
server listens at a restricted port number, and must therefore be run with a user-id
of root. This concept of a "restricted port number" is covered in the Advanced
Topics section below.
The main body of the loop is fairly simple:
for (;;) {
int g, len
=
sizeof from;
g = accept(f, (struct sockaddr *)&from, &len);
if (g < 0) {
if (errno != EINTR)
syslog(LOG_ERR, nrlogind: accept: %mn);
continue;
if ( for k () == 0) { / * Child * /
close(f);
doit(g, &from);
close(g);
/* Parent */
An accept {} call blocks the server until a client requests service. This call
could return a failure status if the call is interrupted by a signal such as
SIGCHLD (to be discussed in the Advanced Topics section below.) Therefore,
the return value from accept {} is checked to insure a connection has actually
Revision A. of 27 March 1990
298
Transport-Level Programming
been established, and an error report is logged via syslog () if an error has
occurred.
With a connection in hand, the server then forks a child process and invokes the
main body of the remote login protocol processing. Note how the socket used by
the parent for queuing connection requests is closed in the child, while the socket
created as a result of the accept () is closed in the parent. The address of the
client is also handed the doi t () routine because it requires it in authenticating
clients.
Clients
The client side of the remote login service was shown earlier in Figure 11-1.
One can see the separate, asymmetric roles of the client and server clearly in the
code. The server is a passive entity, listening for client connections, while the
client process is an active entity, initiating a connection when invoked.
Let us consider more closely the steps taken by the client remote login process.
As in the server process, the first step is to locate the service definition for a
remote login:
sp = getservbyname(nlogin", "tcpn);
if (sp == NULL) {
fprintf (stderr,
nrlogin: tcp/login: unknown service\nn);
exit(l);
Next the destination host is looked up with a gethostbyname () call:
hp = gethostbyname(argv[l]);
if (hp == NULL) {
fprintf(stderr, "rlogin: is: unknown host\nn, argv[l]);
exit(2);
With this accomplished, all that is required is to establish a connection to the
server at the requested host and start up the remote login protocol. The address
buffer is cleared, then filled in with the Internet address of the foreign host and
the port number at which the login process resides on the foreign host:
bzero«char *)&server, sizeof server);
bcopy(hp->h_addr, (char *) &server.sin_addr, hp->h_length);
server. sin_family = hp->h_addrtype;
server.sin-port = sp->s-port ;
A socket is created, and a connection initiated. Note that connect () implicitly
perfonns a bind () cali, since s is unbound.
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
299
s = socket (hp->h_addrtype, SOCK_STREAM, 0);
if (s < 0) {
perror("rlogin: socket");
exit(3);
if (connect(s, (struct sockaddr *)&server,
sizeof server) < 0) {
perror("rlogin: connect");
exit(4);
The details of the remote login protocol will not be considered here.
Connectionless Servers
While connection-based services are the nonn, some services are based on the
use of datagram sockets. One, in particular, is the rw ho service which provides
users with status infonnation for hosts connected to a local area network. This
service, while predicated on the ability to broadcast information to all hosts connected to a particular network, is of interest as an example usage of datagram
sockets.
A user on any machine running the rwho server may find out the current status of
a machine with the rupt ime program. The output generated is illustrated in
Figure 11-2.
Table 11-2
ruptime Output
arpa
cad
calder
dali
degas
ear
ernie
esvax
oz
statvax
up
up
up
up
up
up
down
down
down
up
9:45,
2+12:04,
10:10,
2+06:28,
25+09:48,
5+00:05,
0:24
17:04
16:09
2+15:57,
5 users, load
8 users, load
ousers, load
9 users, load
ousers, load
ousers, load
1.15,
4.67,
0.27,
1.04,
1.49,
1.51,
1.39,
5.13,
0.15,
1.20,
1.43,
1.54,
1.31
4.59
0.14
1.65
1.41
1.56
3 users, load
1.52,
1.81,
1.86
Status infonnation for each host is periodically broadcast by rwho server
processes on each machine. The same server process also receives the status
information and uses it to update a database. This database is then interpreted to
generate the status infonnation for each host. Servers operate autonomously,
coupled only by the local network and its broadcast capabilities.
Note that the use of broadcast for such a task is fairly inefficient, as all hosts must
process each message, whether or not using an rwho server. Unless such a service is sufficiently universal and is frequently used, the expense of periodic
broadcasts outweighs the simplicity.
Revision A, of 27 March 1990
300
Transport-Level Progranuning
The rwho server, in a simplified form, is pictured below. It preforms two
separate tasks. The first is to act as a receiver of status information broadcast by
other hosts on the network. This job is carried out in the main loop of the program. Packets received at the rwho port are interrogated to insure they've been
sent by another rwho server process, then are time stamped with their arrival time
and used to update a file indicating the status of the host. When a host has not
been heard from for an extended period of time, the database interpretation routines assume the host is down and indicate such on the status reports. This algorithm is prone to error, as a server may be down while a host is actually up.
Figure 11-3
rwho Server
main ()
{
sp = getservbyname("who", "udp");
net = getnetbyname("localnet");
sin.sin_addr
inet_makeaddr(INADDR_ANY, net);
sin.sin-port = sp->s-port ;
s = socket (AF_INET, SOCK_DGRAM, 0);
on = 1;
if (setsockopt(s, SOL_SOCKET, SO_BROADCAST, &on,
sizeof on) < 0) {
syslog(LOG_ERR, "setsockopt SO BROADCAST: %m");
exit(1);
bind(s,
(struct sockaddr *) &sin, sizeof sin);
signal (SIGALRM, onalrm);
onalrm() ;
for (;;) {
struct whod wd;
int cc, whod, len = sizeof from;
cc = recvfrom(s, (char *)&wd, sizeof (struct whod),
0, (struct sockaddr *)&from, &len);
if (cc <= 0) {
if (cc < 0 && errno != EINTR)
syslog(LOG_ERR, nrwhod: recv: %mn);
continue;
if (from.sin-port != sp->s-port) {
syslog(LOG_ERR, "rwhod: %d: bad from port",
ntohs(from.sin-port»;
continue;
if (!verify(wd.wd_hostname»
syslog(LOG_ERR, "rwhod: bad host name from %x",
ntohl(from.sin_addr.s_addr»;
Revision A, of 27 March 1990
Chapter 11- An Advanced Socket-Based Interprocess Communications Tutorial
301
continue;
(void) sprintf(path, "%s/whod.%s", RWHODIR,
wd.wd_hostname);
whod = open(path, O_WRONLYIO_CREATIO_TRUNC, 0666);
(void) time(&wd.wd_recvtime);
(void) write (whod, (char *)&wd, cc);
(void) close(whod);
exit(O);
The second task performed by the seIVer is to supply infonnation regarding the
status of its host. This involves periodically acquiring system status information,
packaging it up in a message and broadcasting it on the local network for other
rwho seIVers to hear. The supply function is triggered by a timer and runs off a
signal. Locating the system status infonnation is somewhat involved, but uninteresting. Deciding where to transmit the resultant packet is somewhat problematic, however.
Status infonnation must be broadcast on the local network. For networks that do
not support the notion of broadcast another scheme must be used to simulate or
replace broadcasting. One possibility is to enumerate the known neighbors
(based on the status messages received from other rwho servers). This, unfortunately, requires some bootstrapping information, for a server will have no idea
what machines are its neighbors until it receives status messages from them.
Therefore, if all machines on a net are freshly booted, no machine will have any
known neighbors and thus never receive, or send, any status infonnation. This is
the identical problem faced by the routing table management process in propagating routing status infonnation. The standard solution, unsatisfactory as it
may be, is to infonn one or more servers of known neighbors and request that
they always communicate with these neighbors. If each server has at least one
neighbor supplied to it, status infonnation may then propagate through a neighbor to hosts that are not (possibly) directly neighbors. If the server is able to support networks that provide a broadcast capability, as well as those which do not,
then networks with an arbitrary topology may share status information 14.
It is important that software operating in a distributed environment not have any
site-dependent infonnation compiled into it. This would require a separate copy
of the server at each host and make maintenance a severe headache. SunOS
attempts to isolate host-specific information from applications by providing system calls that return the necessary infonnation 15 . A mechanism exists, in the
fonn of an ioctl () call, for finding the collection of networks to which a host
is directly connected. Further, a local network broadcasting mechanism has been
14 One must. however. be concerned about loops. That is, if a host is connected to multiple networks, it will
receive status infonnation from itself. TIris can lead to an endless. wasteful, exchange of infonnation.
IS
An example of such a system call is the get hostname (2) call that returns the host"s official name.
Revision A, of 27 March 1990
302
Transport-Level Programming
implemented at the socket level. Combining these two features allows a process
to broadcast on any directly connected local network which supports the notion
of broadcasting in a site independent manner. This allows a solution to the problem of deciding how to propagate status information in the case of rw h 0, or
more generally in broadcasting. Such status information is broadcast to connected networks at the socket level, where the connected networks have been
obtained via the appropriate ioct 1 () calls. The specifics of such broadcastings
are complex, however, and will be covered in the Advanced Topics section
below.
11.4. Advanced Topics
A number of facilities have yet to be discussed. For most programmers, the
mechanisms already described will suffice in constructing distributed applications. However, others will find the need to utilize some of the features that we
consider in this section.
Out Of Band Data
The stream socket abstraction includes the notion of out of band data. Out of
band data is a logically independent transmission channel associated with each
pair of connected stream sockets. Out of band data is delivered to the user
independently of normal data. The abstraction defines that the out of band data
facilities must support the reliable delivery of at least one out of band message at
a time. This message may contain at least one byte of data, and at least one message may be pending delivery to the user at anyone time. For communications
protocols (such as TCP) that support only in-band signaling (i.e. the urgent data
is delivered in sequence with the nonnal data), the system nonnally extracts the
data from the nonnal data stream and stores it separately. This allows users to
choose between receiving the urgent data in order and receiving it out of
sequence without having to buffer all the intervening data. It is possible to
"peek" (via MSG_PEEK) at out of band data. If the socket has a process group, a
S I GURG signal is generated when the protocol is notified of its existence. A process can set the process group or process id to be informed by the S IGURG signal
via the appropriate fcntl () call, as described below for SIGIO. Ifmultiple
sockets may have out of band data awaiting delivery, a select () call for
exceptional conditions may be used to detennine those sockets with such data
pending. Neither the signal nor the select indicate the actual arrival of the outof-band data, but only notification that it is pending.
In addition to the infonnation passed, a logical mark is placed in the data stream
to indicate the point at which the out of band data was sent. The remote login
and remote shell applications use this facility to propagate signals between client
and server processes. When a signal flushes any pending output from the remote
process(es), all data up to the mark in the data stream is discarded.
To send an out of band message the MSG_OOB flag is supplied to a send () or
send to () calls, while to receive out of band data MS G_ OOB should be indicated when performing a recvfrom () or recv () call. To find out if the read
pointer is currently pointing at the mark in the data stream, the S IOCATMARK
ioctl is provided:
ioctl(s, SIOCATMARK, &yes);
If yes is 1 on return, the next read will return data after the mark. Otherwise
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
303
(assuming out of band data has arrived), the next read will provide data sent by
the client prior to transmission of the out of band signal. The routine used in the
remote login process to flush output on receipt of an interrupt or quit signal is
shown in the following example. This code reads the nonnal data up to the mark
(to discard it), then reads the out-of-band byte.
Figure 11-4
Flushing Terminal/fO on Receipt of Out Of Band Data
iinclude <sys/ioctl.h>
iinclude <sys/file.h>
oob ()
{
int out = FWRITE;
char waste[BUFSIZ];
int mark;
/ * flush local terminal output * /
ioctl (1, TIOCFLUSH, (char *) &out) ;
for (;;) {
if (ioctl(rem, SIOCATMARK, &mark) < 0)
perror("ioctl n ) ;
break;
{
if (mark)
break;
(void) read (rem, waste, sizeof waste);
if (recv(rem, &mark, 1, MSG_OOB) < 0)
perror ("recv") ;
{
A process may also read or peek at the out-of-band data without first reading up
to the mark. This is more difficult when the underlying protocol delivers the
urgent data in-band with the nonnal data, and only sends notification of its presence ahead of time (e.g., the TCP protocol used to implement socket streams in
the Internet domain). With such protocols, the out-of-band byte may not yet
have arrived when a recv () is done with the MSG_OOB flag. In that case, the
call will return an error of EWOULDBLOCK. Worse, there may be enough inband data in the input buffer that normal flow control prevents the peer from
sending the urgent data until the buffer is cleared. The process must then read
enough of the queued data that the urgent data may be delivered.
Certain programs that use multiple bytes of urgent data and must handle multiple
urgent signals (e.g., telnet(lC)) need to retain the position of urgent data
within the socket stream. This treatment is available as a socket-level option,
SO_OOBINLINE; see setsockopt (2) for usage. With this option, the
Revision A, of 27 March 1990
304
Transport-Level Programming
position of urgent data (the "mark") is retained, but the urgent data immediately
follows the mark within the nonnal data stream returned without the MSG OOB
flag. Reception of multiple urgent indications causes the mark to move, but no
out-of-band data are lost.
Non-Blocking Sockets
It is occasionally convenient to make use of sockets that do not block; that is, I/O
requests that cannot complete immediately and would therefore cause the process
to be suspended awaiting completion are not executed, and an error code is
returned. Once a socket has been created via the socket () call, it may be
marked as non-blocking by fentl () as follows:
iinclude <fcntl.h>
int s;
s
= socket(AF_INET,
SOC~STREAM,
0);
if (fcntl(s, F_SETFL, FNDELAY) < 0)
perror("fcntl F_SETFL, FNDELAyII);
exit(l);
When perfonning non-blocking I/O on sockets, one must be careful to check for
the error EWOULDBLOCK (stored in the global variable errno), which occurs
when an operation would nonnally block, but the socket it was performed on is
marked as non-blocking. In particular, aeeeptO, eonneetO, sendO,
reev(), readO, and wr i te () can all return EWOULDBLOCK, and processes
should be prepared to deal with such return codes. If an operation such as a
send () cannot be done in its entirety, but partial writes are sensible (for example, when using a stream socket), the data that can be sent immediately will be
processed, and the return value will indicate the amount actually sent.
Interrupt Driven Socket I/O
The SI GI 0 signal allows a process to be notified via a signal when a socket (or
more generally, a file descriptor) has data waiting to be read. Use of the SIGIO
facility requires three steps: First, the process must set up a S I GI 0 signal
handler by use of the signal () or sigvee () calls. Second, it must set the
process id or process group id that is to receive notification of pending input to
its own process id, or the process group id of its process group (note that the
default process group of a socket is group zero). This can be accomplished by
use of an fentl () call. Third, it must enable asynchronous notification of
pending I/O requests with another fen t 1 () call. Sample code to allow a given
process to receive information on pending I/O requests as they occur for a socket
s is given in Figure 11-5. With the addition of a handler for S IGURG, this code
can also be used to prepare for receipt of SIGURG signals.
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
Figure 11-5
305
Use of Asynchronous Notification of I/O Requests
*include <fcntl.h>
int io_handler();
signal (SIGIO, io_handler);
/ * Set the process receiving SIGIOISIGURG signals to us. * /
if (fcntl(s, F_SETOWN, getpid(»
perror(nfcntl F_SETOWN n );
exit(l);
< 0)
{
/ * Allow receipt of asynchronous 110 signals. * /
if (fcntl(s, F_SETFL, FASYNC) < 0) {
perror(nfcntl F_SETFL, FASYNCn)i
exit(l)i
Signals and Process Groups
Due to the existence of the SIGURG and SIGIO signals each socket has an associated process number, just as is done for tenninals. This value is initialized to
zero, but may be redefined at a later time with the F SETOWN fcntlO, such as
was done in the code above for S IGIO. To set the socket's process id for signals, positive arguments should be given to the fcntl () call. To set the
socket's process group for signals, negative arguments should be passed to
fcntlO. Note that the process number indicates either the associated process id
or the associated process group; it is impossible to specify both at the same time.
A similar fcntlO, F _ GETOWN, is available for determining the current process
number of a socket.
Note that the receipt of S I GURG and S I G I 0 can also be enabled by using the
ioctl () call to assign the socket to the user's process group:
/* oobdata is the out-oj-band data handling routine
signal (SIGURG, oobdata);
*/
int pid = -getpid();
if (ioctl (client, SIOCSPGRP, (char *) &pid) < 0)
perror(nioctl: SIOCSPGRp n );
{
Revision A, of 27 March 1990
306
Transport-Level Programming
Another signal that is useful when constructing server processes is S I GCHLD.
This signal is delivered to a process when any child processes have changed
state. Nonnally servers use the signal to "reap" child processes that have exited
without explicitly awaiting their tennination or periodically polling for exit
status. For example, the remote login server loop shown in Figure 11-2 may be
augmented as shown here:
Figure 11-6
Use of the SIGCHLD Signal
int reaper();
signal (SIGCHLD, reaper);
listen(f, 5);
for (;;) {
int g, len = sizeof from;
g = accept(f, (struct sockaddr *)&frorn, &len,);
if (g < 0) {
if (errno != EINTR)
syslog(LOG_ERR, "rlogind: accept: %m n ) ;
continue;
-#include <wait.h>
reaper ()
{
union wait status;
while (wait3(&status, WNOHANG, 0) > 0)
continue;
If the parent seIVer process fails to reap its children, a large number of zombie
processes may be created.
Pseudo Terminals
Many programs will not function properly without a terminal for standard input
and output. Since sockets do not provide the semantics of terminals, it is often
necessary to have a process communicating over the network do so through a
pseudo-terminal. A pseudo-terminal is actually a pair of devices, master and
slave, which allow a process to seIVe as an active agent in communication
between processes and users. Data written on the slave side of a pseudo-terminal
are supplied as input to a process reading from the master side, while data written
on the master side are processed as terminal input for the slave. In this way, the
process manipulating the master side of the pseudo-tenninal has control over the
information read and written on the slave side as if it were manipulating the keyboard and reading the screen on a real tenninal. The putpOse of this abstraction
is to preserve terminal semantics over a network connection- that is, the slave
side appears as a normal terminal to any process reading from or writing to it.
Revision A, of 27 March 1990
Chapter 11 -
An Advanced Socket-Based Interprocess Communications Tutorial
307
For example, the remote login server uses pseudo-tenninals for remote login sessions. A user logging in to a machine across the network is provided a shell with
a slave pseudo-tetminal as standard input, output, and error. The server process
then handles the communication between the programs invoked by the remote
shell and the user's local client process. When a user sends a character that generates an interrupt on the remote machine that flushes terminal output, the
pseudo-terminal generates a control message for the server process. The server
then sends an out of band message to the client process to signal a flush of data at
the real terminal and on the intervening data buffered in the network.
The name of the slave side of a pseudo-tenninal is of the form / dev / t t yxy,
where x is a single letter starting at 'p' and continuing to 't'. Y is a hexadecimal
digit (Le., a single character in the range 0 through 9 or 'a' through 'f). The
master side of a pseudo-terminal is / dev / pt yxy, where x and y correspond to
the slave side of the pseudo-terminal.
In general, the method of obtaining a pair of master and slave pseudo-terminals is
to find a pseudo-terminal that is not currently in use. The master half of a
pseudo-terminal is a single-open device; thus, each master may be opened in tum
until an open succeeds. The slave side of the pseudo-terminal is then opened,
and is set to the proper terminal modes if necessary. The process then for k ( ) s;
the child closes the master side of the pseudo-terminal, and exec () s the
appropriate program. Meanwhile, the parent closes the slave side of the pseudoterminal and begins reading and writing from the master side. Sample code making use of pseudo-terminals is given in the following example. This code
assumes that a connection on a socket s exists, connected to a peer who wants a
service of some kind, and that the process has disassociated itself from any previous controlling terminal.
Figure 11-7
Creation and Use of a Pseudo Terminal
gotpty = 0;
for (c = 'p'; !gotpty && c <= 's'; c++)
line = "/dev/ptyXX";
line[sizeof "/dev/pty" -1] = c;
line[sizeof "/dev/ptyp" -1] = '0';
if (stat (line, &statbuf) < 0)
break;
for (i = 0; i < 16; i++)
line[sizeof "/dev/ptyp" -1]
= "0123456789abcdef"[i];
master = open(line, O_RDWR);
if (master >= 0) {
gotpty = 1;
break;
if (! gotpty)
syslog(LOG_ERR, "All network ports in use");
exit(l) ;
Revision A, of 27 March 1990
308
Transport-Level Programming
line[sizeof "/dev/" -1] = 't';
slave = open (line, O_RDWR);
/* slave is now slave side * /
if ( s 1 ave < 0) {
syslog(LOG_ERR, "Cannot open slave pty %s", line);
exit(l);
ioctl (slave, TIOCGETP, &b);
/* Set slave tty modes * /
b.sg_flags = CRMODIXTABSIANYP;
ioctl(slave, TIOCSETP, &b);
i = fork();
if (i < 0) {
syslog(LOG_ERR, "fork: %m");
exit(l);
else if (i) {
/* Parent * /
close(slave);
else {
/* Child * /
close(s);
close(master);
dup2(slave, 0);
dup2(slave, 1);
dup2(slave, 2);
if (slave > 2)
close(slave);
Selecting Specific Protocols
If the third argument to the socket () call is 0, socket () will select a default
protocol to use with the returned socket of the type requested. The default protocol is usually correct, and alternate choices are not usually available. However,
when using "raw" sockets to communicate directly with lower-level protocols or
hardware interfaces, the protocol argument may be important for setting up
demultiplexing. For example, raw sockets in the Internet domain may be used to
implement a new protocol above IP, and the socket will receive packets only for
the protocol specified. To obtain a particular protocol one detennines the protocol number as defined within the protocol domain. For the Internet domain one
may use one of the library routines discussed in the Library Routines section
above, such as getprotobynameO:
#include
#include
#include
#include
<sys/types.h>
<sys/socket.h>
<netinet/in.h>
<netdb.h>
pp = getprotobyname("newtcp");
s = socket (AF_INET, SOCK_STREAM, pp->p-proto);
This would result in a socket s using a stream based connection, but with
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
309
protocol type of "newtcp" instead of the default "tcp."
Address Binding
As was mentioned in the Basics section, binding addresses to sockets in the Internet domain can be fairly complex. As a brief reminder, these associations are
composed of local and foreign addresses, and local and foreign ports. Port
numbers are allocated out of separate spaces, one for each system and one for
each domain on that system. Through the bind () system call, a process may
specify half of an association, the <local address, local port> part, while the
connect () and accept () primitives are used to complete a socket's association by specifying the <foreign address, foreign port> part. Since the association is created in two steps the association uniqueness requirement indicated previously could be violated unless care is taken. Further, it is unrealistic to expect
user programs to always know proper values to use for the local address and local
port since a host may reside on multiple networks and the set of allocated port
numbers is not directly accessible to a user.
To simplify local address binding in the Internet domain the notion of a wildcard
address has been provided. When an address is specified as INADDR_ANY (a
manifest constant defined in <net inet / in. h», the system interprets the
address as any valid address. For example, to bind a specific port number to a
socket, but leave the local address unspecified, the following code might be used:
*include <sys/types.h>
*include <netinet/in.h>
struct sockaddr_in sin;
s = socket(AF INET, SOCK_STREAM, 0);
sin. sin_family = AF_INET;
sin.sin_addr.s_addr = htonl(INADDR_ANY)i
sin.sin-port = htons(MYPORT)i
bind(s, (struct sockaddr *) &sin, sizeof sin);
Sockets with wild carded local addresses may receive messages directed to the
specified port number, and sent to any of the possible addresses assigned to a
host. For example, if a host has addresses 128.32.0.4 and 10.0.0.78, and a socket
is bound as above, the process will be able to accept connection requests that are
addressed to 128.32.0.4 or 10.0.0.78. If a server process wished to only allow
hosts on a given network connect to it, it would bind the address of the host on
the appropriate network.
In a similar fashion, a local port may be left unspecified (specified as zero), in
which case the system will select an appropriate port number for it. For example,
to bind a specific local address to a socket, but to leave the local port number
unspecified:
Revision A, of 27 March 1990
310
Transport-Level Programming
hp = gethostbyname(hostname);
if (hp == NULL) {
bcopy(hp->h_addr, (char *) sin.sin_addr, hp->h_Iength);
sin.sin-port = htons(O);
bind(s, (struct sockaddr *) &sin, sizeof sin);
The system selects the local port number based on two criteria. The first is that
Internet ports below IPPORT_RESERVED (1024) are reserved for privileged
users (Le., the super user); Internet ports above IPPORT_USERRESERVED
(50000) are reserved for non-privileged servers. The second is that the port
number is not currently bound to some other socket. In order to find a free Internet port number in the privileged range the rre svport () library routine may
be used as follows to return a stream socket in with a privileged port number:
int Iport
int s;
IPPORT RESERVED - 1;
s = rresvport(&lport);
if (s < 0) {
if (errno == EAGAIN)
fprintf(stderr, "socket: all ports in use\nn);
else
perror("rresvport: socket");
The restriction on allocating ports was done to allow processes executing in a
"secure" environment to perfonn authentication based on the originating address
and port number. For example, the rlogin (1) command allows users to log
in across a network without being asked for a password, if two conditions hold:
First, the name of the system the user is logging in from is in the file
/ etc/hosts. equi von the system s/he is logging in to (or the system name
and the user name are in the user's. rhosts file in the user's home directory),
and second, that the user's rlogin process is coming from a privileged port on the
machine from which s/he is logging in. The port number and network address of
the machine from which the user is logging in can be detennined either by the
from result of the accept () call, or from the getpeername () call.
In certain cases the algorithm used by the system in selecting port numbers is
unsuitable for an application. This is because associations are created in a two
step process. For example, the Internet file transfer protocol, FTP, specifies that
data connections must always originate from the same local port. However,
duplicate associations are avoided by connecting to different foreign ports. In
this situation the system would disallow binding the same local address and port
number to a socket if a previous data connection's socket still existed. To override the default port selection algorithm, an option call must be perfonned prior
to address binding:
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
int on
=
311
1;
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, sizeof on);
bind(s, (struct sockaddr *) &sin, sizeof sin);
With the above call, local addresses may be bound that are already in use. This
does not violate the uniqueness requirement as the system still checks at connect
time to be sure any other sockets with the same local address and port do not
have the same foreign address and port. If the association already exists, the
error EADDRINUSE is returned.
Broadcasting and
Determining Network
Configuration
By using a datagram socket, it is possible to send broadcast packets on many networks connected to the system. The network itself must support broadcast; the
system provides no simulation of broadcast in software. Broadcast messages can
place a high load on a network since they force every host on the network to service them. Consequently, the ability to send broadcast packets has been limited
to sockets that are explicitly marked as allowing broadcasting. Broadcast is typically used for one of two reasons: it is desired to find a resource on a local network without prior knowledge of its address, or important functions such as routing require that information be sent to all accessible neighbors.
To send a broadcast message, a datagram socket should be created:
s
=
socket(AF_INET, SOCK_DGRAM, 0);
The socket is marked as allowing broadcasting,
int on = 1;
setsockopt(s, SOL_SOCKET, SO_BROADCAST, &on, sizeof on);
and at least a port number should be bound to the socket:
sin. sin_family = AF_INET;
sin.sin_addr.s_addr = htonl(INADDR_ANY);
sin.sin-port = htons(MYPORT);
bind(s, (struct sockaddr *) &sin, sizeof sin);
The destination address of the message to be broadcast depends on the
network(s) on which the message is to be broadcast. The Internet domain supports a shorthand notation for broadcast on the local network, the address
INADDR_BROADCAST (defined in <netinet/ in. h>. To determine the list
of addresses for all reachable neighbors requires knowledge of the networks to
which the host is connected. Since this information should be obtained in a
host-independent fashion and may be impossible to derive, SunOS provides a
method of retrieving this information from the system data structures. The
SIOCGIFCONF ioctl call returns the interface configuration of a host in the
form of a single ifconf structure; this structure contains a "data area" that is
made up of an array of ifreq structures, one for each address domain supported
by each network interface to which the host is connected. These structures are
defined in <net / if. h> as follows:
Revision A, of 27 March 1990
312
Transport-Level Programming
struct ifconf {
int ifc_Ien:
union {
caddr_t ifcu_buf:
struct ifreq *ifcu_req:
ifc_ifcu:
}:
/ * size of associated buffer * /
fdefine ifc_buf ifc_ifcu.ifcu_buf /* buffer~ess */
fdefine ifc_req ifc_ifcu. ifcu_req /* a"ay of structures returned
struct ifreq {
fdefine IFNAMSIZ
16
char ifr_name[IFNAMSIZ]:
/ * if name, e.g. "enD" * /
union {
struct sockaddr ifru_addr:
struct sockaddr ifru_dstaddr:
char ifru_oname [IFNAMSIZ]: /* other ifname * /
short ifru_flags:
char ifru_data(l]i
/ * interface dependent data
ifr_ifru:
*/
*/
}:
tdefine
tdefine
Idefine
#define
#define
ifr addr
ifr_dstaddr
ifr oname
ifr_flags
ifr_data
ifr_ifru.ifru_addr
/* ~ess */
ifr_ifru.ifru_dstaddr/* otherendoflink */
ifr_ifru. ifru_oname /* other ifnarne * /
ifr_ifru.ifru_flags /* flags */
ifr_ifru. ifru_data /*forusebyinterface * /
The actual call that obtains the interface configuration is
struct ifconf ifc:
char buf[BUFSIZ]:
ifc_ifc_len = sizeof buf;
ifc.ifc_buf = buf:
if (ioctl(s, SIOCGIFCONF,
(char *) &ifc) < 0)
{
After this call bufwill contain a list of ifreq structures, one for each network to
which the host is connected. These structures will be ordered first by interface
name and then by supported address families. ifc. ifc_len will have been
modified to reflect the number of bytes used by the ifreq structures.
For each structure there exists a set of "interface flags" that tell whether the network corresponding to that interface is up or down, point to point or broadcast,
etc. The SIOCGIFFLAGS ioctl retrieves these flags for an interface
specified by an ifreq structure as follows:
Revision A. of 27 March 1990
Chapter 11 -
An Advanced Socket-Based Interprocess Communications Tutorial
313
struct ifreq *ifri
ifr = ifc.ifc_reqi
for (n=ifc.ifc_len/sizeof (struct ifreq)i
--n >= 0; ifr++) {
1*
* We must be careful that we don't use an interface
* devoted to an address domain other than those intended;
* ifwe were interested in NS interfaces, the
* AF- INET would be AF- NS.
*1
if (ifr->ifr_addr.sa_family != AF_INET)
continue;
if (ioctl (s, SIOCGIFFLAGS, (char *) ifr) < 0)
{
/*
* Skip boring cases
*/
if «ifr->ifr_flags & IFF_UP) == 0 I I
(ifr->ifr_flags & IFF_LOOPBACK) I I
(ifr->ifr_flags &
(IFF_BROADCAST I IFF_POINTTOPOINT»
continue;
0)
Once the flags have been obtained, the broadcast address must be obtained. In
the case of broadcast networks this is done via the SIOCGIFBRDADDR
ioct 1, while for point-to-point networks the address of the destination host is
obtained with SIOCGIFDSTADDR.
struct sockaddr dst;
if (ifr->ifr_flags & IFF_POINTTOPOINT)
if (ioctl (s, SIOCGIFDSTADDR, (char *) ifr) < 0)
{
bcopy«char *) ifr->ifr_dstaddr, (char *) &dst,
sizeof ifr->ifr_dstaddr)i
else if (ifr->ifr_flags & IFF_BROADCAST)
if (ioctl(s, SIOCGIFBRDADDR, (char *) ifr) < 0)
{
bcopy«char *) ifr->ifr_broadaddr,
sizeof ifr->ifr_broadaddr)i
(char *) &dst,
After the appropriate ioct 1 () s have obtained the broadcast or destination
address (now in dst), the sendto () call may be used:
Revision A, of 27 March 1990
314
Transport-Level Programming
sendto(s, buf, buflen, 0,
(struct sockaddr *)&dst, sizeof dst
In the above loop one sendto () occurs for every interface to which the host is
connected that supports the notion of broadcast or point-to-point addressing. If a
process only wished to send broadcast messages on a given network, code similar
to that outlined above would be used, but the loop would need to find the correct
destination address.
Received broadcast messages contain the sender's address and port, as datagram
sockets are bound before a message is allowed to go out.
Socket Options
It is possible to set and get a number of options on sockets via the setsockopt () and getsockopt () system calls. These options include such
things as marking a socket for broadcasting, not to route, to linger on close, etc.
The general forms of the calls are:
setsockopt(s, level, optname, optval, optlen);
and
getsockopt(s, level, optname, optval, optlen);
The parameters to the calls are as follows: s is the socket on which the option is
to be applied. level specifies the protocol layer on which the option is to be
applied; in most cases this is the "socket level", indicated by the symbolic constant SOL_SOCKET, defined in <sys/ socket. h>. The actual option is
specified in opt name, and is a symbolic constant also defined in
<sys/ socket. h>. optval and optlen point to the value of the option (in most
cases, whether the option is to be turned on or oft), and the length of the value of
the option, respectively. For getsockoptO, optlen is a value-result parameter,
initially set to the size of the storage area pointed to by optval, and modified
upon return to indicate the actual amount of storage used.
An example should help clarify things. It is sometimes useful to determine the
type (e.g., stream, datagram, etc.) of an existing socket; programs invoked by
inetd (described below) may need to perfonn this task. This can be accomplished as follows via the SO_TYPE socket option and the getsockopt ()
call:
#include <sys/types.h>
#include <sys/socket.h>
int type, size;
size = sizeof (int);
if (getsockopt(s, SOL_SOCKET, SO_TYPE,
&size) < 0) {
(char *) &type,
After the getsockopt () call, type will be set to the value of the socket type,
Revision A, of 27 March 1990
Chapter 11 - An Advanced Socket-Based Interprocess Communications Tutorial
315
as defined in <sys/ socket, h>. If, for example, the socket were a datagram
socket, type would have the value corresponding to SOCK_ DGRAM.
inetd
One of the daemons provided with SunOS is inetd, the so called "Internet
super-server." inetd is invoked at boot time and determines from the file
/ etc/ inetd. conf the services for which it is to listen. Once this information has been read and a pristine environment created, inetd proceeds to create
one socket for each service it is to listen for, binding the appropriate port number
to each socket.
inetd then perfonns a select () on all these sockets for read availability,
waiting for somebody wishing a connection to the service corresponding to that
socket. inetd then performs an accept () on the socket in question,
fork ( ) s, dup () s the new socket to file descriptors 0 and 1 (stdin and stdout),
closes other open file descriptors, and exec () s the appropriate server.
Servers making use of inetd are considerably simplified, as inetd takes care
of the majority of the IPC work required in establishing a connection. The server
invoked by inetd expects the socket connected to its client on file descriptors 0
and 1, and may immediately perfonn any operations such as readO, writeO,
sendO, or recvO. Indeed, servers may use buffered I/O as provided by the
"stdio" conventions, as long as they remember to use fflush () when appropriate.
One call that may be of interest to individuals writing servers to be invoked by
inetd is the getpeername () call, which returns the address of the peer (process) connected on the other end of the socket. For example, to log the Internet
address in "dot notation" (e.g., "128.32.0.4") of a client connected to a server
under inetd, the following code might be used:
struct sockaddr_in name;
int namelen = sizeof name;
if (getpeername(O,
(struct sockaddr *)&name, &namelen) < 0)
syslog(LOG_ERR, "getpeername: %m")i
exit(l);
else
syslog(LOG_INFO, "Connection from %s",
inet_ntoa(name,sin_addr»;
{
While the getpeername () call is especially useful when writing programs to
run with inetd, it can be used under other circumstances. Be warned, however,
that getpeername will fail on UNIX domain sockets. ".
Revision A, of 27 March 1990
316
Transport-Level Programming
Revision A, of 27 March 1990
12
Socket-Based IPC Implelllentation
Notes
This chapter describes the internal structure of the socket-based networking facilities originally developed for the 4.2BSD version of the UNIX system and subsequently integrated into SunOS. These facilities are based on several central
abstractions that structure and unify the external (user) view of network communication as well as the internal (system) implementation. In addition, the
implementation introduces a structure for network communications that may be
used by system implementors in adding new networking facilities. The internal
structure is not visible to the user, rather it is intended to aid implementors of
communication protocols and network services by providing a framework that
promotes code sharing and minimizes implementation effort.
The reader is expected to be familiar with the C programming language and system interface, as described in the System Services Overview. Basic understanding of network communication concepts is assumed; where required any additional ideas are introduced.
The remainder of this document provides a description of the system internals,
avoiding, when possible, overlap with the interprocess communication tutorials.
Overview
If we consider the International Standards Organization's (ISO) Open System
Interconnection (OSI) model of network communication [IS081] [Zimmermann80], the networking facilities described here correspond to a portion of the
session layer, all of the transport and network layers, and some datalink layers.
The network layer provides possibly imperfect data transport services with
minimal addressing structure. Addressing at this level is normally host to host,
with implicit or explicit routing optionally supported by the communicating
agents.
At the transport layer the notions of reliable transfer, data sequencing, flow control, and service addressing are normally included. Reliability is usually
managed by explicit acknowledgement of data delivered. Failure to acknowledge a transfer results in retransmission of the data. Sequencing may be
handled by tagging each message handed to the network layer by a sequence
number and maintaining state at the endpoints of communication to utilize
received sequence numbers in reordering data that arrives out of order.
The session layer facilities may provide forms of addressing that are mapped into
formats required by the transport layer, service authentication and client
317
Revision A, of 27 March 1990
318
Transport-Level Programming
authentication, etc. Various systems also provide services such as data encryption and address and protocol translation.
The following sections begin by describing some of the common data structures
and utility routines, then examine the intemallayering. The contents of each
layer and its interface are considered. Certain of the interfaces are protocol
implementation specific. For these cases examples have been drawn from the
Internet [Cerf78] protocol family. Later sections cover routing issues, the design
of the raw socket interface, and other miscellaneous topics.
Goals
The networking system was designed with the goal of supporting multiple protocol families and addressing styles. This required infonnation to be "hidden" in
common data structures that could be manipulated by all the pieces of the system, but that required interpretation only by the protocols that "controlled" it.
The system described here attempts to minimize the use of shared data structures
to those kept by a suite of protocols (a protocol family), and those used for rendezvous between "synchronous" and "asynchronous" portions of the system (e.g.
queues of data packets are filled at interrupt time and emptied based on user
requests).
A major goal of the system was to provide a framework within which new protocols and hardware could be easily be supported. To this end, a great deal of
effort has been extended to create utility routines that hide many of the more
complex and/or hardware dependent chores of networking. Later sections
describe the utility routines and the underlying data structures they manipulate.
12.1. Memory, Addressing
Address Representation
Common to all portions of the system are two data structures. These structures
are used to represent addresses and various data objects. Addresses are internally
described by the sockaddr structure,
struct sockaddr {
sa_family;
short
char
sa_data[14];
/ * address family * /
/ * up to 14 bytes of direct address * /
};
All addresses belong to one or more address families which define their format
and interpretation. The sa_family field indicates the address family to which
the address belongs, and the sa_ da t a field contains the actual data value. The
size of the data field, 14 bytes, was selected based on a study of current address
fonnats. Specific address fonnats use private structure definitions that define the
fonnat of the data field. The system interface supports larger address structures,
although address-family-independent support facilities, for example routing and
raw socket interfaces, provide only 14 bytes for address storage. Protocols that
do not use those facilities (e.g, the current UNIX domain) may use larger data
areas.1 6
16
Later versions of the system may support variable length addresses.
Revision A, of 27 March 1990
Chapter 12 -
Memory Management
Socket-Based IPC Implementation Notes
319
A single structure is used for data storage - the memory buffer, or "mbuf'.
There are three kinds of mbufs - "small", "cluster", and "loaned". They differ
in the policies and mechanisms by which their associated storage is allocated and
managed.
Small mbufs
Small mbufs are the fundamental type and are used both on their own and as
building blocks for cluster and loaned mbufs. They contain their own
storage in the array (see below) named m_ da t. That array is defined as containing 112 (MLEN) bytes, so that's all the data that a small mbuf can hold.
Small mbufs are guaranteed to start on a 128-byte boundary. The dtom
macro, described below, works correctly only with small mbufs - mistaken
attempts to use dtom with cluster and loaned mbufs are a common source of
insidious error.
Cluster mbufs
Cluster mbufs support the storage and sharing of larger amounts of data.
They do so by dynamically allocating storage, as necessary, from a pool of
fixed-sized buffers called clusters. These clusters, each of which is
MCLBYTES (lK) in size, are managed by the mbuf system itself. The mbuf
system uses a small mbuf to refer to a given cluster by setting its m_of f
field to refer to a location in the interior (most commonly, the beginning) of
the cluster. This combination of a small mbuf and a cluster is what constitutes a cluster mbuf.
Cluster mbufs can be shared because clusters are reference-counted. The
routine mcldup () arranges to share an existing cluster mbuf by increasing
its reference count and attaching a new small mbuf to it. Cluster mbufs
always have theirm_cltype field set to MCL_STATIC.
Loaned mbufs
Loaned mbufs provide for treating storage not directly managed by the mbuf
system in the same way as normal mbufs. The mbuf system uses small
mbufs to store bookkeeping information about loaned mbufs, as it does with
cluster mbufs. With loaned mbufs, however, storage is provided by the allocator, who is ultimately responsible of freeing it as well. To allocate a
loaned mbuf, one calls mclgetxO, which takes as arguments the address of
the buffer to be loaned, its length, a pointer to a function, and an argument to
be passed to that function when it's called. This function is called when the
loaned mbuf is freed, and must do whatever is necessary to clean up the
loaned buffer. The m_ cl fun and m_ c larg fields of the mbuf structure
record the pointer to this function and its argument. Loaned mbufs have
Revision A, of 27 March 1990
320
Transport-Level Programming
their m_ cIt ype field set to MeL_LOANED.
An mbuf structure has the fonn:
*define
*define
*define
*define
MSIZE
MMINOFF
MTAIL
MLEN
struct mbuf
struct
u_long
short
short
union
128
12
4
(MSIZE-MMINOFF-MTAIL)
rnbuf *m_nexti
m_offi
m_leni
m_typei
/ * next buffer in chain * /
/ * offset of data * /
/ * amount of data in this mbuf * /
/* mbuftype (0 == free) * /
u char mun_dat [MLEN] ;
/* data storage * /
struct
short mun_cltypei
/* "cluster" type* /
int (*mun_clfun) ()i
int mun_clargi
int (*mun_clswp) ();
mun_cli
} m_un;
struct rnbuf *m_acti
#define
#define
#define
idefine
/* link in higher-level mbuflist * /
m dat
m un.mun_dat
m_cltype
~un.mun_cl.mun_cltype
m_clfun ~un.mun_cl.mun_clfun
m_clarg m_un.mun_cl.mun_clarg
};
The m_next field is used to chain mbufs together on linked lists, while the
m_act field allows lists ofmbuf chains to be accumulated. By convention, the
mbufs common to a single object (for example, a packet) are chained together
with the ~next field, while groups of objects are linked via the ~act field
(possibly when in a queue).
The m_1 enfield indicates the amount of data, while the m_of f field is an offset
to the beginning of the data from the base of the mbuf. Thus, for example, the
macro rntodO, which converts a pointer to an mbufto a pointer to the data stored
in the mbuf, has the fonn
idefine mtod(x,t)
( (t) ( (i n t) (x)
+
(x) -
>m_ 0 f f) )
(note the t parameter, a C type cast, which is used to cast the resultant pointer for
proper assignment). Since a small mbuf's data always resides in the mbuf's own
m_ dat array, its m_off value is always less than MSI ZE. On the other hand,
storage for cluster and loaned mbufs is external to the mbufs themselves, so their
m_off values are always at least MSIZE. The M_HASCL macro distinguishes
these two cases and is defined as
#define M_HASCL(m)
«m)->m_off >= MSIZE)
As mentioned above, the dtom macro is safe to use only ifM_HASCL evaluates
false.
Revision A, of 27 March 1990
Chapter 12 - Socket-Based IPC Implementation Notes
321
The following routines and macros may be used to allocate and free mbufs:
m = m_get(wait, type);
MGET{m, wait, type);
The subroutine m_get () and the macro MGET () each allocate an mbuf,
placing its address in m. The argument wait is either M_WAIT or
M_DONTWAI T according to whether allocation should block or fail if no
mbuf is available. The type is one of the predefined mbuf types for use in
accounting of mbuf allocation.
MCLGET em) ;
This macro attempts to allocate an mbuf cluster to associate with the mbuf
m. If successful, the length of the mbuf is set to MCLS I ZE. The routine
mclget () is similar, but returns success/failure.
mclgetx{fun,
arg,
addr,
len,
wait)
This routine wraps the storage defined by addr and len with an
MCL_LOANED mbuf. The fun argument gives a function to be called when
the resulting loaned mbuf is freed, and arg is a value that will be supplied to
that function as its argument. The argument wait is either M_WAIT or
M_DONTWAI T according to whether allocation should block or fail if no
mbuf is available.
mcldup{m,
n,
off);
A duplicator for cluster and loaned mbufs, which duplicates minto n. If m is
a cluster mbuf, mcldup () simply bumps its reference count and ignores
off. But if m is a loaned mbuf, mcldup () allocates a chunk of memory and
copies it, starting at offset off.
n = m_free em) ;
MFREE em, n) ;
The routine m_free () and the macro MFREE () each free a single mbuf,
m, and any associated external storage area, placing a pointer to its successor
in the chain it heads, if any, in n.
m_freem{m) ;
This routine frees an mbuf chain headed by m.
By insuring that mbufs always reside on 128 byte boundaries, it is always possible to locate the mbuf associated with a data area by masking off the low bits of
the virtual address. This allows modules to store data structures in mbufs and
pass them around without concern for locating the Original mbuf when it comes
time to free the structure. Note that this works only with objects stored in the
internal data buffer of the mbuf. The dtom macro is used to convert a pointer
into an mbufs data area to a pointer to the mbuf,
#define dtom(x)
((struct mbuf *) ((int)x & - (MSIZE-l)))
Mbufs are used for dynamically allocated data structures such as sockets as well
as memory allocated for packets and headers. Statistics are maintained on mbuf
usage and can be viewed by users using the netstat () program. The following utility routines are available for manipulating mbuf chains:
Revision A, of 27 March 1990
322
Transport-Level Programming
m = m_copy(mO, off, len);
The m_copy () routine create a copy of all, or part, of a list of the mbufs in
mO. len bytes of data, starting of/bytes from the front of the chain, are
copied. Where possible, reference counts are manipulated in preference to
core to core copies. The original mbuf chain must have at least off + len
bytes of data. If len is specified as M _COPYALL, all the data present, offset
as before, is copied.
m_cat (m, n) ;
The mbuf chain, n, is appended to the end of m. Where possible, compaction is perfonned.
m_cpytoc(m, off, len, cp)
Copies a part of the contents of the mbuf m to the contiguous memory
pointed to by cp, skipping the first offbytes and copying the next len bytes.
It returns the number of bytes remaining in the mbuf following the portion
copied. m is left unaltered.
m_adj(m, diff};
The mbuf chain, m is adjusted in size by diffbytes. If diff is non-negative,
diffbytes are shaved off the front of the mbuf chain. If diffis negative, the
alteration is perfonned from back to front. No space is reclaimed in this
operation; alterations are accomplished by changing the m_len and m_off
fields of mbufs.
m = m-pullup(mO, size);
After a successful call to m_pull upO, the mbuf at the head of the returned
list, m, is guaranteed to have at least size bytes of data in contiguous memory
within the data area of the mbuf (allowing access via a pointer, obtained
using the mtod () macro, and allowing the mbufto be located from a
pointer to the data area using dtom, defined below). If the original data was
less than size bytes long, len was greater than the size of an mbuf data area
(112 bytes), or required resources were unavailable, m is 0 and the original
mbuf chain is deallocated.
This routine is particularly useful when verifying packet header lengths on
reception. For example, if a packet is received and only 8 of the necessary
16 bytes required for a valid packet header are present at the head of the list
of mbufs representing the packet, the remaining 8 bytes may be "pulled up"
with a single InJ)ull up () call. If the call fails the invalid packet will
have been discarded.
12.2. Internal Layering
The internal structure of the network system is divided into three layers. These
layers correspond to the services provided by the socket abstraction, those provided by the communication protocols, and those provided by the hardware interfaces. The communication protocols are nonnally layered into two or more individual cooperating layers, though they are collectively viewed in the system as
one layer providing services supportive of the appropriate socket abstraction.
The following sections describe the properties of each layer in the system and the
interfaces to which each must confonn.
Revision A, of 27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
323
The socket layer deals with the intetprocess communication facilities provided
by the system. A socket is a bidirectional endpoint of communication which is
"typed" by the semantics of communication it supports. For more infonnation
about the system calls used to manipulate sockets, see A Socket-Based Interprocess Communications Tutorial and An Advanced Socket-Based Interprocess
Communications Tutorial, both sections of Network Programming.
Socket Layer
A socket consists of the following data structure:
struct socket {
so_type;
short
so_options;
short
so_linger;
short
so_state;
short
caddr t soycb;
struct protosw *soyrotoi
/ * generic type. see socket.h * /
/ * from socket call * /
/ * time to linger while closing * /
/ * internal state flags SS_*. below * /
/ * protocol control block * /
/ * protocol handle * /
1*
* Variablesfor connection queueing. A socket where accepts occur is so_head
* in all subsidiary sockets. If so_head is O. the socket is not related to an
* accept. For head socket so_ qO queues partially completed connections. while
* so_ q is a queue of connections ready to be accepted. If a connection is
* aborted and it has so_head set, then it has to be pulled out of either
* so_ qO or so_ q. We allow connections to queue up based on current
* queue lengths and limit on number of queued connections for this socket.
*1
struct
struct
struct
short
short
short
short
u short
short
u short
/ * back pointer to accept socket * /
/* queue ofpartial connections * /
/ * queue of incoming connections * /
/ * partials on so _qO * /
/ * number of connections on so_q * /
/ * max # of queued connections * /
/ * connection timeout * /
/ * error affecting connection * /
/ * pgrp for signals * /
/ * chars to oob mark * /
socket *so_headi
socket *so_qO;
socket *so_q;
so_qOlen;
so_qlen;
so_qlimit;
so_timeo;
so_error;
soygrp;
so_oobmarki
1*
* Variablesfor socket buffering.
*1
struct
struct
sockbuf
sockbuf
/ * receive buffer * /
/ * send buffer * /
so_rcv;
so_snd;
1*
* Hooks for alternative wakeup strategies.
* These are used by kernel subsystems wishing to access the socket
* abstraction. If so_ wupfunc is nonnull. it is called in place of
* wakeup any time that wakeup would otherwise be called with an
* argument whose value is an address lying within a socket structure.
*1
struct wupalt
*so_wupalt;
};
struct wupalt {
int
(*wup_func) ();
caddr t
wup_arg;
•
~!!I!!
/ * function to call instead of wakeup * /
/* argumentfor so_wupfunc * /
Revision A, of 27 March 1990
324
Transport-Level Programming
1* Other state information here, e.g. for a stream
* connected to a socket
*1
};
Each socket contains two send and receive data queues, so_rcv and so_snd
(see below for a discussion), as well as protocol infonnation, private data, error
infonnation and pointers to routines which provide supporting services.
The type of the socket, so_type is defined at socket creation time and used in
selecting those services that are appropriate to support it. The supporting protocol is selected at socket creation time and recorded in the socket data structure
for later use. Protocols are defined by a table of procedures, the proto sw structure, which will be described in detail later. A pointer to a protocol-specific data
structure, the "protocol control block," is also present in the socket structure.
Protocols control this data structure, which normally includes a back pointer to
the parent socket structure to allow easy lookup when returning infonnation to a
user (for example, placing an error number in the so_error field). Other
entries in the socket structure are used in queuing connection requests, validating
user requests, storing socket characteristics (e.g. options supplied at the time a
socket is created), and maintaining a socket's state.
Processes "rendezvous at a socket" in many instances. For instance, when a process wishes to extract data from a socket's receive queue and it is empty, or lacks
sufficient data to satisfy the request, the process blocks, supplying the address of
the receive queue as a "wait channel' to be used in notification. When data
arrives for the process and is placed in the socket's queue, the blocked process is
identified by the fact it is waiting "on the queue."
Socket State
A socket's state is defined from the following:
#define
#define
#define
#define
#define
#define
#define
SS
SS
SS
SS
SS
SS
SS
NOFDREF
ISCONNECTED
ISCONNECTING
ISDISCONNECTING
CANTSENDMORE
CANTRCVMORE
RCVATMARK
#define SS PRIV
#define SS NBIO
#define SS ASYNC
OxOOl
OxOO2
OxOO4
OxOO8
OxOlO
Ox020
Ox040
/ * no file table ref any more * /
/ * socket connected to a peer * /
/ * in process of connecting to peer* /
/ * in process of disconnecting * /
/ * can't send more data to peer * /
/ * can't take more data from peer * /
/ * at mark on input * /
Ox080 / * privileged * /
OxlOO / * non-blocking ops * /
Ox200 / * async i/o notify * /
The state of a socket is manipulated both by the protocols and the user (through
system calls). When a socket is created, the state is defined based on the type of
socket. It may change as control actions are performed, for example connection
establishment. It may also change according to the type of input/output the user
wishes to perfonn, as indicated by options set with fcntlO. "Non-blocking"
I/O implies that a process should never be blocked to await resources. Instead,
any call that would block returns prematurely with the error EWOULDBLOCK, or
Revision A. of 27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
325
the selVice request (e.g. a request for more data than is present) may be only partially fulfilled.
If a process requested "asynchronous" notification of events related to the socket,
the SIGIO signal is posted to the process when such events occur. An event is a
change in the socket's state; examples of such occurrences are space becoming
available in the send queue, new data available in the receive queue, connection
establishment or disestablishment, etc.
A socket may be marked "privileged" if it was created by the super-user. Only
privileged sockets may bind addresses in privileged portions of an address space
or use "raw" sockets to access lower levels of the network.
Socket Data Queues
A socket's data queue contains a pointer to the data stored in the queue and other
entries related to the management of the data. The structure of a data queue,
struct sockbuf, is:
struet soekbuf
u short Sb_CCi
u short sb_hiwati
u short sb_mbcnti
u short sb_mbmaxi
u short sb_lowati
struet mbuf *Sb_mbi
struet proc *sb seli
short
sb_timeoi
short
sb_flagsi
so_rev, so_sndi
/ * actual chars in buffer * /
/ * max actual char count * /
/* chars ofmbufs used * /
/ * max chars of mbufs to use * /
/ * low water mark (not used yet) * /
/ * the mbuf chain * /
/ * process selecting readlwrite * /
/ * timeout (not used yet) * /
/ * flags. see below * /
Data is stored in a queue as a chain of mbufs. The actual count of data characters
as well as high and low water marks are used by the protocols in controlling the
flow of data. The amount of buffer space (characters of mbufs and associated
data clusters) is also recorded along with the limit on buffer allocation. The
socket routines cooperate in implementing the flow control policy by blocking a
process when it requests to send data and the high water mark has been reached,
or when it requests to receive data and less than the low water mark is present
(assuming non-blocking I/O has not been specified). 17
A socket queue has a number of flags used in synchronizing access to the data
and in acquiring resources:
fdefine
fdefine
fdefine
fdefine
fdefine
fdefine
17
SB
SB
SB
SB
SB
SB
MAX
LOCK
WANT
WAIT
SEL
CaLL
65535
OxOl
Ox02
Ox04
Ox08
OxlO
/ * max chars in sockbuf * /
/* lock on data queue (so_rev only) * /
/ * someone is waiting to lock * /
/ * someone is waiting for datal space * /
/ * buffer is selected * /
/ * collision selecting * /
The low-water mark is always presumed to be 0 in the current implementation .
• ~!!.!!
Revision A, of 27 March 1990
326
Transport-Level Programming
The last two flags are manipulated by the system in implementing the select
mechanism.
When a socket is created, the supporting protocol "reserves" space for the send
and receive queues of the socket. The limit on buffer allocation is set somewhat
higher than the limit on data characters to account for the granularity of buffer
allocation. The actual storage associated with a socket queue may fluctuate during a socket's lifetime, but it is assumed that this reservation will always allow a
protocol to acquire enough memory to satisfy the high water marks.
The timeout and select values are manipulated by the socket routines in implementing various portions of the intetprocess communications facilities and will
not be described here.
Data queued at a socket is stored in one of two styles. Stream-oriented sockets
queue data with no addresses, headers or record boundaries. The data are in
mbufs linked through the m_next field. Buffers containing access rights may be
present within the chain if the underlying protocol supports passage of access
rights. Record-oriented sockets, including datagram sockets, queue data as a list
of packets; the sections of packets are distinguished by the types of the mbufs
containing them. The mbufs that comprise a record are linked through the
m_ next field; records are linked from the m_act field of the first mbuf of one
packet to the first mbuf of the next. Each packet begins with an mbuf containing
the "from" address if the protocol provides it, then any buffers containing access
rights, and finally any buffers containing data. If a record contains no data, no
data buffers are required unless neither address nor access rights are present.
Socket Connection Queuing
In dealing with connection oriented sockets (e.g. SOCK STREAM) the two ends
are considered distinct. One end is tenned active, and generates connection
requests. The other end is called passive and accepts connection requests.
From the passive side, a socket is marked with SO_ ACCEP TCONN when a
listen () call is made, creating two queues of sockets: so_ qO for connections
in progress and so_q for connections already made and awaiting user acceptance. As a protocol is preparing incoming connections, it creates a socket structure queued on so_ qO by calling the routine sonewconnO. When the connection is established, the socket structure is then transferred to so_ q, making it
available for an acceptO.
If an SO_ ACCEPTCONN socket is closed with sockets on either so_ qO or so_ q,
these sockets are dropped, with notification to the peers as appropriate.
Protocol Layer(s)
Each socket is created in a communications domain, which usually implies both
an addressing structure (address family) and a set of protocols that implement
various socket types within the domain (protocol family). Each domain is
defined by the following structure:
Revision A, of 27 March 1990
Chapter 12 -
struct domain
int
char
int
int
int
struct
struct
Socket-Based IPC Implementation Notes
327
dom_family;
/* PF_xxx * /
*dom_name;
( * dom_ ini t) () ;
/ * initialize domain structures * /
(*dom_externalize) (); /* externalize access rights * /
(*dom_dispose) () ;
/ *dispose oJ internalized rights* /
protosw *dom-protosw, *dom-protoswNPROTOSW;
domain *dom_next;
} ;
At boot time, each domain configured into the kernel is added to a linked list of
domains. The initialization procedure of each domain is then called. After that
time, the domain structure is used to locate protocols within the protocol family.
It may also contain procedure references for externalization of access rights at the
receiving socket and the disposal of access rights that are not received.
Protocols are described by a set of entry points and certain socket-visible characteristics, some of which are used in deciding which socket type(s) they may support.
An entry in the "protocol switch" table exists for each protocol module
configured into the system. It has the following form:
struct protosw {
short
pr_type;
struct domain *pr_domain;
short
pr-protocol;
short
pr_flags;
/ * protocol-protocol hooks * /
int
(*pr_input) ();
int
(*pr_output) ();
int
(*pr_ctlinput) ();
int
(*pr_ctloutput) ();
/ * user-protocol hook * /
int
(*pr_usrreq) () ;
/ * utility hooks * /
int
(*pr_init) () ;
int
(*pr_fasttimo) ();
int
(*pr_slowtimo) ();
int
(*pr_drain) ();
/ * socket type used for * /
/ * domain protocol a member of * /
/ * protocol number * /
/ * socket visible attributes * /
/ * input to protocol (from below) * /
/ * output to protocol (from above) * /
/ * control input (from below) * /
/ * control output (from above) * /
/ * user request * /
/ * initialization routine * /
/ * fast timeout (200ms) * /
/ * slow timeout (500ms) * /
/ * flush any excess space possible * /
} ;
A protocol is called through the pr_ ini t entry before any other. Thereafter it
is called every 200 milliseconds through the pr_fasttimo entry and every
500 milliseconds through the pr_slowtimo for timer based actions. The system will call the pr_ dr a in entry if it is low on space and this should throw
away any non-critical data.
Protocols pass data between themselves as chains of mbufs using the
pr_input () and pr_output () routines. pr_input () passes data up
Revision A, of 27 March 1990
328
Transport-Level Programming
(towards the user) and pr output () passes it down (towards the network);
control infonnation passes up and down on pr_ctlinput () and
pr ctloutput(). The protocol is responsible for the space occupied by any
of the arguments to these entries and must either pass it onward or dispose of it.
(On output, the lowest level reached must free buffers storing the arguments; on
input, the highest level is responsible for freeing buffers.)
The pr_usrreq () routine interfaces protocols to the socket code and is
described below.
The pr_flags field is constructed from the following values:
:If de fine PR ATOMIC
:If de fine PR ADDR
:lfdefine PR_CONNREQUIRED
:lfdefine PR WANTRCVD
:If de fine PR RIGHTS
OxOl
Ox02
Ox04
Ox08
OxlO
/ * exchange atomic messages only * /
/ * addresses given with messages * /
/ * connection required by protocol * /
/* want PRU_RCVD calls * /
/ * passes capabilities * /
Protocols that are connection-based specify the PR_ CONNREQUIRED flag so
that the socket routines will never attempt to send data before a connection has
been established. If the PR_ WANTRCVD flag is set, the socket routines will
notify the protocol when the user has removed data from the socket's receive
queue. This allows the protocol to implement acknowledgement on user receipt,
and also update windowing infonnation based on the amount of space available
in the receive queue. The PR_ ADDR field indicates that any data placed in the
socket's receive queue will be preceded by the address of the sender. The
PR_ATOM I C flag specifies that each user request to send data must be performed
in a single protocol send request; it is the protocol's responsibility to maintain
record boundaries on data to be sent. The PR_RIGHTS flag indicates that the
protocol supports the passing of capabilities; this is currently used only by the
protocols in the UNIX protocol family.
When a socket is created, the socket routines scan the protocol table for the
domain looking for an appropriate protocol to support the type of socket being
created. The pr_type field contains one of the possible socket types (e.g.
SOCK_STREAM), while the pr_domain is a back pointer to the domain structure. The pryrotocol field contains the protocol number of the protocol,
nonnally a well-known value.
Network-Interface Layer
Each network-interface configured into a system defines a path through which
packets may be sent and received. Nonnally a hardware device is associated
with this interface, though there is no requirement for this (for example, all systems have a software "loopback" interface used for debugging and performance
analysis). In addition to manipulating the hardware device, an interface module
is responsible for encapsulation and decapsulation of any link-layer header information required to deliver a message to its destination. The selection of which
interface to use in delivering packets is a routing decision carried out at a higher
level than the network-interface layer. An interface may have addresses in one or
more address families. The address is set at boot time using an ioctl () on a
socket in the appropriate domain; this operation is implemented by the protocol
Revision A, of 27 March 1990
Chapter 12 - Socket-Based IPC Implementation Notes
329
family, after verifying the operation through the device ioctl () entry.
An interface is defined by the following structure,
struct ifnet {
/ * name, e.g. lien" or 1110" * /
char
*if_name;
short
if_unit;
/ * sub-unit for lower level driver * /
short
if_mtu;
/ * maximum transmission unit * /
/ * up/down, broadcast, etc. * /
short
if_flags;
short
if_timer;
/ * time 'til if_watchdog called * /
/ * # of requests for promiscuous mode * /
u short if-promisc;
int
if_metric;
/ * routing metric (external only) * /
struct ifaddr *if_addrlist; / * linked list of addresses per if * /
struct ifqueue {
struct mbuf *if~head;
struct mbuf *if~tail;
int if~len;
int if~maxlen;
int if~drops;
if_snd;
/ * output queue * /
/ * procedure handles * /
int
(*if_init) ();
/ * init routine * /
int
(*if_output) ();
/ * output routine * /
int
(* if_ioctl) () ;
/ * ioctl routine * /
int
(*if_reset) ();
/ * bus reset routine * /
int
(*if_watchdog) () ;
/ * timer routine * /
/ * generic interface statistics * /
int
if_ipackets;
/ * packets received on interface * /
int
if_ierrors;
/ * input errors on interface * /
int
if_opackets;
/ * packets sent on interface * /
int
if_oerrors;
/ * output errors on interface * /
int
if_collisions;
/ * collisions on csma interfaces * /
/ * end statistics * /
struct
ifnet *if_next;
struct
ifnet *if_upper;
/ * next layer up * /
struct
ifnet *if_lower;
/ * next layer down * /
int (*if_input) ();
/ * input routine * /
int (*if_ctlin) ();
/ * control input routine * /
int (*if_ctlout) () ;
/ * control output routine * /
#ifdef sun
struct map *if_meromap; /* rmapfor interface specific memory * /
#endif
} ;
Each interface address has the following fonn:
struct ifaddr
sockaddr ifa_addr;
struct
/ * address of interface
union {
struct
sockaddr ifu_broadaddr;
sockaddr ifu_dstaddr;
struct
ifa_ifu;
*/
Revision A, of 27 March 1990
330
Transport-Level Programming
struct
struct
ifnet *ifa_ifpi
ifaddr *ifa_nexti
/ * back-pointer to interface * /
/ * next address for interface * /
};
#define if a broadaddr if a ifu.ifu_broadaddr /*brdcastaddress*/
#define ifa_dstaddr ifa_ifu.ifu_dstaddr /*otherendoflink*/
The protocol generally maintains this structure as part of a larger structure containing additional information concerning the address.
Each interface has a send queue and routines used for initialization (if _ ini t),
input (if_input), and output (if_output). If the interface resides on a system bus, the routine if_reset will be called after a bus reset has been perfonned. An interface may also specify a timer routine, if watchdog; if
if_timer is non-zero, it is decremented once per second until it reaches zero,
at which time the watchdog routine is called.
The state of an interface and certain characteristics are stored in the if_flags
field. The following values are possible:
4tdefine
4tdefine
4tdefine
4tdefine
4tdefine
4tdefine
4tdefine
4tdefine
4tdefine
4tdefine
IFF
IFF
IFF
IFF
IFF
IFF
IFF
IFF
IFF
IFF
UP
BROADCAST
DEBUG
LOOPBACK
POINTOPOINT
NOTRAILERS
RUNNING
NOARP
PROMISC
ALLMULTI
Oxl
Ox2
Ox4
Ox8
OxlO
Ox20
Ox40i
Ox80
OxlOO
Ox200
/ * interface is up * /
/ * broadcast is possible * /
/* turn on debugging * /
/ * is a loopback net * /
/ * interface is point-to-point link * /
/ *avoid use of trailers * /
/ * resources allocated * /
/ * no address resolution protocol */
/ * receive all packets * /
/ * receive all multicast packets * /
If the interface is connected to a network that supports transmission of broadcast
packets, the IFF_BROADCAST flag will be set and the ifa_broadaddr field
will contain the address to be used in sending or accepting a broadcast packet. If
the interface is associated with a point-to-point hardware link (for example,
Sunlink/lNR), the IFF_POINTOPOINT flag will be set and ifa_dstaddr
will contain the address of the host on the other side of the connection. These
addresses and the local address of the interface, if_addr, are used in filtering
incoming packets. The interface sets IFF_RUNNING after it has allocated system resources and posted an initial read on the device it manages. This state bit
is used to avoid multiple allocation requests when an interface's address is
changed. The IFF_ NOTRAILERS flag indicates the interface should refrain
from using a trailer encapsulation on outgoing packets, or (where per-host negotiation of trailers is possible) that trailer encapsulations should not be requested;
trailer protocols are described in section 14. The IFF_ NOARP flag indicates the
interface should not use an "address resolution protocol" in mapping internetwork addresses to local network addresses. The IFF PROMISC bit is set when
the interface is in promiscuous mode, indicating that it should receive all incoming packets regardless of their intended destination.
Various statistics are also stored in the interface structure. These may be viewed
by users using the net s tat (1) program.
~~sun
~~ microsystems
Revision A, of 27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
331
The interface address and flags maybe set with the SIOCSIFADDR and SIOCSIFFLAGS ioctls. SIOCSIFADDR is used initially to define each
interface's address; SIOGSIFFLAGS can be used to mark an interface down and
perfonn site-specific configuration. The destination address of a point-to-point
link is set with S IOCS IFDSTADDR. Corresponding operations exist to read
each value. Protocol families may also support operations to set and read the
broadcast address. The S IOCADDMULTI and SCIODELMULTI ioctls may
be used to add and remove multicast addresses from the set that the interface
accepts. In addition, the SIOCGIFCONF ioctl retrieves a list of interface
names and addresses for all interfaces and address families on the host.
12.3. Socket/Protocol
Interface
The interface between the socket routines and the communication protocols is
through the pr_usrreq () routine defined in the protocol switch table. The
following requests to a protocol module are possible:
#define PRU ATTACH
0
1
#define PRU DETACH
2
#define PRU BIND
#define PRU LISTEN
3
#define PRU CONNECT
4
#define PRU ACCEPT
5
#define PRU DISCONNECT 6
7
#define PRU SHUTDOWN
#define PRU RCVD
8
#define PRU SEND
9
#define PRU ABORT
10
11
#define PRU CONTROL
12
#define PRU SENSE
#define PRU RCVOOB
13
14
#define PRU SENDOOB
tdefine PRU SOCKADDR
15
#define PRU PEERADDR
16
#define PRU CONNECT2
17
/ * begin for protocol's internal use * /
#define PRU FASTTIMO
18
#define PRU SLOWTIMO
19
#define PRU PROTORCV
20
21
#define PRU PROTOSEND
/ * attach protocol * /
/ * detach protocol * /
/ * bind socket to address * /
/ * listen for connection * /
/ * establish connection to peer * /
/ * accept connection from peer * /
/ * disconnect from peer * /
/ * won't send any more data * /
/ * have taken data; more room now * /
/ * send this data * /
/ * abort (fast DISCONNECT, DETATCH) * /
/ * control operations on protocol * /
/ * return status into m * /
/ * retrieve out of band data * /
/ * send out of band data * /
/ * fetch socket's address * /
/ * fetch peer's address * /
/ * connect two sockets * /
/
/
/
/
* 200ms timeout * /
* 500ms timeout * /
* receive from below * /
* send to below * /
A call on the user request routine is of the form,
error = (*protosw[] .pr_usrreq) (so, req, ro, addr, rights);
int error;
struct socket *50; int req;
struct mbuf *m, *addr, *rights;
The mbuf data chain m is supplied for output operations and for certain other
operations where it is to receive a result. The address addr is supplied for
address-oriented requests such as PRU_BIND and PRU_CONNECT. The rights
parameter is an optional pointer to an mbuf chain containing user-specified
.~!!.n
Revision A, of 27 March 1990
332
Transport-Level Programming
capabilities (see the sencimsg {) and recvmsg {} system calls). The protocol
is responsible for disposal of the data mbuf chains on output operations. A nonzero return value gives a UNIX error number that should be passed to higher
level software. The following paragraphs describe each of the requests possible.
PRU ATTACH
When a protocol is bound to a socket (with the socket () system call) the
protocol module is called with this request. It is the responsibility of the
protocol module to allocate any resources necessary. The "attach" request
will always precede any of the other requests, and should not occur more
than once.
PRU DETACH
This is the antithesis of the attach request, and is used at the time a socket is
deleted. The protocol module may deallocate any resources assigned to the
socket.
PRU BIND
When a socket is initially created it has no address bound to it. This request
indicates that an address should be bound to an existing socket. The protocol module must verify that the requested address is valid and available for
use.
PRU LISTEN
The "listen" request indicates the user wishes to listen for incoming connection requests on the associated socket. The protocol module should perform
any state changes needed to carry out this request (if possible). A "listen"
request always precedes a request to accept a connection.
PRU CONNECT
The "connect" request indicates the user wants to a establish an association.
The addr parameter supplied describes the peer to be connected to. The
effect of a connect request may vary depending on the protocol. Virtual circuit protocols, such as TCP [PosteI81b], use this request to initiate establishment of a TCP connection. Datagram protocols, such as UDP [Postel 80] ,
simply record the peer's address in a private data structure and use it to tag
all outgoing packets. There are no restrictions on how many times a connect
request may be used after an attach. If a protocol supports the notion of
multi-casting, it is possible to use multiple connects to establish a multi-cast
group. Alternatively, an association may be broken by a
PRU_DISCONNECT request, and a new association created with a subsequent connect request; all without destroying and creating a new socket.
PRU ACCEPT
Following a successful PRU_LISTEN request and the arrival of one or more
connections, this request is made to indicate the user has accepted the first
connection on the queue of pending connections. The protocol module
should fill in the supplied address buffer with the address of the connected
party.
PRU DISCONNECT
Eliminate an association created with a PRU_CONNECT request.
Revision A, of 27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
333
PRU SHUTDOWN
This call is used to indicate no more data will be sent and/or received (the
addr parameter indicates the direction of the shutdown, as encoded in the
soshutdown () system call). The protocol may, at its discretion, deallocate any data structures related to the shutdown and/or notify a connected
peer of the shutdown.
PRU RCVD
This request is made only if the protocol entry in the protocol switch table
includes the PR_ WANTRCVD flag. When a user removes data from the
receive queue this request will be sent to the protocol module. It may be
used to trigger acknowledgements, refresh windowing infonnation, initiate
data transfer, etc.
PRU SEND
Each user request to send data is translated into one or more PRU_ SEND
requests (a protocol may indicate that a single user send request must be
translated into a single PRU_ SEND request by specifying the PR_ ATOMI C
flag in its protocol description). The data to be sent is presented to the protocol as a list of mbufs, and an address is, optionally, supplied in the addr
parameter. The protocol is responsible for preserving the data in the
socket's send queue if it is not able to send it immediately, or if it may need
it at some later time (e.g. for retransmission).
PRU ABORT
This request indicates an abnonnal tennination of service. The protocol
should delete any existing association(s).
PRU CONTROL
The "control" request is generated when a user perfonns a UNIX i 0 c t 1 ( )
system calIon a socket (and the ioctl is not intercepted by the socket routines). It allows protocol-specific operations to be provided outside the
scope of the common socket interface. The addr parameter contains a
pointer to a static kernel data area where relevant infonnation may be
obtained or returned. The m parameter contains the actual i 0 c t 1 ( )
request code (note the non-standard calling convention). The rights parameter contains a pointer to an ifnet structure if the ioctl () operation pertains to a particular network interface.
PRU SENSE
The "sense" request is generated when the user makes an f s tat () system
calIon a socket; it requests status of the associated socket. This currently
returns a standard stat () structure. It typically contains only the optimal
transfer size for the connection (based on buffer size, windowing infonnation and maximum packet size). The m parameter contains a pointer to a
static kernel data area where the status buffer should be placed.
PRU RCVOOB
Any "out-of-band" data presently available is to be returned. An mbuf is
passed to the protocol module, and the protocol should either place data in
the mbuf or attach new mbufs to the one supplied if there is insufficient
space in the single mbuf. An error may be returned if out-of-band data is not
Revision A, of 27 March 1990
334
Transport-Level Programming
(yet) available or has already been consumed. The addr parameter contains
any options such as MSG_PEEK to examine data without consuming it.
PRU SENDOOB
Like PRU_SEND, but for out-of-band data.
PRU SOCKADDR
The local address of the socket is returned, if any is currently bound to it.
The address (with protocol specific format) is returned in the addr parameter.
PRU PEERADDR
The address of the peer to which the socket is connected is returned. The
socket must be in a SS _ I SCONNECTED state for this request to be made to
the protocol. The address fonnat (protocol specific) is returned in the addr
parameter.
PRU CONNECT2
The protocol module is supplied two sockets and requested to establish a
connection between the two without binding any addresses, if possible. This
call is used in implementing the socketpair {2} system call.
The following requests are used internally by the protocol modules and are never
generated by the socket routines. In certain instances, they are handed to the
pr_ usrreq routine solely for convenience in tracing a protocol's operation
(e.g. PRU_S LOWT IMO).
PRU FASTTIMO
A "fast timeout" has occurred. This request is made when a timeout occurs
in the protocol's pr_ fa s t imo routine. The addr parameter indicates
which timer expired.
PRU SLOWTIMO
A "slow timeout" has occurred. This request is made when a timeout occurs
in the protocol's pr_slowtimo {} routine. The addr parameter indicates
which timer expired.
PRU PROTORCV
This request is used in the protocol-protocol interface, not by the routines. It
requests reception of data destined for the protocol and not the user. No protocols currently use this facility.
PRU PROTOSEND
This request allows a protocol to send data destined for another protocol
module, not a user. The details of how data is marked "addressed to protocol" instead of "addressed to user" are left to the protocol modules. No protocols currently use this facility.
12.4. Protocol to Protocol
Interface
The interface between protocol modules is through the pr_ usrreqO,
pr _ inputO, pr_ output(), pr _ ctlinputO, and pr _ ctloutput {} routines. The calling conventions for all but the pr_usrreq {} routine are
expected to be specific to the protocol modules and are not guaranteed to be consistent across protocol families. We will examine the conventions used for some
Revision A, of 27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
335
of the Internet protocols in this section as an example.
The Internet protocol UDP uses the convention,
error = udp_output(inp, m);
int error;
struct inpcb *inp;
struct mbuf *m;
where the inp, "internet protocol control block", passed between modules conveys per connection state infonnation, and the mbuf chain contains the data to be
sent. UDP perfonns consistency checks, appends its header, calculates a checksum, etc. before passing the packet on. UDP is based on the Internet Protocol, IP
[PosteI81a], as its transport. UDP passes a packet to the IP module for output as
follows:
error = ip_output(m, opt, ro, flags);
int error;
struct mbuf *m, *opt;
struct route *ro; int flags;
The call to IP's output routine is more complicated than that for UDP, as befits
the additional work the IP module must do. The m parameter is the data to be
sent, and the opt parameter is an optional list of IP options which should be
placed in the IP packet header. The ro parameter is used in making routing decisions (and passing them back to the caller for use in subsequent calls). The final
parameter, f lags, contains flags indicating whether the user is allowed to
transmit a broadcast packet and if routing is to be perfonned. The broadcast flag
may be inconsequential if the underlying hardware does not support the notion of
broadcasting.
All output routines return 0 on success and a UNIX error number if a failure
occurred that could be detected immediately (no buffer space available, no route
to destination, etc.).
Both UDP and TCP use the following calling convention,
(void) (*protosw [] . pr_input) (m, ifp);
struct mbuf *m;
struct ifnet *ifp;
Each mbuf list passed is a single packet to be processed by the protocol module.
The interface from which the packet was received is passed as the second parameter.
The IP input routine is a software interrupt level routine, and so is not called with
any parameters. It instead communicates with network interfaces through a
queue, ipintrq, which is identical in structure to the queues used by the network interfaces for storing packets awaiting transmission. The software interrupt
Revision A, of 27 March 1990
336
Transport-Level Programming
is enabled by the network interfaces when they place input data on the input
queue.
pr_ctlinput ()
This routine is used to convey "control" infonnation to a protocol module (Le.
infonnation that might be passed to the user, but is not data).
The common calling convention for this routine is,
(void) (*protosw[] .pr_ctlinput) (req, addr);
int req;
struct sockaddr *addr;
The req parameter is one of the following,
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
PRe IFDOWN
PRe ROUTEDEAD
PRe_QUENCH
PRe MSGSIZE
PRe HOSTDEAD
PRe HOSTUNREACH
PRe- UNREACH- NET
PRe- UNREACH- HOST
PRe- UNREACH- PROTOCOL
PRe- UNREACH- PORT
PRC UNREACH NEEDFRAG
PRC- UNREACH- SRCFAIL
PRe REDIRECT NET
PRC- REDIRECT- HOST
PRC REDIRECT TOSNET
PRe REDIRECT TOSHOST
PRe TIMXCEED INTRANS
PRC- TIMXCEED- REASS
PRC PARAMPROB
0
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/ * interface transition * /
/ * select new route ifpossible * /
/ * some said to slow down * /
/ * message size forced drop * /
/ * normally from IMP * /
/* ditto * /
/ * no route to network * /
/ * no route to host * /
/ * dst says bad protocol * /
/* bad port # * /
/* IP _DF caused drop * /
/ * source route failed * /
/ * net routing redirect * /
/ * host routing redirect * /
/* redirectfor type & net * /
/ * redirect for tos & host * /
/ * packet expired in transit * /
/* lifetime expired on reass q * /
/ * header incorrect * /
while the addr parameter is the address to which the condition applies. Many of
the requests have obviously been derived from ICMP (the Internet Control Message Protocol [Postel81 c]), and from error messages defined in the 1822
host/lMP convention [BBN78]. Mapping tables exist to convert control requests
to UNIX error codes that are delivered to a user.
pr_ctloutput 0
This is the routine that implements per-socket options at the protocol level for
getsockopt () and setsockoptO. The calling convention is,
error = (*protosw[] .pr_ctloutput) (op,so,level,optname,mp)i
int 0Pi
struct socket *SOi
int level, optnamei
struct mbuf **mpi
where op is one of PRCO_ SETOPT or PRCO_ GETOPT, so is the socket whence
Revision A, of 27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
337
the call originated, and level and optname are the protocol level and option
name supplied by the user. The results ofa PRCO_GETOPT call are returned in
an mbufwhose address is placed in mp before return. On a PRCO_SETOPT call,
mp contains the address of an mbuf containing the option data; the mbuf should
be freed before retum
12.5. Protocol/NetworkInterface Interface
The lowest layer in the set of protocols that comprise a protocol family must
interface itself to one or more network interfaces in order to transmit and receive
packets. It is assumed that any routing decisions have been made before handing
a packet to a network interface; in fact this is absolutely necessary in order to
locate any interface at all (unless, of course, one uses a single "hardwired" interface). There are two cases with which to be concerned, transmission of a packet
and receipt of a packet; each will be considered separately.
Packet Transmission
Assuming a protocol has a handle on an interface, ifp, a (struct ifnet *), it
transmits a fully fonnatted packet with the following call,
error = (*ifp->if_output) (ifp, m, dst)
int error;
struct ifnet *ifp;
struct mbuf *m;
struct sockaddr *dst;
The output routine for the network interface transmits the packet m to the dst
address, or returns an error indication (a UNIX error number). In reality
transmission may not be immediate or successful; nonnally the output routine
simply queues the packet on its send queue and primes an interrupt driven routine to actually transmit the packet. For unreliable media, such as the Ethernet,
"successful" transmission simply means that the packet has been placed on the
cable without a collision. On the other hand, an 1822 interface guarantees proper
delivery or an error indication for each message transmitted. The model
employed in the networking system attaches no promises of deli very to the packets handed to a network interface, and thus corresponds more closely to the Ethernet. Errors returned by the output routine are only those that can be detected
immediately, and are nonnally trivial in nature (no buffer space, address fonnat
not handled, etc.). No indication is received if errors are detected after the call
has returned.
Packet Reception
Each protocol family must have one or more "lowest level" protocols. These
protocols deal with internetwork addressing and are responsible for the delivery
of incoming packets to the proper protocol processing modules. In the PUP
model [Boggs78] these protocols are tenned Level 1 protocols, in the ISO model,
network layer protocols. In this system each such protocol module has an input
packet queue assigned to it. Incoming packets received by a network interface
are queued for the protocol module, and a software interrupt is posted to initiate
processing.
Three macros are available for queuing and dequeuing packets:
Revision A, of 27 March 1990
338
Transport-Level Programming
IF_ENQUEUE (ifq, rn)
This places the packet m at the tail of the queue ifq.
IF_DEQUEUE (ifq, rn)
This places a pointer to the packet at the head of queue ifq in m and removes
the packet from the queue. A zero value will be returned in m if the queue is
empty.
IF_DEQUEUE IF (ifq, rn, ifp}
Like IF_DEQUEUE, this removes the next packet from the head of a queue
and returns it in m. A pointer to the interface on which the packet was
received is placed in ifp, a (struct ifnet *).
IF_PREPEND(ifq, rn}
This places the packet m at the head of the queue ifq.
Each queue has a maximum length associated with it as a simple fonn of congestion control. The macro IF_QFULL (ifq) returns 1 if the queue is filled, in
which case the macro IF_DROP (ifq) should be used to increment the count of
the number of packets dropped, and the offending packet is dropped. For exampIe, the following code fragment is commonly found in a network interface's
input routine,
if (IF_QFULL(inq»
IF_DROP(inq);
m_freem (m) ;
else
IF_ENQUEUE (inq, m);
12.6. Gateways and
Routing Issues
The system has been designed with the expectation that it will be used in an
internetwork environment. The "canonical" environment was envisioned to be a
collection of local area networks connected at one or more points through hosts
with multiple network interfaces (one on each local area network), and possibly a
connection to a long haul network (for example, the ARPANET). In such an
environment, issues of gatewaying and packet routing become very important.
Certain of these issues, such as congestion control, have been handled in a
simplistic manner or specifically not addressed. Instead, where possible, the network system attempts to provide simple mechanisms upon which more involved
policies may be implemented. As some of these problems become better understood, the solutions developed will be incorporated into the system.
This section will describe the facilities provided for packet routing. The simplistic mechanisms provided for congestion control are described in the Buffering,
Congestion Control section below.
Routing Tables
The network system maintains a set of routing tables for selecting a network
interface to use in delivering a packet to its destination. These tables are of the
fonn:
Revision A, of27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
struct rtentry {
u_long rt_hashi
struct sockaddr rt_dsti
struct sockaddr rt_gatewaYi
short
rt_flagsi
short
rt_refcnti
u_long rt_usei
struct ifnet *rt_ifpi
339
/ * hash key for lookups * /
/ * destination net or host * /
/ * forwarding agent * /
/* see below * /
/ * # of references to structure * /
/ * packets sent using route * /
/ * interface to give packet to * /
}i
The routing infonnation is organized in two separate tables, one for routes to a
host and one for routes to a network. The distinction between hosts and networks
is necessary so that a single mechanism may be used for both broadcast and
multi-drop type networks, and also for networks built from point-to-point links.
Each table is organized as a hashed set of linked lists. Two 32-bit hash values
are calculated by routines defined for each address family; one based on the destination being a host, and one assuming the target is the network portion of the
address. Each hash value is used to locate a hash chain to search (by taking the
value modulo the hash table size) and the entire 32-bit value is then used as a key
in scanning the list of routes. Lookups are applied first to the routing table for
hosts, then to the routing table for networks. If both lookups fail, a final lookup
is made for a "wildcard" route (by convention, network 0). The first appropriate
route discovered is used. By doing this, routes to a specific host on a network
may be present as well as routes to the network. This also allows a "fall back"
network route to be defined to a "smart" gateway which may then perform more
intelligent routing.
Each routing table entry contains a destination (the desired final destination), a
gateway to which to send the packet, and various flags which indicate the route's
status and type (host or network). A count of the number of packets sent using
the route is kept, along with a count of "held references" to the dynamically allocated structure to insure that memory reclamation occurs only when the route is
not in use. Finally, a pointer to the a network interface is kept; packets sent using
the route should be handed to this interface.
Routes are typed in two ways: either as host or network, and as "direct" or
"indirect". The host/network distinction detennines how to compare the
rt_dst field during lookup. If the route is to a network, only a packet's destination network is compared to the rt_dst entry stored in the table. If the route
is to a host, the addresses must match bit for bit.
The distinction between "direct" and "indirect" routes indicates whether the destination is directly connected to the source. This is needed when perfonning
local network encapsulation. If a packet is destined for a peer at a host or network which is not directly connected to the source, the internetwork packet
header will contain the address of the eventual destination, while the local network header will address the intervening gateway. Should the destination be
directly connected, these addresses are likely to be identical, or a mapping
between the two exists. The RTF_GATEWAY flag indicates that the route is to an
Revision A, of 27 March 1990
340
Transport-Level Programming
"indirect" gateway agent, and that the local network header should be filled in
from the rt_gateway field instead of from the final internetwork destination
address.
It is assumed that multiple routes to the same destination will not be present;
only one of multiple routes, that most recently installed, will be used.
Routing redirect control messages are used to dynamically modify existing routing table entries as well as dynamically create new routing table entries. On
hosts where exhaustive routing infonnation is too expensive to maintain (e.g.
work stations), the combination of wildcard routing entries and routing redirect
messages can be used to provide a simple routing management scheme without
the use of a higher level policy process. Current connections may be rerouted
afiernotification of the protocols by means of their pr_ctlinput () entries.
Statistics are kept by the routing table routines on the use of routing redirect messages and their affect on the routing tables. These statistics may be viewed using
.netstat (1)
Status infonnation other than routing redirect control messages may be used in
the future, but at present they are ignored. Likewise, more intelligent "metrics"
may be used to describe routes in the future, possibly based on bandwidth and
monetary costs.
Routing Table Interface
A protocol accesses the routing tables through three routines, one to allocate a
route, one to free a route, and one to process a routing redirect control message.
The routine rtalloc () performs route allocation; it is called with a pointer to
the following structure containing the desired destination:
struct route {
struct rtentry *ro_rt;
struct sockaddr ro_dst;
}i
The route returned is assumed "held" by the caller until released with an
rtfree () call. Protocols which implement virtual circuits, such as TCP, hold
onto routes for the duration of the circuit's lifetime, while connection-less protocols, such as UDP, allocate and free routes whenever their destination address
changes.
The routine rtredirect () is called to process a routing redirect control message. It is called with a destination address, the new gateway to that destination,
and the source of the redirect. Redirects are accepted only from the current
router for the destination. If a non-wildcard route exists to the destination, the
gateway entry in the route is modified to point at the new gateway supplied. Otherwise, a new routing table entry is inserted reflecting the infonnation supplied.
Routes to interfaces and routes to gateways which are not directly accessible
from the host are ignored.
Revision A, of 27 March 1990
Chapter 12 - Socket-Based IPC Implementation Notes
User Level Routing Policies
341
Routing policies implemented in user processes manipulate the kernel routing
tables through two ioctl () calls. The commands SIOCADDRT and SIOCDELRT add and delete routing entries, respectively; the tables are read through
the /dev/kmem device. The decision to place policy decisions in a user process
implies that routing table updates may lag a bit behind the identification of new
routes, or the failure of existing routes, but this period of instability is normally
very small with proper implementation of the routing process. Advisory information, such as ICMP error messages and IMP diagnostic messages, may be read
from raw sockets (described in the next section).
Several routing policy processes have already been implemented. The system
standard "routing daemon" uses a variant of the Xerox NS Routing Information
Protocol [Xerox82] to maintain up-to-date routing tables in our local environment. Interaction with other existing routing protocols, such as the Internet EGP
(Exterior Gateway Protocol), has been accomplished using a similar process.
12.7. Raw Sockets
A raw socket is an object that allows users direct access to a lower-level protocol.
Raw sockets are intended for knowledgeable processes that wish to take advantage of some protocol feature not directly accessible through the nonnal interface, or for the development of new protocols built atop existing lower level protocols. For example, a new version of TCP might be developed at the user level
by utilizing a raw IP socket for delivery of packets. The raw IP socket interface
attempts to provide an identical interface to the one a protocol would have if it
were resident in the kernel.
The raw socket support is built around a generic raw socket interface, (possibly)
augmented by protocol-specific processing routines. This section will describe
the core of the raw socket interface.
Control Blocks
Every raw socket has a protocol control block of the following form:
struct rawcb {
rawcb *reb_next;
struet
struct
rawcb *rcbyrev;
struet
socket *rcb_socket;
struet
sockaddr rcb_faddr;
struet
sockaddr rcb_laddr;
struet
sockproto rebyroto;
caddr t
rcbycb;
struet
mbuf *reb_options;
route reb_route;
struet
int
rcb_ce;
int
rcb_mbcnt;
short
rcb_flags;
/ * doubly linked list * /
/ * back pointer to socket * /
/ * destination address * /
/ * socket's address * /
/ * protocol family, protocol * /
/ * protocol specific stuff * /
/ * protocol specific options * /
/ * routing information * /
/ *bytes of rawintr queued data * /
/ * bytes of rawintr queued mbufs* /
} ;
All the control blocks are kept on a doubly linked list for performing lookups
during packet dispatch. Associations may be recorded in the control block and
used by the output routine in preparing packets for transmission. The
reb_proto structure contains the protocol family and protocol number with
Revision A, of 27 March 1990
342
Transport-Level Programming
which the raw socket is associated. The protocol, family, and addresses are used
to filter packets on input; this will be described in more detail shortly. If any
protocol-specific infonnation is required, it may be attached to the control block
using the rcb_pcb field. Protocol-specific options for transmission in outgoing
packets may be stored in rcb_options. rCb_cc and rcb_rnbcnt are used
to keep track of the resources consumed by the raw socket.
A raw socket interface is datagram oriented. That is, each send or receive on the
socket requires a destination address. This address may be supplied by the user
or stored in the control block and automatically installed in the outgoing packet
by the output routine. Since it is not possible to determine whether an address is
present or not in the control block, two flags, RAW_LADDR and RAW_FADDR,
indicate if a local and foreign address are present. Routing is expected to be performed by the underlying protocol if necessary.
Input Processing
Input packets are "assigned" to raw sockets based on a simple pattern matching
scheme. Each network interface or protocol gives unassigned packets to the raw
input routine with the call:
raw_input(m, proto, src, dst)
struct mbuf *m;
struct sockproto *proto;
struct sockaddr *src, *dst;
The data packet then has a generic header prepended to it of the form
struct raw_header {
struct sockproto raw-proto;
struct sockaddr raw_dst;
struct sockaddr raw_src;
} ;
and it is placed in a packet queue for the "raw input protocol" module. Packets
taken from this queue are copied into any raw sockets that match the header
according to the following rules,
1) The protocol family of the socket and header agree.
2) If the protocol number in the socket is non-zero, then it agrees with tha~
found in the packet header.
3) If a local address is defined for the socket, the address format of the local
address is the same as the destination address's and the two addresses agree
bit for bit.
4) The rules of 3) are applied to the socket's foreign address and the packet's
source address.
A basic assumption is that addresses present in the control block and packet
header (as constructed by the network interface and any raw input protocol
module) are in a canonical form that may be "block compared".
Revision A, of 27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
343
Output Processing
On output the raw pr _ usrreq () routine passes the packet and a pointer to the
raw control block to the raw protocol output routine for any processing required
before it is delivered to the appropriate network interface. The output routine is
nonnally the only code required to implement a raw socket interface.
12.8. Buffering, Congestion
Control
One of the major factors in the performance of a protocol is the buffering policy
used. Lack of a proper buffering policy can force packets to be dropped, cause
falsified windowing infonnation to be emitted by protocols, fragment host
memory, degrade the overall host performance, etc. Due to problems such as
these, most systems allocate a fixed pool of memory to the networking system
and impose a policy optimized for "nonnal" network operation.
The networking system developed for UNIX is little different in this respect. At
boot time a fixed amount of memory is allocated by the networking system. At
later times more system memory may be requested as the need arises, but at no
time is memory ever returned to the system. It is possible to garbage collect
memory from the network, but difficult. In order to perfonn this garbage collection properly, some portion of the network will have to be "turned off' as data
structures are updated. The interval over which this occurs must kept small compared to the average inter-packet arrival time, or too much traffic may be lost,
impacting other hosts on the network, as well as increasing load on the interconnecting mediums. In our environment we have not experienced a need for such
compaction, and thus have left the problem unresolved.
The mbuf structure was introduced in the Memory, Addressing section, above. In
this section a brief description will be given of the allocation mechanisms, and
policies used by the protocols in performing connection level buffering.
Memory Management
The basic memory allocation routines manage a private page map, the size of
which detennines the maximum amount of memory that may be allocated by the
network. A small amount of memory is allocated at boot time to initialize the
mbuf and mbuf cluster free lists. When the free lists are exhausted, more
memory is requested from the system memory allocator if space remains in the
map. If memory cannot be allocated, callers may block awaiting free memory, or
the failure may be reflected to the caller immediately. The allocator will not
block awaiting free map entries, however, as exhaustion of the resource map usually indicates that buffers have been lost due to a "leak." An array of reference
counts parallels the cluster pool and is used when multiple references to a cluster
are present.
64 mbufs fit into a 8Kbyte page of memory. Data can be placed into a mbuf by
copying, or, better, the memory that contains that data can be treated as a temporary ("loaned") mbuf. This second alternative is far more efficient than an
actual copy.
Protocol Buffering Policies
Protocols reserve fixed amounts of buffering for send and receive queues at
socket creation time. These amounts define the high and low water marks used
by the socket routines in deciding when to block and unblock a process. The
reselVation of space does not currently result in any action by the memory
management routines.
Revision A, of 27 March 1990
344
Transport-Level Programming
Protocols that provide connection level flow control do this based on the amount
of space in the associated socket queues. That is, send windows are calculated
based on the amount of free space in the socket's receive queue, while receive
windows are adjusted based on the amount of data awaiting transmission in the
send queue. Care has been taken to avoid the "silly window syndrome"
described in [Clark82] at both the sending and receiving ends.
Queue Limiting
Incoming packets from the network are always received unless memory allocation fails. However, each Level 1 protocol input queue has an upper bound on
the queue's length, and any packets exceeding that bound are discarded. It is
possible for a host to be overwhelmed by excessive network traffic (for instance a
host acting as a gateway from a high bandwidth network to a low bandwidth network). As a "defensive" mechanism the queue limits may be adjusted to throttle
network traffic load on a host. Consider a host willing to devote some percentage
of its machine to handling network traffic. If the cost of handling an incoming
packet can be calculated so that an acceptable "packet handling rate" can be
determined, then input queue lengths may be dynamically adjusted based on a
host's network load and the number of packets awaiting processing. Obviously,
discarding packets is not a satisfactory solution to a problem such as this (simply
dropping packets is likely to increase the load on a network); the queue lengths
were incorporated mainly as a safeguard mechanism.
Packet Forwarding
When packets can not be forwarded because of memory limitations, the system
attempts to generate a "source quench" message. In addition, any other problems
encountered during packet forwarding are also reflected back to the sender in the
form of ICMP packets. This helps hosts avoid unneeded retransmissions.
Broadcast packets are never forwarded due to possible dire consequences. In an
early stage of network development, broadcast packets were forwarded and a
"routing loop" resulted in network saturation and every host on the network
crashing.
12.9. Out of Band Data
Out of band data is a facility peculiar to the stream socket abstraction defined.
Little agreement appears to exist as to what its semantics should be. TCP defines
the notion of "urgent data" as in-line, while the NBS protocols [Burruss81] and
numerous others provide a fully independent logical transmission channel along
which out of band data is to be sent. In addition, the amount of the data which
may be sent as an out of band message varies from protocol to protocol; everything from 1 bit to 16 bytes or more.
A stream socket's notion of out of band data has been defined as the lowest reasonable common denominator (at least reasonable in our minds); clearly this is
subject to debate. Out of band data is expected to be transmitted out of the normal sequencing and flow control constraints of the data stream. A minimum of 1
byte of out of band data and one outstanding out of band message are expected to
be supported by the protocol supporting a stream socket. It is a protocol's prerogative to support larger-sized messages, or more than one outstanding out of
band message at a time.
Revision A, of 27 March 1990
Chapter 12 -
Socket-Based IPC Implementation Notes
345
Out of band data is maintained by the protocol and is usually not stored in the
socket's receive queue. A socket-level option, SO_ OOBINLINE, is provided to
force out-of-band data to be placed in the normal receive queue when urgent data
is received; this sometimes amelioriates problems due to loss of data when multipIe out-of-band segments are received before the first has been passed to the user.
The PRU_SENDOOB and PRU_RCVOOB requests to the pr_usrreq () routine
are used in sending and receiving data.
12.10. Acknowledgements
The internal structure of the system is patterned after the Xerox PUP architecture
[Boggs79], while in certain places the Internet protocol family has had a great
deal of influence in the design. The use of software interrupts for process invocation is based on similar facilities found in the VMS operating system. Many of
the ideas related to protocol modularity, memory management, and network
interfaces are based on Rob Gurwitz's TCP/IP implementation for the 4.1BSD
version of the UNIX system [Gurwitz81].
12.11. References
[Boggs79]
Boggs, D. R., 1. F. Shoch, E. A. Taft, and R. M. Metcalfe; PUP:
An Internetwork Architecture. Report CSL-79-1 O. XEROX Palo
Alto Research Center, July 1979.
[BBN78]
Bolt Beranek and Newman; Specification for the Interconnection
of Rost and IMP. BBNTechnical Report 1822. May 1978.
[Cerf78]
Cerf, V. G.; The Catenet Model for Internetworking. Internet
Working Group, lEN 48. July 1978.
[Clark82]
Clark, D. D.; Window and Acknowledgement Strategy in TCP,
RFC-813. Network Infonnation Center, SRI International. July
1982.
[Gurwitz81]
Gurwitz, R. F.; VAX-UNIX Networking Support ProjectImplementation Description. Internetwork Working Group, lEN
168. January 1981.
[IS081]
International Organization for Standardization. ISO Open Systems
Interconnection - Basic Reference Model. ISO{fC 97/SC 16 N
719. August 1981.
[Joy86]
Joy, W.; Fabry, R.; Leffler, S.; McKusick, M.; and Karels, M.;
Berkeley Software Architecture Manual, 4.3BSD Edition. UNIX
Programmer's Supplementary Documents, Vol. 1 (PS 1:6). Computer Systems Research Group, University of California, Berkeley. May, 1986.
[Leffler84]
Leffler, S.J. and Karels, M.J.; Trailer Encapsulations, RFC-893.
Network Information Center, SRI International. April 1984.
[Postel80]
Postel, J. User Datagram Protocol, RFC-768. Network Information Center, SRI International. May 1980.
[Postel81a]
Postel, J., ed. Internet Protocol, RFC-791. Network Information
Center, SRI International. September 1981.
~~ sun
~
microsystems
Revision A, of 27 March 1990
346
Transport-Level Programming
[PosteI81b]
Postel, J., ed. Transmission Control Protocol, RFC-793. Network
Information Center, SRI International. September 1981.
[Poste181c]
Postel, J. Internet Control Message Protocol, RFC-792. Network
Information Center, SRI International. September 1981.
[Xerox81]
Xerox COtporation. Internet Transport Protocols. Xerox System
Integration Standard 028112. December 1981.
[Zimmermann80]
Zimmermann, H. OSI Reference Model - The ISO Model of
Architecture for Open Systems Interconnection. IEEE Transactions on Communications. Com-28(4); 425-432. April 1980.
Revision A. of 27 March 1990
Index
4
client machines, 1
client/server model, 294
clients and servers, 21
clnt_broadcast(),82
clnt_call (), 76
clnt_control(),56
clnt_create (), 55
clnt_destroy (), 76
clnttcp_create(),76
clntudp_create (), 74
communications domains, 278
computing environments, 5
congestion control, 343
connect () , 275
connection
errors, 284
establishment, 283
connectionless
servers, 299
sockets, 286
control blocks, 341
4.2BSD networking, 251
4.3BSD networking, 252
A
accept () , 297
address
binding, 309
families, 318
variable length, 318
wildcard, 309
administering
networks,9
servers, 8
administering a server, 8
administration
ofRPC,38
advisory locks, 29
AF - address format. 281
ALRM,99
application, 1
arbitrary data types, 69
Arpanet, 338
assigning program numbers, 37
authentication, 16,86,310
authunix_create_default(),87
authunix_parms,88
D
data transfer, 285
datagram
Internet domain, 263, 264
socket. 281
UNIX domain, 260
vs streams, 278
dbm () , 23, 25
debugging with rpcgen, 51
define statements, 53
direction ofXDR operations, 120
domainname command, 23, 24
domains and protocols, 258
B
batching, 82
bind (), 311
binding local names, 282
bool_t, 109
broadcast RPC, 56, 81
synopsis, 82
broadcasting, 311
buffering, 343
byte swapping, 293
E
eachresul t ( ) , 82
enum clnt_stat (in RPC pro grammiJlg),68
/etc/ethers,25
/etc/ exports, 8
/etc/group,25
/etc/hosts, 25,26
/etc/mount,6
/etc/netgroup,25
C
caller process, 2,33
callrpc (), 65, 67, 68, 75, 76, 78
changing passwords, 26
CLIENT,76
client handle, used by rpcgen,46
-347-
Index - Continued
/ete/networks,25
/ete/passwd, 25,26
/ete/protoeols,25
/ete/serviees,25
ether, 67
EXAMPLEPROG,99
export a filesystem, 7
exports,8
extensible design, 8
External Data Representation, 129, 168
F
fentl (), 27
FD_CLR O,288
FD_SET 0,288
FILE,120
filehandle, 12
filesystem
data, 1
exportation, 7
model,169
operations, 1
flock (), 26
fseanf (), 86
ftp, 17
futures, RPC, 101
G
gateways issues, 338
getdomainname(),24
getgrent (), 25
gethostbyaddr (), 26, 291
gethostbyname (), 26, 291
gethostent (), 25
gethostname(),24
getpeername (), 315
getprotobyname, 292
interface flags, 312
Internet, 278
Domain Stream Connection, 266
interrupt-driven socket I/O, 304
ip_output () , 335
IPC
access rights, 277
address representation, 318
basics, 280
C run-time routines, 293
connection, 266
gather, 277
implementation notes, 317
internal layering, 322
Internet domain datagrams, 263
Internet Domain Stream Connection, 269
library routines, 290
memory addressing, 318
memory management, 319
multiplexing, 288
pipe, 253
pipes, 253
processes, 252
read (), 275
receive queues, 324
reev (), 275
scatter, 277
select () , 270
send queues, 324
socket layer, 323
socket naming, 280
socketpair, 256
socketpairs, 256
sockets, 256
UNlX domain, 280
UN1X domain datagrams, 260
UN1X domain stream connection, 272, 273
wri te () , 275
IPPROTO_UDP,74
getprotobynumbe~292
getprotoent, 292
getpwent (), 25
getsoekopt(),314
gettransient (), 98,99
H
hetero geneity of machines, 8
high-water mark, 325
host names, 291
hostname,24
htonl () , 124
I
I/O multiplexing, 288
I/O requests, asynchronous notification, 305
inet_ntoa (), 291
using inetd, 92, 295, 315
inode,10
input processing, 342
Inter-Process Communication, 187
L
layers of RPC, 65
libe.a, 103
library primitives for XDR, 109
library routines, 290
librpesve . a, 66
listen (), 284
local names, 282
Lock Manager, 26
crashing, 27
protocol, 29
state, 27
loekf (), 27
locking, 2
locks, advisory, 29
long, 92
low-water mark, 325
lower layers of RPC, 72
-348-
Index - Com inued
M
network. continued
services, 1
Network File System, 168, 1,4
version-2 protocol specification, 168
Network Information Service, 22, 2
clients, 25
data storage, 25
default files, 25
domain,23
explained, 23
hosts database, 26
maps, 23
naming, 24
password database, 26
servers, 25
Network Lock Manager, 26, 2
Network Status Monitor, 30, 3, 27
network-interface layer, 328
NFS, 168, I, 2, 4
administration, 9
architecture, 8
basic data types, 170
different machines, 8
different operating systems, 8
example usage, 6
extensibility,8
filesystem example, 7
implementation, 181
Interface, 12
introduction, 168
mount servers, 13
pathname interpretation, 182
pathnames, 12
performance, 9
permission issues, 182
protocol, 9, 169
protocol definition, 169
reliability, 9
RPC information, 170
server/client relationship, 182
setting RPC parameters, 183
special files, 16
stateful devices, 16
stateless protocol, 9, 15
Sun implementation, 10
the mount protocol, 12
transparencies, 11
transparent access, 8
version-2 protocol specification, 168
NFS data types, 170
attrstat, 175
diropargs, 175
diropres, 175
fattr, 173
fhandle, 172
filename, 174
ftype,172
path,174
sattr, 174
stat, 171
timeval,173
NFS server procedures, 175
NFSPROC_CREATE(),178
m_adj (), 322
m_cat (), 322
m_copy (), 322
m_cpytoc (), 322
m_free () , 321
m_get (), 321
m_pullup (), 322
makedbm, 25
malloc () , 33, 65
master and slave, 24
mbuf,320
mcldup () , 321
MCLGET () , 321
mclgetx (), 321
memory allocation with XDR, 77
memory management, 343
MFREE ( ) , 321
MGET (), 321
miscellaneous RPC features, 80
mount,6,67
NFS, 12
NFS servers, 13
mount data types, 184
dirpath,184
fhandle, 184
fhstatus, 184
name,185
mount protocol, 183, 14
basic data types, 184
introduction, 183
RPC information, 184
XDR structure sizes, 184
mount server procedures, 185
MNTPROC DUMP(),185
MNTPROC-EXPORT(),186
MNTPROC- MNT () , 185
MNTPROC-NULL(),185
MNTPROC-UMNT(),186
MNTPROC=UMNTALL(),186
mounting a remote filesystem, 6
MSG_ OOB, 302
MSG_PEEK,302
msghdr, 277
mtom (), 321
multiplexing, 288
N
name binding, 282
names
host, 291
network, 291
protocol, 292
netstat (), 321
network
administration, 9
computing environments, 5
configuration, 311
major services, 2
names, 291
-349-
Index - Continued
NFS server procedures, continued
NFSPROC GETATTR(),176
NFSPROC-LINK (), 179
NFSPROC-LOOKOP(),177
NFSPROC-MKDIR(),180
NFSPROC-NOLL (), 176
NFSPROC-READ,177
NFSPROC-READDIR(),180
NFSPROC-READLINK(),177
NFSPROC-REMOVE (), 178
NFSPROC-RENAME(),179
NFSPROC-RMDIR(), 180
NFSPROC-ROOT, 177
NFSPROC-SETATTR(),176
NFSPROC-STATFS(),181
NFSPROC-SYMLINK() , 179
NFSPROC-WRITE(),178
NFSPROC=WRITECACHE(),178
NIS,3
non-blocking sockets, 304
ntohl (), 124
NOLLPROC, 74, 89
o
OS1 model, 317
out of band data, 302, 344
output processing, 343
p
packet
forwarding, 344
reception, 337
transmission, 337
passwd, 26
passwords, changing, 26
performance, 10
pipe semantics, 280
prnap set () , 98
pmap=unset (), 74
pointer semantics and XDR, 119
port allocation, 310
portability, 301
porting
SPARC, 293
Sun-4,293
portrnapper, 21. 2
page registration, 21
typical mapping sequence, 22
pr_ctlinput(),336
pr_ctloutput (), 336
pr_input (), 335
process groups, 305
PROG,92
program number assignment, 37
PROGVERS, 92
PROGVERS_ORI~92
protocol
buffering policies, 343
families, 258
layers, 326
names, 292
protocol, continued
to network interface, 337
to protocol interface, 334
protocols, selecting specific, 308
pseudo terminals, 306
Q
queue I rrni ting , 344
R
raw sockets, 281, 341
rcp, 17,95
rcv, 95
recv (), 275
recvfrorn (), 275, 287
recvrnsg () , 276, 277
registerrpc (), 65,67,69
reliability, 9
Remote File Sharing (RFS), 27
remote mounting, 6
Remote Procedure Call, 2, 33, 168
resolver (), 291
REX, 1,3
RFS, 27
rlogin,17
rnusers (), 66
routing
issues, 338
routing table interface, 340
tables, 338
user-level policies, 341
RPC, 2, 33, 41
administration, 38
an advanced example, 47
authentication, 86, 89
batching, 82
broadcast, 81
broadcast synopsis, 82
built-in routines, 69
callback procedures, 98
calling side, 75
DES, 89
futures, 101
generating XDR routines, 47
guarantees, 88
introduction, 33
layers, 65
library based services, 66
lower layers, 72
miscellaneous features, 80
select ( ) , 80
server side, 73
srrnplified interface, 67
The Highest Layer, 65
The Lowest Layer, 66
The Middle Layer, 65
The Srrnplified Layer, 65
versions on client side, 94
versions on server side, 92
RPC library based services, 66
RPC Programming Guide, 65
-350-
Index - ConJinued
RPC Services, 66
[email protected] sun. com, 38
RPC_ANYSOCK,74
RPC_TlMEDOUT,82
rpcgen, 41, 65, 67
broadcast RPC, 56
C-preprocessor,52
client authentication, 56
client programming, 55
constants, 59
declarations, 59
definitions, 58
dispatch tables, 54
enumerations, 58
local procedures, 42
network types, 53
other operations, 55
programming notes, 53
programs, 61
remote procedures, 42
RPC Language, 58
server programming, 56
special cases, 62
structures, 60
timeout changes, 55
typedef,59
unions, 60
rpcgen Inetd support, 54
RPCL,58
r<L-clntcred. 88
r<L-cred. 87, 88
r<L-cred.oa_flavor,88
rquota,67
rsh,17
ruptirne, 299
RUSERSPROC_ BOOL, 74
RUSERSPROG,74
RUSERSVERS,74
RUSERSVERS_SHORT,92
sigvec (), 275
simplified interface of RPC, 67
SIZE,77
sizeof (), 113
slave and master, 24
snd,95
sockaddr, 318
socket
connection queuing, 326
connectionless, 286
creation, 281
data queues, 325
datagram, 281, 286
discarding, 286
failure, 282
flags, 286
ioctl () , 333
non-blocking, 304
options, 314
raw, 281, 341
sequenced packet, 281
state, 324
stream, 280
to protocol interface, 331
types, 280
Socket-based IPC, 317
Socket-Based IPC
advanced tutorial, 279
tutorial, 251
SPARC
alignment restrictions, 293
porting, 293
spray, 67
statd,30
stateful services, 27
stateless servers, 169
statelessness of NFS, 15
Status Monitor, 30, 27
stream connection
accepting, 273
initiating, 272
Internet domain, 266
stream implementation in XDR, 123
stream sockets, 280
streams vs datagrams, 278
Sun-4
alignment restrictions, 293
porting, 293
svc_freeargs(),78
svc_getargs(),75,78
svc_getreqset(),81
svc_register(),92
svc_run (), 80, 81, 92
svc_sendreply(),74
svcerr_noproc(),74
svcerr_systemerr() ,89
svcerr_weakauth(),89
svctcp_create(),74,77
svcudp_create (), 74,77
SVCXPRT, 74, 75
S
select (), 80, 81, 270, 288, 308
connection, 270
send (), 275
sendmsg () ,276,277
sendto ( ) , 275, 313
sequenced packet socket, 281
server machines, 1
server process, 2, 33
servers
administration, 8
and clients, 21
connectionless, 299
network services, 1
stateless, 169
setsockopt (), 311, 314
S I GCHLD, 306
signal () , 275
signals, 252
and process groups, 305
SIGURG,275
-351-
Index -
Continued
T
TCP, 95
telnet,17
terminals, pseudo, 306
tftp, 17
transparency of NFS, 8, 11
Transport-Level Programming, 187
U
UDP 8K warning, 69
udp_output (), 335
UNIX Authentication, 86
unsigned short,92
user, 1,92
V
versions on client side, 94
versions on server side, 92
VFS, 10
virtual file system, 10
inode,lO
W
wildcard address, 309
write (), 83
X
x_destroy (), 123
x_getbytes (), 123
x_get long (), 124
x_getpostn (), 123
x_inline (), 123
xyutbytes (), 123
x_put long (), 124
x_setpostn () ,123
XDR
advanced topics, 124
array, fixed length, 136
array, variable length, 137
basic block size, 131
block size, 131
boo lean, 133
byte order, 141
canonical standard, 106
constant, 139
data types, 132
data, optional, 140
discriminated union, 138
double-precision floating-point integer, 134
enumeration, 133
fixed-length array, 136
fixed-length opaque data, 135
floating-point integer, 133
futures, 141
hyper integer, 133
integer, 132
integer, double-precision floating point, 134
integer, floating point, 133
integer, hyper, 133
integer, unsigned, 132
XDR, continued
justification, 104
language, 141, 142
library, 107
library primitives, 109
linked lists, 124
memory allocation, 77
memory streams, 121
non-filter primitives, 120
object, 123
opaque data, fixed length, 135
opaque data, variable length, 135
operation directions, 120
optional data, 140
portable data, 106
protocol specification, 131
record (fCP/lP) streams, 121
RFC, 131
RFC status, 131
standard I/O streams, 120
stream access, 120
stream implementation, 123
string, 136
structure, 137
Sun technical notes, 103
system routines, 103
typedef,139
union, 138
unsigned integer, 132
variable-length array, 137
variable-length data, 142
variable-length opaque data, 135
void,138
XDR language
notation, 142
syntax, 143, 144
XDR library
arrays, 113
byte arrays, 112
constructed data type filters, 111
discriminated unions, 116
enumeration filters, 111
fixed sized arrays, 116
floating point filters, 110
no data, 111
number filters, 109
opaque data, 115
pointers, 118
strings, 111
XDR structure sizes, 170
xdr_array () , 71, 77, 113
xdr_bytes (), 71, 112
xdr_chararrl(),77
XDR_DECODE, 108, 111, 120
xdr_destroy(),120
xdr_element (), 113
XDR_ENCODE, 108, 120
XDR_FREE, 111,120
xdr_getpos (), 120
xdr_long 0, lOS, 108
xdr_opaque (), 115
xdr_reference (), 118, 119
-352-
Index - Continued
xdr_setpos (), 120
xdr_string (), 72,111, 112
xdrmem_create(),121
xdrrec_endofrecord(),122
xdrrec_eof(),122
xdrrec_skiprecord(),122
xdrstdio_create(),107,120
y
ypbind command, 25
ypcat command. 25
ypinit command, 25
ypmake command, 26
yppasswd command. 26
yppa s swdd command, 26
yppu sh command, 25
ypwhich command, 25
ypxfr command, 25
-353-
Notes
Notes
Notes
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement