submitted to IEEE TPDS - ECE Engineering, U.Va.

Transactions on Parallel and Distributed Systems
VCMTP: A Reliable Message Multicast Transport Protocol for
Virtual Circuits
Journal:
Transactions on Parallel and Distributed Systems
rP
Fo
Manuscript ID:
Manuscript Type:
Regular
C.2.1.a ATM < C.2.1 Network Architecture and Design < C.2
Communication/Networking and Information Technology < C Computer
Systems Organization, C.2.1.c Circuit-switching networks < C.2.1 Network
Architecture and Design < C.2 Communication/Networking and Information
Technology < C Computer Systems Organization, C.2.2 Network Protocols
< C.2 Communication/Networking and Information Technology < C
Computer Systems Organization, C.2.2.a Applications < C.2.2 Network
Protocols < C.2 Communication/Networking and Information Technology <
C Computer Systems Organization
w
ie
ev
rR
ee
Keywords:
TPDS-2013-07-0699
ly
On
Page 1 of 14
Transactions on Parallel and Distributed Systems
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
VCMTP: A Reliable Message Multicast
Transport Protocol for Virtual Circuits
Jie Li, Student Member, IEEE, Malathi Veeraraghavan, Senior Member, IEEE, Steve Emmerson,
and Robert. D. Russell, Life Member, IEEE,
Abstract—Currently, application-layer (AL) multicasting is commonly used to distribute meteorology and other types of data via unicast
TCP connections between the AL servers. For the same latency/throughput, an AL-multicasting solution requires more servers at the
sender and higher network capacity than a network-layer multicast solution. The lack of adoption of IP multicast has been attributed to
the complexity of multicast routing protocols and variable path congestion, which are concerns that can be addressed with multicast
virtual circuits. Given the recent deployment of dynamic virtual circuit (VC) services, we propose a reliable message transport-layer
protocol called VCMTP for use over multicast VCs. An analytical model was developed for single file distribution, and validated
experimentally. For continuously generated files, a key design aspect of VCMTP is the tradeoff between file-delivery throughput for fast
receivers and robustness for slow receivers. A VCMTP configurable parameter called retransmission timeout factor can be adjusted
to tradeoff these two metrics. For a traffic load of 0.4, and a multicast group with 30 receivers, robustness increased significantly from
81.4 to 97.5% when the retransmission timeout factor was increased from 10 to 50. The corresponding drop in average throughput for
fast receivers was small (86.9 to 85.8 Mbps).
rP
Fo
Index Terms—Network Protocols; reliable multicast; virtual circuits; transport layer; scientific data distribution
ee
1
I NTRODUCTION
T
F
rR
HIS section presents the problem statement and motivation, outlines our solution approach, and highlights the novelty and contributions of this work.
w
On
Potential for broader impact Besides the IDD project,
large data sets are created in other scientific research
projects such as the Earth Systems Grid [3] and Large
Hadron Collider (LHC) [4] experiments. For the many
scientists involved in these projects, some newly created
data files could be distributed in push mode instead of
requiring each scientist to download files in pull mode.
In addition, there are commercial applications for data
distribution to multiple receivers, e.g., electronic distribution of newspapers, financial data, teaching modules,
and video files. Finally, with Content Delivery Networks
(CDN) [5], Web sites and other data are often replicated
to many servers deployed across the Internet to ensure
proximity, and hence lower propagation delays in reaching users.
ly
• J. Li and M. Veeraraghavan are with the Charles L. Brown Department
of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22911. Email: mv5g@virginia.edu
• S. Emmerson is with the University Corporation for Atmospheric Research
(UCAR), Boulder, CO. Email: emmerson@ucar.edu
• R. D. Russell is with the Computer Science Department and InterOperability Laboratory, University of New Hampshire, Durham, NH 03824.
Email: rdr@iol.unh.edu
ie
Problem statement and motivation Increasing volumes
of scientific data are collected by instruments, obtained
from experiments, and generated by simulations executed on high-performance computing platforms. These
datasets often need to be distributed to researchers located at various institutions. For example, the University
Corporation for Atmospheric Research (UCAR) Internet Data Distribution (IDD) project [1] distributes large
amounts of meteorology data on a near real-time basis
to a subscriber base of 170 institutions. Over 30 types
of data products (referred to as feedtypes) are distributed
through the IDD project.
Currently the IDD project uses Application-Layer (AL)
multicasting by creating a tree of Local Data Manager
(LDM) servers1 . For example, the LDM tree used to distribute CONDUIT high-resolution model data consists of
163 servers, with 22 root, 35 middle, and 106 leaf nodes.
Unicast TCP connections are used between all upstream
and downstream LDM servers.
1. LDM is the application software used in the IDD project [2]
While this method of using unicast TCP connections
over IP-routed service has the ease-of-use advantage, it
has certain disadvantages when compared to multicast
delivery. For the same performance metric (e.g., latency),
the unicast TCP approach will require more servers at
the sender and a higher access-link bandwidth than the
multicast approach. Alternately, for the same number
of servers and access-link bandwidth, the latency of
delivering data products could be smaller. Therefore, the
problem statement of this work is to develop a reliable
multicast transport protocol, and the motivation for this
work stems from the IDD and other scientific datadistribution projects.
ev
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
1
Solution approach: Use multicast virtual circuits The
advantages and disadvantages of using virtual circuits
are discussed below.
To avoid the costs of AL-multicasting, there has long
Transactions on Parallel and Distributed Systems
been an interest in using IP-multicast, whereby network
routers rather than application servers replicate packets
for delivery to multiple receivers. However, native IP
multicast has proven to be a challenge [6], [7] because of
the complexity of distributed IP multicast routing protocols, such as PIM-SM [8] and MSDP [9], and because receivers without credentials can join a multicast group using
IGMP [10] and MLD [11] accidentally or to maliciously
undermine the throughput of the legitimate recipients.
Also since the IP network is connectionless, congestionrelated packet losses are possible. A congested path to any
of one of the receivers can decrease throughput for all
other receivers.
These problems can be mitigated with a multicast
virtual-circuit (VC) service [12] by leveraging the setup
phase during which switches on the end-to-end path are
configured for the VC. As a path is explicitly selected
during VC setup, loop-free routes are ensured thus
avoiding the problems of IP-multicast routing protocols.
Since the network’s VC scheduling/provisioning system
needs to communicate with client software running on
each receiver prior to data transfer, user credentials can
be verified. Finally, since bandwidth and buffer resources
are assigned to VCs at each switch in the setup phase,
data-plane congestion is avoided.
While VCs have these above-stated advantages, their
disadvantages are as follows: (i) VC service is not as
ubiquitous as IP-routed service, especially in its dynamic
form, and (ii) VC setup delay could be on the order of
milliseconds due to round-trip propagation delays.
The first disadvantage is being addressed by a growing
deployment of dynamic virtual-circuit service in both
research-and-education networks (RENs) and in commercial networks. ESnet [13], Internet2 [14], GEANT
[15], and JGN-X [16], offer a dynamic virtual-circuit
service in which advance-reservation schedulers handle
requests for rate-guaranteed virtual circuits. The Dynamic Network System (DYNES) project has extended
the reach of dynamic virtual-circuit service to about
40 US universities and 11 regional RENs [17]. In the
commercial arena, AT&T offers an optical mesh service,
and Verizon offers a Bandwidth-on-Demand service [18].
Technologies such as Virtual Private LAN Service (VPLS)
[12] leverage the growing base of MultiProtocol Label
Switching (MPLS) to create wide-area Ethernet Virtual
LANs (VLANs). Since these VLANs can be multipoint,
networks using these technologies can offer multicast
virtual-circuit service.
The second disadvantage, long VC setup delay, is
relevant only for dynamic VCs. For data distribution
projects such as the IDD in which feedtypes are almost
continuous, there is no opportunity to tear-down VCs
and reestablish them, and therefore setup delay is not
relevant. As an example, Fig. 1 shows the data product
(file) sizes and number of data products distributed as
part of a radar-data feedtype on a typical day. Data
products are transmitted almost continuously in this
feedtype. Our analysis shows that silence periods were
larger than 1 second in less than 2% of the cases. Per
day, there were only about 100 silence periods that lasted
longer than 60 seconds [19]. In VC switches, transmitters can be configured to operate in work-conserving
mode, whereby a transmitter will serve packets from
other queues during silence periods on any single VC,
which means bandwidth is not wasted unlike in circuitswitched networks.
For other types of data distribution involving single
files, VC setup delay will be a factor especially in highspeed networks since transmission delays decrease as
link rates increase. For such applications, multicast VCs
are useful only if the datasets being distributed are large,
which is the case for scientific data [20].
Based on this above consideration of the pros and
cons of using VCs, and identification of applications for
which multicast VCs are suitable for data distribution,
we propose a new reliable Virtual Circuit Multicast
Transport Protocol (VCMTP). Reliability is required as
this transport protocol is designed for data distribution,
and not audio/video streaming.
Novelty and contributions One of the key features of
the VCMTP design is the ability to tradeoff file-transfer
throughput for “fast” receivers (receivers that can keep
up with the data arrival rate) with good robustness (the
percentage of successful file delivery) for slow receivers.
This tradeoff is achieved with a per-file configurable parameter called retransmission timeout factor. This factor
determines the amount of time during which receivers
can request retransmissions of missed blocks after a file
has been multicast. It is defined as a multiple of the file
multicast time to make the retransmission period longer
for larger files. For a given file arrival rate, the larger this
factor, the higher the robustness for slow receivers but
the lower the throughput for fast receivers.
The VCMTP design was prototyped, and evaluated
on U. Utah’s Emulab testbed [21]. Random number
generators were used to create samples of file interarrival times and file sizes. The sizes were used to create
files for actual transfers over VCMTP in the testbed,
and measurements were obtained. Packet losses were
injected using the Linux tc utility at a fraction of the
receivers to emulate “slow” receivers. Throughput and
robustness metrics were characterized as a function of
traffic load (arrival rate divided by service rate).
In summary, the key contributions of this work are
(i) a new reliable multicast transport protocol designed
for virtual circuits, (ii) a validated analytical model for
single-file multicasts, and (iii) the evaluation of VCMTP
in the context of a continuous file-arrival process rather
than for single files as in prior work [22], [23].
Section 2 reviews prior work on reliable multicast protocols. Section 3 describes the VCMTP design, including
its finite state machines. Different underlying network
service options for VCMTP are discussed in Section 4.
Sections 5 and 6 describe the VCMTP prototype and its
evaluation. The evaluation section includes (i) analyti-
w
ie
ev
rR
ee
rP
Fo
ly
On
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Page 2 of 14
2
Page 3 of 14
rP
Fo
Fig. 1: Daily Characteristics of NEXRAD2
cal models for VCMTP and the multiple-unicast-TCPconnections approach for single file transfers, which is
validated with experimental measurements, and (ii) an
experimental evaluation of VCMTP for continuous file
arrivals. The paper is concluded in Section 7.
2
R ELATED
WORK
rR
ee
Application-layer multicasting solutions that leverage
peer-to-peer methods such as Bit Torrent have been proposed for Grid applications [24]. In contrast, the VCMTP
approach is designed for network multicast solutions.
While VCMTP is designed for efficient reliable data
distribution across virtual circuits, multicast protocols
have been designed for other reasons and other types of
networks, e.g., for energy efficiency in wireless networks
[25], for improved application throughput in data-center
networks [26], and to improve MPI-collective operation
performance in clouds [27].
Given the popularity of ATM virtual-circuit networks
in the nineties, we searched the literature for prior work
on reliable transport protocols for ATM networks. A
reliable multicast solution for ATM networks was developed by Turner [28]. It required hardware-assistance at
the switches, and did not handle the flow control problem (arising from receiver-buffer overflows). VCMTP
addresses flow control, and does not require hardware
assistance in the VC switches. There were other transport
protocols for ATM multicast such as the one by Ma and
El Zarki [29], but these solutions were for audio/video
streaming, not data distribution.
Work on reliable multicast transport protocols for
IP networks in the IETF Reliable Multicast Transport
working group [30] include Asynchronous Layered Coding (ALC) [31] and NACK-Oriented Reliable Multicast
(NORM) [32]. With ALC, the sender multicasts packets
within a session on multiple channels at potentially different rates allowing for multiple rate congestion control.
Each receiver can obtain packets from one or more channels within the session based on the available bandwidth
on the path from the sender, and its own reception rate.
This combination of layered coding transport and forward error correcting codes allows for massively scalable
reliable multicast. More generally, solutions that combine
error correcting codes such as RaptorQ codes [33] with
techniques such as data carousel [34] or broadcast disk
[35] are well suited to situations in which the number
of multicast receivers is large (e.g., millions). While this
approach works well when massive scalability is required, reception overhead can be high [36]. The sender
computing resources and network bandwidth could be
more than in a NACK-based scheme if the number of
receivers is moderate. As pointed out by Barcellos et al.
[23], “the sender has no knowledge about the network
and needs to be conservative in terms of redundancy to
guarantee reliable delivery.” Since the target applications
of VCMTP are in the scientific community, the number
of receivers is in the hundreds, not millions. For example, the CONDUIT feedtype in the IDD project has a
maximum fan-out of 104 receivers [19].
w
ie
ev
On
When the number of receivers is small-to-moderate,
a negative acknowledgement (NACK) based scheme, as
in Scalable Reliable Multicast (SRM) [37] and NORM is
a better approach (positive acknowledgment schemes,
as in Reliable Multicast Transport Protocol (RMTP) [22],
have the ACK implosion problem). In SRM, requests
and retransmissions are also multicast, but in VCMTP,
requests and retransmissions are sent over unicast TCP
connections as in MTP-2 [38] and Reliable Adaptive
Multicast Protocol (RAMP) [39].
ly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Transactions on Parallel and Distributed Systems
In the VCMTP design, several concepts were taken
from protocols proposed in the research literature on
reliable multicast transport protocols [22], [23], [37]–[45],
as well as unicast reliable message-based protocols such
as Stream Control Transmission Protocol (SCTP) [46].
For example, should VCMTP be message-based or bytestream based? SCTP is a unicast reliable message-based
transport protocol, while TCP is byte-stream based. Most
reliable multicast transport protocols, such as SRM, Mul3
Transactions on Parallel and Distributed Systems
ticast Transport Protocol (MTP) [41], RAMP, and NORM,
support message-based communications, with RAMP
and NORM additionally supporting byte-stream mode.
RMTP appears to be the exception in that it is bytestream based. Our reasons for choosing the messagebased option are explained in Section 3.
However, unlike NORM, VCMTP does not require
data-plane congestion control schemes such as the TCPfriendly multicast congestion control [47] as it is designed for virtual circuits. NORM determines a group
Round Trip Time (RTT) based on feedback from receivers and the RTT estimate of the current limiting
receiver determines the sending rate. Some other reliable
multicast proposals, such as SRM, note that multicast
congestion control, which is required if the underlying
network service is a connectionless service such as IP, is a
difficult proposition. In VCTMP, data-plane congestion is
avoided through the use of admit/reject decisions made
by the VC scheduling/provisioning control-plane system
in the setup phase.
3
VCMTP
ee
rP
Fo
VCMTP is a negative-acknowledgment (NACK) based
reliable transport protocol, designed for multicast virtual circuits, in which a file is segmented into blocks
(limited by a maximum block size) and transmitted over
a multicast network service to one or more receivers,
and retransmissions of errored/lost blocks for individual
receivers are carried over unicast reliable connections
between the sender and each receiver. Even though
the use of rate-guaranteed virtual circuits eliminates
congestion related packet losses, retransmissions may be
required due to bit errors and receive-buffer overflows in
multitasking receivers (“flow-control” problem). A first
version of VCMTP, designed for single-file transfers, was
described in a conference paper [48]. Significant design
changes were made in this second version of VCMTP to
support continuous file distribution.
TCP send function is invoked. However, the TCP send
call returns quickly, i.e., as soon as the data is copied
from user-space to kernel-space within the sender; in
other words, TCP send does not wait until the data
is delivered to the receiver before returning. The copy
held in kernel space allows TCP to slowly send the data
to the receiver and perform retransmissions if required.
Meanwhile control of the user-space data is immediately
released back to the application, allowing the latter to
perform other actions, such as delete, move, copy, on
this data. The disadvantage of TCP’s approach is the
higher latency incurred in copying the data from user
space to kernel space, an operation that is avoided in
the zero-copy approach of an asynchronous API.
On the receive-side, the API there are two alternatives
by which an application can determine where incoming
files are stored. In the first method, called ”per-file
notification,” the VCMTP receiver notifies the application upon reception of the Begin-of-File (BOF) control
message for each new file, at which point the application
must specify whether or not VCMTP should receive the
file, and if so, where to store the file, and optionally
what new name to give it. In the second method, called
“batched notification,” the application provides VCMTP
a folder into which each new file is stored and a period
for receiving notifications. VCMTP in turn will notify the
application periodically via a completion event queue.
VCMTP functions As VCMTP is a reliable transport
protocol, it has error control functionality. Unlike in
TCP where positive acknowledgments (ACKs) are used,
VCMTP uses negative ACKs (NACKs), which are triggered by out-of-sequence blocks. Since virtual circuits
guarantee sequenced delivery, sequence numbers can
be used to detect missing blocks. VCMTP does not
require data-plane congestion control since it is designed
for virtual circuits that are established with guaranteed bandwidth and buffer allocations. Therefore, packet
losses due to congestion events in VC switch buffers
should be small, if any. For flow control, ON-OFF and
window-based schemes are not feasible in the presence
of multiple receivers, and hence there is no data-plane
support for flow control in VCMTP. Instead, VCMTP
relies on a control-plane solution in which each receiver
callibrates its ideal receive rate, and sends this rate to the
multicast sender to help the sender plan the multicast
VCs. This approach may not eliminate, but can reduce,
packet loss at the receivers.
w
ie
ev
rR
On
VCMTP Application Programming Interface (API) is
message-based (unlike TCP’s byte-stream API), and
asynchronous (unlike TCP’s synchronous API). A
message-based API is more suitable for multicast communications than a byte-stream API [37]. In messagebased communications, application data units (ADU) are
preserved by the transport layer [49], unlike in TCP,
where a TCP segment can include bytes from different
application data units. This design choice allows for
an easier identification of missed application data units
if one of the multiple receivers suffers a network loss
and needs a complete message retransmission. The API
is asynchronous, which means that when an application
invokes the VCMTP function to send a file, the application is not blocked but it cannot modify the user
data until VCMTP completes the transfer. The VCMTP
API requires separate calls to send files and receive
completion notifications. This is unlike the synchronous
API of TCP, which blocks the application when the
ly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Page 4 of 14
VCMTP messages The term “message” is reserved for
control messages exchanged between the VCMTP sender
and receivers. Examples include Begin-of-File
(BOF) and End-of-File (EOF) control messages.
The term “file” is used to denote both disk and memory
user data that is passed down by the application to
VCMTP for multicasting.
Fig. 2 illustrates VCMTP messaging with the solid
lines showing the multicast tree, and the dashed
lines representing the reliable unicast connections. The
4
Page 5 of 14
Fig. 3: VCMTP Packet Format
TABLE 1: VCMTP packet format
Message Type
BOF
rP
Fo
Fig. 2: VCMTP Messaging for file i
EOF
BOF-Request
Retx-Request
VCMTP sender starts by multicasting a BOF message
1 and then multicasts the file in the form of blocks
2 The BOF carries metadata such as name and size
.
of the file. After the sender completes multicasting a
3
file, it multicasts an EOF message to all receivers .
Procedures to handle loss of BOF and EOF control
messages are specified in the VCMTP design document
[50].
Each VCMTP receiver j identifies lost/errored blocks,
if any, based on out-of-order block reception, at
which point it requests retransmissions by sending
4b and then receives
Retx-Request control messages 5 both on the reliable unicast
retransmitted blocks ,
connection. File multicasting and loss recovery are concurrent, which means the VCMTP sender can be concurrently transmitting blocks from a file, while handling
loss recovery for previously multicast blocks from the
same file, or blocks from a previous file.
After the EOF and retransmissions of all lost/errored
blocks have been received, each receiver notifies the
sender on the reliable unicast connection via an
4a or ,
6 that
End-Of-Retx-Reqs control message, it has no more retransmission requests. A sender-side
timer is used for this retransmission phase because the
file eventually needs to be released for the application
to reclaim (recall the asynchronous API). If the sender
receives a Retx-Request control message from a re7 the sender
ceiver for a file whose timer has expired ,
rejects the request by sending a Retx-Reject control
8 and the transmission of the file to that
message ,
particular receiver is deemed a failure. The application
has to use some alternative mechanism to request the
file individually from the sender. Fig. 2 illustrates three
cases, with receiver 1 receiving all blocks successfully
and hence sending just the End-Of-Retx-Reqs control
message, receiver j missing certain blocks but requesting
them and receiving retransmissions successfully, and
receiver n being somehow delayed in its request for
retransmissions and hence receiving a Retx-Reject.
End-Of-Retx-Reqs
Retx-Reject
Format
transfer_type (1);
file_size (8);
file_name (256)
none
none
start_pos (8)
end_pos (8)
none
none
Each VCMTP packet includes a 32-byte header with the
four 8-byte fields shown in Fig. 3. The File ID is a
number assigned by the sender that uniquely identifies each file within a multicast group. The Sequence
Number is used in a data packet to indicate the starting byte position of the VCMTP data block within the
file identified by File ID. The Payload Length field
indicates the size of the payload (either a data block or
a control message) in bytes. Finally, the Flags field is
a bit vector that indicates the type of VCMTP packet:
(i) data block; (ii) retransmitted data block; (iii) BOF
message; (iv) EOF message; (v) BOF-Request message;
(vi) Retx-Request message; (vii) End-Of-Retx-Reqs
message; and (viii) Retx-Reject message. The first two
types of VCMTP packets carry data-plane file blocks,
while the remaining six types of VCMTP packets carry
control-plane messages. The roles of all the control-plane
messages except the BOF-Request were explained in
the previous paragraph along with Fig. 2. The purpose
of the BOF-Request message is to handle lost BOF messages. A receiver detects BOF loss if it starts receiving
data blocks for a new file without a corresponding BOF.
Upon receiving a BOF-Request from a receiver, the
sender sends a corresponding BOF on the unicast reliable
connection to that particular receiver.
Table 1 shows the format of each control message
type. The number in the parenthesis after each field
name indicates the size of the field in bytes. Four
types of VCMTP control messages, EOF, BOF-Request,
End-Of-Retx-Reqs, and Retx-Reject messages, do
not have any payload fields. For these messages, the
File ID and Flags fields in the VCMTP header
are sufficient to serve their operational functions.
On the other hand, a BOF message includes three
fields: transfer_type, file_size (in bytes), and
file_name (as an ASCII string). The transfer_type
w
ie
ev
rR
ee
ly
On
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Transactions on Parallel and Distributed Systems
VCMTP packet format In VCMTP, every data block
or control message is encapsulated in a VCMTP packet.
5
Transactions on Parallel and Distributed Systems
is used to indicate whether the forthcoming transfer is memory-to-memory, memory-to-disk, disk-tomemory, or disk-to-disk. Finally, a Retx-Request message includes two fields in the payload: start_pos
and end_pos, which indicate the starting and ending
byte positions of the missing/errored data blocks for
which retransmissions are being requested. The file for
which retransmissions are being requested is identified by File ID in the VCMTP packet header of the
Retx-Request message.
VCMTP Finite State Machines (FSMs) Each multicast
group consists of one sender and an arbitrary number of
receivers. On the sender, three FSMs are defined: multicast sender, coordinator, and receiver-specific retransmitter.
There is one instance of each of the first two FSMs,
and as many instances of the last FSM as there are
receivers in that group. The multicast sender transmits
the BOF, file blocks, and EOF in multicast mode. Each
receiver-specific retransmitter retransmits data blocks in
response to retransmission requests from its corresponding receiver. The coordinator manages the receiverspecific retransmitters and oversees retransmissions. The
coordinator is informed by the sender when all data
blocks of a file have been multicast, at which time the
coordinator starts the retransmission timer for the file.
Upon receiving notifications of successful file completions from all receiver-specific retransmitters, or when
the retransmission timer for the file expires (whichever
occurs first), the coordinator notifies the application that
file transfer is complete. Since the API is asynchronous,
this notification is required for the application to reclaim
the file.
On each receiver, there are two FSMs: data receiver,
and retransmission requester. The data receiver receives
the original multicast data blocks and the retransmitted data blocks, writes these blocks into memory/disk
locations, and handles the reception of the BOF, EOF
and Retx-Reject control messages. The data receiver
also interacts with the application, notifying it of successful or failed file receptions. The retransmission
requester is responsible for sending Retx-Request,
BOF-Request and End-Of-Retx-Reqs control messages to the sender.
The reliable unicast connection is initiated by the data
receiver FSM. On the sending side, the coordinator listens for connection requests from receivers. Connection
closure is handled by the receiver-specific retransmitter
on the sending side and the data receiver on the receiving side. An overview of the FSMs is provided below;
for details of these FSMs, the reader is referred to the
VCMTP design document [50].
rP
Fo
Fig. 4: Interaction between FSMs of a VCMTP sender
network service. As a convention, the following prefixes
are used in event names: API for events generated or
consumed by the user application; and INL for internal
events created by one VCMTP FSM and consumed by
another FSM on the same side (sender or receiver).
The initialization phase is triggered by the user application sending an API_Init_Sender event (see
Fig. 4). This causes the multicast sender FSM to enter its READY-TO-SEND state (see Table 2) and to create the output event INL_Init_Coordinator, which
in turn instantiates the coordinator FSM, placing it
in its ACTIVE state (see Table 4). When the coordinator FSM receives notification from network service (lower layers) that a reliable unicast connection has been established from receiver j, via the
INL_Unicast_Connection_Created(j) event (see
Fig. 4 and Table 4), the coordinator FSM issues an
output event INL_Init_Retransmitter(j), which
instantiates the receiver-specific retransmitter FSM corresponding to receiver j.
File multicasting begins with the user application
sending an API_Send(i) event for file i, causing the multicast sender FSM to transition to the
SENDING-MSGS state (see Fig. 4 and Table 2) after
generating the BOF(i) message. In the SENDING-MSGS
state, as shown in Table 2, the multicast sender continues to multicast the file in segmented data blocks
DATA_Block(i, b), for all blocks b in file i, followed
by the EOF(i) message. For completed files, it generates the INL_Handle_Completion event as shown in
Table 2. This event is received by the coordinator FSM
as seen both in Fig. 4 and Table 4, which then initiates
the retransmission timer for file i. All retransmission
requests that arrive after the expiry of this timer are
rejected to support the asynchronous API. The impact
of the timer value on robustness is measured experimentally as described in a later section. Table 2 also
shows that while the multicast sender FSM is in its
SENDING-MSGS state, the user application may send it
an API_Send event for a new file, triggering it to send
w
ie
ev
rR
ee
ly
On
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Page 6 of 14
VCMTP Sender FSMs Tables 2, 3 and 4 show the state
transition tables, with input and output events, for the
three VCMTP sender FSMs. In addition, Fig. 4 shows
the events communicated between the three FSMs in a
VCMTP sender, along with interactions with external entities such as the user application, VCMTP receivers, and
6
Page 7 of 14
TABLE 2: State Transition Table for Multicast Sender FSM
Current State
NULL
READY-TO-SEND
Input Event
API Init Sender
API Send(i)
API Close Sender
API Send for new files
SENDING-MSGS
Output Event
INL Init Coordinator
BOF(i)
INL Close Coordinator
DATA Block
EOF
INL Handle Completion
BOF for new files
Next State
READY-TO-SEND
SENDING-MSGS
NULL
SENDING-MSGS (files pending)
READY-TO-SEND (no files pending)
TABLE 3: State Transition Table for Receiver-Specific Retransmitter FSM(j)
Current State
NULL
Input Event
INL Init Retransmitter(j)
Retx-Request(i,B)
Output Event
rP
Fo
ACTIVE
BOF-Request(i)
End Of Retx Reqs(i)
INL Close Retransmitter(j)
INL Unicast Connection Closed
DATA Block(i, b) (not timed out)
or
Retx-Reject(i) (timeout)
BOF(i)
INL File(i) Done Rcvr(j)
(after all retx data blocks have been sent)
INL Close Unicast Connection
INL Retxmitter(j) Exited
Next State
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
NULL
TABLE 4: State Transition Table for Coordinator FSM
Current State
NULL
Input Event
INL Init Coordinator
INL Unicast Connection Created(j)
INL Handle Completion(i)
INL File(i) Done Rcvr(j)
rR
ACTIVE
ee
INL Retx(i) Timeout
INL Close Coordinator
INL Retxmitter(j) Exited
CLOSING-
Output Event
INL Init Retransmitter(j)
API File Done(i)
(no
pending receivers)
API File Done(i)
INL Close Retransmitter(j)
ie
ev
RETXMITTERS
w
a BOF message for the new file. This feature prevents
the transfer of one large file from holding up transfers
of other files.
Retransmissions occur concurrently with file multicasting. A Retx-Request(i,B) message for a missing set
of blocks B in file i sent by a receiver, as shown in
Fig. 4, is received by the corresponding receiver-specific
retransmitter FSM. As shown in Table 3, the receiverspecific retransmitter FSM transmits the requested data
blocks, denoted Data_Block(i,b), for each block b ∈
B, as long as the retransmission timer has not expired
for the file (if the timer has expired, the FSM transmits a Retx-Reject(i) message). When all required
retransmissions for file i have been completed successfully and the End_Of_Retx_Reqs(i) message has
been received from the receiver, each receiver-specific
retransmitter FSM j notifies the coordinator FSM via an
INL_File(i)_Done_Rcvr(j) event.
Releasing the file back to the user application, part of
the asynchronous API, is handled by the coordinator
FSM. It keeps track of the delivery status of all files
Next State
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
CLOSINGRETXMITTERS
CLOSINGRETXMITTERS
(more
pending
receivers)
or
NULL (no pending
receivers)
to all receivers using a status matrix. When it detects
that a file has been delivered to all receivers, or if
the retransmission timer for a file i expires (see event
INL_Retx(i)_Timeout in Table 4), it generates an
API_File_Done(i) event to the user application.
The close-out phase typically starts when a
user application completes multicasting all its
files. It issues the API_Close_Sender event,
which causes the multicast sender FSM to send an
INL_Close_Coordinator event to the coordinator
FSM before it terminates itself. The coordinator will in
turn notify each of the receiver-specific retransmitter
FSMs by sending an INL_Close_Retransmitter(j)
event. The coordinator waits to close itself until after it
receives INL_Retxmitter(j)_Exited events from all
the receiver-specific retransmitter FSMs. These events
are sent by the latter after they have all closed the unicast
connections to their corresponding receivers by issuing
the INL_Close_Unicast_Connection events and
receiving the INL_Unicast_Connection_Closed
events from the network service. If a receiver
ly
On
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Transactions on Parallel and Distributed Systems
7
Transactions on Parallel and Distributed Systems
rP
Fo
Fig. 5: Interaction between FSMs of a VCMTP receiver
initiates
connection
closure,
then
the
corresponding
receiver-specific
retransmitter
FSM
will be notified by the network service via a
INL_Unicast_Connection_Closed event causing
the FSM to output a INL_Retxmitter(j)_Exited
event as shown in Table 3.
For further details on the FSMs, including scenarios
in which individual receivers leave a group or new
receivers join a group, please see the VCTMP design
document [50].
block DATA_Block(i,b) from the VCMTP sender, the
data receiver checks to determine whether it is a new
multicast data block, or a retransmitted data block that
was received on the reliable unicast connection. If it is a
new data block, then it compares the sequence number
of the data block against that of the last received data
block. Data loss is detected if a gap exists between the
sequence numbers of consecutively received data blocks,
as virtual circuits guarantee sequenced delivery. Loss
recovery is described in the next paragraph. For each
successfully received block, the data receiver FSM writes
the received data block into the corresponding position
in the disk or memory file. Finally, when a EOF(i)
message is received, the data receiver checks to see if
there is data loss at the end of file i. If so, it first sends
an INL_Retx_Req(i,B) event to the retransmission requester. It then sends an INL_End_Of_Retx_Reqs(i)
event to the retransmission requester, which in turn
sends an End_Of_Retx_Reqs(i) to the receiverspecific transmitter at the sender to indicate that this
receiver has sent all retransmission requests for file i (see
Fig. 5, and Tables 5 and 6).
Loss recovery procedures are as follows. If
data loss is detected at the data receiver, it
sends an INL_Retx_Req(i,B) event to the
retransmission requester, causing the latter FSM to
send Retx-Request(i,B) messages to the VCMTP
sender requesting retransmissions (see Fig. 5, and
Tables 5 and 6). Retransmitted data blocks are received
and copied into memory/disk by the data receiver FSM.
Notification of the application
on
a
perfile
or
batched
basis
occurs
through
two
events,
API_File(i)_Received
or
API_File(i)_Receive_Failed, sent by the data
receiver FSM (see Fig. 5). When all retransmitted data
blocks have been received successfully, it sends an
API_File(i)_Received event to the application
and an INL_End_Of_Retx_Reqs(i) event to the
retransmission requester FSM, which then sends an
End-Of-Retx-Reqs(i) message to the VCMTP
sender, as shown in Tables 5 and 6. If instead the
data receiver FSM receives a Retx-Reject(i)
message, it means that the receiver’s request for a
retransmission arrived at the sender after the sender’s
retransmission timer for the file expired. The data
receiver responds to the Retx-Reject(i) message by
sending an API_File(i)_Receive_Failed event to
the application.
The close-out phase occurs when a user application
chooses to leave a VCMTP multicast group by sending an API_Close_Receiver event to the data receiver, or if the sender closes the multicast session.
In the first case, before it terminates itself, the data
receiver sends an INL_Close_Retx_Requester event
to the retransmission requester, and then sends an
INL_Close_Unicast_Connection event to the network service to close the unicast connection. Upon receiving a INL_Unicast_Connection_Closed event
ev
rR
ee
VCMTP Receiver FSMs Tables 5 and 6 show the state
transition tables, with input and output events, for the
two VCMTP receiver FSMs. In addition, Fig. 5 shows the
interaction between the two FSMs in a VCMTP receiver,
along with interactions with external entities such as the
user application, VCMTP sender, and network service.
The initialization phase is triggered by the user
application
sending
an
API_Init_Receiver
event (see Fig. 5), which causes the data receiver
FSM to enter its IDLE state after issuing an
INL_Create_Unicast_Connection
output
event (see Table 5). When the network service
establishes the reliable unicast connection, it sends
the
INL_Unicast_Connection_Created
event.
This causes the data receiver FSM to transition
into its RECVING-DATA state after issuing an
INL_Init_Retx_Requester event, which in turn
causes the retransmission requester FSM to be created
(see Tables 5 and 6).
Multicast file reception starts when a BOF(i) control message is received from the VCMTP sender. If
the user application had asked for per-file notifications instead of batched notifications when it initialized the data receiver, then the data receiver will send
an API_BOF(i)_Received to the user application as
shown in Fig. 5 and Table 5. The user application can
specify parameters, such as the name and location for
the file, or ask the data receiver to ignore the file in its
API_BOF_Response(i) event. Upon receiving a data
w
ie
ly
On
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
8
Page 8 of 14
Page 9 of 14
TABLE 5: State Transition Table for Data Receiver FSM
Current State
NULL
IDLE
Input Event
API Init Receiver
INL Unicast Connection Created
BOF(i)
API BOF Response(i)
DATA Block(i, b)
RECVING-DATA
EOF(i)
Retx-Reject(i)
API Close Receiver
INL Unicast Connection Closed
rP
Fo
Output Event
INL Create Unicast Connection
INL Init Retx Requester
API BOF(i) Received
INL BOF Req(i) (BOF loss) or
INL Retx Req(i,B) (data loss) or
API File(i) Received
INL Retx Req(i,B) (if data loss)
INL End Of Retx Reqs(i)
API File(i) Receive Failed
INL Close Retx Requester
INL Close Unicast Connection
INL Close Retx Requester (if
initiated closure)
Next State
IDLE
RECVING-DATA
RECVING-DATA
RECVING-DATA
RECVING-DATA
RECVING-DATA
RECVING-DATA
RECVING-DATA
sender-
NULL
TABLE 6: State Transition Table for Retransmission Requester FSM(j)
Current State
NULL
READY-TO-SEND
READY-TO-SEND
READY-TO-SEND
READY-TO-SEND
Input Event
INL Init Retx Requester
INL Retx Req(i,B)
INL BOF Req(i)
INL End Of Retx Reqs(i)
INL Close Retx Requester
Output Event
Next State
READY-TO-SEND
READY-TO-SEND
READY-TO-SEND
READY-TO-SEND
NULL
Retx-Request(i,B)
BOF-Request(i)
End-Of-Retx-Reqs(i)
right-hand side, the host needs a Remote Direct Memory
Access (RDMA) over Converged Ethernet (RoCE) [51] or
InfiniBand (IB) adapter [52].
The left-hand side protocol stack can be executed
across two types of multicast paths: (i) Layer-2 switched
path: all switches on the path (e.g., in the network topology of Fig. 2) perform packet forwarding (including the
multicast points) on Ethernet VLAN identifiers, MAC
addresses and/or MPLS labels, not on IP addresses; in
this configuration, IP is used at the end points strictly to
exploit the easy-to-use socket API, and (ii) IP-multicast
path: packet multicasting is
N performed on IP packet
headers, i.e., nodes marked
in Fig. 2 are IP routers. In
both cases, a UDP socket is used by VCMTP to send file
blocks over the multicast path, and TCP is used for the
reliable unicast connections for loss recovery. In the IPmulticast configuration, in-sequence delivery is not guaranteed, which means a VCMTP receiver could receive
duplicates (the original delayed block and a retransmitted block in response to a retransmission request that
the VCMTP receiver would issue soon after receiving
an out-of-sequence block), but VCMTP is designed to
drop duplicate blocks. Thus, even though VCMTP was
designed to run on multicast VCs, it can be run over an
IP-multicast path. It is not an ideal solution but useful
because of the ubiquity of inter-domain IP-routed service
when compared to inter-domain virtual-circuit service.
The right-hand side protocol stack of Fig. 6 is designed
to run VCMTP over RDMA networks, such as RoCE and
IB [53]. This stack may be a better option for high-speed
multicasting. Tierney et al. found through experiments
that the CPU utilization is significantly lower with RoCE
when compared to TCP for 10 Gbps speeds and higher
[51]. This is because the InfiniBand (IB) transport- and
w
ie
ev
rR
ee
Fig. 6: VCMTP implementation options
On
from its network service, the data receiver terminates itself. In the second case, the data receiver
FSM receives a INL_Unicast_Connection_Closed
event from its network service causing it to send a
INL_Close_Retx_Requester event to the retransmission requester before terminating itself (see Tables 5 and
6).
ly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Transactions on Parallel and Distributed Systems
4 M ULTICAST NETWORK SERVICE INSTANTIA TIONS
As described in Section 3, VCMTP requires an underlying multicast network service (which can be unreliable)
and a reliable unicast connection service.
Fig. 6 illustrates three potential options for the underlying network service, two of which use the left-hand
side protocol stack, and the third option uses the righthand side protocol stack. In the left-hand side, the host
network interface card is standard Ethernet, while in the
9
Transactions on Parallel and Distributed Systems
network-layer protocols [52], which are part of RoCE,
are implemented in hardware, while TCP/IP is implemented in software. InfiniBand is a packet-switched
network, which is typically used within the local area,
though some WAN extensions have been proposed [54].
The RoCE adapter card implements Ethernet and is
therefore connected to an Ethernet switched network.
End-to-end virtual circuits, realized as Ethernet VLANs,
VLANs over MPLS label switched paths, or other Virtual
Private LAN Service (VPLS) options [12], can be used
across the wide-area to carry RoCE frames.
A VCMTP implementation designed for RoCE/IB
would use the IB transport-layer unreliable multicast
and reliable connection services (which are two types of
IB transport-layer services among others). As shown in
Fig. 6, the name of the interface equivalent to sockets for
RoCE/IB is Verbs [55]. The verbs API includes operations
such as RDMA Write, RDMA Read, RDMA Send/Recv,
etc. RDMA relies on message based communications and
is asynchronous. The OpenFabrics Enterprise Development (OFED) software is provided by the OpenFabrics
Alliance [56] and implements the verbs API for RoCE
and InfiniBand adapters produced by several manufacturers. RDMA leverages zero-copy transfer, whereby
data is moved from the sender user space directly to
the channel adapter thus bypassing the kernel. At the
receiver, the data is copied directly by the receiver
channel adapter to user space bypassing the kernel. The
OFED kernel stack is used for control operations, but can
be bypassed in the data movement phase.
Fig. 7: VCMTP Prototype Implementation
Each receiving host runs a VCMTP receiver application process, which consists of two threads as shown in
Fig. 7. The receiving thread and retransmission
request thread implement the data receiver and retransmission requester FSMs described in Section 3, respectively.
Fig. 7 also illustrates the interactions (see arrows)
between the various threads of a VCMTP sender application and the threads of the multiple VCMTP receiver
applications.
VCMTP
PROTOTYPING
ie
5
VCMTP library functions The sender-side functions are
as follows. (i) The StartGroup function corresponds to the
API_Init_Sender event, which causes the initialization phase actions described in Section 3 to be executed.
The multicast sender FSM code implemented as part
of this function opens a UDP socket using a multicastgroup (Class-D) IP address, which configures the IP and
Ethernet layers of the sending host. (ii) The SendMemoryData function corresponds to the API_Send(i) event
for a memory file. (iii) The SendFile function corresponds
to the same event, but for a disk file. Both these functions
implement the file multicasting and retransmission actions described for the sender FSMs in Section 3. (iv) The
GetNextEvent supports the asynchronous API, and implements the actions described under “releasing the file
back to the user application” in Section 3. It fetches the
next event (e.g., API_File_Done) from the event queue
and returns control of the file back to the VCMTP sender
application. (v) The CloseSender function corresponds
to the API_Close_Sender event and implements the
closeout-phase tasks described in Section 3.
The receive-side functions are as follows. (i) The JoinGroup function opens a UDP socket with the multicast group IP address used by the corresponding
sender. (ii) The StartReceiver function corresponds to
the API_Init_Receiver event, and implements the
initialization, multicast file reception and loss recover
actions described for the VCMTP receiver FSMs in Section 3. (iii) The GetNextEvent function implements the
“notification of the application” actions related to the
asynchronous API of VCMTP on the receive side. (iv)
The LeaveGroup function executes the close-out phase
actions described for the VCMTP receiver FSMs in Section 3.
ev
rR
ee
rP
Fo
We implemented a VCMTP prototype in C++ and tested
it on Linux systems. The implementation consists of
two sets of components: (i) VCMTP sender and VCMTP
receiver applications; and (ii) VCMTP library functions.
In the FSMs described in Section 3, we referred to
the layer beneath VCMTP as “network service,” and
assumed that the network offers a multicast service and
a reliable unicast connection service. Given the ease of
socket programming, we used UDP and TCP sockets in
our VCMTP prototype. Effectively, our prototype implements the left-hand side protocol stack of Fig. 6. It can
be run across an MPLS virtual-circuit or an IP-routed
wide-area network.
w
ly
On
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Page 10 of 14
VCMTP sender and receiver applications The application software reads/writes files and calls VCMTP library
functions (which are described next). The application
software is multi-threaded as illustrated in Fig. 7. On the
sending side, the (single) multicasting thread, and
(single) coordinator thread implement the multicast sender FSM and coordinator FSM described in
Section 3, respectively. For a multicast group consisting
of N receivers, there are N retransmission threads
in the VCMTP sender application; each thread implements the receiver-specific retransmitter FSM described
in Section 3.
10
Page 11 of 14
6
VCMTP
EVALUATION
1 GB File Transfer Using VCMTP vs. TCP
(Send Rate: 600 Mbps Receive Rate: 150 Mbps)
Total Transfer Time (seconds)
The VCMTP prototype was tested on the U. Utah’s
Emulab [21], which is a local-area computer cluster. The
underlying network service conformed to the left-hand
side protocol stack of Fig. 6. Since all the hosts used
in our experiments were in the same Ethernet-switched
network, packet forwarding was based on MAC addresses. The destination MAC address is automatically
derived from the Class-D (multicast) IP address, and
multicast VCMTP frames were received by the hosts
whose VCMTP receivers were configured to receive
packets with these destination MAC and IP addresses.
The IP layer is involved only at the hosts; it was used to
exploit the easy-to-use socket API as noted in Section 4.
Two experiments were executed. The goals of Experiment 1 were to compare VCMTP with parallel unicast
TCP connections (one per receiver) for a single large
file transfer in a no-loss environment, and to validate an
analytical model. The goal of Experiment 2 was to study
the performance of VCMTP when serving continuously
generated files as in the Unidata IDD project. In both
experiments, the VCMTP block size was set to 1428B
(bytes), so that with the 32B VCMTP header, 20B TCP
header2 , 20B IP header, the packet would fit within the
1500B maximum transmission unit length of an Ethernet
frame.
rP
Fo
150
135
120
105
90
75
60
45
30
15
0
VCMTP
TCP
2
4
6
8
10
# Receivers
ee
Fig. 8: Experiment 1 results
rR
Experiment 1 The first experiment validated a model
for large-file (size F ) transfer time across no-loss, lowRTT (round-trip time) paths to n identical receivers. The
transfer times using parallel unicast TCP connections
and VCMTP, respectively, are given by:
rate is four times the receiver rate, when the number
of receivers is equal to or less than 4, the transfer times
under VCMTP and parallel unicast TCP connections are
almost the same. But with larger numbers of receivers,
the transfer times are lower under VCMTP. This is consistent with the model of (1), and demonstrates the basic
advantage of multicast communications. For example,
with 6 receivers, the time to transfer a 1 GB file using
TCP is computed analytically to be 80 sec (using (1)
and a send rate of 600 Mbps as indicated in Fig. 8),
while the experimental measurement was 86.2 sec, and
the analytical and experimental results for Tvcmtp are 53
sec and 59.9 sec, respectively. The spread between TCP
delay and VCMTP delay increases with the number of
receivers as seen in Fig. 8.
nF nF F F
,
, , }
ls cs lr cr
F F F F
=max{ , , , }
ls cs lr cr
ie
ev
Ttcp =max{
Tvcmtp
(1)
w
where ls , lr , cs , and cr correspond to the access link
rates at the sender and receiver(s), and server capacities
(processing rates and memory/disk access rates) at the
sender and receiver(s), respectively. In Ttcp , the first two
terms model the case when the sender bottleneck rate
(ls or cs ) is less than n times the receiver bottleneck rate,
while the next two terms model the case when the sender
bottleneck rate is at least n times that of the receiver
bottleneck rate in which case the sender can sustain n
simultaneous transfers, each at the receiver bottleneck
rate. The transfer time under VCMTP is independent of
the number of receivers under the no-loss assumption.
The results of the first experiment, in which the model
assumptions held true, are shown in Fig. 8. The Linux
tc facility was used to limit rates at the sender and
receivers. This experiment was executed on Emulab
hosts with 1 Gbps NICs, Intel quad-core Xeon CPU, 12
GB memory and commodity disks. Since the sending
On
Experiment 2 Since files are generated and distributed
continuously in the Unidata IDD project, the second
experiment was designed to test VCMTP performance in
this setting. Furthermore, unlike the no-loss environment
of experiment 1, in this second experiment, losses were
injected deliberately. The VCMTP solution was evaluated on two metrics: throughput of “fast” receivers and
robustness of “slow receivers,” the distinction being that
random packet losses were injected at slow receivers
but not at the fast receivers. For this experiment, lowend Emulab hosts with 100 Mbps NICs, 850MHz Intel
Pentium III processor and 512 MB memory, were selected
as it was easier to access large numbers of these hosts
(when compared to the hosts with 1 Gbps NICs).
For each experimental run, a sample of 500 files was
generated assuming the file arrival process to be Poisson
ly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Transactions on Parallel and Distributed Systems
2. Since retransmissions are carried in TCP segments, this larger
header size was considered instead of the 8B UDP header used in
the multicast packets.
11
Transactions on Parallel and Distributed Systems
TABLE 7: Experiment 2 results (continuous file transfers)
ρ
Config.
Config.
Config.
Config.
Config.
Config.
Config.
Config.
1
2
3
4
5
6
7
8
0.4
0.4
0.4
0.4
0.8
0.8
0.8
0.8
Loss
Rate
5%
5%
10%
10%
5%
5%
10%
10%
Timeout
Factor
10
50
10
50
10
50
10
50
Robustness
n = 10
86.3 (2.3)
98.9 (0.8)
79.9 (2.3)
96.2 (1.9)
35.0 (4.7)
68.2 (6.0)
22.9 (4.4)
56.3 (9.5)
R as a percentage (SD)
n = 20
n = 30
83.4 (0.4)
81.4 (0.7)
98.1 (0.7)
97.5 (0.3)
74.0 (1.2)
65.2 (0.5)
94.0 (1.6)
85.0 (1.3)
25.9 (3.5)
15.7 (3.8)
60.3 (5.5)
55.8 (6.0)
11.6 (3.2)
10.6 (1.5)
53.3 (1.7)
50.2 (3.6)
with a rate of 25 files/sec, and that file sizes fit the Pareto
distribution [57]. The shape parameter α was chosen
to be 2, and two values of the scale (minimum-value)
parameter k, 100 KB and 200 KB, were used. Traffic load
ρ is the mean service time multiplied by call arrival rate.
Since the mean for the Pareto distribution is αk/(α − 1)
for α > 1, the traffic load corresponding to k values of
100 and 200 KB are 0.4 and 0.8, respectively, assuming
the full link rate of 100 Mbps.
Experiments were run for eight configurations defined
by two values of ρ, two values of the packet loss rate
at slow receivers, and two values of the retransmission
timeout factor (see Table 7). Packet losses were injected at
random according to a set packet loss rate rate at a fixed
fraction (40%) of the receivers (referred to as the “slow
receivers”). The sender-side timer for the retransmission
phase of a file was set to be a factor of the total multicast
time for that file. Since file sizes are small, this factor was
set to be larger than one; the specific values chosen were
10 and 50.
Table 7 shows the results for experiments with 10, 20
and 30 receivers. For each configuration, 5 runs were
executed. Means and standard deviations, computed
from measurements taken in the 5 runs, are shown for
robustness and throughput in Table 7. Robustness, R,
and throughput, Γ, are defined as
Pns Pm
j=1
i=1 Sij
R=
m × ns
Pnf Pm0 Fi
(2)
j=1
i=1 ( Tij )
Γ=
0
m × nf
n =ns + nf
Throughput Γ in Mbps (SD)
n = 10
n = 20
n = 30
92.8 (1.1) 90.5 (0.7) 86.9 (0.7)
92.7 (0.8) 89.2 (0.7) 85.8 (0.8)
90.5 (0.8) 85.1 (0.6) 82.3 (1.0)
89.9 (0.5) 84.9 (0.7) 80.3 (1.6)
93.9 (1.0) 92.0 (0.5) 91.7 (0.8)
92.1 (0.8) 88.9 (2.7) 88.4 (1.5)
93.6 (0.3) 91.1 (0.6) 89.2 (1.2)
92.7 (1.5) 88.1 (1.8) 85.0 (1.6)
both robustness and throughput in this continuouslyarriving-files scenario. Since the ratio of slow receivers
is fixed at 40%, the number of slow receivers increases
when the total number of receivers n is increased. Consequently, a larger number of retransmission threads
compete for sender-side computing and network resources, and hence robustness decreases. For the same
reason, throughput is also impacted adversely. Second,
as traffic intensity ρ increases or loss rate increases,
both robustness and throughput decrease for the same
resource contention reason. To meet certain robustness
and throughput objectives, the server capacity at the
sender should be selected based on these three parameters: number of receivers, traffic intensity, and measured loss rate. Individual receivers that suffer high loss
rates should be moved to a lower-rate virtual circuit if
available to reduce their influence on the throughput of
other receivers. Finally, the sending-side retransmission
timeout factor offers administrators a knob for trading
off robustness against throughput for a given setting of a
particular sender, set of receivers and traffic intensity. By
increasing the retransmission timeout factor, robustness
can be increased at a cost to throughput. For example,
compare the results for Configurations 1 and 2 with
n = 30 in Table 7. For a small drop in throughput (86.9
to 85.8 Mbps on average), robustness can be increased
significantly from 81.4 to 97.5% (on average) just by increasing the retransmission timeout factor from 10 to 50.
As another example, consider Configuration 7 of Table 7
with a 0.8 traffic intensity and 10% packet loss rate. In
this case, robustness is very low at 10.6% when there are
30 receivers. Increasing the retransmission timeout factor
to 50 (moving to Configuration 8) increases robustness,
but only to 50.2%. Since such a low robustness is likely
to be unacceptable, either some of the slow receivers
should be moved to a lower-rate virtual circuit to reduce
their packet loss rate, or more server/network capacity
is required at the sender.
w
ie
ev
rR
ee
rP
Fo
ly
On
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Page 12 of 14
where Sij is an indicator variable that is set to 1 if file i
was successfully received by receiver j and 0 otherwise
(recall that when the sender-side retransmission timer for
a particular file times out, all subsequent retransmissions
requests for blocks of that file will be rejected), m is
500 (number of files in a run), m0 is the number of
files larger than 500 KB (to reduce the timer precision
error, measurements for smaller files were dropped), ns
is the number of slow receivers, nf is the number of fast
receivers, Fi is the size of file i, Tij is the time taken for
the reception of file i at receiver j.
The main observations from the results of Table 7
are as follows. First, the number of receivers affects
7
C ONCLUSIONS
Motivated by the need for efficient scientific data distribution to multiple receivers, we designed, prototyped,
and evaluated a reliable message multicast transportlayer protocol called VCMTP. The growth in dynamic
virtual circuit (VC) services, certain advantages of multicast VCs over multicast IP, and the continuous nature of
12
Page 13 of 14
meteorology data, were factors in assuming a VC-based
underlying network service for VCMTP. An analytical
model was developed for TCP and VCMTP based single
file distribution, and validated experimentally. For continuously generated files, a key design aspect of VCMTP
is the tradeoff between file-delivery throughput for fast
receivers and robustness for slow receivers. A VCMTP
configurable parameter called retransmission timeout
factor can be adjusted to tradeoff these two metrics.
For a traffic load of 0.4, and a multicast group with 30
receivers, robustness can be increased significantly from
81.4 to 97.5% by increasing the retransmission timeout
factor from 10 to 50. The corresponding drop in average
throughput for fast receivers is small (86.9 to 85.8 Mbps).
8
F UTURE
rP
Fo
WORK
Our planned future work items include: (a) VCMTPover-RDMA implementation of the right-hand side protocol stack of Fig. 6; and (b) experimentation with the
current VCMTP implementation on wide-area virtual
circuits using DYNES and Internet2’s dynamic circuit
service.
ACKNOWLEDGMENTS
rR
ee
The authors would like to thank our funding agencies. This work was supported by the NSF grants OCI1038058, OCI-1127340, and OCI-1127228, and DOE grant
DE-SC0007341.
R EFERENCES
[11] R. Vida and L. Costa, “Multicast Listener Discovery Version
2 (MLDv2) for IPv6,” RFC 3810 (Proposed Standard), Internet
Engineering Task Force, Jun. 2004, updated by RFC 4604.
[Online]. Available: http://www.ietf.org/rfc/rfc3810.txt
[12] K. Kompella and Y. Rekhter, “Virtual Private LAN Service
(VPLS) Using BGP for Auto-Discovery and Signaling,” RFC
4761 (Proposed Standard), Internet Engineering Task Force,
Jan. 2007, updated by RFC 5462. [Online]. Available: http:
//www.ietf.org/rfc/rfc4761.txt
[13] Esnet. [Online]. Available: http://www.es.net/
[14] Internet2. [Online]. Available: http://www.internet2.edu/
[15] GÉANT
Plus
Connectivity
Service.
[Online].
Available:
http://www.geant.net/Services/ConnectivityServices/
Pages/home.aspx
[16] Next Generation Network Testbed JGN-X. [Online]. Available:
http://www.jgn.nict.go.jp/english/index.html
[17] MRI-R2 Consortium: Development of Dynamic Network System
(DYNES). [Online]. Available: http://www.internet2.edu/ion/
dynes.html
[18] M. Veeraraghavan, M. Karol, and G. Clapp, “Optical dynamic
circuit services,” IEEE Communications Magazine, vol. 48, pp. 109–
117, Nov. 2010.
[19] J. Li, M. Veeraraghavan, M. Manley, and S. Emmerson, “Analysis
and selection of a network service for a scientific data distribution
project,” in International Conference on Communications, Mobility,
and Computing (CMC), May 21-23 2012.
[20] Z. Liu, M. Veeraraghavan, Z. Yan, C. Tracy, J. Tie, I. Foster,
J. Dennis, J. Hick, Y. Li, and W. Yang, “On using virtual circuits for
GridFTP transfers,” in Proceedings of the International Conference on
High Performance Computing, Networking, Storage and Analysis, ser.
SC ’12. Los Alamitos, CA, USA: IEEE Computer Society Press,
2012, pp. 81:1–81:11.
[21] Emulab. [Online]. Available: http://www.emulab.net/
[22] S. Paul, K. K. Sabnani, J. C.-H. Lin, and S. Bhattacharyya, “Reliable
Multicast Transport Protocol (RMTP),” IEEE Journal on Selected
Areas in Communications, vol. 15, p. 407, Apr. 1997.
[23] M. P. Barcellos, M. Nekovee, M. Koyabe, M. Daw, and J. Brooke,
“Evaluating high-throughput reliable multicast for grid applications in production networks,” in Cluster Computing and the Grid,
2005, pp. 442–449.
[24] M. den Burger and T. Kielmann, “Collective receiver-initiated
multicast for grid applications,” Parallel and Distributed Systems,
IEEE Transactions on, vol. 22, no. 2, pp. 231–244, 2011.
[25] P. Karunakaran, H. Bagheri, and M. Katz, “Energy efficient multicast data delivery using cooperative mobile clouds,” in European
Wireless, 2012. EW. 18th European Wireless Conference, 2012, pp. 1–5.
[26] D. Li, Y. Li, J. Wu, S. Su, and J. Yu, “ESM: Efficient and scalable
data center multicast routing,” Networking, IEEE/ACM Transactions
on, vol. 20, no. 3, pp. 944–955, 2012.
[27] T. Chiba, M. den Burger, T. Kielmann, and S. Matsuoka, “Dynamic load-balanced multicast for data-intensive applications on
clouds,” in Cluster, Cloud and Grid Computing (CCGrid), 2010 10th
IEEE/ACM International Conference on, 2010, pp. 5–14.
[28] J. S. Turner, “Extending ATM networks for efficient reliable
multicast,” 1996. [Online]. Available: http://citeseerx.ist.psu.
edu/viewdoc/summary?doi=10.1.1.29.6034
[29] H. Ma and M. El Zarki, “A new transport protocol for
broadcasting/multicasting mpeg-2 video over wireless ATM
access networks,” Wirel. Netw., vol. 8, no. 4, pp. 371–380, Jul. 2002.
[Online]. Available: http://dx.doi.org/10.1023/A:1015534506045
[30] IETF Reliable Multicast Transport Working Group. [Online].
Available: http://datatracker.ietf.org/wg/rmt/charter/
[31] M. Luby, M. Watson, and L. Vicisano, “Asynchronous Layered
Coding (ALC) Protocol Instantiation,” RFC 5775 (Proposed
Standard), Internet Engineering Task Force, Apr. 2010. [Online].
Available: http://www.ietf.org/rfc/rfc5775.txt
[32] B. Adamson, C. Bormann, M. Handley, and J. Macker, “NACKOriented Reliable Multicast (NORM) Transport Protocol,” Internet
Engineering Task Force, RFC 5740, 2009. [Online]. Available:
http://tools.ietf.org/html/rfc5740
[33] M. Luby, A. Shokrollahi, M. Watson, T. Stockhammer, and
L. Minder, “RaptorQ Forward Error Correction Scheme for
Object Delivery,” Internet Engineering Task Force, RFC 6330,
Aug. 2011. [Online]. Available: http://tools.ietf.org/html/rfc6330
[34] L. Rizzo and L. Vicisano, “A reliable multicast data distribution
protocol based on software fec techniques,” in High-Performance
ie
ev
[1]
Internet Data Distribution. [Online]. Available: http://www.
unidata.ucar.edu/software/idd/
[2] Local Data Manager. [Online]. Available: http://www.unidata.
ucar.edu/software/ldm/
[3] Earth System Grid Federation (ESGF). [Online]. Available:
http://esgf.org/
[4] The Large Hadron Collider. [Online]. Available: http://lhc.web.
cern.ch/lhc/
[5] D. Li, A. Desai, Z. Yang, K. Mueller, S. Morris, and D. Stavisky,
“Web content caching and distribution,” F. Douglis and B. D.
Davison, Eds. Norwell, MA, USA: Kluwer Academic Publishers,
2004, ch. Multicast cloud with integrated multicast and unicast
content distribution routing, pp. 109–118.
[6] S. Ratnasamy, A. Ermolinskiy, and S. Shenker, “Revisiting IP
multicast,” in Proceedings of the 2006 conference on Applications,
technologies, architectures, and protocols for computer communications,
ser. SIGCOMM ’06. New York, NY, USA: ACM, 2006, pp. 15–26.
[Online]. Available: http://doi.acm.org/10.1145/1159913.1159917
[7] B. Quinn and K. Almeroth, “IP Multicast Applications: Challenges
and Solutions,” RFC 3170 (Informational), Internet Engineering
Task Force, Sep. 2001. [Online]. Available: http://www.ietf.org/
rfc/rfc3170.txt
[8] B. Fenner, M. Handley, H. Holbrook, and I. Kouvelas, “Protocol
Independent Multicast - Sparse Mode (PIM-SM): Protocol
Specification (Revised),” RFC 4601 (Proposed Standard), Internet
Engineering Task Force, Aug. 2006, updated by RFCs 5059, 5796,
6226. [Online]. Available: http://www.ietf.org/rfc/rfc4601.txt
[9] B. Fenner and D. Meyer, “Multicast Source Discovery Protocol
(MSDP),” RFC 3618 (Experimental), Internet Engineering Task
Force, Oct. 2003. [Online]. Available: http://www.ietf.org/rfc/
rfc3618.txt
[10] B. Cain, S. Deering, I. Kouvelas, B. Fenner, and A. Thyagarajan,
“Internet Group Management Protocol, Version 3,” Internet Engineering Task Force, RFC 3376, Oct. 2002.
w
ly
On
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Transactions on Parallel and Distributed Systems
13
Transactions on Parallel and Distributed Systems
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[46]
[51]
[52]
[53]
[54]
[55]
[56]
ly
[50]
Steve Emmerson received a BSc in Physics
from the University of California at Irvine and
an MSc in Physical Oceanography from the
University of Miami and currently works as a
software engineer for the University Corporation
for Atmospheric Research. His interests include
near real-time data-distribution via the Internet,
scientific data analysis, and scientific data visualization. He has authored four papers and is a
member of ACM and AGU.
On
[49]
w
[48]
ie
[47]
Malathi Veeraraghavan (M’88-SM’97) is a Professor in the Charles L. Brown Department of
Electrical and Computer Engineering at the University of Virginia. After a ten-year career at
Bell Laboratories, she served on the faculty at
Polytechnic University, Brooklyn, New York from
1999-2002. She served as Director of the Computer Engineering Program at UVa from 20032006. She holds twenty-nine patents, has over
90 publications, and has received six Best-paper
awards. Most recently, she served as the Technical Program Committee Co-Chair for the NGN Symposium at IEEE ICC
2013.
ev
[45]
Jie Li received the BS degree in Electrical Engineering from Tsinghua University, China, in
2006, and MS degree in Computer Engineering
from the University of Virginia, USA, in 2008.
He is currently a PhD student in Computer
Engineering at the University of Virginia. His
research interests include inter-domain routing
and addressing, future Internet architecture, and
virtual circuit services.
rR
[44]
[57] V. Paxson and S. Floyd, “Wide area traffic: the failure of poisson
modeling,” Networking, IEEE/ACM Transactions on, vol. 3, no. 3,
pp. 226–244, 1995.
ee
[43]
Communication Systems, 1997. (HPCS ’97) The Fourth IEEE Workshop on, 1997, pp. 116 –125.
S. Acharya, M. Franklin, and S. Zdonik, “Dissemination-based
data delivery using broadcast disks,” Personal Communications,
IEEE, vol. 2, no. 6, pp. 50 –60, dec 1995.
J. W. Byers, M. Luby, and M. Mitzenmacher, “A digital fountain
approach to asynchronous reliable multicast,” IEEE Journal on
Selected Areas in Communications, vol. 20, pp. 1528–1540, 2002.
S. Floyd, V. Jacobson, C.-G. Liu, S. McCanne, and L. Zhang,
“A reliable multicast framework for light-weight sessions and
application level framing,” in ACM SIGCOMM, Aug. 1995, p. 342.
C. Bormann, J. Ott, H.-C. Gehrcke, T. Kerschat, and N. Seifert,
“MTP-2: Towards achieving the S.E.R.O. properties for multicast
transport,” in Internet Conference on Computer Communications and
Networks (ICCCN), Sep. 1994.
A. Koifman and S. Zabele, “RAMP: A reliable adaptive multicast
protocol,” in IEEE Infocom ’96, Mar. 1996, p. 1142.
M. Beck, Y. Ding, E. Fuentes, and S. Kancherla, “An exposed approach to reliable multicast in heterogeneous logistical networks,”
in Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003.
3rd IEEE/ACM International Symposium on, 2003, pp. 526–533.
S. Armstrong, A. Freier, and K. Marzullo, “Multicast Transport
Protocol,” RFC 1301 (Informational), Internet Engineering Task
Force, Feb. 1992. [Online]. Available: http://www.ietf.org/rfc/
rfc1301.txt
B. Whetten and G. Taskale, “An overview of reliable multicast
transport protocol II,” Network, IEEE, vol. 14, no. 1, pp. 37 –47,
jan/feb 2000.
S. McCanne, V. Jacobson, and M. Vetterli, “Receiver-driven
layered multicast,” in Conference proceedings on Applications,
technologies, architectures, and protocols for computer communications,
ser. SIGCOMM ’96. New York, NY, USA: ACM, 1996, pp.
117–130. [Online]. Available: http://doi.acm.org/10.1145/248156.
248168
S. Wen, J. Griffioen, and K. L. Calvert, “Building multicast services from unicast forwarding and ephemeral state,” Computer
Networks, vol. 38, no. 3, pp. 327–345, 2002.
Jgroups - a toolkit for reliable multicast communication. [Online].
Available: http://www.jgroups.org
R. Stewart, “Stream Control Transmission Protocol,” Internet
Engineering Task Force, RFC 4960, Sep. 2007. [Online]. Available:
http://tools.ietf.org/html//rfc4960
J. Widmer and M. Handley, “TCP-Friendly Multicast Congestion
Control (TFMCC): Protocol Specification,” Internet Engineering
Task Force, RFC 4654, Aug. 2006. [Online]. Available: http:
//tools.ietf.org/html/rfc4654
J. Li and M. Veeraraghavan, “A reliable message multicast transport protocol for virtual circuits,” in International Conference on
Communications, Mobility, and Computing (CMC), May 21-23 2012.
D. D. Clark and D. L. Tennenhouse, “Architectural considerations
for a new generation of protocols,” in Proceedings of the ACM
symposium on Communications architectures & protocols, ser.
SIGCOMM ’90. New York, NY, USA: ACM, 1990, pp. 200–208.
[Online]. Available: http://doi.acm.org/10.1145/99508.99553
J. Li, M. Veeraraghavan, S. Emmerson, and R. D. Russell,
“Virtual Circuit Multicast Transport Protocol (VCMTP) Design
Document.” [Online]. Available: http://www.ece.virginia.edu/
mv/research/EAGER/documents/documents.html
B. Tierney, E. Kissel, M. Swany, and E. Pouyoul, “Efficient data
transfer protocols for big data,” in 2012 IEEE 8th International
Conference on E-Science (e-Science), Oct. 2012, pp. 1–9.
InfiniBand Trade Association. (2007, Nov.) InfiniBand Architecture
Specification Volume 1, Release 1.2.1. [Online]. Available:
http://infinibandta.org
Overview of UNH EXS 1.3.0 for Programmers. [Online]. Available: https://www.iol.unh.edu/services/research/
unh-exs/exs-overview.pdf
H. Subramoni, P. Lai, R. Kettimuthu, and D. Panda, “High Performance Data Transfer in Grid Environment Using GridFTP over
InfiniBand,” in Cluster, Cloud and Grid Computing (CCGrid), 2010
10th IEEE/ACM International Conference on, 2010, pp. 557–564.
RDMA Aware Networks Programming User Manual Rev 1.4. [Online]. Available: http://www.mellanox.com/related-docs/prod\
software/RDMA\ Aware\ Programming\ user\ manual.pdf
OpenFabrics Alliance, “http://www.openfabrics.org,” 2009.
rP
Fo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Page 14 of 14
Robert D. Russell (Ph.D. ’72) is an Associate
Professor in the Computer Science Department
and InterOperability Laboratory at the University
of New Hampshire. His interests are in Operating Systems, High Performance Networking, and
Storage. He is also the principal developer and
instructor for the OpenFabrics Alliance (OFA)
training course entitled ”Writing Application Programs for RDMA Using OFA Software.”
14