Adaptive Message Pipelining for Network Memory and Network

Adaptive Message Pipelining
for Network Memory and Network Storage
Kenneth G. Yocum, Darrell C. Anderson, Jerey S. Chase, Syam Gadde,
Andrew J. Gallatin, Alvin R. Lebeck
fgrant,
Department of Computer Science
Duke University
Durham, NC 27708-0129
anderson, chase, gadde, gallatin, alvyg@cs.duke.edu
fhttp://www.cs.duke.edu/ari/trapeze)
June 3, 1998
Abstract
Recent advances in cluster le systems, network memory, and network-attached disks make it
possible to construct storage systems whose performance tracks network technology rather than
disk technology. However, delivering the potential of high-speed networks for network storage
systems depends on communication support that meets client demands for low network latency
for random block accesses and high bandwidth for streaming block accesses.
This paper describes an adaptive message pipelining technique that reduces network latency of
messages larger than 1 KB while delivering the full bandwidth of gigabit-per-second networks and
PCI I/O buses. We report on experiences with the Trapeze messaging system, which implements
adaptive message pipelining in rmware for a Myrinet network interface. We show that the
Trapeze approach is superior to other message pipelining alternatives, and illustrate its benets
with application experiments using the Global Memory Service, a network memory system on
a Myrinet/Alpha cluster. Message pipelining reduces GMS/Trapeze 8KB page fault latency by
38% to 165s, and yields le read bandwidths up to 97 MB/s.
1 Introduction
Recent advances in distributed le systems and network memory systems have produced storage
systems whose performance tracks the rapid advances in network technology rather than the slower
rate of advances in disk technology. With gigabit-per-second networks, a fetch request for a faulted
page or le block can complete up to two orders of magnitude faster from remote memory than from
a local disk (assuming a seek). Moreover, a storage system built from disks distributed through
the network (e.g., attached to dedicated servers [13, 17], cooperating peers [3, 18], or the network
This work is supported in part by NSF grant CDA-95-12356, Duke University, Intel, Myricom, and the Open
Group. Chase and Lebeck are partially supported by NSF CAREER awards CCR-96-24857 and MIP-97-02547.
itself [10]) is incrementally scalable and can source and sink data to and from individual clients at
network speeds.
These systems shift the storage access bottleneck from the disk to the network. However, network
storage access presents dierent challenges from other driving applications of high-speed networks.
While small-message latency is still important, server throughput and client I/O stall times are
determined largely by the latency and bandwidth of data transfer messages carrying le blocks or
memory pages in the 1 KB to 16 KB range. The relative importance of latency and bandwidth varies
with workload. A client issuing random fetch requests requires low latency; other clients may be
bandwidth-limited due to multiprogramming, prefetching, or write-behind.
Reconciling these conicting demands requires careful attention to data movement through the
messaging system and network interface. One way to achieve high bandwidth is to send larger packets,
reducing per-packet overheads. On the other hand, a key technique for achieving low latency is to
fragment messages and pipeline the fragments through the network, overlapping transfers on the
network links and I/O buses. Since it is not possible to do both at once, systems must select which
strategy to use.
This paper describes an adaptive message pipelining scheme that reduces network latency of
messages larger than 1 KB, while delivering the full bandwidth of gigabit-per-second networks and
I/O buses under load. We report performance results from an implementation of adaptive message
pipelining in Trapeze, a messaging system designed for network storage access through 1.28 Gb/s
Myrinet networks [4]. Message pipelining in Trapeze is supported by a simple local policy called cutthrough delivery implemented within the network interface card (NIC), eliminating host overheads
for message fragmentation and reassembly. We show that cut-through delivery automatically adjusts
to congestion, yielding peak bandwidth for streams of messages without the need for a separate bulk
data transfer mechanism. We compare Trapeze to recently proposed message pipelining strategies
and their Myrinet-based implementations where they exist. Our results show that while one recent
system (BIP [15]) delivers latencies comparable to Trapeze, Trapeze delivers signicantly higher
bandwidth.
Our goal with Trapeze is to dene the performance bounds for network storage systems using
Myrinet. We report on experiments using Trapeze in a kernel-based network memory system, the
Global Memory Service (GMS) [9]. In our GMS experiments, all external data access is to network memory (remote memory), stressing network performance by removing disks from the critical
path. On a Myrinet/Alpha cluster, Trapeze reduces the time to handle an 8KB page fault in GMS
from 576s to 165s, cutting I/O stall times by 72% over IP-based messaging using vendor-supplied
Myrinet rmware. This is a 90% fetch latency reduction over the original ATM-based GMS prototype, and a 70% reduction over a version that uses subpages to reduce latency [11]. Moreover,
GMS/Trapeze can deliver le access bandwidth just under 98 MB/s1 for a single process reading a
memory-mapped le sequentially from network memory.
One contribution of this paper is to show that gigabit-per-second networks and adaptive message
pipelining allow network memory systems to attain previously unreachable levels of performance,
enabling practical use of network memory for large-dataset problems. We now evaluate network
memory performance in terms of slowdown from internal memory congurations, rather than speedup
from disk-bound congurations. Some demanding large-dataset applications show modest slowdowns
below 50%, even on 500 MHz Alpha CPUs, and recent prefetching extensions to GMS [18, 1] promise
1
For bandwidths reported in this paper, a megabyte is one million bytes unless otherwise specied.
2
to reduce this gap further. These more aggressive performance goals emphasize the importance of
low-overhead le systems and virtual memory systems that view the network as the primary access
path to external storage.
This paper is organized as follows. Section 2 gives an overview of message pipelining in Trapeze
and its use in GMS. Section 3 compares several message pipelining alternatives, and includes experimental results comparing Trapeze to leading Myrinet messaging systems. Section 4 presents
application experiments with GMS/Trapeze. We conclude in Section 5.
2 Adaptive Message Pipelining in Trapeze
This section outlines the Trapeze messaging system and its use in the Global Memory Service (GMS).
We motivate message pipelining, dene the adaptive message pipelining scheme implemented in
Trapeze, and discuss related work.
code
user space
host
kernel
data
User Data Pages
User Applications
socket layer
File/VM system
TCP/IP stack
Global Memory Service
Trapeze network driver
GMS RPC
System Page Frame Pool
payload
buffer
pointers
raw Trapeze message layer
PCI bus
NIC
Trapeze Firmware
(Trapeze-MCP)
Send Ring
Receive Ring
Incoming
Payload
Table
Figure 1: In-kernel Trapeze for GMS network memory and TCP/IP networking.
Trapeze is implemented as host-side messaging software and a rmware program for Myrinet
NICs. Like other Myrinet-based messaging systems (e.g., Hamlyn [5], VMMC-2 [8], Active Messages [6], FM [14], and BIP [15]), Trapeze can be used as a memory-mapped network interface
allowing user processes to access the network without kernel intervention. However, Trapeze is designed primarily for in-kernel use in the conguration depicted in Figure 1, in which the raw Trapeze
layer hosts an RPC-like package (gms rpc) for fast kernel-to-kernel messaging in GMS, alongside a
network device driver servicing the conventional TCP/IP protocol stack. The raw Trapeze layer
interacts with the NIC through two circular producer-consumer message queues (message rings) that
reside in SRAM on the NIC.
For GMS, protocol operations that move a virtual memory page or le block are the most critical for performance. GMS clients fetch pages or blocks from remote nodes using a getpage request/response exchange driven by local page faults, misses in the le cache, or internal prefetching
policies. GMS uses its putpage and movepage operations to write or migrate a page or block to a
remote storage server or caching site. Most messages in GMS are sent as short, xed-size Trapeze
3
Network and Messaging System
Latency (s) Incremental Benet
IP over 100 Mb/s Ethernet (Tulip)
1287
0
IP/Myrinet (standard rmware)
578.9
55%
IP/Myrinet (Trapeze rmware)
427.5
26%
Raw Trapeze w/o pipelining
269.5
37%
Raw Trapeze w/ pipelining
165.6
38%
Table 1: GMS remote 8KB page fault times for Alpha/Myrinet.
control messages; a getpage response or putpage/movepage request consists of a control message carrying the 8KB page or le block contents piggybacked on the packet as a Trapeze payload. The
payload buering scheme is designed to allow low-overhead, copy-free movement of pages and blocks
between the system le/VM page cache and the network interface; Trapeze supports early demultiplexing of getpage replies using an incoming payload table on the NIC. These features are described
in our previous work [1].
2.1 Motivation for Message Pipelining
Our goal is to deliver the lowest latency and highest bandwidth for Trapeze payloads, in order to
minimize I/O stall times. Two obvious ways to reduce network latency are (1) use a faster network
and (2) minimize overheads due to copying or protocol processing. Table 1 shows the eect of these
improvements on 8KB remote page fault times in GMS under Digital Unix 4.0 on a Myrinet/Alpha
cluster.2 The message pipelining techniques described in this paper yield an incremental 38% improvement after these other optimizations have been applied, achieving 8KB page fetch latencies of
165.6s.
Sending Host
CPU
Receiving Host
Memory
Memory
Host/PCI
bridge
Host/PCI
bridge
stage 2: network links
PCI
I/O Bus
HostTx
stage 1: sender PCI
CPU
HostRcv
Myrinet
network
NetRcv
NIC
SRAM
buffer
NetTx
HostTx
NIC
SRAM
buffer
LANai
CPU
LANai
CPU
Sending NIC
Receiving NIC
NetTx
NetRcv
HostRcv
stage 3: receiver PCI
Figure 2: Data transfers for a Trapeze/Myrinet payload, and the corresponding pipeline model.
Note on GMS and IP measurements: in the GMS and IP results in this paper, receivers (e.g., GMS clients) are
500 MHz PWS 500au (Miata), and senders (e.g., GMS servers) are 266 MHz AlphaStation 500 (Alcor). We always use
Alcor and Miata in pairs to compensate for asymmetry in their delivered I/O bus bandwidths.
2
4
Message pipelining reduces latency by overlapping stages of the \wire time" for Trapeze payloads.
A payload send/receive requires four DMA transfers through DMA engines on the NIC (Figure 2):
(1) host memory to NIC memory across the sender's PCI I/O bus, (2) sender's NIC to network
link, (3) network link to receiver's NIC, and (4) NIC to host memory across the receiver's PCI. We
refer to the corresponding NIC DMA engines as HostTx, NetTx, NetRcv, and HostRcv respectively.
We ignore transfers through components that never introduce store-and-forward delays, such as the
host/PCI bridges, the NetRcv engine, and the wormhole-routed Myrinet network switches. The NIC
rmware controls all transfers between stages in the resulting three-stage pipeline model.
Optimal Fixed
Store and Forward
HostTx
NetTx
NetTx
HostRcv
HostRcv
0
50
100
microseconds
150
200
0
25
50
75
100
microseconds
125
150
Optimal Variable
HostTx
0
25
50
75 100 125
microseconds
Figure 3: Data transfer and latencies for an 8KB Myrinet message sent using store-and-forward,
optimal xed fragmentation, and optimal variable fragmentation, assuming G =132 MB/s (33 MHz
32-bit PCI) and g = (2s, 4s, 2s).
i
Figure 3 shows how message pipelining lowers latency. The naive approach (a) is to send the
message as a single packet with store-and-forward delay in between each pipeline stage, as the
standard Myrinet rmware does when congured with a maximum packet size (MTU) large enough
for the entire message. Sending the message as a sequence of separate transfers hides latency by
overlapping the transfer time in the dierent stages. The strategy may use xed transfer sizes as in
Figure 3 (b) or variable sizes as in (c).
These diagrams depict fragmentation, the most common pipelining approach. Fragments are sent
as separate packets on the network and reassembled at the destination; transmission of a fragment
through pipeline stage i cannot begin until the entire fragment exits stage i ? 1. Using the analysis
in Wang et. al. [19], we can determine the optimal strategy for both xed and variable fragmentation
strategies, given two parameters for each stage i: the xed per-transfer overhead g , and the inverse
bandwidth (time per unit of data) G . Figure 3 (b) and (c) depict fragment schedules derived from
this analysis.
On Ethernet and other networks with small MTUs, message pipelining occurs automatically as
IP fragments large messages to send them on the network. To show the eect of IP fragmentation on
message latency, Table 2 shows the GMS/IP 8KB page fault times for several IP MTUs on Myrinet.
The latency improvement from pipelining is largely overshadowed by overheads for fragmentation,
packet handling (e.g., interrupts), and reassembly in the hosts. The lower-overhead Trapeze rmware
yields better latency improvements, but still never improves by more than 13%. (Note: both network
drivers are locally produced and equivalently optimized.) The table shows that for message pipelining
to be eective, per-message overheads must be lower than IP fragmentation allows. This motivates
message pipelining at lower levels of the messaging system.
i
i
5
150
175
Fragment GMS 8KB
Stream
Size
Fault Time
Bandwidth
(MTU)
(s)
(MB/s)
1KB
579
12.2 (TCP)
IP/Myrinet
2KB
576
19.8 (TCP)
(standard rmware)
4KB
577
27.6 (TCP)
8KB
579
37.8 (TCP)
1KB
404
27.7 (TCP)
2KB
372
37.5 (TCP)
IP/Myrinet
4KB
384
37.5 (TCP)
(Trapeze rmware)
8KB
427.5
54.4 (TCP)
91.5 (zero-copy TCP)
8KB
269.5
103.5 (gms rpc)
Raw Trapeze
(pipelined)
165.6
103.4 (gms rpc)
Network
Table 2: Some network alternatives for GMS and stream communication on Myrinet.
2.2 The Trapeze Approach: Cut-Through Delivery
Rather than using fragmentation, Trapeze places all control of pipelining within the pipeline, hidden
within the NICs. Figure 4 illustrates the Trapeze approach. The basic function of the NIC and its
rmware is to move data from a source to a sink: when sending, the NIC sources data from hostTx
and sinks it to netTx; when receiving, it sources data from netRcv and sinks it to hostRcv. The
Trapeze rmware implements message pipelining by initiating transfers to idle sinks whenever the
amount of data buered on the NIC awaiting transfer to the sink exceeds a congurable threshhold
(minpulse). When the rmware initiates a transfer to a DMA sink, it transfers all of the data it has
buered for the current packet.
do forever
for each idle DMA sink in {NetTx, HostRcv}
waiting = words awaiting transfer to sink
if (waiting > MINPULSE)
initiate transfer of waiting words to sink
end
for each idle DMA source in {HostTx, NetRcv}
if (waiting transfer and buffer available)
initiate transfer from source
end
loop
Cut-through
0
25
50
75
100
microseconds
125
150
Figure 4: Cut-through delivery policy and resulting pipeline transfers.
We call this technique cut-through delivery in an analogy to the technique used to eliminate
store-and-forward latency in network switches [12]. Our contribution here is to show how this cutthrough technique is applied to network adapters and to evaluate its performance benets for network
storage and network memory systems.3 We nd that this simple technique oers compelling benets
over pipelining strategies recently proposed in other work (see Section 2.4). In particular, it produces
3
Note to reviewers. A previous paper [20] describes an early version of the Trapeze cut-through delivery technique,
developed concurrently with the earliest work on Myrinet message pipelining by the Fast Messages group at UIUC. The
6
near-optimal pipeline schedules automatically, because it naturally adapts to dierent G and g values
at each stage. For example, if a fast source feeds a slow sink, data builds up on the NIC buers
behind the sink, triggering larger transfers through the bottleneck, reducing the total per-transfer
overhead. Similarly, if a slow source feeds a fast sink, the policy sends a sequence of minpulse-size
transfers that use the idle sink bandwidth to minimize latency; the higher overheads do not matter
because bandwidth is bounded by the slow source.
Our implementation takes advantage of properties of the Myrinet network. In particular, Myrinet
permits a NIC to transmit a packet on the link as a sequence of separate transfers. Myrinet network
switches and the receiving NIC recognize all data placed on the link by the sender as a continuation
of the current packet until the sender transmits a tail it to mark the end of the packet; the receiver
does not receive any part of any other packet until the tail it arrives. Reassembly is trivial, and there
fragments cannot be dropped independently. However, approach can be used with other networks as
well. On the receiving side, cut-through delivery is independent of the network, since the cut-through
transfers are on the host I/O bus, not the network. On the sending side, cut-through can be used
on any network that allows segmentation of a packet or frame. The APIC network adapter uses a
similar technique at an ATM cell granularity [7].
2.3 Balancing Latency and Bandwidth
The purpose of message pipelining is to reduce latency, rather than to increase bandwidth. In fact,
pipelined messages always yield lower bandwidth because each transfer incurs the xed cost g at each
pipeline stage i: the eective bandwidth of each stage for a transfer of size B is B=(g + G B ). The
bandwidth eect of per-fragment overheads is visible in Table 2, which shows bandwidth decreasing
with the degree of fragmentation. Lower overhead interfaces reduce the bandwidth penalty, but they
cannot eliminate fundamental per-transfer costs such as bus cycles consumed to set up each transfer.
To quantify this, Figure 5 shows the minimum eective bandwidth of LANai-4 Myrinet NIC DMA
engines as a function of transfer size for Myrinet links and Pentium-II/440LX and Alcor (CIA) I/O
bus implementations (32-bit 33MHz PCI). With 128-byte transfers, the NIC can drive at most 55%
of its peak Myrinet link speed and 46% to 55% of peak speed on the host PCI buses. In practice,
these transfers incur much higher rmware overheads in the LANai-4 CPU increasing the penalty.
Figure 5 includes curves with g parameters more reective of realistic overheads.
Thus message pipelining can have a signicant bandwidth penalty and should not be used in
bandwidth-constrained scenarios. For continuous streams of packets, pipelining yields no bandwidth benet because the packets pipeline naturally in the network and adapters even without fragmentation; fragmenting individual messages only increases overhead and hence lowers the eective
bandwidth. Similarly, message pipelining is detrimental in the presence of congestion delays. The
messaging system could supply users with a separate bulk transfer mechanism for bandwidth, but it
may not be clear when a network storage system should use such a mechanism, particularly when
throughput is limited by streams of messages moving between dierent hosts but passing through a
i
i
i
current paper gives a more detailed treatment and makes the following additional contributions: (1) it shows that our
original scheme extends easily to deliver high bandwidth as well as low latency, (2) it shows that the scheme extends
easily to adapt to network congestion, (3) it compares our scheme to more recent theoretical work and alternative
practical implementations of message pipelining, and (4) it reports experimental results from a full implementation
on platforms that deliver full 32-bit/33MHz PCI bandwidth, showing the benets of adaptive message pipelining
for TCP/IP communication and data-intensive applications running under GMS. Our other two accepted papers on
GMS/Trapeze [1, 18] mention our message pipelining approach only in passing.
7
Effective Bandwidth HMBêsL
120
NetTx êNetRcv @MyrinetD
100
HostTx @Alcor D
HostTx @PII D
80
HostRcv @PII D
g=4
60
g=8
g=12
40
0
1024
2048
3072
4096
5120
Message Size Hbytes L
6144
7168
8192
Figure 5: Eective bandwidths for pipeline stages with various overheads, as a function of transfer
size. Curves for the PCI I/O buses and Myrinet links are measured by a minimal Myrinet rmware
program; the others are computed as B=(g + G B ) for transfer size B .
i
i
8
shared bottleneck. This can occur when clients stripe read or write requests across multiple servers,
when servers handle requests from multiple clients, or when contention occurs on the network or host
I/O bus.
In general, message pipelining reduces latency only when all links between the source and sink
are at low utilization. Cut-through delivery naturally produces the right pipelining behavior when
combined with buer queues on the NICs. In addition to absorbing bursts and allowing overlapped
transfers of separate packets through double buering, buer queues allow the adaptive behavior
of cut-through delivery to carry over to subsequent packets headed for the same sink. If a leading
packet encounters a queueing delay, more of the data from trailing packets accumulates in the NIC
buer queue during the delay. Since each sink transfer always carries as much data as the NIC has
buered for the packet, buering triggers larger transfers, helping to compensate for the delay.
HostTx
NetTx
HostRcv
0
200
400
600
800
1000
microseconds
Figure 6: Cut-through delivery falls back to larger transfer sizes for a stream of 8K payloads. The
xed-size transfer at the start of each packet on NetTx and HostRcv is the control message data, which
is always handled as a separate transfer. Control messages do not appear on HostTx because they
are sent using programmed I/O rather than DMA on this platform (300 MHz Pentium-II/440LX).
Since pipelining itself causes delays due to per-transfer g costs, cut-through delivery allows the
system to degrade naturally to store-and-forward for continuous streams of packets. Figure 6 shows
this eect for a stream of 8KB payloads. This diagram was generated from logs of the DMA activity
taken by an instrumented version of the Trapeze rmware. The transfers for successive packets
are represented with alternating shadings; all consecutive stripes with the same shading represent
transfers for the same packet. The width of each vertical stripe shows the duration of the transfer,
as determined by a cycle counter on the NIC. If there is no contention, this duration is proportional
to the size of the transfer (it is G B ). In practice, hostTx and hostRcv transfers on the host I/O
buses may take longer if another I/O device (or the CPU) contends for the I/O bus, or if the CPU
demands host memory bandwidth. Similarly, netTx transfers on Myrinet may stall due to memory
bandwidth limitations on the NIC, or to back-pressure ow control in the Myrinet link interfaces
and switches. Back-pressure may occur due to competition for a shared link from another sender, or
if the receiving NIC does not consume data from the link fast enough.
We make the following three important observations from Figure 6:
i
9
Pipelining with cut-through delivery always uses full-size transfers on hostTx, yielding peak
bandwidth at the rst stage of the pipeline. When the host sends a packet, the rmware
initiates a full 8K transfer on hostTx to bring it across the bus. In this example, the rst hostTx
transfer in this example takes 254s to complete due to competition from the host CPU as it
uses programmed I/O to write to control elds in the send ring entries for subsequent packets.
However, subsequent hostTx transfers run at close to full I/O bus bandwidth.
High eective bandwidths at the rst stage (hostTx) exert pressure to \pull up" the transfer sizes and eective bandwidths of downstream stages. These stages initially deliver lower
eective bandwidth since cut-through dictates that they will use smaller transfers. As the
source engines deposit data on the NICs faster than the downstream engines can sink it, data
accumulates in the NIC buers, triggering progressively larger transfers and higher eective
bandwidths propagating downstream in a ripple eect.
The NIC buer queues allow increasing transfer sizes to carry over to subsequent packets. Thus
the receiver gradually drops out of cut-through as it comes under load and the amount of data
buered on each NIC increases.
Figure 7 shows the eective bandwidth through each DMA engine for the transfers at each point
in the experiment. The PCI send DMA (hostTx) spikes above 112 MB/s almost immediately, but the
network channel bandwidth and receiver's PCI (hostRcv) bandwidth are \pulled up" to the sender's
speed more gradually. The network channel ramps up rst, followed by the receiver's PCI bus.
This experiment reaches steady state at T =1350s, with all DMA engines transferring at an average
bandwidth above 100 MB/s.
120
100
MBês
80
60
40
HostTx
NetTx
20
HostRcv
250
500
750
1000
microseconds
1250
1500
1750
Figure 7: Stepping up eective link bandwidths for a continuous stream of packets. (Note: the
eective bandwidths include idle time between transfers.)
10
2.4 Related Work
A number of systems have carried over the fragmentation/reassembly approach to reduce network
latency on networks whose MTUs do not require fragmentation. Fixed fragmentation was rst used
to reduce network latency in a version of GMS that treats each page as a sequence of subpages sent as
separate messages, as reported in the last ASPLOS [11] and discussed in more detail in Section 3.4.
Several Myrinet messaging systems now use xed fragmentation to reduce latency of large messages.
FM [14] and VMMC-2 [8] use xed-size fragments to lower latency of large messages. BIP [15]
improved on these approaches by selecting the fragment size based on the size of the message, yielding
latencies comparable to Trapeze on platforms with balanced I/O buses. Our approach appears to
be unique among published rmware implementations for Myrinet. PM appears to use a similar
technique on the sending side (immediate sending), but it is mentioned only in passing in a technical
report [16].
Wang et. al. [19] dene pipelining strategies that are similar to cut-through delivery in that they
are not limited to xed-size transfers. In particular, the work shows that the optimal fragment size
for xed-size fragmentation depends on the size of the message as well as the pipeline parameters,
and that variable-size fragmentation delivers lower latency than xed-size fragmentation when a
bottleneck link exists, i.e., one pipeline stage has higher per-fragment overhead or lower bandwidth
than the others. Their work yields important insights into pipeline behavior, but it assumes that g
and G values are xed and known statically. Moreover, their work considers only the fragmentation
approach to pipelining, in which each fragment encounters a store-and-forward delay, and it addresses
only latency of individual messages, and not bandwidth.
The Trapeze approach diers from these other approaches in that (1) pipelining is implemented
on the NIC, transparently to the hosts and the host network software, and (2) selection of transfer
sizes at each stage is automatic, dynamic, and adaptive to congestion conditions encountered within
the pipeline.
i
i
3 Performance of Message Pipelining Alternatives
This section compares Trapeze message pipelining using cut-through delivery to other approaches using fragmentation. We rst compare Trapeze latency and bandwidth to two of the top Myrinet-based
messaging packages that use pipelining: BIP and Illinois Fast Messages (FM), both of which use xed
fragmentation. Next we analytically contrast our approach with fragmentation strategies proposed
in Wang [19]. Finally, we compare Trapeze message pipelining to the GMS subpage work [11].
3.1 Comparing Myrinet Messaging Alternatives
We rst compare the latency and bandwidth of Trapeze with the latest FM and BIP releases.4 We
also show the eects of our pipelining choices by comparing to two modied versions of Trapeze: one
with all message pipelining disabled (store-and-forward) and one with transfer sizes xed to 1664
bytes (xed-1664), a value determined empirically to deliver the lowest latency for 8KB payloads.
For these experiments, we use a cluster of 300Mhz Pentium-II PCs based on the Intel 82433LX PCI
chipset (440/LX), a platform with excellent PCI performance. We ran each messaging system on its
4
We do not experimentally compare to the approach in [19], since no PCI prototype is available and the SBUS
implementation of Active Messages for Myrinet is limited to 40 MB/s peak bandwidth, even without pipelining.
11
Latency Hmicroseconds L
400
300
FM Hobserved L
200
FM HreportedL
Trapeze
HStore & ForwardL
Trapeze
HFixed -1664L
Trapeze
HCut-throughL
100
Bip Hobserved L
0
8192
16384
Message size Hbytes L
24576
32768
Figure 8: Latency as a function of message size.
primary operating system platform: FreeBSD 2.2.5 for Trapeze, Debian GNU Linux (kernel version
2.0.33) for BIP, and Windows NT 4.0 (SP3) for FM. However, all benchmarks in this section are
user-user tests that access the network interface directly; thus the results are only minimally aected
by OS performance.
We used developer-supplied benchmarks to obtain the most favorable results for each system. We
modied the BIP bandwidth test slightly to achieve higher bandwidths. The resulting benchmarks are
essentially identical for all three systems. However, Trapeze tests report latencies and bandwidths as
a function of payload size rather than packet size; thus each Trapeze packet actually carries additional
control message data not accounted for in the message size. This biases the tests slightly against the
Trapeze variants, particularly for small message sizes.
Figure 8 and Figure 9 show latency and bandwidth for all the systems.5 For messages larger than
a few hundred bytes, Trapeze has comparable or lower latency to the competing systems. At the same
time, Trapeze delivers higher point-to-point bandwidths, up to 126 MB/s for streams of 64K packets.
This high bandwidth is due to the ability of cut-through delivery to adjust transfer sizes as data
backs up in the network, as described in Section 2.3. The results show that this approach combines
the high bandwidth of store-and-forward with latency equivalent to or lower than the approaches
using xed transfer sizes (FM, BIP, Fixed-1664). At 8KB payload sizes, Trapeze can deliver higher
bandwidths to a user process through TCP sockets (with checksums disabled and zero-copy payload
remapping) or the le system than we measured through the raw FM and BIP message interfaces
(but note that the TCP and le system bandwidths we report were measured on Alpha systems with
5
We plot latencies reported on the FM web site as well as those measured in our testbed. The FM team is working
with us to identify the cause of the higher observed latencies.
12
120
Bandwidth HMBêsL
100
80
Trapeze HCut-throughL
Trapeze HStore êForwardL
60
BIP
Trapeze HFixed -1664L
40
FM
0
8192
16384
Message size Hbytes L
24576
32768
Figure 9: Bandwidth as a function of message size.
Total Loop Overhead HostTx/Rcv NetTx NetRcv
3.19 s
1.48 s
1.08 s 0.63 s
Table 3: Trapeze rmware resource loop overhead on the LANai-4 Myrinet NICs.
faster CPUs).
It is interesting to note that the systems using xed transfer sizes show a sawtooth bandwidth
prole caused by lower eective bandwidth for message sizes not evenly divided by the transfer size.
The oscillations in the FM bandwidth curve, most pronounced for small messages, reect the higher
overhead of host-side fragmentation relative to the NIC-based approaches used by Trapeze and BIP.
The terraced BIP bandwidth prole reects BIP's selection of fragment sizes based on the size of the
messages. The smooth Trapeze bandwidth prole reects ne-grained adjustment of transfer sizes
to adjust to message sizes as well as congestion.
Cut-through delivery imposes a modest overhead cost in the Trapeze rmware, which must poll
each NIC DMA engine in a loop to monitor pending DMA transfers (see Figure 4). When a DMA
transfer completes, the rmware executes half of its main loop on average before recognizing that the
engine is idle. Moreover, the status checks on the DMA engines add overhead to the main loop, as
shown in Table 3.1, which shows the average penalties for traversing the resource loop when the NIC
is idle. This overhead is most apparent for small messages and may explain the small dierence in
latency between BIP and Trapeze in Figure 8. The new LANai-5 NICs will reduce this small-message
penalty for cut-through delivery.
13
Aggregate Bandwidth MBês
100
Trapeze
80
BIP
60
40
20
0
1
2
Number of Clients
3
Figure 10: Aggregate bandwidth for many-to-one and one-to-many trac.
3.2 Adjusting to Contention
In the presence of contention, pipelining using xed fragmentation is suboptimal because the pipeline
parameters vary with competing trac. Cut-through delivery is the only strategy that adjusts to
network congestion. Figure 10 shows that Trapeze reverts to store-and-forward when a client or
server network interface saturates, even when the packets passing through the interface are moving
to or from dierent hosts. In this experiment, a single node either sends or receives streams of 8KB
messages to or from its peers. Each stream demands roughly one-half (60 MB/s) of the bandwidth
on the network. Figure 10 shows the resulting aggregate bandwidths for BIP and Trapeze with one,
two, or three peers. In a network storage system, these cases could occur for a server with multiple
clients, or a client striping data across multiple servers.
Both BIP and Trapeze satisfy the rst stream easily. When the second peer is added, congestion
causes Trapeze to revert to store-and-forward, allowing it to deliver its peak bandwidth although
neither stream alone demands full bandwidth. In contrast, BIP saturates at its maximum per-stream
bandwidth on the second peer. The addition of the third peer does not increase aggregate bandwidth
since both systems are already saturated. (Interestingly, the Myrinet network allocates bandwidth
equally among the Trapeze peers, while the bandwidth share is unbalanced under BIP.) The results
are the same in either direction; both systems deliver aggregate bandwidths equivalent to their peak
point-to-point bandwidths at saturation. This is not surprising; what is important here is that the
message pipelining in Trapeze adapts eectively to congestion in these many-to-one and one-to-many
scenarios.
A similar eect occurs for contention on a shared link or host I/O bus. This is important because
a sender using cut-through delivery may acquire network links earlier in order to begin sending a
packet before all of the data is ready, possibly increasing contention at shared links or the destination
14
interface. This is not a problem in practice because congestion causes Trapeze to drop out of cutthrough if any signicant competing trac exists.
3.3 Comparison to Variable Fragmentation
300
Store and Forward
Store and Forward
Optimal Fixed
Optimal Fixed
Optimal Variable
200
Cut-through
100
LANai-4
G = H132,132,132L MBês
g = H2,4,2L microsecs .
0
Latency Hmicroseconds L
Latency Hmicroseconds L
300
Optimal Variable
200
Cut-through
100
LANai-5
G = H266,160,266L MBês
g = H1,1,1L microsecs .
0
0
8192
Message Size Hbytes L
16384
0
8192
Message Size Hbytes L
Figure 11: Latency of four fragmentation schemes for two sample pipelines representative of LANai-4
and LANai-5 Myrinet NICs.
We now give evidence that Trapeze automatically selects pipeline schedules comparable to the
theoretically near-optimal fragmentation strategies outlined in [19]. In many cases the cut-through
delivery latencies are better, because cut-through delivery is not constrained to delay forwarding of
a \fragment" to stage i until it has cleared stage i ? 1.
Using models based on [19] and the cut-through delivery model used in Section 2.1), we compared
the latency of cut-through delivery with that of optimal xed and variable fragmentation strategies
for two \interesting" sets of pipeline parameters representative of a 32-bit/33MHz PCI/LANai-4 NIC
and a next-generation 64-bit/33MHz PCI/LANai-5, both in balanced-bus congurations. Figure 11
shows that cut-through pipelining yields lower latency than the three less exible strategies in these
congurations. We do not compare to the more sophisticated hierarchical fragmentation strategy,
which may improve latencies in pipelines with signicant bottlenecks.
These results are derived analytically, not empirically. While we believe the analysis is correct,
further validation is needed. However, we note that the behavior of cut-through delivery closely
matches the conditions for optimal pipelining given in [19]. Cut-through delivery always uses small
(minpulse) initial transfers to ll the pipeline quickly at the start of a packet; it uses larger transfers
for lower overhead at a bottleneck stage; it always reverts to smaller transfers for better pipelining
on exiting the bottleneck stage.
3.4 Subpages
We now compare message pipelining to the subpage approach do reducing GMS fetch latencies. At
rst approximation, subpages are merely a special case of xed fragmentation; the fragment size is
the subpage size. Subpages are subject to the same disadvantages as host-side xed fragmentation:
high host overheads, inexibility, and reduced bandwidth. NIC-level message pipelining in Trapeze
15
16384
subsumes the pipelining benet of pipelining, and in fact oers signicantly better latency due to
lower overhead (as well as the faster CPU, bus, and network on current systems). For example, the
time to arrival of the rst subpage in the GMS/ATM subpage prototype was 450s with 265-byte
subpages, 2.7 times the 165s time to fetch the entire 8KB page using Trapeze/Myrinet. The full
8KB page fault time in the GMS/subpage prototype was 1250s with optimal 2048-byte subpages.
A second benet of subpages is overlapped execution when the system restarts a page-faulting
process as soon as the faulted fragment arrives, assuming some mechanism for subpage-grain memory
protection in the CPU. If the restarted process faults another page before the rst page fetch is
complete, then the wire time for the remaining subpages of the original page overlaps with request
origination for the second fetch (and perhaps some of the hostTx and netTx wire time).
The subpage experiments showed that most of the benet from overlapped execution is in fact
due to this overlapping I/O eect rather than latency-hiding in the CPU. In GMS/Trapeze, 114s of
the page fetch time is wire time for the page; request origination and host processing account for only
51 s. Assuming perfect overlapping, this can produce a reduction of about 35% over the Trapeze
fetch time, for a marginal benet of 22% over the 38% improvement due to message pipelining
in Trapeze. This potential improvement is small but signicant, and we intend to explore the
possibility of combining the two approaches on CPUs that support subpage protection. This could
be implemented using an extension to Trapeze to allow the host to specify which subpage of the
payload to send rst, and to generate an interrupt at the receiver when the hostRcv DMA of the rst
subpage is complete.
4 Performance of Network Memory with Trapeze
This section reports results from synthetic benchmarks and application experiments using the GMS
network memory system over Trapeze/Myrinet. The most demanding network storage workload
for Trapeze consists of I/O bound applications accessing data sets that are too large for the local
memory of any node, but small enough to t in the GMS network memory. This allows us to study
the performance of Trapeze under a network storage workload with all disk accesses removed from
the critical path, showing the maximum achievable performance of I/O-intensive applications for any
network storage system using Trapeze.
For these experiments, we compared the performance of the applications under several Digital
Unix 4.0 variants, listed in order from slowest to fastest:
disk. Disk-bound runs without network memory.
gmsSF. A GMS system managing network memory as described in [9]: each evicted dirty page
is written to the local paging disk, and reads from global memory are destructive. This system
uses zero-copy page and block transfers over raw Trapeze, but without message pipelining (store
and forward). We have also extended it to do sequential readahead from network memory for
le accesses.
gms+SF. Similar to gmsSF, but modied to manage network memory as a writeback cache.
Evicted dirty pages are written only to a memory server and never to the disk, and pages
fetched from network memory remain there where they can be fetched again if not dirtied.
Evicted clean pages are written to network memory only if they are not already resident. In
16
100
other
gms/vm
tpz
pio
I/O stalls
user
Normalized Time
80
60
40
20
al
rn
s+
SF
in
r−
ce
ra
ra
ce
r−
te
gm
F
s+
gm
ra
ce
r−
r−
ce
ra
ra
ce
r−
gm
di
sS
sk
l
er
m
nd
ra
nd
ra
om
om
−i
nt
−g
s+
m
−g
om
nd
ra
na
s+
SF
F
k
−g
om
nd
nd
ra
ra
m
−d
om
te
in
q−
se
sS
is
al
rn
s+
SF
gm
se
q−
s+
gm
q−
se
q−
se
se
q−
gm
di
sS
sk
F
0
Figure 12: Synthetic benchmark execution times normalized to disk-bound run.
our experiments, pages in network memory are never evicted and there is sucent network
memory to store the entire dataset.
gms+. Identical to gms+SF, but with Trapeze message pipelining enabled to reduce fault
latencies.
internal. Internal memory runs with minimal I/O and user CPU utilizations above 97%.
We report runtime components for each benchmark normalized to the time for the disk runs. The
runtime components are GMS and VM overhead (gms/vm), Trapeze host overhead (tpz), programmed
I/O (pio) overhead, I/O stall time, and user time. We determine the size of each component using
DCPI [2]. All experiments use a single 500 MHz Alpha Miata client and a sucient number of 266
MHz Alcor memory servers to hold the dataset in network memory, if used. The network memory
runs use a warm cache. For the disk-bound runs, the clients used a fast-wide 7200 RPM Seagate
Barracuda paging disk.
4.1 Synthetic Benchmarks
Figure 12 shows performance of three synthetic benchmarks in all congurations. Seq accesses
anonymous virtual memory pages sequentially, random accesses its pages randomly, and racer access
memory-mapped le pages sequentially. Each page access is a single integer load, so the data is
always clean. These synthetic benchmarks perform no computation between page references. They
represent the worst cases for any I/O system.
In the network memory congurations, seq and random incur fault stall times proportional to
the GMS/Trapeze page fault latency from network memory. This latency is constant at 269.5s for
gmsSF and gms+SF; for gms+ adaptive message pipelining reduces this latency to 165s. Note that
random shows more benet from network memory (relative to disk) than seq because seq disk-bound
17
RSS
Disk/GMS/Internal
Application Max
(KB)
Total Time (s) User CPU %
applu
38192 44433/1452/408
0.9/28/compress95 36104
937/301/187
24/79/m88ksim
37984
2461/279/181
9.2/68/su2cor
24544
4320/733/139
3.4/28/-
Description
Partial dierential equations (SPEC)
Lempel-Ziv code/decode
(SPEC)
Motorola 88K chip simulator (SPEC)
Monte Carlo simulation
(SPEC)
Table 4: Application execution and user times.
100
other
gms/vm
tpz
pio
I/O stalls
user
80
60
40
20
l
na
s+
er
m
nt
−i
2c
or
or
2c
su
su
−g
or
−g
s+
sS
m
m
2c
su
2c
or
−g
or
su
2c
su
SF
F
k
is
_
−d
l
na
nt
er
m
−i
im
ks
88
m
88
ks
im
−g
s+
sS
m
m
im
−g
−g
im
m
88
ks
ks
s+
SF
F
k
m
−d
im
co
m
m
88
m
88
ks
in
5−
s9
es
is
_
al
te
gm
rn
s+
SF
5−
s9
pr
pr
es
5−
s9
m
co
es
pr
co
m
gm
gm
5−
5−
s9
s9
es
es
pr
co
m
pr
m
co
s+
di
sS
F
sk
al
rn
s+
in
u−
ap
pl
pl
ap
te
SF
gm
u−
s+
gm
pl
u−
u−
ap
pl
ap
ap
pl
gm
u−
di
sS
F
sk
0
Figure 13: Application execution times normalized to disk-bound runs.
runs are faster due to Digital Unix clustered readahead from the paging disk. Sequential accesses
in racer benet from readahead in both the disk and GMS congurations; under GMS, it is limited
by messaging bandwidth. Since message pipelining in Trapeze is adaptive, the bandwidth of the
racer runs with message pipelining is comparable to the store-and-forward runs. Under gms+, racer
sustains bandwidth just under 98 MB/s, touching one 8KB page every 85s.
4.2 Application Experiments
For these experiments, we selected benchmarks from the SPEC95 suite, constrained their local memory to 50 MB for the non-internal runs, and measured their performance under each system. Table
4 lists the applications, their Resident Set Sizes, and their peak performance under disk, gms+, and
internal congurations. These are demanding tests: Applu generates nearly two million page faults
for its non-internal runs.
The eectiveness of network memory is best evaluated from the user CPU utilizations. We see
that m88ksim shines with 68% user CPU utilization, up from 9.2% in the disk conguration. This
is a 32% runtime penalty from the internal memory runs, and a speedup of 7.4 over the disk runs.
We ran Applu on Alcor due to unstable behavior on Miatas, which have aws in the PCI bridge and
machine-check under I/O load. This inhibits Applu's performance because Alcor delivers only half
the PCI I/O bus bandwidth when receiving.
18
Figure 13 shows the execution time breakdowns for the runs, normalized to the disk runs. Note
that compress95 has a large but primarily read-only dataset, while M88ksim and su2cor dirty most
of their data, and therefore show larger improvements in the gms+ congurations (relative to gms)
since they are no longer limited by writes to the paging disk in gms+. Due to the asymmetry
between network-limited reads and disk-limited writes, even a modest write rate incurs a signicant
performance cost in the gms congurations. However, applications cannot survive dropped messages
or failure of a memory server in the gms+ conguration. The fact that no data was lost and the
applications ran to completion from network memory is evidence that our messaging system is reliable
(though we did not discuss those mechanisms) and our system is stable. We intend to produce a full
set of runs (and more analysis) for the nal paper.
It is interesting to note the variance in user CPU times across the runs, increasing in congurations
with higher I/O bandwidth demands. This is presumably due to memory contention from I/O,
although Miata has excellent memory system bandwidth.
5 Conclusion
Network storage systems are a driving application for high-speed networks. They can improve the
performance of everyday applications without modifying them. Combined with cheaper disk storage,
they help meet the growing need to store and process larger volumes of data.
However, network storage systems place conicting demands on the network. They demand low
latency for accesses that incur I/O stalls, particularly when the source and destination of the data
block is memory rather than disk, as is often the case when memory is used eectively as a disk
cache. They require high bandwidth for streaming transfers due to prefetching, multiprogramming,
write-behind, or concurrency between multiple clients and servers. In many cases, high-bandwidth
streams have many-to-one and one-to-many patterns, as in a storage client using multiple servers
(e.g., in a network RAID) or a storage server serving multiple clients. In these cases, high bandwidth
is needed not just for point-to-point bulk data transfer, but for streams of independent block-transfer
messages moving through the network.
The Trapeze messaging system combines low network latency with high bandwidth by careful
pipelining of data movement between network links and host memory. In Trapeze, message pipelining is implemented using a cut-through delivery policy implemented in custom rmware for Myrinet
network interface cards (NICs). Cut-through delivery on the NIC eliminates fragmentation and
reassembly overheads common to other pipelining approaches. Moreover, cut-through delivery automatically adapts pipeline transfers, compensating for varying message sizes, varying overheads and
bandwidths at dierent stages in the communication pipeline, and congestion conditions encountered
within the pipeline.
We use microbenchmarks on a Myrinet/Intel Pentium-II cluster to show that adaptive message
pipelining in Trapeze yields comparable latencies to other leading Myrinet messaging systems (FM
and BIP). Analytical models show that cut-through delivery yields lower latencies than theoretically
optimal strategies using the less exible alternative of fragmentation, for pipelines corresponding to
real cluster interconnects. Moreover, adaptive message pipelining in Trapeze eliminates the bandwidth penalty of the other pipelining approaches, yielding bandwidths from 91-98 MB/s to user
programs through TCP streams and network storage access.
We present results from a fully operational network memory system (the Global Memory Service)
using Trapeze on a Myrinet/Alpha cluster. Our results show that GMS/Trapeze can satisfy an 8KB
19
page fault in 165s, compared to 576s using IP and the vendor-supplied Myrinet messaging system.
Adaptive message pipelining yields a 38% page fault stall time improvement after all other optimizations have been applied. We show that these improvements in communication performance translate
directly into signicantly reduced storage access costs for a range of data-intensive applications.
References
[1] Darrell Anderson, Jerey S. Chase, Syam Gadde, Andrew J. Gallatin, Kenneth G. Yocum, and
Michael J. Feeley. Cheating the I/O bottleneck: Network storage with Trapeze/Myrinet. In
Proceedings of the 1998 Usenix Technical Conference, June 1998. Duke University CS Technical
Report CS-1997-21.
[2] Jennifer Anderson, Lance Berc, Je Dean, Sanjay Ghemawat, Monika Henzinger, Shun-Tak
Leung, Mark Vandevoorde, Carl Waldspurger, and Bill Weihl. Continuous proling: Where
have all the cycles gone? In Proceedings of the Sixteenth ACM Symposium on Operating System
Principles (SOSP), October 1997.
[3] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang. Serverless network
le systems. In Proceedings of the ACM Symposium on Operating Systems Principles, pages
109{126, December 1995.
[4] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W-K Su. Myrinet a gigabit-per-second local area network. IEEE Micro, February 1995.
[5] Greg Buzzard, David Jacobson, Milon Mackey, Scott Marovich, and John Wilkes. An implementation of the Hamlyn sender-managed interface architecture. In Proc. of the Second Symposium
on Operating Systems Design and Implementation, pages 245{259, Seattle, WA, October 1996.
USENIX Assoc.
[6] David E. Culler, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Brent Chun, Steven Lumetta,
Alan Mainwaring, Richard Martin, Chad Yoshikawa, and Frederick Wong. Parallel computing
on the Berkeley NOW. In Proceedings of the 9th Joint Symposium on Parallel Processing (JSPP
97), 1997.
[7] Zubin D. Dittia, Guru M. Parulkar, and Jerome R. Cox. The APIC approach to high performance
network interface design: Protected DMA and other techniques. In Proceedings of IEEE Infocom,
1997. WUCS-96-12 technical report.
[8] Cezary Dubnicki, Angelos Bilas, Yuqun Chen, Stefanos Damianakis, and Kai Li. VMMC-2: Ecient support for reliable, connection-oriented communication. In Hot Interconnects Symposium
V, August 1997.
[9] Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, and Henry M.
Levy. Implementing global memory management in a workstation cluster. In Proceedings of the
Fifteenth ACM Symposium on Operating Systems Principles, 1995.
[10] Garth A. Gibson, David F. Nagle, Khalil Amiri, Fay W. Chang, Eugene M. Feinberg, Howard
Gobio, Chen Lee, Berend Ozceri, Erik Riedel, David Rochberg, and Jim Zelenka. File server
20
scaling with network-attached secure disks. In Proceedings of ACM International Conference
on Measurement and Modeling of Computer Systems (SIGMETRICS 97), June 1997.
[11] Herve A. Jamrozik, Michael J. Feeley, Georey M. Voelker, James Evans III, Anna R. Karlin,
Henry M. Levy, and Mary K. Vernon. Reducing network latency using subpages in a global
memory environment. In Proceedings of the Seventh Symposium on Architectural Support for
Programming Languages and Operating Systems (ASPLOS VII), pages 258{267, October 1996.
[12] Parviz Kermani and Leonard Kleinrock. Virtual cut-through: A new computer communication
switching technique. Computer Networks 3, pages 267-286, 1979., 3:267{286, 1979.
[13] Edward K. Lee and Chandramohan A. Thekkath. Petal: Distributed virtual disks. In Proceedings
of the Seventh International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 84{92, Cambridge, MA, October 1996.
[14] Scott Pakin, Vijay Karamcheti, and Andrew Chien. Fast Messages (FM): Ecient, portable
communication for workstation clusters and massively-parallel processors. IEEE Parallel and
Distributed Technology, 1997.
[15] Loic Prylli and Bernard Tourancheau. BIP: A new protocol designed for high performance
networking on Myrinet. In Workshop PC-NOW, IPPS/SPDP98, 1998.
[16] Hiroshi Tezuka, Atsushi Hori, and Yutaka Ishikawa. PM:a high-performance communication
library for multi-user parallel environments. Technical report, Real World Computing Partnership.
[17] Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A scalable
distributed le system. In Proceedings of the Sixteenth ACM Symposium on Operating System
Principles (SOSP), October 1997.
[18] Geo Voelker, Tracy Kimbrel, Eric Anderson, Mike Feeley, Anna Karlin, Je Chase, and Hank
Levy. Implementing cooperative prefetching and caching in a globally-managed memory system.
In Proceedings of ACM International Conference on Measurement and Modeling of Computer
Systems (SIGMETRICS '98), June 1998.
[19] Randolph Y. Wang, Arvind Krishnamurthy, Richard P. Martin, Thomas E. Anderson, and
David E. Culler. Modeling and optimizing communication pipelines. In Proceedings of ACM
International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS
98), June 1998.
[20] Kenneth G. Yocum, Jerey S. Chase, Andrew J. Gallatin, and Alvin R. Lebeck. Cut-through
delivery in Trapeze: An exercise in low-latency messaging. In Proceedings of the Sixth IEEE
International Symposium on High Performance Distributed Computing (HPDC-6), pages 243{
252, August 1997.
21