Efficient Synchronization and Coherence for Nonuniform Communication Architectures ´

Efficient Synchronization and Coherence for Nonuniform Communication Architectures ´
IT Licentiate theses
2003-008
Efficient Synchronization and
Coherence for Nonuniform
Communication Architectures
Z ORAN R ADOVI Ć
UPPSALA UNIVERSITY
Department of Information Technology
Efficient Synchronization and
Coherence for Nonuniform
Communication Architectures
BY
Z ORAN R ADOVI Ć
September 2003
D IVISION OF C OMPUTER S YSTEMS
D EPARTMENT OF I NFORMATION T ECHNOLOGY
U PPSALA U NIVERSITY
U PPSALA
S WEDEN
Dissertation for the degree of Licentiate of Philosophy in Computer Science
at Uppsala University 2003
Efficient Synchronization and
Coherence for Nonuniform
Communication Architectures
Zoran Radović
[email protected]
Division of Computer Systems
Department of Information Technology
Uppsala University
Box 337
SE-751 05 Uppsala
Sweden
http://www.it.uu.se/
c Zoran Radović 2003
°
ISSN 1404-5117
Printed by the Department of Information Technology, Uppsala University, Sweden
Abstract
Nonuniformity is a common characteristic of contemporary computer systems,
mainly because of physical distances in computer designs. In large multiprocessors, the access to shared memory is often nonuniform, and may vary as much as
ten times for some nonuniform memory access (NUMA) architectures, depending
on if the memory is close to the requesting processor or not. Much research has
been devoted to optimizing such systems.
This thesis identifies another important property of computer designs, nonuniform communication architecture (NUCA). High-end hardware-coherent machines
built from a few large nodes or from chip multiprocessors, are typical NUCA
systems that have a lower penalty for reading recently written data from a neighbor’s cache than from a remote cache. The first part of the thesis identifies
node affinity as an important property for scalable general-purpose locks. Several software-based hierarchical lock implementations that exploit NUCAs are
presented and investigated. This type of lock is shown to be almost twice as fast
for contended locks compared with other software-based lock implementations,
without introducing significant overhead for uncontested locks.
Physical distance in very large systems also limits hardware coherence to a
subsection of the system. Software implementations of distributed shared memory (DSM) are cost-effective solutions that extend the upper scalability limit of
such machines by providing the “illusion” of shared memory across the entire system. This also creates NUCAs with even larger local-remote penalties, since the
coherence is maintained entirely in software.
The major source of inefficiency for traditional software DSM implementations
comes from the cost of interrupt-based asynchronous protocol processing, not
from the actual network latency. As the raw hardware latency of internode communication decreases, the asynchronous overhead in the communication becomes
more dominant. This thesis introduces the DSZOOM system that removes this
type of overhead by running the entire coherence protocol in the requesting processor.
i
ii
Acknowledgments
First of all, I would like to thank my supervisor, Professor Erik Hagersten, for
introducing me to the world of high performance computer architectures, and for
offering me a Ph.D. student position at Uppsala University. Erik spent countless
number of hours guiding me and my colleagues; improving our writing and “giving
the talk” skills. Erik also directly contributed to several key topics presented in
this thesis.
I would like to thank all brave members of the newly formed Uppsala Architecture Research Team (UART) for their comments, great time, and debates. Many
thanks to Lars Albertsson, Erik Berg, Martin Karlsson, Henrik Löf, Dan Wallin,
and Håkan Zeffer.
Oskar Grenholm, a student from Uppsala University, contributed to this work
during his final six months of study at UART. Oskar implemented a SPARC
Assembler Instrumentation Tool (SAIT) and several low-level DSZOOM optimization techniques.
The anonymous reviewers have provided valuable insights, pointers to literature, and criticism that I have used to make this research stronger. I am also
grateful to Karin Hagersten for her careful review of several manuscripts presented in this thesis.
This research is a close cooperation with Sun Microsystems, Inc. Many people
at Sun have provided valuable insights and criticism; especially Anders Landin,
Larry Meadows, and Steven Sistare. This research is supported in part by Sun
Microsystems, Inc., and the Parallel and Scientific Computing Institute (PSCI),
Sweden.
I would also like to thank Sverker Holmgren and the Department of Scientific
Computing at Uppsala University for the use of their Sun WildFire machine.1
This thesis is dedicated to my mother and father; Mira and Gojko Radović,
who have provided so much love and support throughout my life.
1
Erik’s design... [40]
iii
iv
Publications by the Author
This thesis includes, summarizes, and discusses the results and contributions
presented in several papers listed below. These papers will be referred to in the
summary as papers A through E.
Paper A Zoran Radović and Erik Hagersten.
Efficient Synchronization for Nonuniform Communication Architectures. In Proceedings of Supercomputing 2002 (SC2002), Baltimore, Maryland, USA, November 2002.
Paper B Zoran Radović and Erik Hagersten.
Hierarchical Backoff Locks for Nonuniform Communication
Architectures. In Proceedings of the Ninth International Symposium on
High Performance Computer Architecture (HPCA-9), Anaheim, California,
USA, February 2003.
Paper C Zoran Radović and Erik Hagersten.
Removing the Overhead from Software-Based Shared Memory. In Proceedings of Supercomputing 2001 (SC2001), Denver, Colorado,
USA, November 2001.
Paper D Henrik Löf, Zoran Radović, and Erik Hagersten.
THROOM — Running POSIX Multithreaded Binaries on a Cluster. Technical Report 2003-026, Department of Information Technology,
Uppsala University, Sweden, April 2003. A shorter version of this paper
is published in Proceedings of the 9th International Euro-Par Conference
(Euro-Par 2003 ), Klagenfurt, Austria, August 2003.
Paper E Oskar Grenholm, Zoran Radović, and Erik Hagersten.
Latency-hiding and Optimizations of the DSZOOM Instrumentation System. Technical Report 2003-029, Department of Information
Technology, Uppsala University, Sweden, May 2003.
v
Reprints of papers A, B, C, and D were made with permission from the publishers.
c 2002 IEEE
Paper A °
c 2003 IEEE/ACM
Paper B °
c 2001 ACM
Paper C °
c 2003 Springer-Verlag
Paper D °
Comments on My Participation
Paper A I am the principal author of this paper.
Paper B I am the principal author of this paper.
Paper C I am the principal author of this paper.
Paper D Henrik Löf is the principal author of this paper. I am responsible
for the instrumentation support, which is based on the original DSZOOM
implementation.
Paper E This paper is based on Oskar Grenholm’s Master’s thesis, which was
co-advised by me and Professor Erik Hagersten.
Other Papers and Reports
In addition to papers A through E, I am the principal author of a number of
work-in-progress papers [75, 76, 79].
vi
Contents
1. Introduction
1
2. Experimentation Environment
2.1. Hardware: Sun’s WildFire Prototype . . . . . . . . . . . . . . . .
2.2. SPLASH-2 Benchmark Suite . . . . . . . . . . . . . . . . . . . . .
3
3
4
3. NUCA-Aware Locks
3.1. The RH Lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2. HBO Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
8
9
4. DSZOOM — All-Software Fine-Grained DSM
4.1. Computer Clusters . . . . . . . . . . . . . .
4.2. Beowulfs . . . . . . . . . . . . . . . . . . . .
4.3. Software-Based Distributed Shared Memory
4.4. The DSZOOM System . . . . . . . . . . . .
4.5. THROOM: Pthreads Support for DSZOOM
4.6. Write Permission Cache . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
13
15
15
5. Future Work
5.1. Future Experimentation Platform
5.1.1. Sun Fire Link . . . . . . .
5.1.2. InfiniBand . . . . . . . . .
5.2. Possible Improvements . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
17
17
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6. Summary of the Contributions
21
A. Efficient Synchronization for Nonuniform Communication Architectures
A.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2. Nonuniform Communication Architectures . . . . . . . . . . . . .
A.3. Background and Related Work . . . . . . . . . . . . . . . . . . . .
A.3.1. Atomic Primitives . . . . . . . . . . . . . . . . . . . . . . .
A.3.2. Simple Lock Algorithms . . . . . . . . . . . . . . . . . . .
A.3.3. Queue-Based Locks . . . . . . . . . . . . . . . . . . . . . .
25
26
27
28
28
29
30
vii
Contents
A.4.
A.5.
A.6.
A.7.
A.3.4. Alternative Approaches . . .
Key Idea Behind RH Lock . . . . .
The RH Lock . . . . . . . . . . . .
Performance Evaluation . . . . . .
A.6.1. Uncontested Performance .
A.6.2. Traditional Microbenchmark
A.6.3. New Microbenchmark . . .
A.6.4. Application Performance . .
Conclusions . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
B.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.2. Nonuniform Communication Architectures . . . . . . . . . . . . .
B.3. Background and Related Work . . . . . . . . . . . . . . . . . . . .
B.4. Hierarchical Backoff Locks . . . . . . . . . . . . . . . . . . . . . .
B.4.1. The HBO Lock . . . . . . . . . . . . . . . . . . . . . . . .
B.4.2. The HBO_GT Lock . . . . . . . . . . . . . . . . . . . . .
B.4.3. The HBO_GT_SD Lock . . . . . . . . . . . . . . . . . . .
B.5. Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . .
B.5.1. Uncontested Performance . . . . . . . . . . . . . . . . . .
B.5.2. Traditional Microbenchmark . . . . . . . . . . . . . . . . .
B.5.3. New Microbenchmark . . . . . . . . . . . . . . . . . . . .
B.5.4. Application Performance . . . . . . . . . . . . . . . . . . .
B.6. Fairness and Sensitivity . . . . . . . . . . . . . . . . . . . . . . . .
B.7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C. Removing the Overhead from Software-Based Shared Memory
C.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.2. DSZOOM Overview . . . . . . . . . . . . . . . . . . . . . . .
C.2.1. Cluster Networks Model . . . . . . . . . . . . . . . . .
C.2.2. Node Model . . . . . . . . . . . . . . . . . . . . . . . .
C.2.3. Blocking Directory Protocol Overview . . . . . . . . .
C.2.4. Protocol Details . . . . . . . . . . . . . . . . . . . . . .
C.3. Implementation Details . . . . . . . . . . . . . . . . . . . . . .
C.4. Performance Study . . . . . . . . . . . . . . . . . . . . . . . .
C.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . .
C.4.2. Applications . . . . . . . . . . . . . . . . . . . . . . . .
C.4.3. Binary Instrumentation . . . . . . . . . . . . . . . . .
C.4.4. Parallel Performance . . . . . . . . . . . . . . . . . . .
C.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C.7. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
32
33
36
36
38
41
43
47
51
52
53
54
58
58
59
59
61
62
63
65
68
72
73
79
80
82
82
83
83
84
87
91
91
91
92
93
98
99
100
Contents
D. THROOM — Running POSIX Multithreaded Binaries on a Cluster103
D.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
D.2. DSZOOM — a Fine-Grained SW-DSM . . . . . . . . . . . . . . . 104
D.3. THROOM Overview . . . . . . . . . . . . . . . . . . . . . . . . . 105
D.3.1. Distributing Threads . . . . . . . . . . . . . . . . . . . . . 105
D.3.2. Creating a Global Shared Address Space . . . . . . . . . . 106
D.3.3. Cluster-Enabled Library Calls . . . . . . . . . . . . . . . . 106
D.3.4. THROOM in a Nutshell . . . . . . . . . . . . . . . . . . . 107
D.4. Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 108
D.4.1. Binary Instrumentation . . . . . . . . . . . . . . . . . . . 109
D.4.2. Modified System and Library Calls . . . . . . . . . . . . . 111
D.5. Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . 111
D.6. Discussion, Conclusions, and Future Work . . . . . . . . . . . . . 112
D.7. Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation
System
117
E.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
E.2. Target Architecture/Compiler Overview . . . . . . . . . . . . . . 119
E.2.1. Original Proof-of-Concept Platform . . . . . . . . . . . . . 119
E.2.2. SPARC V8 and V9 ABI Restrictions . . . . . . . . . . . . 120
E.2.3. Target Compiler Details . . . . . . . . . . . . . . . . . . . 120
E.3. SAIT: SPARC Assembler Instrumentation Tool . . . . . . . . . . 122
E.3.1. Parsing SPARC Assembler . . . . . . . . . . . . . . . . . . 122
E.3.2. Liveness Analysis . . . . . . . . . . . . . . . . . . . . . . . 124
E.3.3. Handling Delay Slots . . . . . . . . . . . . . . . . . . . . . 126
E.3.4. Using the Instrumentation Tool . . . . . . . . . . . . . . . 127
E.4. Low-Level Optimization Techniques . . . . . . . . . . . . . . . . . 128
E.4.1. Rewriting Snippets and Reducing the MTAG Size . . . . . 129
E.4.2. Straight Execution Path . . . . . . . . . . . . . . . . . . . 130
E.4.3. Avoiding Local Load/Store Instrumentation . . . . . . . . 130
E.4.4. Write Permission Cache (WPC) . . . . . . . . . . . . . . 130
E.5. Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . 132
E.5.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 132
E.5.2. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 134
E.5.3. Performance Overview . . . . . . . . . . . . . . . . . . . . 135
E.5.4. WPC Study . . . . . . . . . . . . . . . . . . . . . . . . . . 137
E.6. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . 141
Bibliography
145
ix
Contents
x
1. Introduction
During the 1990s, shared-memory UNIX servers made parallel computing popular. They are commonly used today in the commercial and technical server
markets. One of the prime reasons for the final acceptance of multiprocessors
in the wider marketplace was their shared-memory programming abstraction,
which made parallel systems easy to program and manage [15, 19, 22, 58, 65, 96].
Many programming paradigms and language primitives have been designed assuming the existence of shared-memory, including the Argonne National Laboratory (ANL) parallel macro package [14], POSIX Pthreads [43], OpenMP [26],
parallel implementations of Java [35], and Unified Parallel C (UPC) [17].
Today’s servers are typically populated with processors that can only process
one thread at a time and spend a majority of time waiting for memory. In contrast, the chip multiprocessor (CMP) and/or the simultaneous multithreading
(SMT) processors are designed to being able to process multiple threads simultaneously [73, 106]. The modern/future processors, such as IBM’s Power4 [104],
Intel’s Xeon and Pentium 4, or Sun’s upcoming UltraSPARC 4, are all capable of
running several threads in parallel. Also, Sun’s upcoming Niagara architecture is
said to run as many as 32 parallel threads on each chip [55]. When a thread must
wait for memory, the affected core will simply start processing another thread.
This radically improves chip utilization and increases application throughput.
Many sequential applications, i.e., the majority of all software written so far, will
simply be forced to “switch” to a multithreaded/parallel mode.
We are currently not that far from a symmetric multiprocessor (SMP) on chip
implementations (server-on-a-chip). It is even possible to create larger systems
from several server-on-a-chip units, which will create several hierarchical and
nonuniform properties of the computing system. One of these properties is identified in A as a nonuniform communication architecture (NUCA), in which the
cache-to-cache transfer from a cache in a “remote” server-on-a-chip unit (also
called a processing node) is significantly slower than a transfer from a cache in
the same node.
Almost all shared-memory applications use synchronization mechanisms to correctly coordinate thread accesses to shared-memory objects. The scalability of
these applications is often limited by contention for some critical section, for
example, to modify shared data guarded by mutual exclusion locks. While the
various optimizations of spin-lock implementations have in some instances led
to enhanced performance, most solutions do not consider or exploit the NUCA
characteristics of the computer system. In addition, many implementations have
1
1. Introduction
resulted in relatively high latencies for uncontested locks. A mechanism and
methodology is therefore desirable that may exploit the NUCA nature of a multiprocessing system to optimize spin-lock operations without introducing significant
latencies for uncontested locks. The RH lock is presented in A and hierarchical
backoff (HBO) locks in B, all exploiting NUCAs. These lock implementations
prioritize handing over the lock to a thread that is residing as closely as possible to the previous owner, even if it has not been waiting for the longest time.
The most extreme results reported showed a greater than two-fold speedup compared with any existing software-based locks, running standard benchmarks on
our NUCA-hardware.
With the advent of low-latency, high-bandwidth networking hardware, clusters
of workstations, blades or entry-level servers strive to offer the same processing power as high-end servers for a fraction of the cost. In such environments,
shared memory has been traditionally limited to software page-based distributed
shared-memory (DSM) systems, which control access to shared memory using
the memory’s page protection to implement coherence protocols. Unfortunately,
false sharing and interrupt- and/or poll-based asynchronous protocol messaging
force such systems to resort to weak consistency shared-memory models, which
complicate the shared-memory programming model. DSZOOM, introduced in C,
is a sequentially consistent [56], fine-grained distributed software-based sharedmemory system that differs from most other software DSM implementations. The
DSZOOM system is an all-software solution that creates the “illusion” of a single
shared memory across the entire cluster using a software run-time layer, attached
between the application and the hardware. A full DSZOOM system is operational
today and it demonstrates a slowdown in the range of 30 percent compared to an
expensive hardware implementation of shared memory, both running unoptimized
code originally written for hardware-based multiprocessors [77]. The experiments
with highly optimized programs are presented in E, together with several lowlevel DSZOOM optimization techniques, including the write permission cache
(WPC) that can hide latencies for memory-store operations.
The rest of this thesis is organized as follows. First, in chapter 2, a common
experimentation environment which is used in the thesis is briefly introduced.
NUCAs are discussed in more detail in chapter 3, and a set of NUCA-aware
locks is introduced. Computer clusters in general, and DSZOOM in particular,
are discussed in chapter 4. Possible improvements to the DSZOOM system are
proposed in chapter 5. Finally, a summary of the contributions is presented in a
compact form in chapter 6.
2
2. Experimentation Environment
In this chapter, the common experimentation environment (both hardware and
software) that is used throughout this work is summarized. Basically, in all
included papers (A through E), the identical experimentation hardware is used
since the start of this research in January 2000: a 2-node Sun WildFire machine
(described in section 2.1). All papers also use well-known workloads from the
SPLASH-2 benchmarks suite [111] that are shortly presented in section 2.2. In
addition, in NUCA-synchronization papers (A and B), several simple benchmarks
are introduced and used to evaluate many properties of software-based locking
algorithms studied.
2.1. Hardware: Sun’s WildFire Prototype
The common experimentation hardware for all papers in this thesis is an experimental prototype multiprocessor called WildFire, built by Sun Microsystems.
The WildFire is a cache-coherent nonuniform memory access (CC-NUMA) architecture with unusually large nodes (up to 28 processors per node). The individual
nodes in the WildFire design are Sun E series multiprocessors (E6x00, E5x00,
E4x00, or E3x00). The measurements in this thesis are all done with two E6000
multiprocessors as the nodes. WildFire can connect two to four multiprocessors
by replacing one dual-processor (or I/O) board with a WildFire Interface (WFI)
board, yielding up to 112 processors (4×28). The WFI board supports one coherent address space across all four multiprocessor nodes (it plugs into the bus
and sees all memory requests). Each WFI has three ports that connect to up to
three additional WildFire nodes, each with a raw bandwidth of 800 MB/sec in
each direction (Figure 2.1). This type of machines is in literature usually referred
to as a hardware DSM. A hardware-configuration used in this thesis:
❏ E6000 nodes. The individual nodes in our WildFire configuration are
Sun Enterprise E6000 symmetric multiprocessors [96]. The server has 16
UltraSPARC II (250 MHz) processors and 4 GB uniformly shared memory
with an access time of 330 ns (lmbench latency [67]) and a total bandwidth
of 2.7 GB/sec. Each processor has a 16 kB on-chip instruction cache, a
16 kB on-chip data cache, and a 4 MB second-level off-chip data cache.
❏ WildFire configuration. The configuration used in this thesis, is a 2node machine, built from two E6000 nodes with 16 processors each. The
3
2. Experimentation Environment
access time to local memory is the same as above, 330 ns, while accessing
data located in the other E6000 node takes about 1700 ns (lmbench latency). The entire system runs a slightly modified version of the Solaris 2.6
operating system. The WildFire system is a highly configurable machine
and is well suited for research environments. To read more on WildFire, see
Hagersten and Koster [40], Noordergraaf and van der Pas [72], or Hennessy
and Patterson [42].
Memory
WFI
$
$
$
P1
P2
P3
...
$
$
$
$
Pn
P1
P2
P3
Memory
WFI
$
$
$
P1
P2
P3
...
Memory
WFI
$
...
Pn
Memory
WFI
$
$
$
$
Pn
P1
P2
P3
...
$
Pn
Figure 2.1.: Sun’s WildFire prototype connects up to four Sun E series multiprocessors by inserting one WFI board in each node. I/O boards are
not shown in this figure.
2.2. SPLASH-2 Benchmark Suite
The benchmarks that are used in this thesis are well-known workloads from the
SPLASH-2 benchmark suite [111], originally developed for hardware multiprocessors. Only unmodified benchmarks from the original Stanford University distribution are used in this thesis. The applications are representative for scientific,
engineering, and computing graphic fields. A short description of each application
from the original SPLASH-2 paper by Woo et al. is given here:
❏ Barnes — This application simulates the interaction of a system of bodies
(galaxies or particles, for example) in three dimensions over a number of
time steps, using the Barnes-Hut hierarchical N-body method.
4
2.2. SPLASH-2 Benchmark Suite
❏ Cholesky — The blocked sparse Cholesky factorization kernel factors a
matrix into the product of a lower triangular matrix and its transpose.
√
❏ FFT — The FFT kernel is a complex 1-D version of the radix- n sixstep FFT algorithm described in [7]. This kernel is optimized to minimize
interprocessor communication.
❏ FMM — Like Barnes, the FMM application also simulates a system of
bodies over a number of timesteps. However, it simulates interactions in
two dimensions using a different hierarchical N-body method called the
adaptive Fast Multipole Method [37].
❏ LU-c — Blocked LU decomposition with contiguous allocation of data. It
factors a dense matrix into the product of a lower triangular and upper
triangular matrix. More optimized version of LU-nc. See [113] for more
details.
❏ LU-nc — A classical method for blocked LU decomposition. See [113] for
more details.
❏ Ocean-c — The Ocean application studies large-scale ocean movements
based on eddy and boundary currents. See [112] for more details. More
optimized version of Ocean-nc.
❏ Ocean-nc — The Ocean application studies large-scale ocean movements
based on eddy and boundary currents. See [112] for more details.
❏ Radiosity — This program computes the equilibrium of light in a scene
using the interactive hierarchical diffuse radiosity method [41].
❏ Radix — The integer radix sort kernel is based on the method described
in [13].
❏ Raytrace — This application renders a three-dimensional scene using ray
tracing. See [94] for more details.
❏ Volrend — This application renders a three-dimensional volume using a
ray casting technique. See [71] for more details.
❏ Water-Nsquared — Water simulation without spatial data structure.
This application evaluates forces and potentials that occur over time in
a system of water molecules. See [95] for more details.
❏ Water-spatial — Water simulation with spatial data structure. This application solves the same problem as Water-nsq, but uses a more efficient
algorithm. See [95] for more details.
5
2. Experimentation Environment
6
3. NUCA-Aware Locks
Shared-memory architectures with a nonuniform memory access (NUMA) time
to the shared memory are gaining popularity [32, 40, 58, 59, 65]. This thesis
identifies another important property of NUMAs: nonuniform communication
architecture (NUCA). Most systems that form NUMA architectures also have
the characteristic of a NUCA, in which the access time from a processor to other
processors’ caches varies greatly depending on their placement (see Figure 3.1 for
an example).
Memory
$
$
$
$
1
P1
P2
Memory
Switch
P3
NUCA
ratio
$
$
$
P1
P2
P3
...
$
2 – 10
Pn
Pn
Figure 3.1.: NUCA architecture.
A cache-to-cache transfer from a cache in a remote node is X times slower than
a transfer from a cache in the same node. The X is called NUCA ratio. Several
NUCA examples are shown in Table 3.1 with their NUCA ratios.
Stanford DASH is one of the first cache-coherent NUCA machines [59], introduced in 1992. Each DASH node consists of four processors connected by a
snooping bus. DASH is more known as a traditional node-based NUMA machine.
In node-based NUMA systems; in particular, processors have much shorter access
times to other caches in their group than to the rest of the caches. Large servers,
built from several CMP and/or SMT chips, can therefore be expected to form
NUCAs, since collated threads will most likely share an on-chip cache at some
level [8]. It is possible that several levels of nonuniformity will be present in future
large-scale servers, e.g., one of today’s NUMA architectures populated with CMP
processors instead of traditional single-threaded processors. This would create a
hierarchical NUMA and NUCA property of the system.
Many of today’s applications exhibit a large fraction of cache-to-cache misses [9,
50]. Optimizations which consider the NUCA nature of a system may lead to
significant performance enhancements.
7
3. NUCA-Aware Locks
In general, spin-lock operations are associated with software locks that are used
by programs to ensure that only one parallel process at a time can access a critical
region of memory. In some cases, several tens of percents of the performance could
be lost at contended locks.
Year
1992
1996
1999
2000
Future
NUCA Example
Stanford DASH [59]
Sequent NUMA-Q [65]
Sun WildFire [40]
Compaq DS-320 [32]
CMP [73] and/or SMT [106]
NUCA Ratio
≈ 4.5
≈ 10
≈6
≈ 3.5
≈ 6–10
Table 3.1.: NUCA examples.
The synchronization-research boom took place in the late 1980s and early
1990s. A variety of lock implementations have been proposed, ranging from
simple spin-locks to advanced queue-based locks [23, 66, 68]. However, the complicated software queuing locks are less efficient for uncontested locks, which have
led to the creation of even more complicated adaptive hybrid proposals in the
quest for a general-purpose solution [64]. Although, simple spin-locks can create
very bursty traffic, they are still the most commonly used software locks within
computer systems. The major drawback of the simple spin-lock implementations
is that they suffer from poor performance at high contention: the more contested
the critical section gets, the lower is the rate at which new threads can enter it.
A lock-unlock solutions presented so far, do not consider or exploit the NUCA
characteristics of a shared-memory computer system. Several NUCA-aware locking primitives are presented in A and B; the RH lock and three hierarchical
backoff (HBO) locks that are aware of the machine’s NUCA-property.
3.1. The RH Lock
The RH lock is our first NUCA-aware lock proposal that exploits 2-node NUCAs
(see paper A, section A.5 for pseudo code). The goal of the RH lock is (1) to
create a lock that minimizes the global traffic generated at lock handover and (2)
to maximize the node locality of NUCA architectures. In the RH lock proposal,
every node contains a copy of the lock. The total lock storage for a 2-node case
is thus 2×sizeof(lock). The global traffic at lock handover is minimized by
simply making sure that only one thread per node (the “node winner”) performs
remote-spinning if the lock is currently not owned by a thread in the node. The
node locality of NUCAs is increased by handing over the lock to another thread
running in the same node. This not only cuts down the lock-handover time,
but creates locality in the critical section work, since its data structures already
8
3.2. HBO Locks
reside in the node. The RH lock is implemented on the SPARC V9 architecture
with three atomic operations: tas, swap, and cas (see section A.3.1 for details).
This lock could easily be adapted to noncoherent systems as well, assuming the
existence of remote put, get, and atomic operations in the networking hardware.
3.2. HBO Locks
There are few drawbacks with this original RH lock proposal. First, the lock implementation is vulnerable to starvation. Second, the implementation is not particularly portable since the allocation and a physical placement of the memory in
different nodes is sometimes difficult or even impossible task for many machines.
Finally, this proof-of-concept implementation supports only two NUCA nodes.
The work on NUCA-aware locks continues in B. A set of simple hierarchical
backoff (HBO) locks that are based on traditional software spin-lock implementations with exponential backoffs are presented. All HBO proposals use only one
atomic operation (cas) and all of them are dependent on a per-thread/process
node_id information. Currently, there are three flavors of HBO locks:
1. The basic HBO lock, see section B.4.1
2. HBO lock with global traffic throttling (HBO_GT), see section B.4.2
3. HBO_GT with starvation detection (HBO_GT_SD), see section B.4.3
In summary, the goals for HBO locks are the following:
❏ Efficiently exploit communication locality (create node affinity) in a NUCA
❏ Handle contended locks well
❏ Introduce minimal overhead for uncontested locks (the most common case)
❏ Scale to many NUCA nodes
❏ Reasonable memory space requirements
❏ Minimizing the potential risk of starvation (HBO_GT_SD)
❏ Simple and portable implementation
RH and HBO locks are compared with other software-based locks using simple synchronization microbenchmarks. In both A and B, the total local and
global/remote traffic in the system is significantly reduced for NUCA-aware locks
compared with other software-based locks. An application study demonstrates
superior performance for applications compiled with RH and HBO locks and with
high lock contention (especially Raytrace) and competitive performance for other
programs.
9
3. NUCA-Aware Locks
10
4. DSZOOM — All-Software
Fine-Grained DSM
So far in the thesis, only the hardware-based cache-coherent computer systems
have been considered. Today, it is possible to build really large systems in only
one box/cabinet; more than 100 processors is already a reality. Sometimes, this is
still not enough for several time-consuming applications, such as weather simulations. Connecting together several of these hardware-coherent boxes can result in
even more scalable machines, traditionally called for distributed shared memory
(DSM) systems. The Sun WildFire machine, introduced in section 2.1, is a great
example of one such hardware-based DSM system. In this chapter, we search for
a scalable and cost-effective software-based DSM solution. This will introduce
yet another level in the memory hierarchy, a hierarchy-level between the computing nodes that is kept coherent with software. This chapter starts with a short
introduction to computer clusters and a cache coherence problem in general.
4.1. Computer Clusters
There are many different types of computer clusters. Clusters have been used
since Tandem introduced them in 1975 for fault tolerance. Digital also offered
VAX clusters in 1983. Generally speaking, a computer cluster is a group of
individual systems (PCs, workstations, blades, or multiprocessors/servers) that
typically behave and perform as a single system under one management framework. A cluster environment consists of a number of computing nodes, a cluster
interconnect component, and cluster software. For example, Sun’s latest “supercluster” is defined below [103].
❏ Compute nodes: Each server in the cluster environment, sometimes called
a compute node, has one or more processors, data resources, and its own
memory, operating system, as well as application software. A node can be a
single server or a domain1 in a high-end server. SMP servers provide large
memory, processing and data movement capacity as a versatile building
block of the cluster.
1
In many Sun servers, like the Sun Fire 6800, Sun Fire 12K, and Sun Fire 15K servers,
processors can be partitioned into a number of completely independent domains, each of
which runs a separate copy of the Solaris Operating System and participates as a member
in the cluster.
11
4. DSZOOM — All-Software Fine-Grained DSM
❏ Cluster interconnect: Cluster nodes are connected to each other by the
cluster interconnect. The cluster interconnect carries the message and data
traffic between the nodes. The interconnect fabric and the communication
protocol forms the network.
❏ Cluster software: Distributed resources and capabilities of individual
cluster nodes are managed collectively and made available to the users
and applications as a single resource by the cluster software. The cluster
software determines the purpose and function of the cluster.
The supercluster is available in 2-, 4-, and 8-node configurations. With up to
eight Sun Fire 15K servers, clusters based on Sun Fire Link [97] offer almost 2
TFLOPS peak performance—a total of about 800 UltraSPARC III Cu processors.
However, this system does not currently support the global view of the common
shared memory; users are instead forced to use a message passing programming
interface (MPI) during the development time of their parallel applications.
One major complication associated with distributed shared memory in multiprocessing computer systems relates to maintaining the coherency of program
data shared across multiple nodes in a traditional DSM implementation. In general, the system must implement an “ordering policy” that defines an order of
operations (in particular memory operations) initiated by different processors.
During the execution of a system’s workload, cache lines often move between
the nodes. This “movement” needs to be performed such that operations on
the cache line occur in a manner that is consistent with the ordering/memory
model. Without a “coordination mechanism,” one of the processors may perform
an update that is not properly reflected in another node. Maintaining a unified,
coherent view of shared-memory locations is thus essential from the standpoint
of program correctness. In other words, a memory system is coherent if any read
of a data item returns the most recently written value of that data item. For
a more formal definition of cache coherency, see Hennesey and Patterson [42].
The protocols to maintain coherence for multiprocessor systems are called cache
coherence protocols.
4.2. Beowulfs
Recently, so-called Beowulf clusters have gained popularity [99]. Beowulf systems
are built form several PCs, workstations, or blades, connected by a “standard”
interconnect. Each cluster node runs its own operating system, and there is no
globally shared data. The Beowulf model suits certain classes of applications well,
especially many independent jobs running in parallel.2 Some loosely coupled parallel algorithms can also run well on Beowulfs. Thus, Beowulf systems are here to
2
For example, Google is one of the largest and most famous installations in the world today.
12
4.3. Software-Based Distributed Shared Memory
stay for many years to follow. However, while this is a great price/performance
choice for many applications, it is in many ways a step backwards in terms of technology. Beowulf clusters do not efficiently support the popular abstraction of a
common memory, shared in a coherent way among the cluster nodes, and can not
execute certain common parallel programming paradigms and languages. Basically, Beowulf clusters forces the applications that run across several processors to
explicitly communicate shared data. This can be a cumbersome task, especially
for dynamically scheduled and load-balanced systems. It also forces programmers
to describe the problem using a message-passing paradigm even though it may
be best described using well established shared-memory constructs. One popular solution that can run shared-memory applications on such clusters, originally
called for shared virtual memory (SVM), was presented by Li and Hudak [62]. A
first SVM prototype was implemented on an Apollo ring in the mid 1980s. In this
thesis, this type of systems is called software-based distributed shared-memory
systems, or SW-DSMs.
4.3. Software-Based Distributed Shared Memory
Traditional software-based shared memory relies on the page protection system
to detect when a coherence activity is needed [10, 29, 44, 45, 47, 52, 62, 60, 61, 63,
98, 100]. A page can be put in the exclusive state in a node, giving the processors
of that node write permission to the page while the processors of the other nodes
would have no access rights to the page and would a trap the first time they try
to access it. The trap will wake up the software protocol engine of that node and
send a message to a dedicated home node’s protocol engine requesting a copy of
the page. The home node would forward the request to the protocol engine of the
node with the exclusive copy of the page, which would downgrade its access to
read-only access right of the page and send a copy of the page to the requesting
node’s protocol engine. There are three major drawbacks of this scheme: (1) the
interrupt-based asynchronous messages between the protocol engines add substantial latency; (2) the large coherence units (i.e., a page), instead of commonly
used smaller cache line size, creates an ample of false sharing; and (3) each protocol action requires a large quantity of data to be transferred. Another major
drawback is that the synchronization is slow because it is typically implemented
through explicit messages.
4.4. The DSZOOM System
A software-based distributed shared memory proposal and implementation presented in C, DSZOOM [74, 75, 76, 77], differs from most other software-based
DSM implementations in five major areas:
13
4. DSZOOM — All-Software Fine-Grained DSM
1. Fine-grain coherence: Instead of relying on page-based coherence units,
the code instrumentation [39, 57] technique is used to expand load and store
instructions that may touch a shared data into a sequence of instructions
that performs in-line coherence checks. This work is originally inspired by
DEC’s Shasta [86, 85, 83] and Wisconsin’s Blizzard-S [90, 89] proposals
from the mid 1990s, which are two cost-effective fine-grain software DSM
implementations.
2. Protocol in requesting processor : If the coherence check determines
that global coherence action is required, the entire protocol is run in the
requesting processor, which otherwise would have been idle waiting for the
coherence action to be resolved.
3. No asynchronous interrupts: Since the entire coherence protocol is run
in the requesting processor/node, there is no need to send asynchronous
interrupts to “coherence agents” in other nodes, i.e., this also removes the
overhead of asynchronous interrupts from the remote latency timing path.
4. Deterministic coherence protocol : The global coherence protocol is
designed to avoid all traditional corner cases of cache coherence protocols.
Only one thread at a time can have the exclusive right to produce global
coherence activity at each piece of data. This greatly simplifies the task of
designing an efficient and correct coherence protocol.
5. Thread-safe protocol implementation: While most other implementations have single protocol thread per node, our scheme allows several
threads in the same node to perform protocol actions at the same time.
The DSZOOM technology should be applicable to a wide range of cluster implementations, even though this specific implementation is targeting SPARC-based
Solaris systems. It also assumes a cluster built from “standard” SMP nodes. This
allows each node to contain several processors, whose caches and memory content
are kept coherent within the node by a hardware-coherence protocol. The proposal is assuming a high-bandwidth, low-latency cluster interconnect, supporting
put, get, and atomic operations to a remote node’s memory, but with no hardware
support for globally coherent shared memory. Example of cluster interconnects
are the common Beowulf SCI interconnects, Sun’s new “supercluster” Sun Fire
Link [97], and the emerging InfiniBand standard [46].
The proof-of-concept implementation on top of the Sun WildFire machine,
DSZOOM-WF, is presented in C and in several earlier work-in-progress papers [74, 75, 76]. This is the first SW-DSM implementation that consistently
demonstrates stable performance that is comparable to much more expensive
hardware-based DSM implementations for a range of well-known shared-memory
applications. On average, the demonstrated slowdown is in the range of 30 percent compared to WildFire, both running SPLASH-2 benchmarks [111] that are
14
4.5. THROOM: Pthreads Support for DSZOOM
compiled without any optimizations and instrumented with the Executable Editing Library (EEL) [57].
The reason why the unoptimized code is used in C and also in D, is because
of EEL, which is unable to instrument all types of loads and stores that are
placed in SPARC’s delay slots. In addition, the EEL library is not maintained
anymore. This makes it impossible to use EEL with modern compilers and operating systems. To overcome this problem, a new and simple SPARC assembler
instrumentation tool (SAIT) is introduced in E. The SAIT can instrument a
highly optimized assembler output from Sun’s latest compilers for the newest UltraSPARC processors. The major limitation of this approach is that the source
code must be available, which is not always the case, especially not for the system
libraries or other commercial software components that might access the shared
data. Since there currently is no optimization after our instrumentation phase
with SAIT, the overhead from our inserted (unoptimized) snippet code fragments
has increased relative to the total execution time, even though the fraction of instructions that need to be instrumented in general seems to decrease with a higher
compiler optimization level. On average, the sequential instrumentation overhead
is around 107 percent for highly optimized code, and 49 percent for unoptimized
binaries. Possible improvements of this approach are discussed in chapter 5.
4.5. THROOM: Pthreads Support for DSZOOM
The original DSZOOM system is designed to run parallel applications written
with PARMACS macros [6], which are mainly used in academia. Our attempt to
run shared-memory programs written with industry-standard parallel constructs,
POSIX threads (Pthreads), is presented in D. In the Pthreads model, all processdata (text, static data, heap, and stack) is “global.” This is not the case for
PARMACS macros, where the shared data must be explicitly allocated with a
special function call. This is why the number of instrumented loads and stores is
much higher in D compared with the original PARMACS-implementation. The
average runtime overhead compared to DSZOOM (paper C) is 65 percent for
8-processor runs and 78 percent for 16-processor runs.
4.6. Write Permission Cache
The store snippet is responsible for about half of the instrumentation overhead.
Reducing the load overhead would expose the efficiency of store instrumentation
further. Paper E introduces a write permission cache (WPC) that significantly
lowers instrumentation overheads for store instructions (as much as 45 percent for
LU-c, running on two nodes with eight processors each). The idea is the following:
when a thread has ensured that it has the write permission for a cache line, it
15
4. DSZOOM — All-Software Fine-Grained DSM
holds on to that permission hoping that following stores will be to the same cache
line (spatial locality). The address/ID of the cache line is stored in a dedicated
register. If indeed the next store is to the same cache line, the store snippet is
reduced to conditional branch operation, i.e., no extra memory instructions need
to be added. On average, the sequential instrumentation overhead for SPLASH-2
programs instrumented with SAIT and with a 1-entry WPC is around 95 percent
for optimized binaries, and around 29 percent for programs that are compiled
without optimizations.
16
5. Future Work
5.1. Future Experimentation Platform
From the middle of May 2003 we have access to a Sun “supercluster” system [103].
In this setup, we can connect several variably sized domains of the 48-processor
Sun Fire 15K server [20], with UltraSPARC III Cu 900 MHz processors, through
the Sun Fire Link cluster interconnect [97]. This has the advantage that a large
variety of configurations, ranging from a few large nodes to many smaller nodes,
can be studied using identical hardware technology. The fact that this future
platform has much faster processors, 900 MHz instead of 250 MHz in our WildFire configuration, is an interesting experiment in itself. In our preliminary study,
we have noted that the relative instrumentation overhead for sequential execution decreases for faster processors (for example, the load snippet only adds ALU
operations that are sped up more than the memory operations for each generation of processors). In the rest of this section, two possible interconnects for
the DSZOOM system are shortly introduced: Sun Fire Link and the emerging
InfiniBand proposal.
5.1.1. Sun Fire Link
Sun Fire Link is a high-performance cluster interconnect for the Sun Fire 6800,
Sun Fire 12K, and Sun Fire 15K servers. Currently, this is Sun’s highest performing cluster interconnect, which uses a Solaris OS feature called remote shared
memory (RSM), whereby memory regions on one machine can be mapped into
the address space of another machine [97]. The interconnect supports kernel
bypass messaging via RSM, where the ordinary UltraSPARC processor’s Block
Load/Store instructions must be used to read/write to remote memory. Remote
shared-memory interface library (librsm) provides an interface for OS bypass
messaging for applications over the Sun Fire Link hardware and similar highspeed interconnects, such as Sun’s implementation of SCI.
5.1.2. InfiniBand
The emerging InfiniBand interconnect proposal supports efficient user-level accesses to remote memory (RDMA) as well as atomic operations to smaller pieces
of data, e.g., CmpSwap (Compare and Swap) and FetchAdd (Fetch and Add) [46].
17
5. Future Work
InfiniBand is designed to provide significantly higher levels of reliability, availability, performance, and scalability than alternative server I/O technologies, and
it is expected to become the interconnect of choice in the future for clustering
applications.
5.2. Possible Improvements
During the development of DSZOOM, new ideas have emerged, some of which
have been prototyped in order to evaluate their potentials. We believe that several
of these new techniques could make DSZOOM more mature and more effective.
While it is premature to prioritize and decide which improvement areas would
be most beneficial, this list of possible improvements could be an indication of
possibilities.
❏ Synchronization. We have proposed a set of NUCA-aware locks that
perform as well as the simpler locks at low contention, but perform better
and better the more contention there is [79, 78, 80]. Distributed shared
memory on a Beowulf is certainly a system where the access time to a
neighbor’s thread is much shorter that to a thread in a remote node. We
would like to try out similar hierarchical locks in the DSZOOM system. Fast
and contention-free barriers proposed by Scott and Mellor-Crummey [92]
could also improve the performance of several barrier-intensive applications.
❏ Optimize instrumented code. Our instrumentation tool allow us to instrument applications compiled with industry-quality optimized compilers.
Since there currently is no optimization after the instrumentation phase,
instruction scheduling techniques could be used to reduce the overhead of
the inserted instrumentation. Alternatively, the instrumentation could be
integrated with the higher levels of a compiler, such as the GNU’s gcc compiler, an OpenMP compiler, or the JIT code generator of a Java system,
and rely on their optimization phases.
❏ Trap-based load checks. Instead of storing the “magic” value (IEEE
floating-point NaN) in each memory line in invalid state, one could tag
them to generate a trap when they get loaded. This could possibly be done
using the page protection system, similarly to the traditional SW-DSMs, by
writing a magic double-bit ECC error into the line, or by introducing some
kind of tagged memory, similar to the old Lisp machines. A trap would only
be generated on a load miss, why the overhead from load instrumentation
would be completely removed. The latency from taking a trap (in the order
of 100 ns) would only be added to the remote access time (in the order of
a microsecond). Similar techniques were tried out in the Wisconsin Wind
Tunnel project [88].
18
5.2. Possible Improvements
❏ Optimizing store instrumentation. We have performed some initial
experiments and shown that there is a large amount of spatial locality for
memory-store operations and that a WPC with two entries of 256 bytes each
would reduce the instrumentation overhead by several factors [38]. However, using a WPC also raises some correctness, liveness, and performance
concerns. The WPC entries have to be released at synchronization points,
at failure to acquire directory entry or MTAG, and at thread termination.
This allowed us to correctly and efficiently run all the applications in the
SPLASH-2 benchmark suite. However, this is clearly not sufficient for more
general cases, and more attention should be added to this matter.
❏ Update-based protocols. In a shared-memory architecture, the implicit
communication, caused by several threads reading and writing the same
memory location, typically results in “cache misses” caused by the invalidates and downgrades performed by the coherence protocol. In today’s
DSZOOM, implicit communication will result in costly remote coherence
activity. Instead of implementing an invalidate protocol in DSZOOM, we
have made some initial experiments with a push-based protocol—a kind
of update-based protocol. The push-based protocol will selectively write
updated copies of data into the other node’s memories upon changes to
shared data. This will result in a shared-memory implementation that
always finds an up-to-date copy of the data in its remote memory. The
advantage of this scheme is that the latency of the interconnect becomes irrelevant for the execution time, however, if not handled with care the update
protocol may increase the interconnect traffic—trading latency-sensitivity
for bandwidth-sensitivity. Another advantage is that loads does not have
to be instrumented, since they always return up-to-date data. The WPC
is a component also here. Normally, an update-based protocol will send
update messages for every store to shared data. Using the WPC, the entire
cache line is copied to the other nodes. It surprised us that a “quick-anddirty” implementation of the above scheme resulted in a less than 20 percent
overhead for DSZOOM’s execution time compared with hardware shared
memory, for about one quarter of the applications studied (using optimized
compilers). There are many options to explore in such an implementation. It is interesting to note that a strong memory consistency model,
such as Total Store Ordering (TSO) [107], efficiently can be implemented
in this push-based scheme, if all remote copies of data are invalidated when
a write-permission is obtained. Another challenge would be to implement
a selective update scheme instead of the “brute-force” as was used in our
quick experiment.
❏ Dynamically adapting protocols. As shown by our rapid push-based
implementation, some applications can benefit even from a brute-force im-
19
5. Future Work
plementation of update protocol in combination with the write permission
cache. That implementation did not recognized the fact that the threads
often divide the shared data between them. That way, some of the shared
data get “privatized,” such that all loads and stores to that piece of data always are performed by the same thread. The brute-force update would create global traffic also for stores to privatized data. A dynamically adopting
protocol could distinguish between communication data and the privatized
data, and use push-based protocol for communication data and invalidate
protocols for the privatized data.
20
6. Summary of the Contributions
In summary; this work includes novel research activities in these areas:
❏ Identifying a new property of CC-NUMAs: nonuniform communication
architecture (NUCA).
❏ Presenting several locking algorithms that exploit NUCAs; including the
RH lock and three hierarchical backoff (HBO) lock proposals.
❏ Improving the traditional microbenchmark synchronization algorithm, extending a sequential microbenchmark algorithm to NUCA machines, and
introducing a new microbenchmark algorithm that better models real applications.
❏ A novel software-based shared-memory proposal: DSZOOM.
❏ An implementation proposal thereof.
❏ A performance characterization of the proposed implementation.
❏ THROOM: a novel run-time system concept that allow Pthread-applications
to run on noncoherent clustered architectures on top of the DSZOOM system.
❏ New assembler instrumentation tool: SPARC Assembler Instrumentation
Tool (SAIT).
❏ A latency-hiding technique for memory-store operations: write permission
cache (WPC).
21
6. Summary of the Contributions
22
Paper A
23
A. Efficient Synchronization for
Nonuniform Communication
Architectures
Zoran Radović and Erik Hagersten
Uppsala University, Department of Information Technology
P.O. Box 337, SE-751 05 Uppsala, Sweden
E-mail: {zoranr,eh}@it.uu.se
In Proceedings of Supercomputing 2002 (SC2002), Baltimore, Maryland, USA, November 2002.
c 2002 IEEE
0-7695-1524-X/02 $17.00 °
Abstract
Scalable parallel computers are often nonuniform communication architectures
(NUCAs), where the access time to other processor’s caches vary with their physical location. Still, few attempts of exploring cache-to-cache communication locality have been made. This paper introduces a new kind of synchronization primitives (lock-unlock) that favor neighboring processors when a lock is released. This
improves the lock handover time as well as access time to the shared data of the
critical region.
A critical section guarded by our new RH lock takes less than half the time to
execute compared with the same critical section guarded by any other lock on our
NUCA hardware. The execution time for Raytrace with 28 processors was improved 2.23–4.68 times, while global traffic was dramatically decreased compared
with all the other locks. The average execution time was improved 7–24% while
the global traffic was decreased 8–28% for an average over the seven applications
studied.
25
A. Efficient Synchronization for Nonuniform Communication Architectures
A.1. Introduction
There are plenty of examples in academia and industry of shared-memory architectures with a nonuniform memory access time to the shared memory (NUMA).
Most of the NUMA architectures, but not all, also have nonuniform communication architectures (NUCA). This means the access time from a processor to the
other processor’s caches varies greatly depending on their placement. In particular, node-based NUCAs, in which a group of processors have a much shorter
access time to each other’s caches than to the other caches, are common.
Recently, technology trends have made it attractive to run more than one
thread on a chip, using either the chip multiprocessor (CMP) and/or the simultaneous multithreading (SMT) approach. Larger servers, built from several of
those chips, can therefore be expected to be NUCA architectures, since collocated threads will most likely share an on-chip cache at some level [8]. In our
opinion, there are strong indications that many important architectures in the
future will have a nonuniform access time to each other’s caches, as well as to
the shared memory.
NUMA optimizations have attracted much attention in the past. The migration and replication of data in NUMA systems have demonstrated a great
performance improvement in many applications [40, 72]. However, many of today’s applications show a large fraction of cache-to-cache misses [9], which is why
attention should also be given to the NUCA nature of the system.
The scalability of a shared-memory application is often limited by contention
for some critical section, often accessing some shared data, guarded by mutual
exclusion locks. The simpler, and most widely used, test&set lock implementations, perform worse at high contention; i.e., the more important the critical
section gets, the worse the lock algorithm performs. This is mostly due to the
vast amount of traffic generated at the lock handover.
An application can often be rewritten to decrease the contention. This could,
however, be a complicated task. More advanced queue-based locks have been
proposed that have a slightly worse performance at light lock contention, but a
much better performance at high lock contention because less traffic is generated [68, 23, 66]. Furthermore, the queue-based locks maintain a first come, first
served order between the contenders. While queue-based locks have shown low
traffic and great scalability on many architectures, their first come, first served
property is less desirable on a NUCA architecture.
Three properties determine the average time between two threads entering the
contested critical section: lock handover time, traffic generated by the lock, and
the data locality created by the lock algorithm. We have noticed that the test&set
locks give an unfair advantage to processors in the NUCA node where the lock
last was held. This will create more node locality and will partly make up for
the more traffic generated by the test&set locks. The increased node locality will
improve the lock handover time, but also on the locality of the work in the critical
26
A.2. Nonuniform Communication Architectures
section.
The goal of this work is to create a lock that minimizes the global traffic generated at lock handover, and maximizes the node locality of NUCA architectures.
The remainder of this paper is organized as follows. Section A.2 gives an
introduction to several machines with NUCA architectures. Background and
related work is presented in section A.3. The key idea behind the RH lock is given
in section A.4, and section A.5 presents the RH lock algorithm. In section A.6 we
present performance results obtained on a 32-processor Sun WildFire machine.
Finally, we conclude in section A.7.
A.2. Nonuniform Communication Architectures
Many large-scale shared-memory architectures have nonuniform access time to
the shared memory (NUMA). In order to make a key difference, the nonuniformity
should be substantial, let’s say at least a factor two between best case and worst
unloaded case. Most of the NUMA architectures also have a substantial difference
in latency for cache-to-cache transfer—a nonuniform communication architecture
(NUCA). A NUCA is an architecture where the unloaded latency for a processor
accessing data recently modified by another processor differs at least a factor of
two depending on where that processor is located.
DASH was the first NUCA architecture [59]. Each DASH node consists of four
processors connected by a snooping bus. A cache-to-cache transfer from a cache
in a remote node is 4.5 times slower than a transfer from a cache in the same
node. We call this the NUCA ratio. Sequent’s NUMA-Q has a similar topology,
but its NUCA ratio is closer to ten [65]. Both DASH and NUMA-Q have a remote
access cache (RAC) located in each node that simplifies the implementation of
the node-local cache-to-cache transfer.
NUCA architecture
Stanford DASH
Sequent NUMA-Q
Sun WildFire
Compaq DS-320
Future: CMP & SMT
NUCA ratio
≈ 4.5
≈ 10
≈6
≈ 3.5
≈ 6–10
Sun’s WildFire system can have up to four nodes with up to 28 processors each,
totaling 112 processors [40]. Parts of each node’s memory can be turned into a
RAC using a technique called coherent memory replication (CMR). Accesses to
data allocated in a CMR cache have a NUCA ratio of about six, while accesses to
other data only have a minor latency difference between node-local and remote
cache-to-cache transfers.
27
A. Efficient Synchronization for Nonuniform Communication Architectures
Compaq’s DS-320 (which was also code-named WildFire) can connect up to
four nodes, each with four processors sharing a common DTAG and directory
controller [32]. Its NUCA ratio is roughly 3.5.
Future microprocessors can be expected to run many more threads on a chip
by a combination of CMP and SMT technology. This can already be seen in the
Pentium 4’s Hyperthreading and the IBM Power4’s dual CMP processors on a
chip. The Piranha CMP proposal expects 8 CMP threads to run on each chip [8].
Larger systems, built from many such CMPs, are expected to have a NUCA ratio
of between six and ten depending on the technology chosen.
Not all architectures are NUMAs or NUCAs. The recent SunFire 15k architecture can have up to 18 nodes, each with four processors, memory and directory
controllers [20]. The nodes are connected by a fast backplane. It has a flavor of
both NUMA and NUCA. However, both its NUMA and NUCA ratios are well
below two. The SGI Origin 2000 is a NUMA architecture with a NUMA ratio of
around three for reasonably sized systems [58]. However, it does not efficiently
support cache-to-cache transfers between adjacent processors and has a NUCA
ratio below two.
A.3. Background and Related Work
Ideally, synchronization primitives should provide high performance under both
high and low contention without requiring substantial programmer effort. Mutual
exclusion (lock-unlock) operations can be implemented in a variety of different
ways, including: atomic memory primitives; nonatomic memory primitives (loadlinked/store-conditional), and explicit hardware lock-unlock primitives (CRAY’s
Xmp lock registers, DASH’s lock-unlock operations on directory entries, or queueon-lock-bit, QOLB). We will concentrate on implementing locks with only atomic
primitives. Explicit hardware primitives are not currently popular on modern
machines.
The five synchronization primitives we discuss and directly compare in this
paper are: test&test&set (abbreviated TATAS), test&test&set with exponential
backoff (abbreviated TATAS_EXP), queue-based locks of Mellor-Crummey and
Scott (abbreviated MCS), queue-based locks of Craig, Landin, and Hagersten
(abbreviated CLH), and RH lock (our new NUCA-aware lock). We also present
a short introduction to alternative synchronization approaches; reactive synchronization and an aggressive queue-on-lock-bit (QOLB) hardware scheme.
A.3.1. Atomic Primitives
In this paper we make reference to three atomic operations: (1) tas(address)
atomically writes a nonzero value to the address memory location and returns
its original contents; a nonzero value for the lock represents the locked condition,
28
A.3. Background and Related Work
while a zero value means that the lock is free; (2) swap(address, value) atomically
writes a value to the address memory location and returns its original contents;
(3) cas(address, expected_value, new_value) atomically checks the contents of
a memory location address to see if it matches an expected_value and, if so,
replaces it with a new_value.
The IBM 370 instruction set introduced cas. Sparc V9 provides tas, swap,
and cas and is our target architecture for this paper.
A.3.2. Simple Lock Algorithms
Traditionally, the simple synchronization algorithms tend to be fast when there
is little or no contention for the lock, while more sophisticated algorithms usually
have a higher cost for low-contention cases. On the other hand, they handle
contention much better. In this section we describe two still very commonly used
busy-wait algorithms: TATAS and TATAS_EXP.
Rudolph and Segall first proposed an extension to ordinary test&set (this was
the sole synchronization primitive available on numerous early systems, such as
the IBM 360 series) that performs a read of the lock before attempting the actual
atomic tas operation [82]. A typical TATAS algorithm is shown below.
typedef unsigned long bool;
typedef volatile bool tatas_lock;
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
void tatas_acquire(tatas_lock *L)
{
if (tas(L)) {
do {
if (*L)
continue;
} while (tas(L));
}
}
void tatas_release(tatas_lock *L)
{
*L = 0;
}
This is the most basic busy-wait algorithm in which a process (or thread) repeatedly attempts to change a lock value L from false/zero to true/nonzero , using
an atomic hardware primitive. In the Sparc V9 instruction set, this is typically
done with load-store unsigned byte (LDSTUB) instruction. Traditional test&setbased spin locks are vulnerable to memory and interconnect contention, and do
not scale well to large machines. This contention can be reduced by polling (busywait code) with ordinary load operations to avoid generating expensive stores to
potentially shared location (lines 4–6 in the code above). Furthermore, the burst
29
A. Efficient Synchronization for Nonuniform Communication Architectures
of refill traffic whenever lock is released can be reduced by using the Ethernetstyle exponential backoff algorithm in which, after a failure to obtain the lock,
a requester waits for successively longer periods of time before trying to issue
another lock operation [2, 68]. The delay between tas attempts should not be
too long; otherwise, processors might remain idle even when the lock becomes
free. This is the idea behind the TATAS_EXP lock. The acquire function of one
typical TATAS_EXP implementation is shown below.
1: void tatas_exp_acquire(tatas_lock *L)
2: {
3:
int b = BACKOFF_BASE, i;
4:
5:
if (tas(L)) {
6:
do {
7:
for (i = b; i; i--) ; // delay
8:
b = min(b * BACKOFF_FACTOR, BACKOFF_CAP);
9:
if (*L)
10:
continue;
11:
} while (tas(L));
12:
}
13: }
Type definitions and release code are the same as in the TATAS example. Parameters BACKOFF_BASE, BACKOFF_FACTOR, and BACKOFF_CAP must be tuned by
trial and error for each individual architecture. We use the following settings in
our experiments, which are identical to the settings used by Scott and Scherer on
the same platform [93]:
BACKOFF_BASE
BACKOFF_FACTOR
BACKOFF_CAP
625
2
2,500
A.3.3. Queue-Based Locks
Even with exponential backoff, TATAS locks still induce significant contention.
Performance results using backoff with a real tas instruction on older machines
can be found in the literature [36, 68]. On a standard symmetric multiprocessor
(SMP) with uniform access times between all of the processor’s caches in the
node, queue-based locks may eliminate these problems by letting each process
spin on a different local memory location. The first proposal for a distributed,
queue-based locking scheme in hardware was made by Goodman, Vernon, and
Woest [34] (see section A.3.4 for more details). Several researchers have proposed
locking primitives that incorporate both local spinning and queue-based locking
in software [1, 36, 68].
30
A.3. Background and Related Work
The acquire function of the software-based queue locks perform three basic
phases: (1) a flag variable in a shared address space is initialized to the value
BUSY; (2) the content at the lock location in memory is swapped with the address
value pointing to the flag; (3) the thread spins until the prev_flag memory
location, a pointer which was returned by the swap, contains the value FREE.
The release function of the queue-based locks writes a FREE value to the flag
location.
The locking primitive called MCS is one of the first software queue-based lock
implementations, originally inspired by the QOSB [34] hardware primitive proposed for the cache controllers of the Wisconsin Multicube in the late 1980s. The
MCS lock was developed by Mellor-Crummey and Scott [68]. During the acquire
request, the MCS lock inserts requesters for a held lock into a software queue
using atomic operations such as swap and cas. Mellor-Crummey and Scott also
describe another version of the MCS lock which only requires the swap operation.
Fairness in that case is no longer guaranteed, and the implementation is slightly
more complex.
Magnusson, Landin, and Hagersten proposed two software queue-based locking
primitives about three years after MCS, namely LH and M [66]. Craig independently developed a lock identical to LH [23]. In this paper we will refer to this lock
as the CLH lock. The CLH lock requires one fewer remote accesses to transfer a
lock than does MCS, and will usually outperform MCS when high lock contention
exists [66, 93]. The CLH lock achieves this behavior at the expense of increased
latency to acquire an uncontested lock. The M lock achieves the more efficient
lock transfer without increased uncontested lock access latency, at the expense
of significant additional complexity in the lock algorithm.
A.3.4. Alternative Approaches
The fact that some synchronization algorithms perform well under low-contention
periods and other under high-contention periods is the basic idea behind the “reactive synchronization” presented by Lim and Agarwal a couple of years after the
first proposals for queue-based locks [64]. Reactive synchronization algorithms
will dynamically switch among several software lock implementations. Typically,
spin locks (TATAS_EXP) are used during the low-contention phase, and queuebased locks (MCS) are used during the high-contention phase [49]. The goal of
reactive synchronization is to achieve both low latency lock access and efficient
lock handoff at low cost.
Very aggressive hardware support for locks have been proposed by Goodman,
Vernon, and Woest [34]. They introduce the queue-on-lock-bit primitive (QOLB,
originally called QOSB), which was the first proposal for a distributed, queuebased locking scheme. In this scheme, a distributed, linked list of nodes waiting
on a lock is maintained entirely in hardware, and the releaser grants the lock
to the first waiting node without affecting others. Furthermore, QOLB prevents
31
A. Efficient Synchronization for Nonuniform Communication Architectures
unnecessary network traffic or interference with the lock holder by letting the
waiting processors spin locally on a “shadow” copy. Effective collocation is possible. Thus, this hardware scheme may reduce the lock handover time as well as
the interference of lock traffic with data access and coherence traffic.
Unfortunately, QOLB requires additional hardware support. Most synchronization primitives that we discussed in previous sections can be implemented
entirely in software, requiring only an atomic memory operation available in the
majority of modern processors. Detailed evaluation of all hardware requirements
for QOLB is presented by Kägi, Burger, and Goodman [49].
A.4. Key Idea Behind RH Lock
Queue-based locks implement a first come, first served fairness, which is less
desirable on a NUCA machine because of the potentially huge percentage of
the lock’s node handoffs. In other words, there is a risk that a contested lock
might “jump” back and forth between the nodes, creating an enormous amount
of traffic. The goal of the RH lock is to create a lock that minimizes the global
traffic generated at lock handover and maximizes the node locality of NUCA
architectures. To make this possible in our first proposal of a NUCA-aware lock,
it is necessary to have the information about what thread is performing the
acquire-release operation and in which node that thread is running.
Preliminaries. Every node contains a copy of the lock. The total lock storage
for a 2-node case is thus 2×sizeof(lock). Initially, we could decide to logically
place a lock in node 0 (mark the lock value as FREE in that node, meaning that
both threads from the local node or from another node can acquire the lock).
The copy of the lock that is allocated and placed in node 1 is then marked with
a REMOTE value, meaning that the “global” lock is in another node (node 0). At
most one node may have this local copy of the lock in state FREE. Thus, the
threads from node 1 would “see” a REMOTE tag if they tried to acquire the local
copy of the lock for the first time. The first thread that gets back the REMOTE
value is the “node winner” and is allowed to continue to spin remotely with a
larger backoff until the global lock is obtained. Other threads will spin locally
(on their local copy) until the lock is fetched and released by the node winner.
Minimizing global traffic. One way to cut down on lock traffic is to make sure
that only one thread per node (the node winner) tries to retrieve a lock which is
currently not owned by a thread in the node.
Maximizing the node locality of NUCAs. One way to increase locality is to
hand over the lock to another thread running in the same node. We will later
refer to this operation as marking the lock value with L_FREE tag. This not only
cuts down on the lock-handover time, but creates locality in the critical section
work, since its data structures already reside in the node.
Even if the first come, first served policy may be a too strong requirement for a
32
A.5. The RH Lock
lock, it must guarantee some fairness and make sure that other nodes eventually
get the lock even if there are always local requests for the lock.
A.5. The RH Lock
During the design phase of the RH lock we paid attention to several general
performance goals for locks, as given by Culler et al. [25], page 343:
❏ Low latency. If a lock is free and no other processors are trying to acquire it
at the same time, a processor should be able to acquire it with low latency.
❏ Low traffic. If many or all processors try to acquire a lock at the same time,
they should be able to acquire the lock one after the other with as little
generation of traffic or bus transactions as possible.
❏ Scalability. Neither latency nor traffic should scale quickly with the number
of processors used.
❏ Low storage cost. The information needed for a lock should be small and
should not scale quickly with the number of processors.
❏ Fairness. Ideally, processors should acquire locks in the order their requests
are issued. At the least, starvation or substantial unfairness should be
avoided. Since starvation is usually unlikely, the importance of fairness
must be traded off with its impact on performance.
In fact, we paid attention only to the first four goals and ignored the last one
(with the exception of goals for starvation). We also paid attention to the data
locality created by our lock algorithm; in other words, our additional goal was to
maximize the node locality of NUCA architectures.
The following atomic operations are used in our current implementation: tas,
swap, and cas, which are all available in the Sparc V9 instruction set.
Our NUCA-aware lock algorithm is shown in Figures A.1 and A.2. The RH lock
algorithm supports only two nodes. my_tid is the thread identification number
(0, 1, 2, ..., maximum number of threads – 1), and my_node_id is the node
number in which thread is placed (0 or 1). Both my_tid and my_node_id must
be thread-private values, and preferably efficiently accessible from rh_acquire
and rh_release functions. In our implementation, we reserve one of the Sparc’s
thread-private global registers (%g2) for that particular task which is the most
efficient way to obtain my_tid and my_node_id. During the rh_acquire function,
every thread will swap its own thread identification number into the node-local
copy of the lock. If it happens that the lock is already in the node and is in the
state L_FREE or FREE (lines 6–7 in the rh_acquire function), the acquire operation
finishes and the thread can proceed with its critical section. If the lock value is
33
A. Efficient Synchronization for Nonuniform Communication Architectures
typedef volatile unsigned long rh_lock;
-------------------------------------------------------1: void rh_acquire(rh_lock *L)
2: {
3:
unsigned long tmp;
4:
5:
tmp = swap(L, my_tid);
6:
if (tmp == L_FREE || tmp == FREE)
7:
return;
8:
if (tmp == REMOTE) {
9:
rh_acquire_remote_lock(L);
10:
return;
11:
}
12:
rh_acquire_slowpath(L);
13: }
-------------------------------------------------------1: void rh_acquire_slowpath(rh_lock *L)
2: {
3:
unsigned long tmp;
4:
int b = BACKOFF_BASE, i;
5:
6:
if ((random() % FAIR_FACTOR) == 0)
7:
be_fair = TRUE;
8:
else
9:
be_fair = FALSE;
10:
11:
while (1) {
12:
for (i = b; i; i--) ; // delay
13:
b = min(b * BACKOFF_FACTOR, BACKOFF_CAP);
14:
if (*L < FREE)
15:
continue;
16:
tmp = swap(L, my_tid);
17:
if (tmp == L_FREE || tmp == FREE)
18:
break;
19:
if (tmp == REMOTE) {
20:
rh_acquire_remote_lock(L);
21:
break;
22:
}
23:
}
24: }
-------------------------------------------------------1: void rh_acquire_remote_lock(rh_lock *L)
2: {
3:
int b = REMOTE_BACKOFF_BASE, i;
4:
5:
L = get_remote_lock_addr(L, my_node_id);
6:
7:
while (1) {
8:
if (cas(L, FREE, REMOTE) == FREE)
9:
break;
10:
for (i = b; i; i--) ; // delay
11:
b = min(b * BACKOFF_FACTOR, REMOTE_BACKOFF_CAP);
12:
}
13: }
34
Figure A.1.: RH lock-acquire code.
A.5. The RH Lock
1: void rh_release(rh_lock *L)
2: {
3:
if (be_fair)
4:
*L = FREE;
5:
else {
6:
if (cas(L, my_tid, FREE) != my_tid)
7:
*L = L_FREE;
8:
}
9: }
Figure A.2.: RH lock-release code.
in the REMOTE state (lines 8–11), the function rh_acquire_remote_lock is called.
Otherwise, the lock is in the node and some other neighbor thread performs
the critical task. In that case, the current thread calls the rh_acquire_slowpath
function and spins locally until it succeeds with its own acquire operation. Of
course, there is one rare special case when another node is lucky enough to obtain
the lock before the current thread (line 19 in the rh_acquire_slowpath function).
Once again, in that case, the function rh_acquire_remote_lock is called.
To achieve controlled unfairness we use a thread-private be_fair variable that
initially is TRUE. The random function (line 6 in the rh_acquire_slowpath function) uses a nonlinear feedback random-number generator. It returns pseudo
random numbers in the range from 0 to 231 − 1. If FAIR_FACTOR is equal to one,
the RH lock will behave “as fairly as it can.” During the rh_release operation,
the thread first checks if the be_fair variable is TRUE or not (line 3). If threadprivate be_fair is TRUE, the lock will be released by writing the FREE value
into the lock’s place. Otherwise, the lock can be released only to local/neighbor
threads if interest was shown by them (line 7). Or, the lock can be released to
the “world” by an atomic cas operation if no other from the same node showed
any interest to acquire the same lock (line 6).
We use the following settings and definitions:
BACKOFF_BASE
BACKOFF_FACTOR
BACKOFF_CAP
REMOTE_BACKOFF_BASE
REMOTE_BACKOFF_CAP
FREE
REMOTE
L_FREE
625
2
2,500
2,500
10,000
max. number of threads
FREE + 1
FREE + 2
35
A. Efficient Synchronization for Nonuniform Communication Architectures
A.6. Performance Evaluation
Most experiments in this paper are performed on a Sun Enterprise E6000 SMP [96].
The server has 16 UltraSPARC II (250 MHz) processors and 4 Gbyte uniformly
shared memory with an access time of 330 ns (lmbench latency [67]) and a total
bandwidth of 2.7 Gbyte/s. Each processor has a 16 kbyte on-chip instruction
cache, a 16 kbyte on-chip data cache, and a 4 Mbyte second-level off-chip data
cache.
The hardware DSM numbers have been measured on a 2-node Sun WildFire
built from two E6000 nodes connected through a hardware-coherent interface with
a raw bandwidth of 800 Mbyte/s in each direction [40, 72].1 The Sun WildFire
access time to local memory is the same as above, 330 ns, while accessing data
located in the other E6000 node takes about 1700 ns (lmbench latency). Accesses
to data allocated in a CMR cache have a NUCA ratio of about six, while accesses
to other data only have a minor latency difference between node-local and remote
cache-to-cache transfers. The E6000 and the WildFire DSM system are both
running a slightly modified version of the Solaris 2.6 operating system.
We have implemented the traditional TATAS lock and the RH lock using the
tas, swap, and cas operations available in the Sparc V9 instruction set. The
source code for TATAS_EXP, CLH, and MCS lock is written by Scott and
Scherer [93]. The entire experimentation framework is compiled with GNU’s
gcc-3.0.4, optimization level -O1. The TATAS_EXP lock was previously tuned
for a Sun Enterprise E6000 machine by Scott and Scherer [93]. We use identical
values in our experiments.
A.6.1. Uncontested Performance
One important performance goal for locks is low latency [25]. In other words, if a
lock is free and no other processors are trying to acquire it at the same time, the
processor should be able to acquire it as quickly as possible. This is especially
important for applications with little or no contention for the locks, which is a
quite common case.
In this section we obtain an estimate of lock overhead in the absence of contention for three common scenarios. We evaluate the cost of the acquire-release
operation (1) if the same processor is the owner of the lock (lock is in its cache);
(2) if the lock is in the same node but the previous owner is not the current
processor (lock is in the neighbor’s cache), and (3) if the lock was owned by a
remote node (lock is in the cache of a processor that is in another node). The
pseudocode for this NUCA-aware microbenchmark is shown below.
1
Currently, our system has 30 processors, 16 plus 14, and therefore we perform our experiments
mainly on a 14 plus 14 configuration.
36
A.6. Performance Evaluation
Lock Type
TATAS
TATAS_EXP
MCS
CLH
RH
Previous Owner
Same
Same Remote
Processor
Node
Node
135
139
250
230
178
ns
ns
ns
ns
ns
614
668
722
827
663
ns
ns
ns
ns
ns
2081
2014
2192
2623
4497
ns
ns
ns
ns
ns
Table A.1.: Uncontested performance for a single acquire-release operation for
different synchronization algorithms.
1: acquire and release all locks;
2: BARRIER
// case 1: previous owner: same processor
3: if (my_tid == 0) {
4:
acquire and release all locks;
5:
start timer;
6:
acquire and release all locks;
7:
stop timer;
8: }
9: BARRIER
// case 2: previous owner: same node
10: if (my_tid == 1) {
11:
start timer;
12:
acquire and release all locks;
13:
stop timer;
14: }
15: BARRIER
// case 3: previous owner: remote node
16: if (my_tid == 2) {
17:
start timer;
18:
acquire and release all locks;
19:
stop timer;
20: }
For this experiment we need three threads. Threads with thread identification
numbers (tid) zero and one are executing inside cabinet 1 and thread two is
running inside cabinet 2. The total number of allocated locks is 2,000. The first
two lines of the microbenchmark are used to warm the TLBs and are executed
by all threads. Results are presented in Table A.1.
Unsurprisingly, TATAS and TATAS_EXP are the fastest acquire-release operations for this simple test without contention. We observe that our low latency
37
A. Efficient Synchronization for Nonuniform Communication Architectures
4
13
12
TATAS
TATAS_EXP
MCS
CLH
RH
10
9
Time [microseconds]
Time [microseconds]
3
TATAS
TATAS_EXP
MCS
CLH
RH
11
2
8
7
6
5
4
1
3
2
1
0
0
0
2
4
6
8
10
12
Number of Processors
14
16
0
(a)
2
4
6
8
10
12
Number of Processors
14
16
(b)
Figure A.3.: Traditional (a) and slightly modified (b) traditional microbenchmark
iteration time for a single-node Sun Enterprise E6000.
design goal for the RH lock is within the reasonable magnitudes for the same
processor and the same node case. For the remote node case RH lock performs
much more poorly compared to other locks. The reason for this is two fold:
(1) the rh_acquire function executed by the third thread will always acquire
the local copy of the lock before the “global” lock which is in another cabinet
(the rh_acquire_remote_lock function is always called), and (2) the rh_acquire
function generates some additional remote coherence traffic at line 5.
A.6.2. Traditional Microbenchmark
The traditional microbenchmark that is used by many researchers consists of a
tight loop containing a single acquire-release operation. The iteration time for
a single-node Sun Enterprise E6000 is shown in Figure A.3(a). The number of
iterations performed by every thread in this microbenchmark is 10,000. The
FAIR_FACTOR for the RH lock is equal to one. From the Figure A.3(a) it appears
that TATAS_EXP and RH performs much better than the other locks. Both
38
A.6. Performance Evaluation
these locks have an exponential backoff, why waiting threads will “sample” the
lock variable less often. This increases the probability that the thread releasing
a lock will manage to acquire it again right away. The iteration time for these
two locks indeed look similar to that of their uncontested performance if the
previous owner is the same processor (see Table A.1). TATAS has no backoff for
the waiting threads, that is why a handoff to the same processor is less likely to
occur, especially at high processor counts. Queue-based locks on the other hand
will rarely allow the releasing processor to directly acquire the lock again.
To overcome this problem we altered the microbenchmark to initialize a global
variable last_owner inside the critical section, and force the thread to observe a
new owner before it is allowed to compete for the lock again. A single remaining
thread will be excluded from this requirement in order to run until completion.
Slightly modified traditional microbenchmark is shown below.
shared int iterations, total_threads;
shared volatile int last_owner = -1;
shared volatile int total_finished = 0;
1: for (i = 0; i < iterations; i++) {
2:
ACQUIRE(L);
3:
if (my_tid != last_owner) last_owner = my_tid;
4:
// some more statistics goes here
5:
RELEASE(L);
6:
while ((last_owner == my_tid) &&
7:
(total_finished < total_threads - 1))
8:
; // spin
9: }
10: atomic_increase(total_finished);
Figure A.3(b) shows the result from the modified microbenchmark (number of
iterations is still equal to 10,000 and RH’s FAIR_FACTOR is one). The iteration
time here is an offset by the time it takes to perform the extra work in the critical
section. The queue-based CLH and MCS perform the best. This is because only
their third phase of the lock-acquire function (see section A.3.3) is performed at
lock-handover time. The releasing thread will perform an single store upgrade
cache miss to its flag (its cache already contains a valid copy in shared state)
and the next thread will need a single load cache miss for its prev_flag, i.e.,
the same memory location as the releasing thread’s flag. For all the other locks
the releasing thread will need to perform a store miss to L (its cache does not
contain a valid copy) and the next thread will perform a load cache miss and an
store upgrade miss to L. TATAS is the only lock showing a distinct degradation
caused by increased traffic as more processors are added.
The iteration time for the modified benchmark on a 2-node Sun WildFire is
shown in Figure A.4(a). In this study, we use round-robin scheduling for thread
binding to different cabinets. The RH lock outperforms all other tested locks for
all runs with more than two threads, and is comparable to other locks for one
39
A. Efficient Synchronization for Nonuniform Communication Architectures
100
60
TATAS
TATAS_EXP
MCS
CLH
RH fair_factor = 1
RH fair_factor = 2
RH fair_factor = 100
55
50
45
80
70
Node-handoffs [%]
Time [microseconds]
40
TATAS
TATAS_EXP
MCS
CLH
RH fair_factor = 1
RH fair_factor = 2
RH fair_factor = 100
90
35
30
25
60
50
40
20
30
15
20
10
10
5
0
0
0
4
8
12
16
20
Number of Processors
(a)
24
28
0
4
8
12
16
20
Number of Processors
24
28
(b)
Figure A.4.: Slightly modified traditional microbenchmark iteration time (a) and
locality study (b) for a 2-node Sun WildFire system.
(no contention) and two threads. In the two-thread case the node-handoff is 100
percent, and all rh_acquire accesses will result in the rh_acquire_remote calls.
Figure A.4(b) shows the ratio of node handoffs for each lock type, reflecting
how likely it is for a lock to migrate between nodes each time it is acquired. We
ignore the TATAS values for more than 16 processors. The graph clearly shows
the key advantage of the RH lock. The RH lock consistently shows low node
handoff numbers for all the three settings of FAIR_FACTOR.
The simple spin locks also show a node handoff ratio below 50 percent, which
could be expected since local processors can acquire a released lock much faster
than can remote processors. The queue locks are expected to show a node handoffs equal to (N/2)/(N − 1), since N/2 of the processors reside in the other node
and we do not allow the same processor to aquire the lock twice in a row. However, the queue-based locks show unnatural behavior with large variation in the
node handoff ratio. Our only explanation for this is pure luck. At 22 processors
the CLH shows a ratio of 23 percent. This also explains the varied performance
in Figure A.4, for example the good CLH performance at 12 and 22 processors.
40
A.6. Performance Evaluation
At 8–10 processors, the node handoff ratio is fairly normal for both queue-based
locks. Here we can see that a critical section guarded by the RH lock takes less
than half the time to execute compared with the same critical section guarded
by any other lock.
It appears that the simplistic regular microbenchmark that we, as well as most
other lock studies, use sometimes makes processors in the same node more likely
to queue up after each other making the lock ratio substantially lower than expected. We also suspect that RH’s job of creating locality is greatly simplified
by this highly regular benchmark.
A.6.3. New Microbenchmark
No real applications have a fixed number of processors pounding on a lock. Instead, they have a fixed number of processors spending most of their time on
noncritical work, including accesses to uncontested locks. They rarely enter the
“hot” critical section. The degree of contention is affected by the ratio of noncritical work to critical work. The unnatural node handover behavior of the
traditional lock benchmark led us to this new benchmark that we think reflects
the expected behavior of a real application better. The pseudocode of the new
benchmark is shown below.
shared int cs_work[MAX_CRITICAL_WORK];
shared int iterations;
1: for (i = 0; i < iterations; i++) {
2:
ACQUIRE(L);
3:
{
4:
int j;
5:
for (j = 0; j < critical_work; j++)
6:
cs_work[j]++;
7:
}
8:
RELEASE(L);
9:
{
10:
int private_work[MAX_NONCRITICAL_WORK];
11:
int j, random_delay;
12:
for (j = 0; j < noncritical_work; j++)
13:
private_work[j]++;
14:
random_delay = random() % noncritical_work;
15:
for (j = 0; j < random_delay; j++)
16:
private_work[j]++;
17:
}
18: }
In the new microbenchmark, the number of processors is kept constant. They
each perform some amount of noncritical work between trying to aquire the lock,
consisting of one static delay (lines 12–13) and one random delay (lines 14–16)
of similar sizes. Initially, the length of the noncritical work is chosen such that
41
A. Efficient Synchronization for Nonuniform Communication Architectures
13
100
TATAS
TATAS_EXP
MCS
CLH
RH
12
11
80
70
Node-handoffs [%]
10
Time [seconds]
TATAS
TATAS_EXP
MCS
CLH
RH
90
9
8
7
60
50
40
6
30
5
20
4
10
3
0
0
500
1000
1500
critical_work
(a)
2000
0
500
1000
1500
2000
critical_work
(b)
Figure A.5.: New microbenchmark iteration time (a) and locality study (b) for a
2-node Sun WildFire system, 28-processor runs.
there is insignificant contention for the critical section (lines 3–7) and all lock
algorithms perform almost identically. More contention is modeled by increasing
the number of elements of a shared vector that are modified before the lock is
released.
Figure A.5(a) shows that the two queue-based locks perform almost identically
for the new benchmark and Figure A.5(b) shows their node handoffs to be close
to the expected values of 50 percent. The number of iterations is 1,000 in this
experiment, and the noncritical_work is equal to 80,000. The FAIR_FACTOR for
the RH lock is equal to one. As the amount of critical work is increased, the time
to perform the critical work gets longer and contention for the lock is intensified.
The TATAS poor contested performance will further add to the time period the
lock is held for each iteration. This results in even more contention—very much
like the feedback loop of an instable control system. This clearly demonstrates
the danger of using TATAS in the application with some contention.
The TATAS values are measured for a critical_work between 0 and 400
because its performance is extremely poor as soon some contention is present. The
42
A.6. Performance Evaluation
Lock Type
TATAS_EXP
MCS
CLH
RH
Local
Transactions
Global
Transactions
1.00
0.49
0.48
0.48
1.00
0.47
0.46
0.31
Table A.2.: Local and global/remote traffic generated for the new microbenchmark (critical_work = 1500, 28 processors). The performance for
TATAS is extremely poor for this setting (see Figure A.5), and is
excluded from the table.
simple spin locks still perform unpredictably which is tied to their unpredictable
node handover. The RH lock performs better the more contention there is, which
can be explained by its decreasing amount of node handover. This is exactly the
behavior we want in a lock: the more contention there is, the better it should
perform.
In Table A.2 we also present the numbers for the traffic that is generated on the
machine for our new microbenchmark. The numbers are normalized against the
TATAS_EXP which is generating 15.145 millions local transactions and 8.878
millions global/remote transactions. The queue-based locks performs almost the
same, MCS is generating slightly more transactions than CLH. The RH lock
performs best, it generates about the same amount of local transactions as queuebased locks, but it generates only 2.777 millions global transactions, which is more
than three times better than the TATAS_EXP. For this setup, the execution time
is improved 1.46–1.58 times, while the global traffic is significantly decreased.
A.6.4. Application Performance
In this section we evaluate the effectiveness of our new locking mechanism using
the real SPLASH-2 applications [111]. Table A.3 shows SPLASH-2 applications
with the corresponding problem sizes and lock statistics. Total locks is the number
of allocated locks, and Lock calls is the total number of acquire-release lock
operations during the execution. We chose to further examine only applications
with more than 10,000 lock calls, which were the cases of Barnes, Cholesky, FMM,
Radiosity, Raytrace, Volrend, and Water-Nsq. For each application, we vary the
synchronization algorithm used and measure the execution time on a 2-node Sun
WildFire machine. Programs are compiled with GNU’s gcc-3.0.4 (optimization
level -O1). Table A.4 presents the execution times in seconds for 28-processor
runs for five different locking schemes: TATAS, TATAS_EXP, MCS, CLH, and
43
A. Efficient Synchronization for Nonuniform Communication Architectures
Program
Problem Size
Total
Locks
Lock
Calls
Barnes
Cholesky
FFT
FMM
LU-c
LU-nc
Ocean-c
Ocean-nc
Radiosity
Radix
Raytrace
Volrend
Water-Nsq
Water-Sp
29k particles
tk29.O
1M points
32k particles
1024×1024 matrix, 16×16 blocks
1024×1024 matrix, 16×16 blocks
514×514
258×258
room, -ae 5000.0 -en 0.050 -bf 0.10
4M integers, radix 1024
car
head
2197 molecules
2197 molecules
130
67
1
2,052
1
1
6
6
3,975
1
35
67
2,206
222
69,193
74,284
32
80,528
32
32
6,304
6,656
295,627
32
366,450
38,456
112,415
510
Table A.3.: The SPLASH-2 programs. Only emphasized programs are studied
further. Lock statistics are obtained for 32-processor runs.
Program
TATAS
Barnes
Cholesky
FMM
Radiosity
Raytrace
Volrend
Water-Nsq
1.54
2.31
4.84
1.66
2.90
1.70
2.37
(0.052)
(0.072)
(0.333)
(0.059)
(0.914)
(0.031)
(0.028)
Average
2.47 (0.212)
TATAS_EXP
MCS
CLH
(0.010)
(0.043)
(0.193)
(0.067)
(0.183)
(0.096)
(0.057)
1.83 (0.153)
2.09 (0.027)
4.33 (0.057)
N/A
1.41 (0.284)
1.48 (0.278)
2.20 (0.035)
1.54 (0.099)
2.25 (0.107)
4.46 (0.067)
N/A
1.38 (0.319)
1.75 (0.157)
2.45 (0.031)
1.54
2.23
4.27
1.44
0.62
1.61
2.21
2.13 (0.093)
2.22 (0.139)
2.31 (0.130)
1.99 (0.073)
1.43
2.04
4.19
1.75
1.71
1.57
2.25
RH
(0.137)
(0.061)
(0.134)
(0.068)
(0.011)
(0.088)
(0.011)
Table A.4.: Application performance for five different synchronization algorithms
for 28-processor runs, 14 threads per WildFire node. Execution time
is given in seconds and the variance is shown in parentheses.
44
A.6. Performance Evaluation
TATAS
TATAS_EXP
MCS
CLH
RH
3.0
Normalized Speedup
2.5
2.0
1.5
1.0
0.5
e
ra
g
Av
e
Ba
rn
e
C
ho s
le
sk
y
FM
R
M
ad
io
si
t
R
ay y
tra
c
Vo e
lre
W
at nd
er
-N
sq
0.0
Figure A.6.: Normalized speedup for 28-processor runs on a 2-node Sun WildFire.
our RH lock.2 (FAIR_FACTOR is equal to one in this experiment.) Variance is
given in parentheses in the same table. On the average, the execution time is
improved 7–24 percent with the RH locks instead of other locks.
Normalized speedup for all lock algorithms is shown in Figure A.6. For Barnes,
the MCS lock is much worse than the ordinary TATAS_EXP, and for Volrend and
Water-Nsq that is also the case for the CLH lock. On the average, queue-based
locks perform about the same as the TATAS with exponential backoff. The RH
lock demonstrates quite stable performance for both uncontested and contested
applications.
We chose to further investigate only Raytrace. This application renders a
3D scene using ray tracing and is one of the most unpredictable SPLASH-2
programs [25]. Detailed analysis of Raytrace is out of the scope of this paper
(see [94, 111, 25] for more details). In this application, locks are used to protect
task queues and for some global variables that track statistics for the program.
The work between synchronization points is usually quite large. Execution time
given in seconds for five different synchronization algorithms, for single-, 28-, and
2
Unmodified version of Radiosity will not execute correctly with queue-based locks. We did
not investigated this any further.
45
A. Efficient Synchronization for Nonuniform Communication Architectures
TATAS
TATAS_EXP
MCS
CLH
RH
9
8
7
Speedup
6
5
4
3
2
1
0
0
4
8
12
16
20
24
28
Number of Processors
Figure A.7.: Speedup for Raytrace.
30-processor runs, is shown below (variance is presented in parentheses).
Lock Type
TATAS
TATAS_EXP
MCS
CLH
RH
1 CPU
5.02
5.26
5.05
5.30
5.08
28 CPUs
2.90
1.71
1.41
1.38
0.62
(0.914)
(0.183)
(0.284)
(0.319)
(0.011)
30 CPUs
2.70 (0.445)
2.05 (0.257)
> 200 s
> 200 s
0.68 (0.002)
Speedup for Raytrace is shown in Figure A.7. There is a decrease in performance for all other locks above 12 processors, while the RH lock continue to scale
all the way up to 28 processors. The RH lock outperforms all other locks by a factor of 2.23–4.68 for 28-processor runs. Also, our NUCA-aware lock demonstrates
the lowest measurement variance, only 0.011, compared to the second best value
of 0.183 for TATAS_EXP. In the table above, we also demonstrate that MCS and
CLH locks are practically unusable for a 30-processor runs. They are extremely
sensitive for small disturbances produced by the operating system itself. This
46
A.7. Conclusions
Program
Local
Transactions
Global
Transactions
Barnes
Cholesky
FMM
Radiosity
Raytrace
Volrend
Water-Nsq
2.339
14.323
6.771
4.142
7.751
5.172
2.840
1.779
4.553
3.227
1.876
2.882
1.298
1.157
Total
43.338
16.772
Table A.5.: Local and global/remote traffic for TATAS_EXP on a 2-node Sun
WildFire, 28-processor runs. The numbers are given in the millions
of transactions.
Program
TATAS
Barnes
Cholesky
FMM
Radiosity
Raytrace
Volrend
Water-Nsq
1.01
0.99
1.09
1.06
1.15
1.02
1.01
/
/
/
/
/
/
/
Average
1.05 / 1.04
TATAS_EXP
0.67
1.00
1.17
1.08
1.24
1.07
1.03
MCS
CLH
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.01 / 0.66
0.96 / 0.87
0.99 / 0.83
N/A
0.91 / 0.84
1.02 / 1.05
1.00 / 1.04
1.14 / 0.78
0.97 / 0.90
0.97 / 0.80
N/A
1.04 / 0.78
1.04 / 1.17
1.07 / 1.10
1.02
0.95
1.00
1.00
0.86
1.01
1.03
1.00 / 1.00
0.98 / 0.88
1.04 / 0.92
0.98 / 0.81
1.00
1.00
1.00
1.00
1.00
1.00
1.00
/
/
/
/
/
/
/
RH
/
/
/
/
/
/
/
0.60
0.87
0.83
0.85
0.49
1.03
1.02
Table A.6.: Normalized traffic (local / global) for all synchronization algorithms
for 28-processor runs, 14 threads per node.
unwanted behavior of the queue-based lock has been studied further by Scott on
the same architecture and on the Sun Enterprise 10000 multiprocessor [93, 91].
Table A.5 shows the generated traffic by all applications for the TATAS_EXP
synchronization algorithm. The normalized traffic numbers for all other synchronization algorithms are shown in Table A.6. On the average, the global traffic is
decreased 8–28 percent by RH locks for an average over the seven applications
studied.
A.7. Conclusions
Three properties determine the average time between two processes/threads entering the contested critical section: lock handover time, traffic generated by the
47
A. Efficient Synchronization for Nonuniform Communication Architectures
lock, and the data locality created by the lock algorithm. This paper demonstrates that the first come, first served nature of queue-based locks makes them
less suitable for architectures with a nonuniform cache access time (NUCA), such
as NUMAs built from a few large nodes or chip multiprocessors. In contrast, the
simpler test&set locks gives an unfair advantage to neighboring processors when
a lock is released, which will create a fast lock handover time as well as more
locality for the data accessed in the critical region.
We also propose the new RH lock, which explores the NUCA architectures
by creating controlled unfairness and much reduced traffic compared with the
simple test&set locks. The RH lock algorithm minimizes the global traffic generated at lock handover by making sure that only one thread per node tries to
retrieve a lock which is currently not owned by the same node. Also, the RH lock
maximizes the node locality of NUCA architectures by handing over the lock to
another process/thread in the same node. This will not only cut down on the
lock handover time, but will also create locality in the critical section work, since
its data structures will already reside in the node. A critical section guarded by
the RH lock is shown to take less than half the time to execute compared with
the same critical section guarded by any other lock. We also demonstrate that
one of the most commonly used test&set locks shows extremely unstable performance for a certain microbenchmark. We highly recommend to avoid this type
of synchronization algorithms in a large-scale parallel applications, even though
the risk of lock contention is minimal.
Finally, we investigate the effectiveness of our new lock on a set of real SPLASH2 applications. For example, execution time for Raytrace with 28 processors was
improved between 2.23 and 4.68 times, while the global traffic was dramatically
decreased by using the RH locks instead of any other tested locks. The average
execution time was improved 7–24 percent while the global traffic was decreased
8–28 percent for an average over the seven applications studied.
Acknowledgments
We thank Michael L. Scott and William N. Scherer III, Department of Computer Systems, University of Rochester, for providing us with the source code for
many of the tested locks. We would also like to thank Bengt Eliasson, Sverker
Holmgren, Anders Landin, Henrik Löf, Fredrik Strömberg and the Department
of Scientific Computing at Uppsala University for the use of their Sun WildFire machine. We are grateful to Karin Hagersten for her careful review of the
manuscript. Finally, we would like to thank anonymous reviewers for their comments. This work is supported in part by Sun Microsystems, Inc., and the Parallel
and Scientific Computing Institute (PSCI — ψ), Sweden.
48
Paper B
49
B. Hierarchical Backoff Locks for
Nonuniform Communication
Architectures
Zoran Radović and Erik Hagersten
Department of Information Technology, Uppsala University
P.O. Box 337, SE-751 05 Uppsala, Sweden
E-mail: {zoran.radovic,erik.hagersten}@it.uu.se
In Proceedings of the Ninth International Symposium on High Performance Computer Architecture (HPCA-9), Anaheim, California, USA, February 2003.
c 2003 IEEE/ACM
1530-0897/03 $17.00 °
Abstract
This paper identifies node affinity as an important property for scalable generalpurpose locks. Nonuniform communication architectures (NUCAs), for example
CC-NUMAs built from a few large nodes or from chip multiprocessors (CMPs),
have a lower penalty for reading data from a neighbor’s cache than from a remote
cache. Lock implementations that encourages handing over locks to neighbors will
improve the lock handover time, as well as the access to the critical data guarded
by the lock, but will also be vulnerable to starvation.
We propose a set of simple software-based hierarchical backoff locks (HBO)
that create node affinity in NUCAs. A solution for lowering the risk of starvation
is also suggested. The HBO locks are compared with other software-based lock
implementations using simple benchmarks, and are shown to be very competitive
for uncontested locks while being more than twice as fast for contended locks. An
application study also demonstrates superior performance for applications with
high lock contention and competitive performance for other programs.
51
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
B.1. Introduction
The scalability of a shared-memory application is often limited by contention
for some critical section, such as modification of shared data guarded by mutual exclusion locks. The simplest, and most widely used, test-and-test&set lock
implementation suffers from poor performance at high contention: the more contested the critical section gets, the lower is the rate at which new threads can
enter it. This very inappropriate behavior is mostly due to the vast amount of
traffic generated at each lock handover event.
Some alternative lock implementations’ traffic is less dependent on the number of contenders, which is why their lock hand-over rates do not decrease as
significantly at high contention. The simplest way to limit the contention traffic
is to apply some backoff strategy that causes the threads to access the common
lock variable less frequently the longer they have waited. The more advanced
queue-based locks instead maintain a first come, first served order between the
contending threads. Each contender will only spin on the dedicated flag set at
its predecessor’s release of the lock, and contenders ordered after it will not be
affected [23, 66, 68]. However, the complicated software queuing locks are less
efficient for uncontested locks, which have led to the creation of even more complicated adaptive hybrid proposals in the quest for a general-purpose solution
[64].
Shared-memory architectures with a nonuniform memory access time to the
shared memory (CC-NUMAs) are gaining popularity. Most systems that form
NUMA architectures also have the characteristic of a nonuniform communication
architecture (NUCA), in which the access time from a processor to other processors’ caches varies greatly depending on their placement. In node-based NUMA
systems; in particular, processors have much shorter access times to other caches
in their group than to the rest of the caches. Recently, technology trends have
made it attractive to run more than one thread per chip, using either the chip
multiprocessor (CMP) and/or the simultaneous multithreading (SMT) approach.
Large servers, built from several such chips, can therefore be expected to form
NUCAs, since collated threads will most likely share an on-chip cache at some
level [8].
Due to the popularity of NUMA systems, optimizations directed to such architectures have attracted much attention in the past. For example, optimizations
involving the migration and replication of data in NUMAs have demonstrated
a great performance improvement in many applications [40, 58, 72]. In addition, since many of today’s applications exhibit a large fraction of cache-to-cache
misses [9], optimizations which consider the NUCA nature of a system may also
lead to significant performance enhancements.
On a NUCA, it is attractive from a performance point of view to hand the
lock to a waiting neighbor, rather than the thread which has waited the longest
time. Favoring the neighbor will improve the lock handover time, as well as the
52
B.2. Nonuniform Communication Architectures
access to the critical data that most likely reside in its cache. However, in such
a scheme attention must also be given to starvation avoidance. We have noticed
that the existing test-and-test&set locks already give some unfair advantage to
the processor neighbors in the NUCA node where the lock last was held. This will
create more node locality and will partly make up for the more traffic generated
by the test&set-based locks.
The goal of this work is to create a new set of locks that efficiently exploit communication locality in a NUCA while minimizing the potential risk of starvation.
In order to be generally usable, such a lock should scale to a large number of
nodes, handle contended locks well, have reasonable memory space requirements,
and introduce minimal overhead for the uncontested locks.
The remainder of this paper is organized as follows. Section B.2 gives an introduction to several NUCAs. Background and related work is presented in section B.3. Our new NUCA-aware locks are presented in section B.4. In section B.5
we present performance results obtained on a 2-node Sun WildFire machine. A
simple fairness and sensitivity study is performed in section B.6, and we conclude
in section B.7.
B.2. Nonuniform Communication Architectures
Many large-scale shared-memory architectures have nonuniform access time to
the shared memory. In order to make a key difference, the nonuniformity should
be substantial—let’s say at least a factor of two between best and worst unloaded
latency. Most NUMA architectures also have a substantial difference in latency
for cache-to-cache transfer—a nonuniform communication architecture. A NUCA
is an architecture in which the unloaded latency for a processor accessing data recently modified by another processor differs at least by a factor of two, depending
on where that processor is located.
DASH was the first NUCA machine [59]. Each DASH node consists of four
processors connected by a snooping bus. A cache-to-cache transfer from a cache
in a remote node is 4.5 times slower than a transfer from a cache in the same
node. We call this the NUCA ratio. Sequent’s NUMA-Q has a similar topology,
but its NUCA ratio is closer to 10 [65]. Both DASH and NUMA-Q have a remote
access cache (RAC), located in each node, that simplifies the implementation of
the node-local cache-to-cache transfer.
NUCA Example
Stanford DASH
Sequent NUMA-Q
Sun WildFire
Compaq DS-320
Future: CMP, SMT
NUCA Ratio
≈ 4.5
≈ 10
≈6
≈ 3.5
≈ 6–10
53
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
Sun’s WildFire system can have up to four nodes with up to 28 processors each,
totaling 112 processors [40]. Parts of each node’s memory can be turned into an
RAC using a technique called coherent memory replication (CMR). Accesses to
data allocated to a CMR cache have a NUCA ratio of about six, while accesses
to other data only have a NUCA ratio of less than two.
Compaq’s DS-320 (which was also code-named WildFire) can connect up to
four nodes, each with four processors sharing a common DTAG and directory
controller [32]. Its NUCA ratio is roughly 3.5.
Future microprocessors can be expected to run many more threads on a chip
through a combination of CMP and SMT technology. This can already be seen
in the Pentium 4’s Hyperthreading and the IBM Power4’s dual CMP processors
on a chip. The Piranha CMP proposal expects eight CMP threads to run on
each chip [8]. Larger systems, built from many such CMPs, are expected to have
a NUCA ratio of between six and ten depending on the technology chosen.
It is possible that several levels of non-uniformity will be present in future
large-scale servers. A simple example of this would be one of today’s NUMA architectures populated with CMP processors instead of traditional single-threaded
processors. This would create a hierarchical NUMA and NUCA property of the
system.
Not all architectures are NUMAs or NUCAs. The recent SunFire 15k architecture can have up to 18 nodes, each with four processors, memory, and directory
controllers [20]. The nodes are connected by a fast backplane. It has a flavor of
both NUMA and NUCA. However, both of its NUMA and NUCA ratios are below
two. The SGI Origin 2000 is a NUMA architecture with a NUMA ratio of around
three for reasonably sized systems [58]. However, it does not efficiently support
cache-to-cache transfers between adjacent processors and also has a NUCA ratio
below two.
B.3. Background and Related Work
Ideally, synchronization primitives should provide good performance under both
high and low contention without requiring substantial programmer effort. Mutual
exclusion (lock-unlock) operations can be implemented in a variety of different
ways, including: atomic memory primitives; nonatomic memory primitives (loadlinked/store-conditional), and explicit hardware lock-unlock primitives (CRAY’s
Xmp lock registers, DASH’s lock-unlock operations on directory entries, or Goodman’s queue-on-lock-bit). In this paper, we will concentrate on implementing
locks entirely in software using the atomic memory primitives, that are available
in the majority of modern processors. The software-only locking primitives we
directly compare are the following:
1. TATAS: traditional test-and-test&set lock
54
B.3. Background and Related Work
2. TATAS_EXP: TATAS with exponential backoff
3. MCS: queue-based locks of Mellor-Crummey and Scott [68]
4. CLH: queue-based locks of Craig, Landin, and Hagersten [23, 66]
5. RH: our proof-of-concept NUCA-aware lock [78]
6. HBO: our new NUCA-aware spin lock with hierarchical backoff (see section B.4.1)
7. HBO_GT: HBO with global traffic throttling (see section B.4.2)
8. HBO_GT_SD: HBO_GT with starvation detection (see section B.4.3)
We also present a short introduction to alternative synchronization approaches;
reactive synchronization and several hardware locking schemes.
Atomic Primitives. In this paper we make reference to three atomic operations: (1) tas(address) atomically writes a nonzero value to the address
memory location and returns its original contents; a nonzero value for the lock
represents the locked condition, while a zero value means that the lock is free;
(2) swap(address, value) atomically writes a value to the address memory location
and returns its original contents; (3) cas(address, expected_value, new_value)
atomically checks the contents of a memory location address to see if it matches
an expected_value and, if so, returns its original contents and replaces it with a
new_value.
Simple Locking Algorithms. Two very commonly used busy-wait algorithms are TATAS and TATAS_EXP. The contention produced by the traditional
test&set-based spin locks can be reduced by polling (busy-wait code) with ordinary load operations to avoid generating expensive stores to potentially shared
locations (TATAS algorithm). Furthermore, the burst of refill traffic whenever a
lock is released can be reduced by using the Ethernet-style exponential backoff
algorithm in which, after a failure to obtain the lock, a requester waits for successively longer periods of time before trying to issue another lock operation [2, 68].
The delay between tas attempts should not be too long; otherwise, processors
might remain idle even when the lock becomes free. This is the idea behind the
TATAS_EXP lock, and one typical implementation is shown below.
typedef volatile unsigned long tatas_lock;
01 void tatas_exp_acquire(tatas_lock *L)
02 { if (tas(L)) tatas_exp_acquire_slowpath(L); }
03 void tatas_exp_acquire_slowpath(tatas_lock *L)
04 {
05
int b = BACKOFF_BASE, i;
06
do {
55
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
07
08
09
10
11
}
12 }
for (i = b; i; i--) ; // delay
b = min(b * BACKOFF_FACTOR, CAP);
if (*L)
continue;
while (tas(L));
13 void tatas_exp_release(tatas_lock *L)
14 { *L = 0; }
In many implementations, acquire and release functions are in-lined, while the
acquire_ slowpath routine is linked to the binary code. Backoff parameters must
be tuned by trial and error for each individual architecture. The storage cost for
TATAS locks is low and does not increase with the number of processors.
Software Queuing Locks. Even with exponential backoff, TATAS locks still
induce significant traffic. Software queuing locks may eliminate this problem by
letting each process spin on a different local-memory location. Many of the software queuing locks are inspired by the first proposal for a distributed, hardware
queue-based locking scheme proposed for the cache controllers of the Wisconsin
Multicube in the late 1980s [34].
The acquire function of the software-based queue locks perform three basic
phases: (1) a flag variable in a shared address space is initialized to the value
BUSY; (2) the content at the lock location in memory is swapped with the address
value pointing to the flag; (3) the thread spins until the prev_flag memory
location, a pointer which was returned by the swap, contains the value FREE. The
release function of the queue-based locks writes a FREE value to the flag location.
Numerous variations of software queuing lock implementations are known [2, 23,
36, 66, 68, 91].
A unique feature of software queuing locks found in many implementations is
an explicit starvation avoidance and maximal fairness; in other words, first come,
first served order of lock-acquire requests is guaranteed. In addition, software
queue-based locks provide reasonable latency in the absence of contention, provide
good scalability for high-contended locks on many architectures, and can easily
be completely in-lined into the application code.
The RH Lock. The RH lock is our proof-of-concept NUCA-aware spin lock
that supports two nodes [78]. The goal for the RH lock was to create a lock
that minimizes the global traffic generated at lock handover and maximizes the
node locality of NUCAs. In the RH scheme, every node contains a copy of a
lock. Thus, the lock storage cost is twice that of simple locking algorithms.
Allocating and physically placing memory in different nodes may be difficult or
even impossible task for many machines. That is why the RH lock code is not
particularly portable.
Initially, the lock is logically placed in node 0 (the lock value is marked as FREE,
meaning that both threads from the local or remote node are allowed to acquire
the lock). The other node (node 1) will observe a REMOTE value if it acquires its
56
B.3. Background and Related Work
local copy of the lock for the first time. The first thread that observes the REMOTE
tag is the “node winner” and will continue to spin remotely with a larger backoff
until the global lock is obtained. The RH scheme can exclusively hand over the
lock to another thread in the same node by marking the lock value with “local
free” tag L_FREE. This will not only cut down on the lock handover time, but will
also create more locality in the critical section work, since its data structures are
probably already in the node. The swap and cas primitives are used, and the
implementation is vulnerable to starvation.
Alternative Approaches. The fact that some synchronization algorithms
perform well under low-contention periods and others under high-contention periods is the basic idea behind the reactive synchronization presented by Lim
and Agarwal [64]. Reactive algorithms will dynamically switch among several
software lock implementations. Typically, spin locks (TATAS_EXP) are used
during the low-contention phase, and queue-based locks (MCS) are used during
the high-contention phase [49]. Reactive algorithms demonstrate modest performance gains.
The first to propose a distributed, queue-based locking scheme in hardware
were Goodman, Vernon, and Woest [34]. They introduced the queue-on-lock-bit
primitive (QOLB, originally called QOSB). In this scheme, a distributed, linked
list of nodes waiting on a lock is maintained entirely in hardware (pointers in the
processor caches are used to maintain the list of the waiting processors), and the
releaser grants the lock to the first waiting node without affecting others. Effective collocation (allocation of the protected data in the same cache line as the
lock) is possible; thus, this hardware scheme may reduce the lock hand-over time
as well as the interference of lock traffic with data access and coherence traffic.
The original QOLB proposal demonstrates good performance [49], but it requires
complex protocol support, new instructions, and recompilation of applications.
Another purely hardware-based mechanism called Implicit QOLB uses speculation and delays to transparently convert software locks to provide a hardware
queued-lock behavior without requiring any software support or new instructions
[81]. The load-linked/store-conditional instructions are used to demonstrate a
possible implementation.
Stanford DASH uses directories to indicate which processors are spinning on the
lock [59]. When the lock is released, one of the waiting nodes is chosen at random
and is granted the lock. The grant request invalidates only that node’s caches and
allows one processor in that node to acquire the lock with a local operation. This
scheme lowers both the traffic and the latency involved in releasing a processor
waiting on a lock. A time-out mechanism on the lock grant allows the grant to be
sent to another node if the spinning process has been swapped out or migrated.
57
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
B.4. Hierarchical Backoff Locks
In this section we describe a set of a new NUCA-aware spin locks with hierarchical
backoffs (HBO and HBO_GT) that exploit communication locality and reduce
global traffic for contended locks while adding less overhead for uncontested locks
than any of the software queue-based lock implementations. We also suggest one
solution that lowers the risk of starvation (HBO_GT_SD). The storage cost for
all proposed locks is low (a single variable allocated in just any of the NUCA
nodes suffices) and does not increase with the number of processors. HBO_GT
and HBO_GT_SD also use one extra variable per NUCA node.
What do we need to make this possible? All three proposals only use one
common atomic operation: cas. In addition, per-thread/process node_id information is needed.
B.4.1. The HBO Lock
The goal of the HBO lock is similar to that of the RH lock, which is that the
algorithm should exploit communication locality and reduce global traffic for
contended locks. In addition, in this algorithm, we pay attention to adding as
little overhead as possible for uncontested locks. Ideally, at low contention or in
the absence of contention the algorithm should not add any overhead, it should
simply perform an atomic operation directly on the lock variable.
The idea behind the HBO lock is really simple: when a lock is acquired, the
node_id of the thread/process is cas-ed into the lock location. In other words,
if the lock-value is in the FREE state, it is atomically changed into the node_id,
otherwise it remains the same. If a busy lock is held by someone in the same
node, the cas will return the thread’s own node_id, and the thread will start
spinning with a small backoff constant (the same, for example, as the typical
TATAS_EXP configuration). If the cas returns a different node_id, the thread
will use a larger backoff constant. In this manner, a thread that is executing
in the same node in which a lock has already been obtained will be more likely
to subsequently acquire the lock when it is freed in relation to other contending
threads executing in other nodes. Decreased migration of the lock (and the shared
critical-section data structures) from node to node is obtained, and the overall
performance is enhanced. This scheme can be expanded in a hierarchical way,
using more than two sets of constants, for a hierarchical NUCA. Figure B.1 shows
code for the HBO lock. Emphasized lines are related to the HBO_GT lock and
should be ignored for the HBO proposal.
Note that the “critical path” of the HBO lock (lines 6–9) does not add any
significant overhead compared with the simple spin locks (assuming that the
node_id information is easily accessible, e.g., it is stored in a thread-private
register). That is important for the performance of the lock in the absence of
contention.
58
B.4. Hierarchical Backoff Locks
As in the TATAS_EXP implementation (see section B.3), the acquire and release functions of HBO locks can be in-lined, while the acquire_slowpath routines
are linked to the binary code. Backoff parameters must also be tuned by trial
and error for each individual architecture.
B.4.2. The HBO_GT Lock
When multiple processors in the same node are executing in the slow spin loop
(HBO lock, lines 37–52, Figure B.1), the global coherence traffic through network
is created by cas operations of each of the spinning processors. The purpose of
the HBO_GT (global traffic throttling) lock is to limit the number of processors
that are spinning in the same node and attempting to gain a lock currently owned
by another node, thereby reduce global traffic on the network.
Before acquiring a lock, the thread reads the per-node memory location (not
necessarily allocated in the local memory) is_spinning, compares its content
with the lock address, and keeps spinning for as long they are equal (line 5 and
56, Figure B.1). Then, the thread performs the atomic cas operation (line 6 and
57). If the cas returns a node_id different from the thread’s own id, the thread
will store the lock address L in the node’s is_spinning and start to spin for
the lock with a fairly large backoff constant. This operation may thus prevent
others in the same node from performing lock-acquisition transactions to the
lock address (the cas operations) that might otherwise create global coherence
traffic on the network. As soon as the thread has acquired the lock, it writes the
“dummy value” to the node’s is_spinning, which allows any waiting neighbor
to start spinning. By using this algorithm, there is usually only one thread per
node (or a small number of threads) that is performing remote spinning.
B.4.3. The HBO_GT_SD Lock
Many of the queue-based locks guarantee starvation freedom. Even though starvation is usually unlikely [25], with multiple threads competing for a lock, it
is possible that some threads may be granted the lock repeatedly while others
may not and may become starved. Since both HBO and HBO_GT (and all
other simple spin-locking algorithms) are vulnerable to starvation, we need a solution that at least lowers the risk of potential starvation, which is especially
important for HBO locks because of their “nonuniform” nature in the locking
algorithm. For example, a count can be maintained of the number of times a
lock-request has been denied and, after a certain threshold, a thread’s priority
may be increased (a thread can start spinning without any backoff until the lock
is obtained). In addition to this simple “thread-centric” solution, HBO_GT_SD
provides a “node-centric” mechanism, described in further detail below, which
lowers the risk of node starvation. Explicit thread-centric starvation avoidance
59
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
typedef volatile unsigned long hbo_lock;
01 void hbo_acquire(hbo_lock *L)
02 {
03
unsigned long tmp;
04
05
while (L == is_spinning[my_node_id]) ; // spin
06
tmp = cas(L, FREE, my_node_id);
07
if (tmp == FREE)
08
return; // lock was free, and is now locked
09
hbo_acquire_slowpath(L, tmp);
10 }
11 void backoff(int *b, int cap)
12 {
13
int i;
14
for (i = *b; i; i--) ; // delay
15
*b = min(*b * BACKOFF_FACTOR, cap);
16 }
17 void hbo_acquire_slowpath(hbo_lock *L,
18
unsigned long tmp)
19 {
20
int b;
21
22 start:
23
24
if (tmp == my_node_id) {
// local lock
25
b = BACKOFF_BASE;
26
while (1) {
27
backoff(&b, BACKOFF_CAP);
28
tmp = cas(L, FREE, my_node_id);
29
if (tmp == FREE)
30
return;
31
if (tmp != my_node_id) {
32
backoff(&b, BACKOFF_CAP);
33
goto restart;
34
}
35
}
36
}
37
else {
// remote lock
38
b = REMOTE_BACKOFF_BASE;
39
is_spinning[my_node_id] = L;
40
while (1) {
41
backoff(&b, REMOTE_BACKOFF_CAP);
42
tmp = cas(L, FREE, my_node_id);
43
if (tmp == FREE) {
44
is_spinning[my_node_id] = DUMMY;
45
return;
46
}
47
if (tmp == my_node_id) {
48
is_spinning[my_node_id] = DUMMY;
49
goto restart;
50
}
51
}
52
}
53
54 restart:
55
56
while (L == is_spinning[my_node_id]) ; // spin
57
tmp = cas(L, FREE, my_node_id);
58
if (tmp == FREE)
59
return;
60
goto start;
61 }
62 void hbo_release(hbo_lock *L)
63 { *L = FREE; }
Figure B.1.: Acquire and release code for HBO and HBO_GT locks. Emphasized
lines are related to the HBO_GT implementation.
60
B.5. Performance Evaluation
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
if (tmp == FREE) {
// release the threads from our node
is_spinning[my_node_id] = DUMMY;
// release the threads from stopped nodes,
if (stopped_nodes > 0)
is_spinning[for each stopped_node]
return;
}
if (tmp == my_node_id) {
is_spinning[my_node_id] = DUMMY;
if (stopped_nodes > 0)
is_spinning[for each stopped_node]
goto restart;
}
if (tmp != my_node_id) {
// lock is still in some remote node
get_angry++;
if (get_angry == GET_ANGRY_LIMIT) {
stopped_node_id[stopped_nodes++] =
is_spinning[tmp] = L;
}
}
if any
= DUMMY;
= DUMMY;
tmp;
Figure B.2.: Part of the HBO_GT_SD lock’s acquire function. Lines 43–50 from
the HBO_GT algorithm are replaced with the code in this figure.
can be implemented with alternative approaches (reactive synchronization [64],
see section B.3), at the expense of additional complexity in the lock algorithm.
The HBO_GT_SD lock is based on the HBO_GT proposal. The idea behind
the node-centric algorithm, which lowers the risk of node starvation, is the following: after a “winning” thread (or a small number of threads), has tried several
times to acquire a remote lock owned by another node, it gets “angry.” An angry
node will take two measures to get the lock more quickly: (1) it will spin more
frequently, and (2) it will set the is_spinning for the other nodes to the lock
address and thus prevent more threads in those nodes from trying to acquire the
lock.
The details of this algorithm are shown in Figure B.2 (lines 43–50 from Figure B.1 are replaced with the code from Figure B.2). Note that the initialization
of variables get_angry and stopped_nodes is excluded from the code example.
B.5. Performance Evaluation
Most of the experiments in this paper are performed on a Sun Enterprise E6000
SMP [96]. The server has 16 UltraSPARC II (250 MHz) processors and 4 Gbyte
uniformly shared memory with an access time of 330 ns (lmbench latency [67])
61
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
and a total bandwidth of 2.7 Gbyte/s. Each processor has a 16 kbyte on-chip
instruction cache, a 16 kbyte on-chip data cache, and a 4 Mbyte second-level
off-chip data cache.
The hardware DSM numbers have been measured on a 2-node Sun WildFire
built from two E6000 nodes connected through a hardware-coherent interface with
a raw bandwidth of 800 Mbyte/s in each direction [40, 72].1 The Sun WildFire
access time to local memory is the same as above, 330 ns, while accessing data
located in the other E6000 node takes about 1700 ns (lmbench latency). Accesses
to data allocated in a CMR cache have a NUCA ratio of about six, while accesses
to other data only have a minor latency difference between node-local and remote
cache-to-cache transfers. The E6000 and the WildFire DSM system both run a
slightly modified version of the Solaris 2.6 operating system.
We have implemented the traditional TATAS lock and the RH lock using the
tas, swap, and cas operations available in the Sparc V9 instruction set. All HBO
locks are implemented with only a cas operation. The code for TATAS_EXP,
CLH, and MCS lock is written by Scott and Scherer [93], and the entire experimentation framework is compiled with GNU’s gcc-3.2, optimization level -O3.
The TATAS_EXP lock was previously tuned for a Sun Enterprise E6000 machine by Scott and Scherer [93]. We use identical values in our experiments.
By using the gcc’s static __inline__ construct, we explicitly in-line TATAS,
CLH, and MCS locks. All other locks have a “slowpath” routine called from the
corresponding in-line part of the acquire function. Release functions for all locks
are in-lined. Machines used for tests were otherwise unloaded.
B.5.1. Uncontested Performance
One important design goal for locks is a low latency acquisition of a free lock [25].
In other words, if a lock is free and no other processors are trying to acquire it at
the same time, the processor should be able to acquire it as quickly as possible.
This is especially important for applications with little or no contention for the
locks, which fortunately is a quite common case.
In this section we obtain an estimate of lock overhead in the absence of contention for three common scenarios. We evaluate the cost of the acquire-release
operation (1) if the same processor as the previous owner is the owner of the
lock (lock is in its cache); (2) if the lock is in the same node but the previous
owner is not the current processor (lock is in the neighbor’s cache); and (3) if the
lock was owned by a remote node (lock is in the cache of a processor that is in
another node). More details about this NUCA-aware microbenchmark are given
in [78]. Results are presented in Table B.1. We observe that our low latency
design goal for the HBO locks is fulfilled, and performance is almost identical
1
Currently, our system has 30 processors, 16 plus 14, and therefore we perform our experiments
mainly on a 14 plus 14 configuration.
62
B.5. Performance Evaluation
Lock Type
TATAS
TATAS_EXP
MCS
CLH
RH
HBO
HBO_GT
HBO_GT_SD
Previous Owner
Same
Same Remote
Processor
Node
Node
150
143
210
234
198
152
152
149
ns
ns
ns
ns
ns
ns
ns
ns
660
613
732
806
672
652
643
638
ns
ns
ns
ns
ns
ns
ns
ns
2050
2070
2120
2630
4480
2010
2010
2010
ns
ns
ns
ns
ns
ns
ns
ns
Table B.1.: Uncontested performance for a single acquire-release operation.
with the simplest locks: TATAS and TATAS_EXP.
B.5.2. Traditional Microbenchmark
The traditional microbenchmark we use in this paper is a slightly modified code
used by Scott and Scherer in [93] on the same architecture—Sun WildFire prototype SMP cluster. The code consists of a tight loop containing a single acquirerelease lock operation, plus some critical section work for gathering statistics. In
addition, we initialize last_owner, a global variable inside the critical section,
and force the thread to observe a new owner before it is allowed to contend for a
lock again. The last remaining thread is excluded from this requirement in order
to run to completion (see [78] for more details).
The microbenchmark iteration time for parallel execution on a 2-node Sun
WildFire is shown in Figure B.3 (“Iteration Time” diagram). In this study, we
use round-robin scheduling for thread binding to different cabinets. Figure B.3
(“Locality” diagram) shows also the ratio of node handoffs for each lock type,
reflecting how likely it is for a lock to migrate between nodes each time it is
acquired. As expected, NUCA-aware locks consistently demonstrate low nodehandoff numbers. The simple spin locks (especially TATAS) also show fairly low
node handoffs, which can be expected since local processors acquire a released
lock much faster than remote processors do. The queue-based locks are expected
to show node handoffs equal to (N/2)/(N − 1), since N/2 of the processors reside
in the other node, and we do not allow the same processor to acquire the lock
twice in a row. However, the queue-based locks exhibit unnatural behavior in
the node-handoff ratio. The simplistic microbenchmark we use is used in most
other lock studies. It makes processors in the same node more likely to “queue
up” after each other, and node handoffs are substantially lower than expected.
63
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
40
Iteration Time
35
Time [microseconds]
30
25
20
15
10
5
0
0
4
8
100
12
16
Number of Processors
Locality
90
Node handoffs [%]
80
70
60
50
20
24
28
TATAS
TATAS_EXP
MCS
CLH
RH
HBO
HBO_GT
HBO_GT_SD
40
30
20
10
0
0
4
8
12
16
Number of Processors
20
24
28
Figure B.3.: Slightly modified traditional microbenchmark on a 2-node Sun WildFire system.
64
B.5. Performance Evaluation
This is especially true for CLH, which takes unfair advantage of the test setup.
Our only explanation for this is pure luck. At 20 processors, the CLH shows a
ratio of about 24/100. This also explains the varied performance in Figure B.3,
such as the good CLH performance at 20 and 24 processors. At 8–10 processors,
the node-handoff numbers are fairly normal for both queue-based locks. Here we
can see that a critical section guarded by NUCA-aware locks (especially by RH,
HBO_GT, and HBO_GT_SD) takes about half the time to execute compared
with the same critical section guarded by any other software-based lock.
B.5.3. New Microbenchmark
No real applications have a fixed number of processors pounding on a lock. Instead, they have a fixed number of processors spending most of their time on
noncritical work, including accesses to uncontested locks. They rarely enter the
“hot” critical section. The degree of contention is affected by the ratio of noncritical work to critical work. The unnatural node-handover behavior of the
traditional lock benchmark led us to a new lock benchmark, which we feel better
reflects the expected behavior of a real application. In the new microbenchmark,
the number of processors is kept constant. Each performs some amount of noncritical work, which consist of one static delay and one random delay of similar
sizes, between attempts to acquire the lock. Initially, the length of the noncritical work is chosen such that there is little contention for the critical section and
all lock algorithms perform the same. More contention is modeled by increasing
the number of elements of a shared vector that are modified before the lock is
released. The pseudocode of the new benchmark is shown in Figure B.4.
Figure B.5 (“Iteration Time” diagram) shows that the two queue-based locks
perform almost identically for the new benchmark. Figure B.5 (“Locality” diagram) shows also their node handover to be close to the expected values. The
simple spin locks still perform unpredictably. This is tied to their unpredictable
node handover. (TATAS values are measured for a critical_work of 0–1300 because its performance is poor for higher levels of contention.) The NUCA-aware
locks perform better the more contention there is, which can be explained by its
decreasing amount of node hand-over. This is exactly the behavior we want in a
lock: the more contention there is, the better it should perform.
In Table B.2 we present the numbers for the traffic that is generated by our
new microbenchmark. The numbers are normalized to the TATAS_EXP which
generates 15.1 million local and 8.9 million global transactions. Once again,
software queuing locks performs almost identically. We can also observe that
NUCA-aware locks generate less than half the amount of global transactions
than any of the software-based locks on this NUCA machine. The global traffic
is reduced by a factor of 15 compared to the traditional TATAS locks.
65
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
shared int cs_work[MAX_CRITICAL_WORK];
shared int iterations;
01 for (i = 0; i < iterations; i++) {
02
ACQUIRE(L);
03
{
04
int j;
05
for (j = 0; j < critical_work; j++)
06
cs_work[j]++;
07
}
08
RELEASE(L);
09
{
10
int non_cs_work[MAX_PRIVATE_WORK];
11
int j, random_delay;
12
for (j = 0; j < private_work; j++)
13
non_cs_work[j]++;
14
random_delay = random() % private_work;
15
for (j = 0; j < random_delay; j++)
16
non_cs_work[j]++;
17
}
18 }
Figure B.4.: New microbenchmark.
Lock Type
TATAS
TATAS_EXP
MCS
CLH
RH
HBO
HBO_GT
HBO_GT_SD
Local
Transactions
Global
Transactions
4.41
1.00
0.53
0.54
0.54
0.60
0.60
0.61
4.70
1.00
0.65
0.63
0.28
0.30
0.30
0.29
Table B.2.: Normalized local and global traffic generated for the new microbenchmark (critical_work = 1500, 28 processors).
66
B.5. Performance Evaluation
12
TATAS
TATAS_EXP
MCS
CLH
RH
HBO
HBO_GT
HBO_GT_SD
11
Time [seconds]
10
9
8
Iteration Time
7
6
5
4
3
0
500
60
1000
critical_work
1500
2000
1500
2000
Locality
Node handoffs [%]
50
40
30
20
10
0
0
500
1000
critical_work
Figure B.5.: New microbenchmark on a 2-node Sun WildFire system, 28processor runs.
67
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
Program
Problem Size
Total
Locks
Barnes
Cholesky
FFT
FMM
LU-c
LU-nc
Ocean-c
Ocean-nc
Radiosity
Radix
Raytrace
Volrend
Water-Nsq
Water-Sp
29k particles
tk29.O
1M points
32k particles
1024×1024 matrices, 16×16 blocks
1024×1024 matrices, 16×16 blocks
514×514
258×258
room, -ae 5000.0 -en 0.050 -bf 0.10
4M integers, radix 1024
car
head
2197 molecules
2197 molecules
130
67
1
2,052
1
1
6
6
3,975
1
35
67
2,206
222
Lock
Calls
69,193
74,284
32
80,528
32
32
6,304
6,656
295,627
32
366,450
38,456
112,415
510
Table B.3.: The SPLASH-2 programs. Only emphasized programs are studied
further. Lock statistics are obtained for 32-processor runs.
B.5.4. Application Performance
In this section we evaluate the effectiveness of our new locking mechanisms using
explicitly parallel programs from the SPLASH-2 suite [111]. Table B.3 shows
SPLASH-2 applications with the corresponding problem sizes and lock statistics
(Total Locks is the number of allocated locks, and Lock Calls is the total number
of acquire-release lock operations during the execution). Problem size is a very
important issue in this context. Generally, the larger the problem size, the lower
the frequency of synchronization relative to computation. On the one hand, using large problem sizes will make synchronization operations seem less important.
On the other hand, small problem sizes might lead to very low speedup of the
application, rendering it uninteresting on a machine of this scale, even though
we chose the fairly standard problem sizes found in many other related investigations. We also chose to further examine only applications with more than
10,000 lock calls, as in the cases of Barnes, Cholesky, FMM, Radiosity, Raytrace,
Volrend, and Water-Nsq. For each application, we vary the synchronization algorithm used and measure the execution time on a 2-node Sun WildFire machine.
Programs are compiled with GNU’s gcc-3.0.4, optimization level -O1.2 Table B.4
presents the execution times in seconds for 28-processor runs for all eight stud2
This is the highest level of optimization that does not break the correctness of execution for
all applications and lock implementations.
68
B.5. Performance Evaluation
Program
TATAS
Barnes
Cholesky
FMM
Radiosity
Raytrace
Volrend
Water-Nsq
1.54
2.31
4.84
1.66
2.90
1.70
2.37
(0.05)
(0.07)
(0.33)
(0.06)
(0.91)
(0.03)
(0.03)
Average
2.47 (0.21)
Program
TATAS_EXP
MCS
CLH
(0.01)
(0.04)
(0.19)
(0.07)
(0.18)
(0.10)
(0.06)
1.83 (0.15)
2.09 (0.03)
4.33 (0.06)
N/A
1.41 (0.28)
1.48 (0.28)
2.20 (0.04)
1.54 (0.10)
2.25 (0.11)
4.46 (0.07)
N/A
1.38 (0.32)
1.75 (0.16)
2.45 (0.03)
1.54
2.23
4.27
1.44
0.62
1.61
2.21
2.13 (0.09)
2.22 (0.14)
2.31 (0.13)
1.99 (0.07)
1.43
2.04
4.19
1.75
1.71
1.57
2.25
HBO
Barnes
Cholesky
FMM
Radiosity
Raytrace
Volrend
Water-Nsq
1.50
2.06
4.37
1.45
0.77
1.68
2.14
(0.04)
(0.09)
(0.09)
(0.09)
(0.01)
(0.14)
(0.03)
Average
2.00 (0.07)
HBO_GT
1.69
2.34
4.59
1.68
0.70
1.33
2.09
(0.06)
(0.03)
(0.27)
(0.04)
(0.01)
(0.10)
(0.02)
2.06 (0.08)
RH
(0.14)
(0.06)
(0.13)
(0.07)
(0.01)
(0.09)
(0.01)
HBO_GT_SD
1.44
2.13
4.27
1.51
0.72
1.24
2.14
(0.10)
(0.11)
(0.09)
(0.03)
(0.01)
(0.03)
(0.01)
1.92 (0.05)
Table B.4.: Application performance for eight locking algorithms for 28-processor
runs, 14 threads per WildFire node. Execution time is given in seconds and the variance is shown in parentheses.
69
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
TATAS
TATAS_EXP
MCS
CLH
HBO_GT_SD
Normalized Speedup
2.5
2.0
1.5
1.0
0.5
e
ra
g
Av
e
er
-N
sq
d
W
at
re
n
Vo
l
ce
tra
R
ay
io
si
ty
M
R
ad
FM
sk
y
le
C
ho
Ba
rn
e
s
0.0
Figure B.6.: Normalized speedup for 28-processor runs on a 2-node Sun WildFire.
ied locking schemes.3 Variance is given in parentheses in the same table. Normalized speedup for TATAS, TATAS_EXP, MCS, CLH, and HBO_GT_SD is
shown in Figure B.6. For Barnes, the MCS lock is much worse than the ordinary
TATAS_EXP, and for Volrend and Water-Nsq that is also the case for the CLH
lock. As expected from our new microbenchmark study, on average, queue-based
locks perform about the same as the TATAS with exponential backoff.
We chose to further investigate only Raytrace. This application renders a
3D scene using ray tracing and is one of the most unpredictable SPLASH-2
programs [25]. Detailed analysis of Raytrace is beyond the scope of this paper
(see [25, 94, 111] for more details). In this application, locks are used to protect
task queues; locks are also used for some global variables that track statistics for
the program. A large amount of work is usually done between synchronization
points. Execution time in seconds for all eight synchronization algorithms for
single-, 28-, and 30-processor runs is shown in Table B.5.
NUCA-aware locks demonstrate very low measurement variance for both 28and 30-processor runs. In the same table, we also demonstrate that MCS and
CLH locks are practically unusable for 30-processor runs. They are extremely
sensitive for small disturbances produced by the operating system itself. Traditionally, original software queuing locks exhibit poor behavior in the presence of
3
An unmodified version of Radiosity will not execute correctly with software queuing locks on
any optimization level. We did not investigated this any further.
70
B.5. Performance Evaluation
Lock Type
1 CPU
TATAS
TATAS_EXP
MCS
CLH
RH
HBO
HBO_GT
HBO_GT_SD
5.02
5.26
5.05
5.30
5.08
5.00
5.02
5.02
28 CPUs
30 CPUs
2.90
1.71
1.41
1.38
0.62
0.77
0.70
0.72
2.70 (0.45)
2.05 (0.26)
> 200 s
> 200 s
0.68 (0.00)
0.78 (0.01)
0.75 (0.00)
0.80 (0.02)
(0.91)
(0.18)
(0.28)
(0.32)
(0.01)
(0.01)
(0.01)
(0.01)
Table B.5.: Raytrace performance. Execution time is given in seconds and the
variance is presented in parentheses.
TATAS
CLH
HBO_GT
9
TATAS_EXP
RH
HBO_GT_SD
MCS
HBO
8
7
Speedup
6
5
4
3
2
1
0
0
4
8
12
16
20
24
28
Number of Processors
Figure B.7.: Speedup for Raytrace.
71
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
Program
TATAS
Barnes
Cholesky
FMM
Radiosity
Raytrace
Volrend
Water-Nsq
1.01
0.99
1.09
1.06
1.15
1.02
1.01
/
/
/
/
/
/
/
0.67
1.00
1.17
1.08
1.24
1.07
1.03
Average
1.05 / 1.04
Program
TATAS_EXP
MCS
CLH
(2.3) 1.00 / 1.00 (1.8)
(14.3) 1.00 / 1.00 (4.6)
(6.8) 1.00 / 1.00 (3.2)
(4.1) 1.00 / 1.00 (1.9)
(7.8) 1.00 / 1.00 (2.9)
(5.2) 1.00 / 1.00 (1.3)
(2.8) 1.00 / 1.00 (1.2)
1.01 / 0.66
0.96 / 0.87
0.99 / 0.83
N/A
0.91 / 0.84
1.02 / 1.05
1.00 / 1.04
1.14 / 0.78
0.97 / 0.90
0.97 / 0.80
N/A
1.04 / 0.78
1.04 / 1.17
1.07 / 1.10
1.02
0.95
1.00
1.00
0.86
1.01
1.03
0.98 / 0.88
1.04 / 0.92
0.98 / 0.81
1.00 / 1.00
HBO
Barnes
Cholesky
FMM
Radiosity
Raytrace
Volrend
Water-Nsq
0.92
0.96
0.96
1.01
0.83
1.01
0.98
/
/
/
/
/
/
/
0.61
0.90
0.84
0.89
0.58
0.87
0.97
Average
0.95 / 0.81
HBO_GT
0.92
0.96
0.99
0.92
0.82
1.02
0.96
/
/
/
/
/
/
/
0.62
0.90
0.89
0.82
0.58
0.87
0.98
0.94 / 0.81
RH
/
/
/
/
/
/
/
0.60
0.87
0.83
0.85
0.49
1.03
1.02
HBO_GT_SD
0.97
0.97
1.03
0.99
0.81
1.01
0.99
/
/
/
/
/
/
/
0.62
0.91
0.98
0.98
0.64
0.86
0.98
0.97 / 0.85
Table B.6.: Normalized traffic (local/global) for all locking algorithms for 28processor runs. Number of transactions in millions for TATAS_EXP
is shown in parentheses.
multiprogramming, because a process near the end of the queue, in addition to
having to wait for any process that is preempted during its critical section, must
also wait for any preempted process ahead of it in the queue. This unwanted
behavior of the queue-based locks has been studied further by Scott on the same
architecture and on the Sun Enterprise 10000 multiprocessor [91, 93].
Speedup for Raytrace is shown in Figure B.7. There is a significant decrease in
performance for all other locks above 12 processors, while the NUCA-aware locks
continue to moderately scale all the way up to 28 processors. We also present
the normalized traffic numbers for all synchronization algorithms in Table B.6.
B.6. Fairness and Sensitivity
The task of the lock-unlock synchronization primitives is to create a serialization schedule for each critical region such that simultaneous attempt to enter
the region will be ordered in some serial way. Any serial order will result in a
correct execution, as long as starvation is avoided. Fairness is often considered
72
B.7. Conclusions
a desirable property, since it can create an even distribution of work between
threads. Without fairness, the threads may arrive unevenly at a barrier, even
though they have performed the same amount of work. This will force the early
arriving threads to wait for the last arriving thread while performing no useful
work. However, the importance of fairness must be traded off with its impact on
performance [25].
It is not clear that fairness always results in the fastest execution. Assume that
all threads are expected to enter the same critical region exactly once between
two barriers. All threads arrive roughly at the same time to the critical section.
Here, the serialization scheme with the shortest time between two contenders
entering the critical section would be preferable over the fairness scheme that
strictly scheduled the threads according to their arrival time.
The queue-based locks implement a first come, first served order between simultaneous attempts to enter a critical region and guarantee both fairness and
starvation avoidance. The TATAS locks rely on the coherence implementation
for its serialization, and can not make such guarantees. They are dependent on
such guarantees made by the underlying coherence mechanism. HBO algorithms
are based on the TATAS proposals. In addition, they maximize node affinity of
the NUCAs, and improve the lock handover time by handing over the lock to a
waiting neighbor from the same NUCA node, rather than the thread which has
waited the longest time in the system.
A very simple fairness study is performed on a new microbenchmark by measuring the finish times of all individual threads in the benchmark. The results
for all lock algorithms are shown in Figure B.8. As expected, we can see that
the queue-based locks are the fairest locks in this experiment; the percentage
difference in completion time between the first and the last processor is only 2.1
percent. TATAS_EXP seems to be most “unfair” lock for this benchmark with
the same percentage difference of 28.9 percent. NUCA-aware locks perform about
the same, for example, the percentage difference for HBO_GT_SD is 5.6 percent,
which is quite close to the difference of software queuing lock implementations.
The sensitivity of the HBO_GT_SD algorithm is studied by running the new
microbenchmark algorithm on a 2-node Sun WildFire machine. Results for 26processor runs are shown in Figure B.9(a) and B.9(b). In Figure B.9(a) we perform the study by varying the REMOTE_BACKOFF_CAP parameter (see Figure B.1),
and in Figure B.9(b), the GET_ANGRY_LIMIT parameter is varied (see Figure B.2).
The MCS and the HBO_GT algorithms are used for comparison.
B.7. Conclusions
Efficient and scalable general-purpose synchronization primitives should perform
well both at high and low contention. Most existing synchronization proposals,
such as queue-based locks and locks based on various backoff strategies, reduce
73
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
28
26
TATAS
TATAS_EXP
MCS
CLH
RH
HBO
HBO_GT
HBO_GT_SD
Number of Finished Processors
24
22
20
18
16
14
12
10
8
6
4
2
0
0
5
10
15
Time [seconds]
Figure B.8.: Fairness study.
the traffic generated at high contention while adding a reasonable overhead at
low contention.
The node affinity is identified as yet another important property for scalable
general-purpose locks. NUCAs, for example CC-NUMAs built from a few large
nodes or from CMPs, have a lower penalty for reading data from a neighbor’s
cache than from a remote cache. Lock implementations that encourage handing
over locks to neighbors will improve the lock handover time, as well as the access
to the critical data guarded by the lock, but will also be vulnerable to starvation.
In this paper we propose a set of new HBO locks that exploit communication
locality and reduce global traffic for contended locks while adding less overhead
for uncontested locks than any of the software queue-based lock implementations.
We also suggest one simple solution for detecting starvation and lowering the risk
of starvation.
A critical section guarded by the HBO locks is shown to take about half the
time to execute compared with the same critical section guarded by any other
software-based lock. We also investigate the effectiveness and stability of our new
locks on a set of real SPLASH-2 applications. The global traffic in the system is
significantly reduced for several microbenchmarks and applications.
74
B.7. Conclusions
Iteration Time
Iteration Time
Node handoffs
4.50
4.50
4.00
4.00
3.50
3.50
3.00
3.00
2.50
2.50
2.00
2.00
1.50
1.50
Node handoffs
1.00
1.00
0.50
0.50
T
_G
H
BO
00
00
00
0
10
10
10
50
10
2
C
S
M
00
0
00
16
00
80
40
00
0
0
0
00
20
00
00
10
50
00
25
REMOTE_BACKOFF_CAP
(a)
5
0.00
0.00
GET_ANGRY_LIMIT
(b)
Figure B.9.: Sensitivity study of the HBO_GT_SD algorithm. The values are
normalized, and MCS and HBO_GT are used for comparison.
75
B. Hierarchical Backoff Locks for Nonuniform Communication Architectures
Acknowledgments
We thank Michael L. Scott and William N. Scherer III, Department of Computer
Systems, University of Rochester, for providing us with the source code for many
of the tested locks. We would also like to thank the Department of Scientific
Computing at Uppsala University for the use of their Sun WildFire machine.
We are grateful to Karin Hagersten for her careful review of the manuscript.
This work is supported in part by Sun Microsystems, Inc., and the Parallel and
Scientific Computing Institute (PSCI — ψ), Sweden.
76
Paper C
77
C. Removing the Overhead from
Software-Based Shared
Memory
Zoran Radović and Erik Hagersten
Uppsala University, Information Technology
Department of Computer Systems
P.O. Box 325, SE-751 05 Uppsala, Sweden
E-mail: {zoranr,eh}@it.uu.se
In Proceedings of Supercomputing 2001 (SC2001), Denver, Colorado, USA, November 2001.
c 2001 ACM
1-58113-293-X/01/0011 $5.00 °
Abstract
The implementation presented in this paper—DSZOOM-WF—is a sequentially
consistent, fine-grained distributed software-based shared memory. It demonstrates a protocol-handling overhead below a microsecond for all the actions involved in a remote load operation, to be compared to the fastest implementation
to date of around ten microseconds.
The all-software protocol is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality,
similar to the emerging InfiniBand standard. All interrupt- and/or poll-based
asynchronous protocol processing is completely removed by running the entire coherence protocol in the requesting processor. This not only removes the asynchronous overhead, but also makes use of a processor that otherwise would stall.
The technique is applicable to both page-based and fine-grain software-based shared
memory.
DSZOOM-WF consistently demonstrates performance comparable to hardwarebased distributed shared-memory implementations.
79
C. Removing the Overhead from Software-Based Shared Memory
C.1. Introduction
Clusters of symmetric multiprocessors (SMPs) provide a powerful platform for
executing parallel applications. To allow for shared-memory applications to run
on such clusters, software distributed shared-memory (SW-DSM) systems support the illusion of shared memory across the cluster via a software run-time
layer between the application and the hardware. This approach can potentially
provide a cost-effective alternative to hardware shared-memory systems for executing certain classes of workloads. SW-DSM technology can also be used to
connect several large hardware distributed shared-memory (HW-DSM) systems
and thereby extend their upper scalability limit.
Most SW-DSM systems keep coherence between page-sized coherence units [61,
18, 52, 98, 100]. The normal per-page access privilege of the memory-management
unit offers a cheap access control mechanism for these SW-DSM systems. The
large page-size coherence units in the earlier SW-DSM systems created extra false
sharing and caused frequent page transfers of large pages between nodes. In order
to avoid most of the false sharing, weaker memory models have been used to allow
many update actions to be lumped to a specific point in time, such as the lazy
release consistency (LRC) protocol [51].
Fine-grain SW-DSM systems with a more traditional cache-line-sized coherence
unit have also been implemented. Here, the access control check is either done by
altering the error-correcting codes (ECC) [89] or by in-line code snippets (small
fragments of machine code) [89, 86]. The small cache-line size reduces the false
sharing for these systems, but the explicit access-control check adds extra latency
for each load or store operation to global data. The most efficient access check
reported to date is three extra instructions adding three extra cycles for each load
to global data [88].
Today’s implementations of SW-DSM systems suffer from long remote latencies and their scalability has never reached acceptable levels for general SMP
shared-memory applications. The coherence protocol is often implemented as
communicating software agents running in the different nodes sending requests
and replies to each other. Each agent is responsible for accessing its local memory
and for keeping a directory structure for “its part” of the shared address space.
The agent where the directory structure for a specific coherence unit resides is
called its home node. The interrupt cost, associated with receiving a message,
for asynchronous protocol processing is the single largest component of the slow
remote latency, not the actual wire delay in the network or the software actually
implementing the protocol [12, 45]. To our knowledge, the shortest SW-DSM
read latency to date is that of Shasta [85]. The 15-microsecond round-trip read
latency is roughly divided into 5 microseconds, of “real” communication and 10
microseconds of interrupt and agent overhead [30]. Most other SW-DSM implementations have substantially larger interrupt overheads, and latencies closer to
100 microseconds have been reported [89].
80
C.1. Introduction
In this paper we suggest a new efficient approach for software-based coherence protocols. While other work has proposed elaborate schemes for cutting
down on the overhead associated with interrupting and/or polling caused by the
asynchronous communication between the agents [11, 70], our implementation
has completely eliminated the protocol-agent interactions. In DSZOOM the entire coherence protocol is implemented in the protocol handler running in the
requesting processor. This also makes use of a processor that otherwise would
have been idle. Rather than relying on a “directory agent” located in the home
node, as the synchronization point for the coherence of a cache line, we use a
remote atomic fetch-and-set operation to allow for protocol handlers running in
any node, not just the home node, to temporarily acquire atomic access to the
directory structure of the cache line. We believe that the solution presented here
would be beneficial both for page-sized and fine-grain SW-DSM systems, even
though we will only concentrate on fine-grain SW-DSM in this paper.
We have implemented the DSZOOM-WF system, a sequentially consistent [56]
fine-grain SW-DSM, between the nodes of a Sun Orange (earlier referred to as
Sun WildFire) system without relying on its hardware-based coherence capabilities [40, 72]. All loads and stores are instead performed to the node’s local
“private” memory. We use the executable editing library (EEL) [57] to insert
fine-grain access control checks before shared-memory loads and stores in a fully
compiled and linked executable. Global coherence is resolved by a coherence protocol implemented in C that copies data to the node’s “private” local memory by
performing loads and stores from and to remote memory.
A total of twelve unmodified SPLASH-2 applications [111], developed for finegrain hardware SMP multiprocessors, are studied. We compare the performance
of a DSZOOM-WF system with that of a Sun Enterprise E6000 SMP server [96]
as well as the hardware-coherent Sun Orange DSM system. We have measured
the actual protocol overhead to be less than one microsecond for a remote load
operation, in addition to the 1.7 microseconds remote latency of the Sun Orange
hardware, i.e., a perceived remote latency of 2.7 microseconds for the application.
Our approach is close to what hardware cache coherence can do on the same
platform. On average, our implementation demonstrates a relative difference for
SPLASH-2 speedups of 31.6% compared to the hardware-based cache-coherent,
memory access (CC-NUMA) system.
The remainder of this paper is organized as follows. Section C.2 presents our
basic idea and an introduction to a general DSZOOM system. The DSZOOM-WF
implementation is described in Section C.3. Section C.4 presents the experimental
environment, applications used in this study, and results of our performance
study. Finally, we present related work and conclude.
81
C. Removing the Overhead from Software-Based Shared Memory
C.2. DSZOOM Overview
This section contains an overview of the general DSZOOM system. More protocol details are reported for the related DSZOOM-EMU system [75], our initial proof-of-concept DSZOOM implementation that emulates fine-grain softwarebased DSM between “virtual nodes,” modeled as processes inside a single SMP.
C.2.1. Cluster Networks Model
DSZOOM assumes a cluster interconnect with an inexpensive user-level mechanism to access memory located in other nodes, similar to the remote put/get
semantics found in the cluster version of the Scalable Coherence Interface (SCI)
implementation by Dolphin.1 A ping-pong round-trip latency of 5 microseconds,
including MPI protocol overhead, has been demonstrated on a SCI network with
a 2 microsecond raw read latency. Some of the memory in the other nodes is
mapped into a node’s I/O space and can be accessed using ordinary load and
store operations. The different cluster nodes run different kernel instances and
do not share memory with each other in a coherent way; in other words, no invalidation messages are sent between the nodes to maintain coherence when replicated data are altered in one node. This removes the needs for the complicated
coherence scheme implemented in hardware and allows the NIC to be connected
to the I/O bus, e.g., PCI or SBUS, rather than to the memory bus. In order to
prevent a “wild node” from destroying crucial parts of other nodes’ memories, the
incoming transactions are sent through a network MMU (IOMMU). Each kernel
needs to set up appropriate IOMMU mapping to the remotely accessible part of
its memory before the other nodes are accessed. Given the correct initialization
of the IOMMU, user-level accesses to remote memory are enabled. SCI-Cluster
is widely used as the high-performance cluster interconnect by Sun Microsystems
for large commercial and technical systems.
We further assume support for two new remote-access operations not supported
by the SCI-Cluster: the half-word-wide put2 and fetch-and-set2 (fas2). The fas2
operation is launched by a “normal” half-word load operation and the put2 is
launched by a half-word store to the remotely mapped I/O space. The network
interface detects the half-word load and converts it into a fetch-and-set. The fas2
operation will return the 2 bytes of data that was stored in the remote memory
and also atomically set the most significant byte of the data in the remote memory.
The fas2 primitive is used to aquire a lock and retrieve a corresponding small
data-structure in a single operation.
There are strong indications that interconnects fulfilling our assumptions will
soon be widely available. The emerging InfiniBand interconnect proposal sup1
SCI is better known for its implementation of coherent shared memory than its non-coherent
internode cluster communication. In this paper we only refer to its usage as a cluster
interconnect.
82
C.2. DSZOOM Overview
ports efficient user-level accesses to remote memory as well as atomic operations
to smaller pieces of data, e.g., CmpSwap (Compare and Swap) and FetchAdd
(Fetch and Add) [46]. InfiniBand’s FetchAdd can effectively implement a function similar to the fas2 functionality for a system with up to 128 nodes. The least
significant byte (LSB) of the data entity accessed is the “lock” and the remaining
part of the data entity is the payload data. A FetchAdd returning data with a
zero LSB means that the lock was acquired. The lock is released and the payload
data is updated in a single operation by writing the new payload value with a zero
byte concatenated at the LSB end to the data entity. In order to avoid mangling
the payload data for contended locks, a FetchAdd returning a LSB with a value
above 128 will require the contenders to poll the data-structure using ordinary
fetch operations until the LSB with a value below 128 has been observed.
C.2.2. Node Model
Each DSZOOM node consists of an SMP multiprocessor, e.g., the Sun Enterprise
E6000 SMP with up to 30 processors or the Pentium Pro Quad with up to
four processors. The SMP hardware keeps coherence among the caches and the
memory within each SMP node. The InfiniBand-like interconnect, as described
above, connects the nodes. We further assume that the write order between any
two endpoints in the network is preserved.
C.2.3. Blocking Directory Protocol Overview
Most of the complexity of a coherence protocol is related to the race conditions caused by multiple simultaneous requests for the same cache line. Blocking
directory coherence protocols have been suggested to simplify the design and verification of hardware DSM systems [40]. The directory blocks new requests to a
cache line until all previous coherence activity to the cache line has ceased. The
requesting node sends a completion signal upon completion of the activity, that
releases the block for the cache line. This eliminates all the race conditions, since
each cache line can only be involved in one ongoing coherence activity at any
specific time.
The DSZOOM protocol implements a distributed version of a blocking protocol.
A processor that has detected the need for global coherence activity will first
acquire a lock associated with the cache line before starting the coherence activity.
A remote fas2 operation to the corresponding directory entry in the home node
will bring the directory entry to the processor and also atomically acquire the
cache line’s “lock.” If the most significant byte of the directory entry returned is
set, the cache line is “busy” by some other coherence activity. The fas2 operation
is repeated until the most significant byte is zero. (A random back-off scheme can
be used to avoid a live-lock situation, but has not been employed in DSZOOM
yet.) Now, the processor has acquired the exclusive right to perform coherence
83
C. Removing the Overhead from Software-Based Shared Memory
= Small packet (~10 bytes)
= Large packet (~68 bytes)
= Message on the critical path
= Message off the critical path
2. put2
1a. fas2
Requestor
Dir
Mem
1b. get64
data
Figure C.1.: Read data from home node — 2-hop read.
activities on the cache line and has also retrieved the necessary information in
the directory entry using a single operation. The processor now has the same
information as, and can assume the role of, the “directory agent” in the home node
of a more traditional SW-DSM implementation. Once the coherence activity is
completed, the lock is released and the directory is updated by a single put2
transaction. No memory barrier is needed after the put2 operation since any
other processor will wait for the most significant byte of the directory entry to
become zero before the directory entry can be used. Thus, the latency of the
remote write will not be visible to the processor.
To summarize, we have enabled the requesting processor to momentarily assume the role of a traditional “directory agent,” including access to the directory
data, at the cost of one remote latency and the transfer of two small network
packets. This has the advantage of removing the need for asynchronous interrupts in foreign nodes and also allows us to execute the protocol in the requesting
processor that most likely would be idle waiting for the data. A further advantage is that the protocol execution is divided between all the processors in the
node, not just one processor at a time as suggested in some other proposals, for
example by Mukherjee et al. [70].
C.2.4. Protocol Details
The SMP hardware keeps the coherence within the node, on top of which the
global DSZOOM protocol has been added. All the coherence activities and state
names discussed in this paper apply to the DSZOOM protocol.
The DSZOOM protocol states, Modified, Shared and Invalid (MSI), are
explicitly represented by data structures in the node memory. The DSZOOM
84
C.2. DSZOOM Overview
Mem
MTAG
3a. put2
2a. fas2
Dir
data
2b. get64
1. fas2
3b. put2
Requestor
Figure C.2.: Read data modified in a third node — 3-hop read.
directory entry has eight presence bits per cache line, i.e., can support up to
eight SMP nodes. The location of a cache line’s directory location, i.e., its “home
node,” is determined by looking at some of its address bits. To avoid most of the
accesses to the directory caused by global load operations, all cache lines in state
Invalid store by convention a “magic” data value as independently suggested
by Schoinas et al. [89] and Scales et al. [86]. The directory only has to be
consulted if the register contains the magic value after the load. Whenever our
selected magic value is indeed the intended data value, the directory state must
be examined at the cost of some unnecessary global activities. This has, however,
proven to be a very rare event in all our studied applications.
To also avoid most of the accesses to the directory caused by global store
operations, each node has two bytes of local state (MTAG) per global cache line
(similar to the private state table found in Shasta [85]), indicating if the cache
line is locally writable. Before each global store operation, the MTAG byte is
locked by a local atomic operation, before the access write to the cache line is
determined. The directory only has to be consulted if the MTAG indicates that
the node currently does not have write permission to the cache line. The home
node can access the cache line’s directory entry by a local memory access and
does not need any extra MTAG state.
The traditional SW-DSM’s software handler running in the current processor
sends a message to the home node and busy-waits for the reply. A new software handler is invoked in the home node upon the arrival of the request. The
home handler retrieves the requested data from its local memory and modifies
the corresponding directory structure before returning the data reply to the re-
85
C. Removing the Overhead from Software-Based Shared Memory
Algorithm 1 Pseudo-code for global coherence load operations. Emphasized line
is implemented as in-line assembler, while the remaining protocol is implemented
by a routine coded in C.
IF (register == MAGIC) {
lock(dir)
IF (presence_bits(me) == 0) {
IF ((number(presence_bits) == 1) &&
(remote_node != home)) {
lock(remote_mtag)
// Data can not be altered in the remote node now
read_remote(data)
update_release(remote_mtag)
}
ELSE {
read_remote(data)
}
}
update_release(dir)
}
questing handler. The two major drawbacks of this approach are the latency
from asynchronously invoking a handler in the home node and the simultaneous
occupancy of two handlers during most of the protocol handling, i.e., occupying
two processors.
Figure C.1 illustrates the DSZOOM activity caused by a read miss in a twonode system. The software protocol handler in the DSZOOM example will acquire
exclusive access right to the directory entry through a single remote fas2 operation
to the home node. In parallel it also speculatively retrieves the data from the
home node through a remote get64 operation. The directory entry is updated
and released by a single remote put2 operation at the end of the handler. The
protocol handler is completed as soon as the put2 write operation is issued to the
write buffer, so the latency of this operation is not on the critical latency path
of the application. While the DSZOOM approach will drastically cut the latency
for retrieving remote data and will avoid using any processor time in the home
node, its major drawback is the global bandwidth consumed. To illustrate the
excess bandwidth consumed by DSZOOM, each global packet has been marked
as either a “small packet,” with a payload of less than 6 bytes, or a “large packet,”
with a payload of 64 bytes.2 Each packet type is assumed to also carry 2 bytes
of cyclic redundancy code (CRC) and 2 bytes of routing/header information, so
the total number of bytes are 10 bytes and 68 bytes respectively. Based on these
2
We have not included the implicit acknowledge packets that may be used by the lower level
network implementation.
86
C.3. Implementation Details
assumptions, DSZOOM’s four small and one large packet will transfer 108 bytes
compared with the 78 bytes used by the traditional approach, i.e., 38% more
bandwidth is used in DSZOOM.
A similar bandwidth overhead can be seen in the example in Figure C.2 showing
a three-node system performing a 3-hop read operation, i.e., a read request to
data which resides in a modified state in a node different than the home node,
called the slave node. DSZOOM will need one fas2 message to lock and acquire
the directory and determine the identity of the node holding the modified data.
A second fas2 to that node’s MTAG structure will temporarily disable write
accesses to the data. Right after the fas2 has been issued a get64 is issued to
speculatively bring the data to the requesting node. The directory entry and the
MTAG are updated and released through two put2 write operations at the end of
the handler, i.e., off the critical path. Again, DSZOOM will need more messages
to complete its task: seven small and one large packet compared with the three
small and one large used by the traditional approach. The traditional SW-DSM
approach will need two asynchronous interrupts on the critical path before the
data is forwarded to the requesting node. Thus, DSZOOM will require 41% more
bandwidth for this particular operation. This is the worst-case protocol example
for DSZOOM which, fortunately, is not that common in the studied examples.
Algorithm 1 shows pseudo-code for global coherence load operations. The
pseudo-code for global coherence store and load-store operations is shown in Algorithm 2. Emphasized lines in both algorithms are implemented as UltraSPARC
in-line assembler, while the remaining protocol is implemented by routines coded
in C.
C.3. Implementation Details
This section describes our implementation. DSZOOM-WF is a sequentially consistent [56] fine-grain SW-DSM implemented on top of a 2-node Sun Orange prototype SMP cluster configured as a cache-coherent, non-uniform memory access
(CC-NUMA) architecture (without relying on its hardware-coherent capabilities).
Our cluster is built from two Sun Enterprise E6000 SMP machines (referred to
as cabinet 1 and cabinet 2). DSZOOM-WF compilation process is shown in Figure C.3. The unmodified SMP application written with PARMACS macros is
first preprocessed with a m4 macro preprocessor. m4 will replace all macros with
DSZOOM-WF run-time library calls. A standard GNU gcc compiler is used to
compile and link the preprocessed file with a DSZOOM-WF run-time library.
The resulting file, the “(Un)executable,” is then passed to our binary modification tool that is based on an unmodified version of the executable editing library
(EEL) [57]. The binary modification tool inserts fine-grain access control checks
after shared-memory loads, it inserts range checks and node-local MTAG lookups
before stores, and it also adds calls to the corresponding coherence protocol rou-
87
C. Removing the Overhead from Software-Based Shared Memory
Algorithm 2 Pseudo-code for global coherence store and load-store operations.
Emphasized lines are implemented as in-line assembler, while the remaining protocol is implemented by a routine coded in C.
lock(my_mtag)
IF (my_mtag == my_mask) {
// Is only “my” bit set?
IF (me != home) {
// Have we already locked the dir?
lock_test(dir)
// Try once to lock the directory
// Release our MTAG if dir is busy
IF (busy(dir)) {
release(my_mtag)
// To avoid deadlocks
lock(dir)
// Now, first lock directory
lock(my_mtag)
// then lock MTAG
}
}
// Now we have locked the dir for sure!
IF (number(presence_bits) != 1) {
// The data is shared by many nodes and is not writable
IF (presence_bits(me) == 0) {
// My data is not valid
read_data_from_one_node
}
FOREACH sharer {
store_remote(MAGIC)
// Invalidate remote nodes
}
}
ELSE IF (presence_bits(me) == 0) {
// There is a single node with a writable copy,
// and it is not me
IF (me == home) {
// The dir is already locked
read_remote(data)
store_remote(MAGIC)
}
ELSE {
lock(remote_mtag)
read_remote(data)
store_remote(MAGIC)
update_release(remote_mtag)
}
}
IF (me != home) {
update_release(dir)
}
update_release(my_mtag)
}
88
C.3. Implementation Details
Unmodified
SPLASH-2
Application
DSZOOM-WF
Run-Time Library
m4
DSZOOM-WF
Implementation
of PARMACS
Macros
GNU
gcc
(Un)executable
Coherence
Protocols
EEL
a.out
Figure C.3.: DSZOOM-WF compilation process.
tines shown in Algorithm 1 and Algorithm 2. Finally, the a.out is produced and
can be used as if it was executed inside one SMP.
The implementation of PARMACS macros is based on the System V IPC version of the shared-memory macros developed by Artiaga et al. [6, 5]. The macro
library was modified in several ways. We use user-level synchronization through
physically distributed test-and-set locks instead of System V IPC semaphore library calls. Additionally, we added support for process distribution by using
the Solaris system call pset_bind. The support for correct memory placement
and directory distribution based on Sun Orange “first-touch” policy was added
as well. We also added support for memory-mapped communication between the
processes. Address space layout and attachment of the shared-memory objects
for processes in cabinet 1 is shown in Figure C.4. The compiled code makes
global memory accesses to the G_MEM area. Shared-memory objects with sharedmemory identifiers A and B represent the physically shared memory of every
node in the cluster. The shared-memory identifier P, which is physically allocated and placed in the cabinet 1, is attached to the PROFILE_DATA area with
a standard Solaris system call shmat. This globally shared-memory segment is
used for profiling purposes only. Local run-time system data for every process
(e.g., DSZOOM-WF process identifier, UNIX process identifier, etc.) is stored
in a privately mapped PRIVATE_DATA area in a current implementation, and is
also used mainly for debugging and profiling purposes. Distributed directory is
placed at the beginning of the G_MEM area.
Binary instrumentation is a technique usually described as a low-cost, mediumeffort approach of inserting sequences of machine instructions into a program
89
C. Removing the Overhead from Software-Based Shared Memory
Physical Memory
of the Cabinet 2
Stack
Stack
Stack
shmget
shmat
shmid = B
Cabinet_2_G_MEM
shmat
Cabinet_1_G_MEM
Physical Memory
of the Cabinet 1
“Aliasing”
shmat
G_MEM
shmget
0x80000000
shmid = A
shmat
PROFILE_DATA
0x40000000
shmget
shmid = P
PRIVATE_DATA
PRIVATE_DATA
PRIVATE_DATA
0x20000000
Heap
Heap
Heap
Text & Data
Text & Data
Text & Data
Figure C.4.: Address space layout and attachment of the shared-memory objects
for processes running in the cabinet 1.
in executable or object format. We decided to use executable editing library
(EEL), a library that was successfully used in several similar projects based on the
UltraSPARC architecture, e.g., Blizzard-S [90] and Sirocco-S [88]. The following
code example shows the code snippet for one global floating-point fine-grain access
control check.
1:
2:
3:
4:
5:
hit:
ld [addr], %reg // original load
fcmps %fcc0, %reg, %reg
nop
fbe,pt %fcc0, hit
// call global coherence load routine ...
The “magic” value in this case is a small integer corresponding to an IEEE
floating-point NaN. Only instructions 1–4 are executed if the loaded data is valid,
i.e., the %reg is a non-magic value.3 Thus, this access control check is comparable to the most efficient access check reported to date; three extra instructions
adding three extra cycles for each load to global data [88]. The actual implementation of the low-level fine-grain instrumentation is still far from optimal. The
DSZOOM-WF system requires in total between 7 and 8 instructions after every
global load and 17 instructions before every global store/load-store. The reason
why our instrumentation overhead for stores/load-stores is so high compared to
some other fine-grain SW-DSM implementations (for example, Shasta [85] and
Sirocco-S [88]) is because our local MTAG lookups are atomic in the current
3
Line 3 can be eliminated if the code is executed on a SPARC-V9 architecture.
90
C.4. Performance Study
implementation, i.e., the MTAG byte is locked by a local atomic operation before the access write to the cache-line is determined. The characterization of the
dynamic overheads for studied applications is presented in Section C.4.3.
C.4. Performance Study
This section describes experimental setup, applications used in this study, sequential and parallel binary instrumentation overheads, and finally, DSZOOM-WF
performance results for parallel execution.
C.4.1. Experimental Setup
Most experiments in this paper are performed on a Sun Enterprise E6000 SMP
[96]. The server has 16 UltraSPARC II (250 MHz) processors and 4 Gbyte uniformly shared memory with an access time of 330 ns (lmbench latency [67]) and
a total bandwidth of 2.7 Gbyte/s. Each processor has a 16 kbyte on-chip instruction cache, a 16 kbyte on-chip data cache, and a 4 Mbyte second-level off-chip
data cache.
The hardware DSM numbers have been measured on a 2-node Sun Orange built
from two E6000 nodes connected through a hardware-coherent interface with a
raw bandwidth of 800 Mbyte/s in each direction [40, 72]. The Orange system
has been configured as a traditional cache-coherent, non-uniform memory access
(CC-NUMA) architecture with its data migration capability activated while its
coherent memory replication (CMR) has been kept inactive. The Sun Orange
access time to local memory is the same as above, 330 ns, while accessing data
located in the other E6000 node takes about 1700 ns (lmbench latency). The
E6000 and the Orange DSM system are both running a slightly modified version
of the Solaris 2.6 operating system.
DSZOOM-WF runs in user space on the Sun Orange system. The data migration and the CMR data replication of the Orange interconnect are kept inactive.
C.4.2. Applications
The benchmarks we use in this study are well-known scientific workloads from
the SPLASH-2 benchmark suite [111]. The data-set sizes for the applications
studied are presented in Table C.1. The reason why we cannot run Volrend and
Cholesky (that are also part of the original SPLASH-2 distribution) is that the
global variables used as shared are not correctly allocated with the G_MALLOC
macro. It should be possible to manually modify those applications to solve this
problem.4 We began all measurements at the start of the parallel phase to exclude
4
Artiaga has experienced similar problems with the original fork-exec versions of Volrend and
Cholesky [4].
91
C. Removing the Overhead from Software-Based Shared Memory
Program
FFT
LU-c
LU-nc
Radix
Barnes
FMM
Ocean-c
Ocean-nc
Radiosity
Raytrace
Water-nsq
Water-sp
Problem Size, Iterations
1 048 576 points (48.1
1024×1024, block 16 (8.0
1024×1024, block 16 (8.0
4 194 304 items (36.5
16 384 bodies (32.8
32 768 particles (8.1
514×514 (57.5
258×258 (22.9
room (29.4
car (32.2
2197 mols., 2 steps (2.0
2197 mols., 2 steps (1.5
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
Average
Uniproc
Time
%
Load
%
Store
s
s
s
s
s
s
s
s
s
s
s
s
19.0
15.5
16.7
15.6
23.8
17.5
27.0
11.6
26.3
19.0
13.4
15.7
16.5
9.4
11.1
11.6
31.1
13.6
23.9
28.0
27.2
18.1
16.2
13.9
45.56 s
18.4
18.3
15.47
69.17
82.43
28.95
37.16
109.76
43.89
17.04
25.10
9.73
85.01
22.98
Instrumented
Uniproc Time
21.28
109.64
123.81
32.68
38.44
116.27
58.77
21.14
26.74
11.75
90.06
24.95
s
s
s
s
s
s
s
s
s
s
s
s
(1.38)
(1.59)
(1.50)
(1.13)
(1.03)
(1.06)
(1.34)
(1.24)
(1.07)
(1.21)
(1.06)
(1.09)
56.29 s (1.22)
Table C.1.: Data-set sizes and sequential-execution times for non-instrumented
and instrumented SPLASH-2 applications. Fourth and fifth column
show percentage of statically replaced loads and stores. Binary instrumentation overhead is given in parentheses in the last column.
DSZOOM-WF’s run-time system initialization time.
C.4.3. Binary Instrumentation
Efficient binary instrumentation (or compiler support) is very important for the
overall DSZOOM performance. In this section we primarily focus on characterizing the overhead of inserted fine-grain access control checks for global loads
and stores by presenting the dynamic overheads for all of the studied SPLASH-2
programs.
Sequential-execution times for non-instrumented programs, and the percentage of statically replaced loads/stores for SPLASH-2 applications are shown in
Table C.1. Currently, our binary modification tool statically replaces on average
18.4% loads and 18.3% stores during the instrumentation phase. We use numerous techniques to perform static and dynamic analysis of the unmodified binaries
in order to recognize as many global loads and stores as possible [109, 57, 86], i.e.,
the binary modification tool will not replace many of the stack and static data
accesses. Uniprocessor execution times in seconds for instrumented programs are
also shown in Table C.1. Efficiency overhead for sequential execution, presented
in parentheses in the last column, is between 1.03 and 1.59 (which averages 1.22)
for all of the studied applications. Thus, instrumented code takes between 3% and
92
C.4. Performance Study
59% longer time to execute the program after the global fine-grain access control
checks for loads and atomic/node-local MTAG lookups for stores are added.
State-of-the-art checking overheads (for example in Shasta [85]) are in the
range of 5-35%. Unfortunately, increased software checking overhead gives us a
bad starting point for some of the applications (e.g., for FFT, LU-c, and LU-nc).
100%
90%
80%
70%
60%
50%
40%
f-p-ST-snippet
int-ST-snippet
f-p-LD-snippet
int-LD-snippet
E6000 seq.
30%
20%
10%
FF
T
LU
-c
LU
-n
c
R
ad
ix
Ba
rn
es
FM
M
O
ce
an
O
ce c
an
-n
R
ad c
io
si
ty
R
ay
tra
W
ce
at
er
-n
W sq
at
er
-s
p
0%
Figure C.5.: Normalized instrumentation overhead breakdown for sequential execution.
Normalized instrumentation overhead breakdown for sequential-execution is
shown in Figure C.5. Floating-point store snippets are the major slowdown factor
for FFT, LU-c, and LU-nc. LU is one of the most store-intensive SPLASH-2
applications [111] and will typically perform much better on software-based DSM
systems with weaker memory models (for example on GeNIMA [11] with homebased LRC protocol).
In order to examine the instrumentation overhead for parallel execution with
8 and 16 processors, we configure DSZOOM-WF as a single-node system with
a cache-line-sized coherency unit of 128 bytes, i.e., a system without inter-node
communication. This shows the overhead introduced by the inserted run-time
in-line checks (ILCs) when there is no protocol activity. Table C.2 presents
the parallel binary instrumentation overheads for a single-node DSZOOM-WF
configuration. The instrumentation overheads for 8-processor nodes averages
1.25 (the overhead increases by 2.5% compared to the sequential execution). For
16-processor nodes, this overhead averages 1.18 (the overhead decreases by 4.6%
compared to the sequential execution).
C.4.4. Parallel Performance
This section presents the parallel performance of the applications for the DSZOOMWF system. We report results for 2-node SMP clustering of 4 and 8 processors
93
C. Removing the Overhead from Software-Based Shared Memory
Program
8 Processors
16 Processors
FFT
LU-c
LU-nc
Radix
Barnes
FMM
Ocean-c
Ocean-nc
Radiosity
Raytrace
Water-nsq
Water-sp
1.29
1.58
1.60
1.15
1.15
1.03
1.25
1.56
1.09
1.20
1.06
1.09
(-8.9%)
(-0.2%)
(+9.6%)
(+2.4%)
(+11.5%)
(-3.2%)
(-8.4%)
(+32.3%)
(+2.4%)
(-1.2%)
(-0.4%)
(+0.9%)
1.08
1.50
1.44
1.07
1.05
1.02
1.14
1.52
1.06
1.10
1.05
1.09
(-29.8%)
(-8.3%)
(-6.2%)
(-6.0%)
(+1.7%)
(-3.6%)
(-20.3%)
(+28.2%)
(-0.8%)
(-10.6%)
(-0.6%)
(+0.8%)
Average
1.25
(+2.5%)
1.18
(-4.6%)
Table C.2.: Instrumentation overheads for parallel execution on a single-node system. The change in overheads for 8- and 16-processor runs compared
to the uniprocessor overheads is presented in parentheses.
per node. We also characterize inserted overheads compared to many of the unmodified SMP applications by presenting the dynamic overheads for instrumented
SPLASH-2 programs.
Figure C.6 shows execution times in seconds for 8- and 16-processor runs for
Sun Enterprise E6000, 2-node CC-NUMA, and three different DSZOOM configurations:
❏ Single-node DSZOOM-WF. This is a system without inter-node communication. It shows the effects of the inserted run-time in-line checks for
global loads and stores as described in the previous section.
❏ DSZOOM-EMU. This is a system without any “real” physical memory
and process distribution. This configuration emulates DSZOOM-WF between “virtual nodes,” modeled as processes inside a single SMP multiprocessor. It shows effects of the protocol processing for a 2-node DSZOOMWF system.
❏ 2-node DSZOOM-WF. This configuration is a “real” DSZOOM-WF implementation. Both the memory and the running processes are physically
distributed across the nodes. If we compare this configuration to the previous one, we can see how the Sun Orange interconnect impacts performance.
Cache-line-sized coherency unit of 128 bytes is used for all configurations. The
performance of several 16-processor runs shown in Figure C.6(b) is lower than
94
C.4. Performance Study
expected. This is due to the contention on the SMP memory bus mainly caused
by the misses from processors within that particular SMP node.
Table C.3 shows efficency overheads for the effects of in-line checks, global
Program
ILC
Protocol
Distribution
FFT
LU-c
LU-nc
Radix
Barnes
FMM
Ocean-c
Ocean-nc
Radiosity
Raytrace
Water-nsq
Water-sp
1.29
1.58
1.60
1.15
1.15
1.03
1.25
1.56
1.09
1.20
1.06
1.09
1.27
1.00
1.01
1.16
1.03
1.09
1.08
1.09
1.21
1.23
1.01
1.01
1.14
1.02
1.00
1.10
1.06
1.08
1.08
1.03
1.18
1.20
1.01
1.00
Average
1.25
1.10
1.08
Table C.3.: Efficency overheads for effects of in-line checks (ILC), coherence protocol processing, and the memory and process distribution across the
nodes for 8-processor runs on a 2-node DSZOOM system.
coherence protocol processing, and the effects of the physical memory and process
distribution across the nodes. The efficency overhead numbers are derived from
Figure C.6(a). On average, the run-time in-line checks are the largest efficiency
overhead factor for 8-processor runs.
Normalized execution time breakdowns for 8- and 16-processor runs for a 2node DSZOOM-WF with a coherency unit of 128 bytes are shown in Figure C.7.
The execution time is divided into Task, inserted run-time in-line checks (ILC )
for global loads and stores, the synchronization cost (Barriers and Locks), and
the cost of coherency protocol processing (Store and Load ).
Our performance numbers presented so far are based on a constant cache-linesized coherency unit of 128 bytes for all of the tested configurations. Choosing the
different coherence granularity can potentially improve the performance for many
applications (for example, see Shasta [86]). Table C.4 reports our experiments
for several of the SPLASH-2 applications that have demonstrated performance
improvements for a different coherency unit sizes on a 2 node DSZOOM-WF
system with 8 processors per node. For example, if the FFT is executed with a
cache-line-sized coherency unit of 2048 bytes, its overall performance is improved
with 20.1% compared to the values presented in Figure C.6(a).
95
C. Removing the Overhead from Software-Based Shared Memory
18
16
14
12
10
8
6
4
2
0
FFT
LU-c
LU-nc
Radix
Barnes
FMM
Ocean-c Ocean-nc Radiosity Raytrace
Waternsq
Water-sp
E6000 8 CPUs
2.23
9.69
11.59
3.99
5.02
15.08
5.81
2.27
3.34
1.33
11.93
3.80
CC-NUMA 2x4
2.63
10.02
12.74
4.26
5.19
15.43
6.07
2.66
3.53
1.48
12.11
3.83
DSZOOM-WF 1x8
2.87
15.34
18.52
4.60
5.77
15.49
7.29
3.55
3.64
1.59
12.59
4.16
DSZOOM-EMU 2x4
3.65
15.41
18.73
5.35
5.97
16.94
7.90
3.88
4.42
1.95
12.76
4.19
DSZOOM-WF 2x4
4.17
15.79
18.81
5.91
6.33
18.29
8.57
3.98
5.20
2.34
12.95
4.21
Waternsq
Water-sp
2.78
(a) 8 processors
10
9
8
7
6
5
4
3
2
1
0
FFT
LU-c
LU-nc
Radix
Barnes
FMM
Ocean-c Ocean-nc Radiosity Raytrace
E6000 16 CPUs
1.42
5.52
7.16
2.31
2.72
8.40
3.38
1.34
1.93
0.79
6.15
CC-NUMA 2x8
1.63
5.28
6.79
2.52
2.85
8.64
3.59
1.62
2.31
1.05
6.21
2.80
DSZOOM-WF 1x16
1.53
8.29
10.31
2.47
2.86
8.60
3.84
2.04
2.04
0.87
6.48
3.04
DSZOOM-EMU 2x8
1.94
8.49
10.39
2.91
3.27
9.97
5.03
2.96
2.69
1.31
6.61
3.07
DSZOOM-WF 2x8
2.45
8.32
10.17
3.38
3.46
10.00
4.66
2.37
3.34
2.11
6.73
3.09
(b) 16 processors
Figure C.6.: Execution times in seconds for (a) 8- and (b) 16-processor runs for
Sun Enterprise E6000, 2-node CC-NUMA, single-node DSZOOMWF, 2-node DSZOOM-EMU, and 2-node DSZOOM-WF.
96
C.4. Performance Study
100%
90%
80%
Store
Load
Locks
Barriers
ILC
Task
70%
60%
50%
40%
30%
20%
10%
si
ty
R
ay
tra
ce
W
at
er
-n
sq
W
at
er
-s
p
nc
io
n-
R
ad
O
ce
a
n-
c
M
ce
a
FM
O
c
R
ad
ix
Ba
rn
es
LU
-n
LU
-c
FF
T
0%
(a) 8 processors
100%
90%
80%
Store
Load
Locks
Barriers
ILC
Task
70%
60%
50%
40%
30%
20%
10%
R
ad
ix
Ba
rn
es
FM
M
O
ce
an
-c
O
ce
an
-n
c
R
ad
io
si
ty
R
ay
tra
ce
W
at
er
-n
sq
W
at
er
-s
p
LU
-n
c
-c
LU
FF
T
0%
(b) 16 processors
Figure C.7.: Normalized execution time breakdowns for (a) 8- and (b) 16processor runs for a 2-node DSZOOM-WF with cache-line-sized coherency unit of 128 bytes.
97
C. Removing the Overhead from Software-Based Shared Memory
Program
Unit Size [bytes]
FFT
LU-c
Barnes
FMM
Ocean-c
Time [s]
2 048
1 024
64
64
256
2.04
8.22
3.42
9.99
4.35
(+20.1%)
(+1.2%)
(+1.2%)
(+0.1%)
(+7.1%)
Table C.4.: Effects of the coherency unit variations for a 2-node DSZOOM-WF
with 8 processor nodes.
The speedup values for 16-processor runs for Sun Enterprise E6000, 2-node CCNUMA, and 2-node DSZOOM-WF with “optimal” coherency units are shown in
Figure C.8. The speedups shown are the ratio of the execution time of the appliE6000 16 CPUs
CC-NUMA 2x8
DSZOOM-WF 2x8
e
ra
g
Av
e
q
W
at
er
-s
p
ce
er
-n
s
tra
si
io
ay
R
an
ad
R
ce
W
at
ty
c
-n
-c
an
O
ce
O
FM
M
rn
es
ix
Ba
ad
R
LU
-n
c
LU
-c
FF
T
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Figure C.8.: Application speedups for Sun Enterprise E6000, 2-node CC-NUMA,
and 2-node DSZOOM-WF.
cation running on 16 processors to the execution time of the original sequential
application (with no access control checks). We can see that our all-software
solution is close to what hardware CC-NUMA architecture can do on the same
platform. On average, our implementation demonstrates a relative difference for
SPLASH-2 speedups of 31.6% compared to the hardware DSM implementation.
C.5. Related Work
Many different SW-DSM implementations have been proposed over the years:
Blizzard-S [90], Brazos [98], Cashmere-2L [100, 27], CRL [48], GeNIMA [11],
Ivy [61, 63], MGS [114], Munin [18], Shasta [86, 85, 83, 84, 27], Sirocco-S [88],
98
C.6. Conclusions
SoftFLASH [29], and TreadMarks [52]. Most of them suffer from synchronous
interrupt protocol processing. We belive that many of these implementations
would benefit from a more efficient protocol implementation; such the one described here.
The DSZOOM-WF’s basic approach is derived from several fine-grain SWDSM systems: Shasta, Blizzard-S, and Sirocco-S. Our “magic”-value technique
for fine-grain access control checks presented in Section C.3 is similar to Shasta’s
“flag”-value and Blizzard’s “sentinel”-value optimizations. This technique was
independently introduced in Shasta [86] and Blizzard-S [90] for use with all types
of loads. There are several other systems that use compiler-generated checks
to implement a global address space (for example, Olden [16], Split-C [24], and
Midway [10]).
Regarding the simple architectural support [45], the GeNIMA proposal is closest to our work [11, 33]. GeNIMA proposes a protocol and a general network
interface mechanism to avoid some of the asynchronous overhead. A processor
starting a synchronous communication event, e.g., the requesting processor initiating some coherence actions, checks for incoming messages at the same time.
This avoids some of the asynchronous overhead in the home node, but will also
add some extra delay while waiting for a synchronous event to happen in the
node. The protocol is still implemented as communicating protocol agents.
Several other papers have suggested hardware support for fine-grain remote
write operations in the network interface [54, 53]. One of the recent implementations is the automatic update release consistency (AURC) home-based protocol
[44]. This implementation is a page-based SW-DSM which eliminates “diffs”—
the compact encoded representation of the differences between the two pages,
frequently used in many page-based SW-DSM systems—by using fine-grain remote writes for both the application data and the protocol meta-data. The AURC
approach usually performs better than all-software home-based LRC implementations.
C.6. Conclusions
In this paper we have presented the DSZOOM-WF system, an all-software (sequentially consistent) fine-grain SW-DSM implementation. We have demonstrated how asynchronous protocol processing can be completely avoided at the
cost of some extra remote transactions—trading bandwidth for efficiency. We
believe that the total round-trip SW-DSM latency can be kept below three microseconds once the raw latency of a modern interconnect has been added.
The protocol described in this paper is applicable to the emerging InfiniBand
I/O interconnect standard. We believe that a protocol such as the one we describe could speed up many of the existing SW-DSM implementations on such
interconnects.
99
C. Removing the Overhead from Software-Based Shared Memory
DSZOOM-WF consistently demonstrates performance comparable to hardwarebased DSM implementations. On average, the speedup difference between our
implementation and the hardware CC-NUMA system is 31.6% for the studied
SPLASH-2 applications.
C.7. Future Work
We plan to extend this work in several different directions. First, cache-coherence
protocol code optimizations will improve performance of the DSZOOM-WF system. Because EEL has problems with hand-written in-line assembly in combination with high optimization levels during the compilation (our protocol routines
written in C, and the synchronization part of our run-time system that is also
written in C, use quite a lot of in-line assembly gcc constructs) we do not use any
optimizations during the compiling phase of the coherence protocol routines and
the run-time system.
Second, in order to improve the performance of the DSZOOM-WF system,
weaker memory models, such as lazy release consistency (LRC) [51] and the
release consistency model presented by Gharachorloo et al. [31, 86], can be used
instead of the sequential consistency model that is currently implemented. This
kind of optimization will allow many update actions to be deferred and combined
into a single operation.
Third, we plan to experiment with several inter-node lock synchronization algorithms (e.g., ticket-based locks). The test-and-set locks that we are currently
using work well for small-scale SMP nodes, but they are not adequate for largescale, CC-NUMA nodes. Usually, test-and-set locks lead to poor caching performance and increased inter-node communication in many CC-NUMA systems.
We believe that we can speed up many lock-intensive applications with improved
synchronization algorithms.
Finally, to make this kind of system more usable it is desirable to make a
POSIX-threads implementation because most of the commercial workloads are
implemented with that programming model rather than PARMACS.
Acknowledgments
We would like to thank Glen Ammons for excellent support and quick EEL updates, Ernest Artiaga for help with couple of PARMACS applications, Sverker
Holmgren and Henrik Löf for providing access to the Sun Orange system. We
would also like to thank Lars Albertsson, Erik Berg, Anders Landin, Larry Meadows, Thiemo Voigt, and the anonymous reviewers for comments on earlier drafts
of the paper. This work is supported in part by Sun Microsystems, Inc., and the
Parallel and Scientific Computing Institute (PSCI), Sweden.
100
Paper D
101
D. THROOM — Running
POSIX Multithreaded
Binaries on a Cluster
Henrik Löf, Zoran Radović, and Erik Hagersten
Uppsala University, Department of Information Technology
P.O. Box 337, SE-751 05 Uppsala, Sweden
Technical Report 2003-026, Department of Information Technology, Uppsala University, Sweden, April 2003. A shorter version of this paper is published in Proceedings
of the 9th International Euro-Par Conference (Euro-Par 2003), Klagenfurt, Austria,
August 2003.
Abstract
Most software distributed shared-memory systems (SW-DSMs) lack industry standard interfaces that limit their applicability to a small set of shared-memory applications. In order to gain general acceptance, SW-DSMs should support the
same look-and-feel of shared memory as hardware DSMs. This paper presents
a runtime system concept that enables unmodified POSIX P1003.1c (Pthreads)
compliant binaries to run transparently on clustered hardware. The key idea is to
extend the single process model of multi-threading to a multi-process model where
threads are distributed to processes executing in remote nodes. The distributed
threads execute in a global shared address space made coherent by a fine-grain
SW-DSM layer. We also present THROOM, a proof-of-concept implementation
that runs unmodified Pthread binaries on a virtual cluster modeled as standard
UNIX processes. THROOM runs on top of the DSZOOM fine-grain SW-DSM
system with limited OS support.
D.1. Introduction
Clusters built from high-volume compute nodes, such as workstations, PCs, and
small symmetric multiprocessors (SMPs), provide powerful platforms for execut-
103
D. THROOM — Running POSIX Multithreaded Binaries on a Cluster
ing large-scale parallel applications. Software distributed shared-memory (SWDSM) systems can create the illusion of a single shared memory across the entire
cluster using a software run-time layer, attached between the application and the
hardware. In spite of several successful implementation efforts [27, 52, 77, 84, 100],
SW-DSM systems are still not widely used today. In most cases, this is due to
the relatively poor and unpredictable performance demonstrated by the SW-DSM
implementations. However, some recent SW-DSM systems have shown that this
performance gap can be narrowed by removing the asynchronous protocol overhead [11, 77], and demonstrate a performance overhead of only 30-40 percent in
comparison to hardware DSMs (HW-DSM) [77]. One obstacle for SW-DSMs is
the fact that they often require special constructs and/or impose special programming restrictions in order to operate properly. Some SW-DSM systems further
alienate themselves from HW-DSMs by relying heavily on very weak memory
models in order to hide some of the false sharing created by their page-based
coherence strategies. This often leads to large performance variations when comparing the performance of the same applications run on HW-DSMs. SW-DSMs
should support the same look-and-feel of shared memory as the HW-DSMs. This
includes support for POSIX threads running on some standard memory model
and a performance footprint similar to that of HW-DSMs, i.e., the performance
gap should remain approximately the same for most applications. The ultimate
goal is that, binaries that run on HW-DSMs will also run on SW-DSMs, without
manual modifications.
In this paper we present a new runtime system concept that allow POSIX
threads (Pthreads) [43] applications to run on a non-coherent clustered architecture. Threads are distributed from their original process to other processes
running on the same or other nodes of the cluster. By letting the distributed
Pthreads access the original process’ software context in a coherent way, we can
create the illusion of a shared-memory multiprocessor. We show how to create a
global shared address space and how to distribute Pthreads using a fine-grained
SW-DSM and binary instrumentation techniques. We also show that this can be
made transparent using library pre-loading functionality present in most UNIX
dynamic linkers and program loaders.
D.2. DSZOOM — a Fine-Grained SW-DSM
Our initial implementation is based on the DSZOOM SW-DSM [77]. Each
DSZOOM node can either be a uniprocessor, a SMP, or a CC-NUMA cluster.
The node’s hardware keeps coherence among its caches and its memory. The
different cluster nodes run different kernel instances and do not share memory
with each other in a hardware-coherent way. DSZOOM assumes a cluster interconnect with an inexpensive user-level mechanism to access memory in other
nodes, similar to the remote put/get semantics found in the cluster version of the
104
D.3. THROOM Overview
Scalable Coherent Interface (SCI), or the emerging InfiniBand interconnect proposal that supports RDMA READ/WRITE as well as the atomic operations CmpSwap
and FetchAdd [46]. Another example is the Sun Fire (TM) Link interconnect
hardware [97] that supports kernel bypass messaging via remote shared-memory
(RSM) interface, whereby shared-memory regions on one machine can be mapped
into the address space of another.
While traditional page-based SW-DSMs rely on TLB traps to detect coherence
“violations,” fine-grained SW-DSMs like Shasta [86], Blizzard-S [90], Sirocco-S
[88], and DSZOOM [77] have to insert software coherence checks with executable
editing. In DSZOOM, this is originally done by replacing each load and store
that may reference shared data of the binary with a code snippet (short sequence
of machine code). The binary instrumentation technique adds extra latency for
each load or store operation to global data, independently if that data is locally
available or not. In DSZOOM, the largest source of overhead comes from the
in-line checks (ILC) for global loads and stores [77].
D.3. THROOM Overview
Most SW-DSM systems keep coherence inside a specified segment of the virtual
address space. We call this segment global memory (G_MEM). In most SWDSM implementations, the different nodes of the cluster all run some daemon
or host process to maintain the G_MEM mappings and to deal with requests
for coherency actions. In this paper, we use the term user node to refer to the
cluster node in which the user executes the binary (the user process). All other
nodes are called remote nodes and their daemon processes will be called shadow
processes. This setup creates a split-execution system [105].
Transparency is achieved by using library interposition [105], which allow us
to change the default behavior of a shared library call without recompiling the
binary. Many operating systems implement the core system libraries such as
libc, libpthread, and libm as shared libraries. Using interpositioning, we can
catch a call to any shared library and redirect it to our own implementations.
Original arguments can be altered and post or preprocessing of output can be
applied. See Figure D.1 for an example.
D.3.1. Distributing Threads
Threads are distributed by catching the pthread_create() call and copy the
arguments to some shared scratch area of the address space. The call is then
executed in a shadow process using the copied arguments. When the call returns
in the shadow process, the output is written to the scratch area. The user process
then reads the shadow output from the scratch area in the user node and returns.
The new distributed thread will start to execute in the shadow process, using
105
D. THROOM — Running POSIX Multithreaded Binaries on a Cluster
pthread_t pthread_self(void)
{
static pthread_t (*func)();
if( I_am_master ) {
if(!func)
func = (pthread_t(*)())dlsym(RTLD_NEXT, "pthread_self");
return(func());
}
else {
if(!func)
func = (pthread_t(*)())dlsym(RTLD_NEXT, "pthread_self");
return(REMOVE_NODE_ID(func(), _myid));
}
}
Figure D.1.: Interposing agent example.
arguments pointing to the context of its user process. A minimal requirement
for the thread to execute correctly in a shadow process is that it must share the
address space of the user process.
D.3.2. Creating a Global Shared Address Space
Code and global data are made accessible in a coherent way from all cluster nodes
by copying memory containing code and global data of the application’s virtual
address space to the G_MEM and then divert accesses to the copy. This will
make the application execute entirely in the global shared-memory segment. To
enable a multi-threaded application to run coherently, all global data referenced
by threads have to reside in G_MEM. If we copy the .text, .data, and .bss
segments to the G_MEM and modify the code to refer to these copies, we can
move a thread to a shadow process and still access the global data.
The access diversion can be made transparent in several different ways. If
the code is compiled as Position Independent Code (PIC), the Global Offset
Table (GOT) can be modified so that the copy is referred to [110]. This is often
not applicable, since most binaries are not compiled as PIC. The structure and
placement of the GOT might also be hard to find at runtime. Another approach is
to use binary instrumentation to change references to access the G_MEM instead
of their original segments. It is natural to use this approach in THROOM, since
the G_MEM is already made coherent by a fine-grain SW-DSM.
D.3.3. Cluster-Enabled Library Calls
Most application binaries use system calls and calls to shared libraries. If the arguments refer to thread-global data (call-by-reference), the access must be mod-
106
D.3. THROOM Overview
ified to use the G_MEM in order for modifications of the data to be coherent
across the cluster. This can be done in at least two ways:
Instrument the library code This is unfortunately a difficult task, since library
code normally is heavily optimized. This makes it hard for an instrumentation tool to rebuild the structure of the code and to produce a correctly
instrumented binary.
Use library interposition A call that references potentially thread-global variables is caught and the references are modified in the interposing library.
The referenced memory is validated to ensure that new copies of any invalidated data are used.
Instrumenting all library code is in principle, the best way to cluster-enable library calls. However, our instrumentation tool, EEL [57], was not able to instrument all of the libraries. Instead, we had to use the library interposition method.
It enabled us to make cluster versions of system and/or library calls without recompiling or instrumenting the libraries. It also turned out to be of great use
for debugging and reverse engineering. Using an interposed library, we need two
primitives to make coherent accesses from and to the G_MEM:
coherent_mem_store() copies data from original text and data segments to
G_MEM. Generates the coherence actions needed.
coherent_mem_load() loads data from G_MEM to the original text and data
segments and resolves any invalidated copies.
An obvious disadvantage of this method is that we have to write interposing
agents for many calls. Another disadvantage is the runtime overhead associated
with data copying, especially for I/O operations. A better solution would be to
generate the coherence actions on the original arguments before the call is made
in the application binary [84]. This requires a very sophisticated instrumentation
tool, which is outside the scope of this work. Some calls also need to be totally
rewritten to work on a cluster. For example, the malloc() system call must
allocate its memory in the G_MEM instead of using the standard heap.
D.3.4. THROOM in a Nutshell
To summarize, the following different steps need to be taken to transparently
allow an unmodified POSIX binary to run on a cluster.
1. Threads need to be distributed to execute in several different processes on
different OS kernels.
107
D. THROOM — Running POSIX Multithreaded Binaries on a Cluster
2. The distributed threads should reference shared data through a global
shared-memory segment made coherent by a fine-grain SW-DSM.
3. The initial .text, .data, and .bss segments need to be copied into the
G_MEM and accesses to data in these segments need to be modified to hit
the copy using binary instrumentation.
4. To make the whole system transparent, we implement it as a shared library
to be interposed at program loading.
5. System calls or other shared library calls using pointers to global data must
validate the referenced memory before any loads and stores are made.
6. Some calls such as malloc() and synchronization primitives must be made
THROOM aware.
D.4. Implementation Details
We have implemented the THROOM system on a 2-node Sun WildFire prototype
SMP cluster [40, 42]. The cluster is running a slightly modified version of Solaris
2.6 and the hardware is configured as a standard CC-NUMA architecture. Our
cluster is built from two Sun Enterprise E6000 SMP machines, which we denote
as cabinet 1 and cabinet 2. Processors in the different cabinets have been set up
to access different node-private copies of the G_MEM, and the DSZOOM system
is used to keep these copies coherent.
The runtime system is implemented as a shared library. A user sets the
LD_PRELOAD environment variable to the path of the THROOM runtime library,
and then executes the instrumented binary.
The DSZOOM address space is set up during initialization using the .init
section and standard POSIX shared-memory primitives. Control is then given to
the application. The user process issues a fork(2) call to create a shadow process,
which will inherit its parents mappings by the copy-on-write semantics of Solaris.
The two processes are bound to the different cabinets using the WildFire firsttouch memory initialization and the pset_bind() call. The home process then
reads its own /proc file system to locate the .text, .data, and .bss segments
and copies them to the G_MEM.
The shadow process waits on a process shared POSIX conditional variable
to create remote threads for execution in the G_MEM. Parameters are passed
through a shared-memory mapping separated from the G_MEM. Since the remote thread is created in another process, thread IDs can no longer be guaranteed
to be unique. To fix this, the remote node ID is copied into the most significant
eight bits of the thread type, which in the Solaris 2.6 implementation is an unsigned integer. Similar techniques are used for other Pthread calls.
108
D.4. Implementation Details
Since the Sun WildFire is a single-system-image cluster, we can simply use
POSIX process shared synchronization primitives. Synchronization primitives
called from within the application will not work, since we have not been able to
instrument the Solaris Pthread library using EEL. Instead, we allocate process
shared synchronization variables in a separate shared-memory area before the
application starts. When an application initializes a synchronization variable, we
catch the call and redirect the variable (by switching addresses) to one of our
pre-prepared variables. This is done by storing the address of the interposed lock
in a field of the pthread_mutex_t data structure.
D.4.1. Binary Instrumentation
The binary instrumentation process must be compliant with all Sparc ABI specifications, and especially with global register usage. The DSZOOM engine requires
two free global registers, at the insertion point during the instrumentation phase,
to pass parameters to the coherence routines in an efficient way from in-line code
snippets. On Sparc V8 (32-bit) and Sparc V8plus (64-bit) there are three global
thread-private registers that are saved/restored during the thread-switching by
the Solaris system libraries: %g2, %g3, and %g4; all other global registers are
thread-global and are not saved during the switch. The thread-private registers
are also called for application registers. On Sparc V9 (64-bit), on the other hand,
only %g2 and %g3 are application registers, and the %g4 register is free for general use and is volatile across function calls together with %g1 and %g5. On all
targets, registers %g6 and %g7 are reserved for system software and are not used
during the binary modification process (in fact, Solaris’ Pthread library uses one
of those system registers itself).
Currently, DSZOOM also need a fast mechanism to lookup the node id for
every running thread, which led us to reserve the global register %g2 for that
task (this restriction can easily be removed if we use another convention for
MTAG lookups1 that is not dependent on node id). This is also why our target
architecture for this proof-of-concept implementation is Sparc V8 and/or V8plus
with three thread-private application registers. For simplicity reasons, we only
instrument binaries without application-specific registers.2 We have noticed that
our test binaries are only about 1% slower if application registers are not used
by the compilers.
The THROOM runtime system only imposes minor modifications to the original DSZOOM load and store snippets to divert the static data and access the
G_MEM area. At most four additional machine instructions were enough compared
to the original snippets to perform that task. The original DSZOOM floatingpoint load snippet is shown in Figure D.2. The instrumented instruction in this
1
2
See [77] for more details about MTAGs.
Sun’s compilers can be instructed to reserve application registers by using the
-xarch=no%appl compiler flag. On GNU’s gcc, this is done with the -mno-app-regs flag.
109
D. THROOM — Running POSIX Multithreaded Binaries on a Cluster
1:
2:
3:
4:
add
ld
fcmps
fbe,pt
%o5, %g0, %ADDR_REG
[%ADDR_REG], %f7
%fcc1, %f7, %f7
%fcc1, hit
... call coherence routine ...
hit:
Figure D.2.: Floating-point load snippet example for the original DSZOOM system.
1: add
2:
3:
4:
5:
%o5, %g0, %ADDR_REG
srl
%ADDR_REG, 31, %g3
brnz,pn %g3, L6
sethi
%hi(0x80000000), %g4
add
%ADDR_REG, %g4, %ADDR_REG
L6: ld
7: fcmps
8: fbe,pt
[%ADDR_REG], %f7
%fcc1, %f7, %f7
%fcc1, hit
... call coherence routine ...
hit:
Figure D.3.: Floating-point load snippet example with support for the THROOM
runtime system.
example is ld [%o5],%f7. The efficient access control check [88], is performed on
lines 3 and 4. In this example, the range check is performed inside the coherence
routine to minimize the code expansion with this in-line snippet.
If we are unable to determine the effective address of the instrumented load
during the binary modification phase, we should use the worst case THROOM
snippet shown in Figure D.3. The THROOM-related machine instructions are
shown in lines 2 to 5. Lines 2 and 3 perform a simple range check to handle static
data accesses. If this range check evaluates as true, i.e., the effective address of
this load is below the G_MEM starting point, the same load will be performed at
the %ADDR_REG + 0x80000000. If this particular load can be classified as
static by a more elaborate analysis of the binary, lines 2 and 3 can be eliminated
and the G_MEM offset can be added directly. This optimization is currently not
implemented in THROOM.
110
D.5. Performance Study
Program
FFT
LU-c
LU-nc
Radix
Barnes
Ocean-c
Ocean-nc
Radiosity
Water-nsq
Water-sq
Problem size, Iterations
1 048 576 points (48.1 MB)
1024×1024, block 16 (8.0 MB)
1024×1024, block 16 (8.0 MB)
4 194 304 items (36.5 MB)
16 384 bodies (8.1 MB)
514×514 (57.5 MB)
258×258 (22.9 MB)
room (29.4 MB)
2197 mol., 2 steps (2.0 MB)
2197 mol., 2 steps (1.5 MB)
Replaced Loads [%]
44.6 (19.0)
48.3 (15.5)
49.2 (16.7)
54.4 (15.6)
56.6 (23.8)
50.6 (27.0)
51.0 (11.6)
41.1 (26.3)
50.4 (13.4)
48.5 (15.7)
Replaced Stores [%]
32.8 (16.5)
23.0 ( 9.4)
27.7 (11.1)
31.4 (11.6)
55.4 (31.1)
31.2 (23.9)
39.0 (28.0)
35.1 (27.1)
38.0 (16.2)
32.5 (13.9)
Table D.1.: Problem sizes and replacement ratios for the ten SPLASH-2 applications studied. Instrumented loads and stores are showed as a
percentage of the total amount of load or store instructions. The
number in parenthesis shows the replacement ratio for the DSZOOM
SW-DSM without THROOM.
D.4.2. Modified System and Library Calls
The malloc() call is altered to allocate memory in G_MEM. In order to run our
benchmark suite, a number of system and library calls had to be modified (for
example, getopt, gettimeofday, gets, fgetc, scanf, fscanf, sscanf, etc.).
The calls were identified by examining the undefined symbols of the binaries and
checking, using the man-pages, whether a specific call could reference static data
or not.
D.5. Performance Study
Ten SPLASH-2 applications [111] were compiled using the GCC v2.95.2 compiler
without optimization (-O0).3 A standard Pthread PARMACS macro implementation was employed.4 To exclude the initialization time for the runtime system,
timings are started at the beginning of the parallel phase. All timings have been
performed on the 2-node Sun WildFire [40, 42] configured as a traditional CCNUMA architecture. Each node has 16 UltraSPARC II processors running at 250
MHz. The access time to node-local memory is about 330 ns (lmbench latency
[67]). Remote memory is accessed in about 1800 ns (lmbench latency).
In Table D.1, we see that more instructions are replaced in the case of THROOM
3
The code is compiled without optimization to eliminate any delay slots, which EEL cannot
handle correctly.
4
c.m4.pthreads.condvar_barrier
111
D. THROOM — Running POSIX Multithreaded Binaries on a Cluster
Figure D.4.: Runtime performance of the THROOM runtime system. A total of
8 processors were used.
since all references to static data have to be instrumented. This large difference in
replacement ratio compared to DSZOOM is explained by the fact that DSZOOM
can exploit the PARMACS programming model and use program slicing to remove accesses to static data that are not shared.
Figures D.4 and D.5 show execution times in seconds for 8- and 16-processor
runs for the following THROOM configurations:
THROOM_HO All threads are created in the home node. No threads are
scheduled in the shadow process, which means that no remote accesses are
generated.
THROOM_RO All threads are created in the remote node (the shadow process) except for the master thread which runs in the home node.
THROOM_RR The threads are scheduled over the two nodes in a round-robin
fashion.
DSZOOM Used as reference. Aggressive slicing and snippet optimizations. Optimized for a two-node fork-exec native PARMACS environment, see [77].
D.6. Discussion, Conclusions, and Future Work
A study of Figures D.4 and D.5 reveals that the present implementation is slower
than a state-of-the-art SW-DSM such as DSZOOM. In some cases, THROOM_HO
112
D.6. Discussion, Conclusions, and Future Work
Figure D.5.: Runtime performance of the THROOM runtime system. A total of
16 processors were used.
is faster than DSZOOM, since in this case no remote accesses are generated as
all activity is contained within a single node. The average runtime overhead
compared to DSZOOM for THROOM_RR is 65% on 8 processors and 78%
on 16 processors. In order to put these numbers into the context of total SW
overhead, it should be noted that DSZOOM’s overhead is 32% compared to a
hardware-coherent implementation [77]. The most significant contribution to the
high overhead when comparing DSZOOM to THROOM is the increased number
of instrumentations needed to support the POSIX thread model. Another source
of overhead is the rather inefficient implementation of synchronization primitives
of the present THROOM implementation, see Section D.4.
In conclusion, we have showed that it is possible to extend a single process
address space to a multi-process model using fine-grain instrumentation techniques. Even though the current THROOM implementation relies on some of the
WildFire’s single system image properties, we are convinced that the THROOM
concept can be extended to a pure cluster model. On a true cluster, additional
issues need to be addressed. The shadow processes need to be set up without
relying on a single system image cluster, possibly using a standard MPI runtime
system. Synchronization needs to be handled more efficiently (see [78, 80]), and
we need to create more complete and more efficient support for I/O and other
library calls.
An intermediate step could be to layer THROOM below a standard OpenMP
runtime system [26]. Most OpenMP implementations use subroutine outlining to
interface to the underlying OS, which means that PRIVATE variables are put on
the stack. (Currently, THROOM views the stack as thread private which is not
in compliance with POSIX. We could also instrument stack accesses, which would
probably generate a large amount of overhead.) Using OpenMP, we have more
113
D. THROOM — Running POSIX Multithreaded Binaries on a Cluster
information about critical sections and shared data, which can be exploited to
minimize the overhead associated with the increased number of instrumentation
of static data.
THROOM will probably also benefit from many of the proposed optimizations to make SW-DSMs more efficient. Such as, better instrumentation tools,
coherence protocol optimizations and new faster hardware.
D.7. Related work
To our knowledge, no SW-DSM system has yet been built that enables transparent and efficient execution of an unmodified POSIX binary. The Shasta system
[84], come closest to our work and this system has showed that it is possible
to run an Oracle database system on a cluster using a fine-grain SW-DSM technique. Shasta has solved the OS functionality issues in a similar way as is done in
THROOM although they support a larger set of system calls and process distribution. THROOM differs from Shasta in that it relies on the DSZOOM SW-DSM,
which does not suffer from the synchronous protocol processing. THROOM also
supports thread distribution and a thread-enabled address space. Shasta motivates the lack of multi-threading support by claiming that the overhead associated
with access checks lead to lower performance [84].
Another system announced recently is CableS [47] built on the GeNIMA pagebased SW-DSM [11]. This system support a large set of system calls, but they
have not been able to achieve binary transparency. Some source code modifications must be made and the code must be recompiled for the system to operate. Another work related to THROOM is the OpenMP interface [87] to the
TreadMarks page-based SW-DSM [52], where a compiler front-end translates the
OpenMP pragmas into TreadMark fork-join style primitives. The DSM-Threads
system [69] provide a page-based SW-DSM interface similar to the Pthreads standard without binary transparency.
The library interposition techniques and a split-execution system is covered by
the Multiple Bypass system [105]. Some work by Welsh et al. on how to create
a global address space can be found on the web [110].
114
Paper E
115
E. Latency-hiding and
Optimizations of the
DSZOOM Instrumentation
System
Oskar Grenholm, Zoran Radović, and Erik Hagersten
Uppsala University, Department of Information Technology
P.O. Box 337, SE-751 05 Uppsala, Sweden
Technical Report 2003-029, Department of Information Technology, Uppsala University, Sweden, May 2003.
Abstract
An efficient and robust instrumentation tool (or compiler support) is necessary
for an efficient implementation of fine-grain software-based shared-memory systems (SW-DSMs). The DSZOOM system, developed by the Uppsala Architecture
Research Team (UART) at Uppsala University, is a sequentially consistent finegrained SW-DSM originally developed using Executable Editing Library (EEL)—a
binary modification tool from University of Wisconsin-Madison. In this paper, we
identify several weaknesses of this original approach and present a new and simple tool for assembler instrumentation: Sparc Assembler Instrumentation Tool
(SAIT). This tool can instrument (modify) a highly optimized assembler output
from the compiler for the newest UltraSPARC processors. Currently, the focus of
the tool is load-, store-, and load-store-instrumentation.
By using the SAIT, we develop and present several low-level instrumentation
optimization techniques that significantly improve the performance of the original
DSZOOM system. One of the presented techniques is a write permission cache
(WPC), a latency-hiding mechanism for memory-store operations that can lower
the instrumentation overheads for some applications (as much as 45% for LUcont, running on two nodes with eight processors each).
Finally, we demonstrate that this new DSZOOM system executes faster than
the old one for all 13 applications studied, from the SPLASH-2 benchmark suite.
117
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
Execution time improvement factors range from 1.07 to 2.82 (average 1.73).
E.1. Introduction
The DSZOOM-WF implementation [77], is a sequentially consistent [56], finegrained distributed software-based shared-memory system, implemented on top
of the 2-node Sun WildFire [40, 42] prototype, without relying on its hardwarebased coherence capabilities. All loads and stores are instead performed to the
node’s local “private” memory. An unmodified version of the executable editing
library (EEL) [57] is used to insert fine-grain access control checks before sharedmemory loads and stores in a fully compiled and linked executable. Global coherence is resolved by coherence protocols (which are triggered by the inserted
access control checks) implemented in C, that copies data to the node’s local
memory with UltraSPARC processor’s Block Load/Store operations to the remote memory (which is locally mapped in every node). The all-software protocol
is implemented assuming some basic low-level primitives in the cluster interconnect and an operating system bypass functionality, similar to the emerging
InfiniBand standard [46]. All interrupt- and/or poll-based asynchronous protocol processing, usually found in almost all traditional SW-DSM proposals so far
[52, 63, 86, 85, 88, 98, 100], is completely removed by running the entire coherence
protocol in the requesting processor.
Program instrumentation is a general term for techniques used to modify existing programs in order to typically collect the data during a program execution
(tracing, cache simulation, profile information, etc.). During the program execution, the instrumentation code is executed together with the original program
code. The code instrumentation can typically be done at several different levels:
hardware, library interposing, source code, assembler code, or machine level (i.e.,
binary instrumentation). Currently, in the DSZOOM system, the binary instrumentation is used [57], a technique that was also used in several similar projects
in the past (e.g., Blizzard-S [90], Shasta [86, 85, 84], Sirocco-S [88]).
As mentioned above, the DSZOOM’s instrumentation system is developed with
EEL, which is unfortunately not maintained anymore. The current state of the
EEL library makes it unusable for binaries that are compiled with high optimization levels and new compilers on modern operating systems. EEL is also
unable to instrument all types of instructions that are placed in SPARC’s delay
slots. This is the main motivation for this work. To avoid those limitations we
instead instrument the assembler output from the compiler, and insert the snippets (small fragments of assembler/machine code) needed. The compiler finishes
its job of making it all into an executable. By doing the actual instrumentation
at the assembler level, we can easily analyze and re-arrange the code in a way
that eliminates loads/stores in delay slots. The problem of inserting code snippets is now reduced to inserting correct assembler code, as text, into a text file
118
E.2. Target Architecture/Compiler Overview
containing the assembler output of the program.
In this paper, we present a new instrumentation tool, SPARC Assembler Instrumentation Tool (SAIT), which is a simple and efficient instrumentation technique to instrument the assembler output from the compiler. The SAIT can
instrument a highly optimized assembler output from Sun’s latest compilers for
the newest UltraSPARC processors. Currently, the focus of the tool is load-,
store-, and load-store-instrumentation because this is of the biggest importance
for the DSZOOM system. (The tool can easily be extended to support other
types of instrumentation as well.) The major limitation of this approach is that
the source code must be available, which is not always the case, especially not
for the system libraries. On the other hand, instrumenting libraries with EEL
or some other binary instrumentation tool is in practice very difficult (on some
architectures even impossible) task since the code in libraries is usually heavily
optimized and it can contain “data in code” and “code in data” segments that are
very difficult to resolve without some additional help from the compilers.
We present several low-level instrumentation optimization techniques that significantly improve the performance of the original DSZOOM system. All new
instrumentation techniques are developed with SAIT. One of the presented techniques is a write permission cache (WPC), a latency-hiding mechanism for memorystore operations. By using the WPC, the instrumentation overheads for some
applications can be lowered (as much as 45% for LU-cont, running on two nodes
with eight processors each). We also demonstrate that this new DSZOOM system executes faster than the old one for all 13 applications studied, from the
SPLASH-2 benchmark suite. Execution time improvement factors range from
1.07 to 2.82 (average 1.73).
The rest of this paper is organized as follows. In section E.2, we give a short
target architecture/compiler overview. The SAIT is introduced in the section E.3.
Several low-level instrumentation optimization techniques, including the WPC,
are described in section E.4. In section E.5, the performance study is performed,
and finally, we conclude in section E.6.
E.2. Target Architecture/Compiler Overview
E.2.1. Original Proof-of-Concept Platform
The system used to host the original proof-of-concept DSZOOM was a 2-node
Sun WildFire [40, 72, 42], with Sun Enterprise E6000 SMP processing nodes
[96], with 16 UltraSPARC II (250 MHz) processors per node, running a slightly
modified version of the Solaris 2.6 operating system. The compiler used to compile
both EEL and the SPLASH-2 benchmark programs was GNU’s gcc-2.8.1. The
benchmark programs were compiled without any optimization because EEL could
not always instrument the binaries produced otherwise. The proof-of-concept
119
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
implementation was tested thoroughly on this platform. For further information
on this original implementation using EEL, see [77].
E.2.2. SPARC V8 and V9 ABI Restrictions
The instrumentation process must be compliant with all SPARC ABI specifications, and especially with global register usage [101]. Currently, the DSZOOM
engine requires two free global registers, at the insertion point during the instrumentation phase, to pass parameters to the coherence routines in an efficient
way from in-line code snippets. On SPARC V8 (32-bit) and SPARC V8plus (64bit) there are three global thread-private registers that are saved/restored during
the thread-switching by the Solaris system libraries: %g2, %g3, and %g4; all
other global registers are thread-global and are not saved during the switch. The
thread-private registers are also called for application registers. On SPARC V9
(64-bit), on the other hand, only %g2 and %g3 are application registers, and the
%g4 register is free for general use and is volatile across function calls together
with %g1 and %g5. On all targets, registers %g6 and %g7 are reserved for system
software and are not used during the code instrumentation process.
As mentioned above, we need two registers to pass arguments. For that purpose, we choose to use registers %g3 and %g4. On SPARC V8 or V8plus this
choice leave us one extra register to use: %g2. This register is mainly used to
make snippets a bit more efficient, and also for the write permission cache implementation (see section E.4 for details). The DSZOOM should be possible to
implement on SPARC V9 architecture as well, since usually only two application
registers are used in snippets.
E.2.3. Target Compiler Details
The compiler used to produce the assembler output and the executables in this
paper is Sun WorkShop 6 update 2 C 5.3 Patch 111679-08.1 A brief overview of
several relevant compiler flags, taken from the Forte Developer 6 update 2 manual
[102], is given here.
-S Directs compiler to produce an assembly source file but not to assemble the program.
-xregs=r[,r...] Specifies the usage of registers for the generated code. r is a commaseparated list that consists of one or more of the following: [no% ]appl , [no% ]float.
The -xregs values available are:
appl : Allows the use of the following registers: g2, g3, g4 (v8a, v8, v8plus,
v8plusa, v8plusb) g2, g3 (v9, v9a, v9b). In the SPARC ABI, these registers are
1
Today, the SAIT can also instrument the assembler output from Sun’s Forte Developer 7
compilers (C, C++, and Fortran, version 5.4). These compilers are also called for Sun ONE
Studio compiler collection.
120
E.2. Target Architecture/Compiler Overview
described as application registers. Using these registers can increase performance
because fewer load and store instructions are needed. However, such use can
conflict with some old library programs written in assembly code.
no%appl : Does not use the appl registers.
-xO[1|2|3|4|5] Optimizes the object code; note the upper-case letter O. The levels
(1, 2, 3, 4, or 5) you can use with -xO are described below.
-xO1 Does basic local optimization (peephole).
-xO2 Does basic local and global optimization. This is induction variable elimination, local and global common subexpression elimination, algebraic simplification,
copy propagation, constant propagation, loop-invariant optimization, register allocation, basic block merging, tail recursion elimination, dead code elimination,
tail call elimination, and complex expression expansion. The -xO2 level does not
assign global, external, or indirect references or definitions to registers. It treats
these references and definitions as if they were declared volatile. In general, the
-xO2 level results in minimum code size.
-xO3 Performs like -xO2 , but also optimizes references or definitions for external
variables. Loop unrolling and software pipelining are also performed. This level
does not trace the effects of pointer assignments. When compiling either device
drivers, or programs that modify external variables from within signal handlers,
you may need to use the volatile type qualifier to protect the object from optimization. In general, the -xO3 level results in increased code size.
-xO4 Performs like -xO3 , but also automatically inlines functions contained in
the same file; this usually improves execution speed. If you want to control which
functions are inlined, see -xinline=list. This level traces the effects of pointer
assignments, and usually results in increased code size.
-xO5 Attempts to generate the highest level of optimization. Uses optimization algorithms that take more compilation time or that do not have as high a
certainty of improving execution time. Optimization at this level is more likely
to improve performance if it is done with profile feedback. See -xprofile=p.
-fast Selects the optimum combination of compilation options for speed. This should
provide close to the maximum performance for most realistic applications. Modules compiled with fast must also be linked with fast. The fast option is unsuitable
for programs intended to run on a different target than the compilation machine.
In such cases, follow -fast with the appropriate -xtarget option. The fast option
is unsuitable for programs that require strict conformance to the IEEE 754 Standard. The following table lists the set of options selected by -fast on the SPARC
platform.
121
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
< STORE: ("st" | "stb" | "sth" | "std" | "stx") >
< REG: "%" ["r","g","i","o","l","f","s","y"]
(["0"-"9","p","o","c"])* >
Figure E.1.: Two examples of tokens in JavaCC.
-dalign, -fns, -fsimple=2, -fsingle, -ftrap=%none, -xarch,
-xbuiltin=%all, -xlibmil, -xtarget=native, -xO5
fast acts like a macro expansion on the command line. Therefore, you can override the optimization level and code generation option aspects by following -fast
with the desired optimization level or code generation option. Compiling with the
-fast -xO4 pair is like compiling with the -xO2 -xO4 pair. The latter specification
takes precedence. You can usually improve performance for most programs with
this option. Do not use this option for programs that depend on IEEE standard
exception handling; you can get different numerical results, premature program
termination, or unexpected SIGFPE signals.
E.3. SAIT: SPARC Assembler Instrumentation
Tool
The SAIT is first of all created for use with the DSZOOM system. It is a SPARC
assembler parser with ability to insert code snippets at specified locations. Currently, the SAIT instruments integer-loads, floating-point-loads, and any type
of stores. Java Compiler Compiler (JavaCC) version 2.1 is used for the design
of SAIT’s parser [108]. JavaCC is a parser generator for use with mainly Java
applications. It works much like the classical tool, yacc, for the C programming
language. With JavaCC it is also possible to write functions and additional code
in Java, that can decide what to do with the parsed code (tree building is also
possible to perform via a tool called JJTree, which is included with JavaCC).
In this section, we shortly describe the main features/phases of the SAIT. For a
more detailed description of the tool, see [38].
E.3.1. Parsing SPARC Assembler
Parsing the SPARC assembler is a quite simple task. An example of how the
tokens for stores and registers are written in JavaCC, is given in Figure E.1. All
tokens are constructed from regular expressions.
The internal data structures of the tool consist of two different classes, implemented in Java, called Basic Block (BB) and Control Flow Graph (CFG). The
formal definitions of basic blocks and control flow graphs is shown below [21].
122
E.3. SAIT: SPARC Assembler Instrumentation Tool
Definition 1 A basic block is a sequence of consecutive statements in which flow
of control enters at the beginning and leaves at the end without halt or
possibility of branching except at the end.
Definition 2 A control flow graph G = (N ; E; h) for a program P is a connected,
directed graph, that satisfies the following conditions:
❏ h is the unique entry node to the graph,
❏ ∀n ∈ N ; n represents a basic blocks of P , and
❏ ∀e = (ni ; nj ) ∈ E; e represents flow of control from basic block ni to basic
block nj , and ni ; nj ∈ N .
A basic block is just a collection of individual instructions. The set of instructions is always entered at the beginning and exited at the end. This means that
they start with a label and ends with a branch, jump or call, or ends with the
appearance of a new label. The algorithm for dividing a long sequence of statements into basic blocks is quite simple. The parser just looks for a label, then
creates a basic block instance and adds all the following instructions into the
structure, until a branch or a new label is found, then a new instance of a basic
block is created and the old one is ended, and so on. This procedure is done once
for every function block. At the end of a basic block, there is some information
stored about where the next basic block that can be executed is, so that we can
now know how the flow of the program will go. There are several different ways
that a basic block usually can end [21]:
one-way the last instruction in the basic block is an unconditional jump to a
label, hence, the block has one out-edge.
two-way the last instruction is a conditional jump to another label, thus, the
block has two out-edges.
call the last instruction is a call to a procedure. There are two out-edges from
this block: one to the instruction following the procedure call, and the other
to the procedure that is called. Throughout analyses, the called procedure
is normally not followed, unless inter-procedural analysis is required.
return the last instruction is a procedure return instruction. There are no outedges from this basic block.
fall the next instruction is the target address of a branch instruction (i.e., the
next instruction has a label). This node is seen as a node that falls through
the next one, thus, there is only one out-edge.
123
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
Our model differs slightly from those definitions. In our case, we do not do any
distinction between conditional and unconditional branches; both are two-way.
In both cases, we just store the label that the branch is to, and the label that
comes directly after the store. (The instruction in the delay slot belongs to
the basic block as well.) Moreover, since we do not try to do inter-procedural
analysis, we do not let our basic blocks end with a call, we just treat them as any
other instruction. When the basic block just falls through, the label of the next
basic block is stored. Finally, if the basic block ends with a return statement of
some kind that indicates that the flow for this function or program ends here and
accordingly information about this is stored instead.
After the creation of all the basic blocks, the flow of control is analyzed and
a CFG is set up. A CFG is simply just a tree list over the different execution
paths of the program. One CFG is constructed for each function in the program.
Each node in this tree has a basic block and pointers to the next nodes (one or
two pointers depending on if the exiting point is a straight code or a branch) or
a null pointer if it is a terminating point.
E.3.2. Liveness Analysis
The SAIT needs to analyze the intermediate-representation structure to determine which registers are in use at the insertion point during the instrumentation
phase. A register is live if it holds a value that may be needed in the future, so
this analysis is usually called liveness analysis.
Armed with the information in the CFGs and the basic blocks, we can calculate
the liveness of the registers in each basic block. Liveness is the information about
which of the registers that holds values to be used later on in the program and
which that can be overwritten without altering the execution of the program. To
be able to do this, we need to know exactly which registers each instructions uses
and defines (defs). To clarify, an assignment to a register defines that register and
an occurrence of a register on the right-hand side of an assignment (or in other
expressions) uses that register. For example in the following SPARC assembly
statement, add %o1,%o2,%o3, registers %o1 and %o2 are used, and %o3 is defined. Knowing the different execution paths of a function, and which registers
each basic block uses and defs, there is a simple algorithm for calculating the
liveness at each basic block. This algorithm is given in Figure E.2. The algorithm returns a hash table for each basic block in every CFG that holds those
registers that are live at the entering point and at the exit point of the block.
Free registers (non-live) are not included in the table.
Although liveness analysis is great way to find out what registers are allowed
to use in the snippets, there are some downsides with this approach. First, at
high compiler optimization levels, the register allocation algorithms are usually
very good. This means that there are seldom any free registers to find. On lower
optimization or at none at all, the task of finding registers is a much easier one.
124
E.3. SAIT: SPARC Assembler Instrumentation Tool
Terminology. A flow-graph node has out-edges that lead to successor nodes,
and in-edges that come from predecessor nodes. The set pred [n] is all the
predecessors of node n, and succ[n] is the set of successors.
for each n
in[n] ← {}; out[n] ← {}
repeat
for each n
in0 [n] ← in[n]; out0 [n] ← out[n]
in[n] ← use[n] ∪ (out[n] − def [n])
S
out[n] ← s∈succ[n] in[s]
until in0 [n] = in[n] and out0 [n] = out[n] for all n
Figure E.2.: Computation of liveness by iteration [3].
The second problem is that the liveness analysis is only intra-procedural because
it is done on CFGs that only contain information on one function in the program.
To know what limits this imposes on liveness analysis, it is necessary to know
more about SPARC processor’s register-convention.
On SPARC, there exists 32 registers, which are grouped in four different classes:
global, local, in and out (%g0-7, %l0-7, %i0-7, %o0-7). The local registers are
supposed to be scratch registers; the in registers are used to send parameters to
functions, and out are used to return values from functions. The global registers
are a bit different. They are non-windowed as opposed to the rest. The difference
between windowed and non-windowed is that windowed registers automatically
are saved between function calls, i.e., the local registers that have the same name
in different functions are actually not the same registers. The way this is handled
is that the local and out registers are a completely new set, while the out of the
previous function becomes the in of the new function, and so on for every function.
Non-windowed registers on the other hand are the same between functions. This
means that, without us knowing it from our liveness analysis, all the registers,
except the local, can be used in another function calling the one we are analyzing.
Therefore, it is not safe to assume that registers are free just because they seem
to be that within our CFG. Most of the time it is OK to make the assumption
mentioned above, but one has to remember that it is not always strictly so.
In addition to the set of 32 ordinary registers, there also exists a set of floatingpoint registers, some registers to hold integer and floating-point condition codes,
as well as a number of other miscellaneous registers. At this moment, the liveness
analysis does not handle the floating-point registers. This is because they are all
global and we cannot really know if they are used in other functions or not. (We
do not use these register in our snippets anyhow.)
125
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
The condition code registers are on the other hand used in our snippets, and
the SAIT therefore must find out if they are alive or not. Since they are used and
defined as ordinary registers, with the exception of implicit use and definition by
some instructions, they can easily be analyzed as well. All we have to do is to
take special care about instructions that branches on a condition code register
or, like cmp, sets a condition code register. This problem is no longer present in
assembler code for the SPARC V9, since the condition code register has to be
explicitly named.
E.3.3. Handling Delay Slots
The SAIT must correctly handle and instrument instructions that are placed in
the delay slots of control-flow instructions. A delay slot means that the instruction
after the jump or branch instruction is executed before the jump or branch is
executed. This is traditionally done to keep the processor pipeline busy. An
example of a load in a delay slot is given below.
bne
ld
%reg1, .LABEL
[addr], %reg2
!! delay slot instruction
There exists three different kinds of situations that arise with regard to delay
slots. They are in principle handled in the same way, but extra care has to be
taken in two of the cases. This means that some extra checks has to be performed
by the parser and that additional instructions has to be added. The three cases
are described in more detail below.
Case #1. The trivial case, without any register-dependencies. If the instruction
after the branch is just an ordinary instruction, the branch is just written
to the basic block as usual. Otherwise, if it is a load or a store that is to be
instrumented, the load/store is written first, then the branch, and finally a
nop (to fill the delay slot) is written to the basic block.
Original code
bne
ld
%reg1, .LABEL
[addr], %reg2
Replaced code
=⇒
ld
bne
nop
[addr], %reg2
%reg1, .LABEL
Case #2. Sometimes it is not possible to just lift out the instruction in the delay
slot and place it before the branch. If the load in the delay slot is actually
loading a new value into the register used to decide to branch or not, then
this simple strategy would alter the execution path of the program. To avoid
this, the content of the register is moved to a temporary, free register (that
is found by the tool’s liveness analysis) and then this temporary register is
126
E.3. SAIT: SPARC Assembler Instrumentation Tool
instead used in the branch instruction. If there is no free register available,
a register is spilt to memory and then this register is used. Then afterwards,
the original content of the register is read back from memory again. An
example on how this can look is given below.
Original code
bne
ld
%reg, .LABEL
[addr], %reg
Replaced code
=⇒
mov
ld
bne
nop
%reg, %temp_reg
[addr], %reg
%temp_reg, .LABEL
Case #3. Another tricky thing to handle with regard to loads/stores in delay
slots are annulling delay slots. Here, depending on if the branch is taken
or not, the execution in the delay slot is executed or not. In this case, just
moving the load/store is not enough, we also have to do some additional
code expansion. First, the original branch is replaced with a branch of the
same kind, but with a different destination. The new destination is a new
label created by the SAIT. At this label, the load or store in the delay slot
is instrumented and executed. If the branch is not taken, we reach another
new label, but here the load or store is not executed. At the end of both the
new blocks, a branch is always taken. In the case where the load or store
was executed, this is to the actual label indicated by the original branch;
otherwise, the branch is to the label directly following the original branch.
Original code
bl,a,pt %icc,.L900000283
st %o0,[%g1+%l2]
.L77000552:
Replaced code
=⇒
bl,a,pt %icc,.LX686
nop
ba .LX687
nop
.LX686:
st %o0,[%g1+%l2]
ba .L900000283
nop
.LX687:
.L77000552:
E.3.4. Using the Instrumentation Tool
The SAIT is primarily designed for usage with the DSZOOM system.2 In DSZOOM,
there are three different kinds of snippets: one snippet to insert at global integer
loads (IntLoad), one snippet for global floating-point loads (FloatLoad), and
2
The tool is currently used in several other SPARC-related projects.
127
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
IntLoad
#
!! The snippet is written as SPARC assembler
#
FloatLoad
#
!! The snippet is written as SPARC assembler
#
Store
!! The snippet is written as SPARC assembler
#
Figure E.3.: Snippet-file layout.
one snippet for global stores (Store). The instrumentation tool is invoked as a
normal Java program with two arguments: (1) the *.s-file to instrument and (2)
a text file with user-written snippets. The layout of the snippet-file is simple.
Three keywords are used to separate different snippets: IntLoad, FloatLoad,
and Store. To separate different parts of this file, the character “#” is also used.
An example of the layout is shown in Figure E.3.
To increase the flexibility in snippets, a group of special symbols is available, in
addition to the ordinary SPARC-assembler. A description of all these symbols is
given in Figure E.4. Finally, certain lines of the snippet-code is possible to prefix
with the character “*” and a number. Then, depending on how some conditions
are met, only those lines with one of the specific numbers will be inserted into
the code. This is an easy way to have a little bit more “intelligent” snippets. An
example of a short snippet using this technique is shown below.
FloatLoad
#
*1 fcmps $F, $3, $3
*2 fcmpd $F, $3, $3
fbne,pn $F, .LY$L
...
In this snippet, depending on if the instrumented instruction it is a single or a
double floating-point load, either a compare single (fcmps) or a compare double
(fcmpd) will be inserted by the tool.
E.4. Low-Level Optimization Techniques
Besides implementing the SAIT for use with the DSZOOM system, a number of
other changes has been done to the original system. Among those are some new
128
E.4. Low-Level Optimization Techniques
$n: Where n can be 1, 2, or 3. These symbols represent the three arguments of
the instrumented instruction. For example, the instruction ld [%g1+128],%g5,
will give $1=%g1, $2=128, and $3=%g5.
$I: This represents the instruction instrumented itself. For the same example as
above, the $I is equal to: “ld [%g1+128],%g5”
$L: This symbol is replaced with an incremented digit, representing new labels
inserted by the parser.
$R: This symbol either returns a free register found by the liveness analysis or
spills (i.e., stores) a register to memory, and thereby makes it available to use in
the snippet. The register is restored after the snippet, if any spilling occurred, to
maintain the execution-correctness of the instrumented file.
$F: The same as the $R above, with the exception that the registers looked for
are the floating-point condition code registers; that is %fcc0, %fcc1, %fcc2, or
%fcc3.
$D: This symbol is replaced with the type of load that is instrumented, much
like the $I, but without any arguments.
$S: The same as the $D above, but for stores instead.
Figure E.4.: A list of all special snippet-file characters available.
optimizations to the snippets as well as optimizations to the DSZOOM runtimesystem.
E.4.1. Rewriting Snippets and Reducing the MTAG Size
We have noticed that when the programs are compiled with the highest optimization levels, the liveness analysis seldom founds enough free registers for “old”
snippets written for the instrumentation system based on the EEL library. Since
EEL could not be used on optimized code, this problem was not identified earlier.
Spilling the registers to memory in more than 50% of cases of instrumentation
insertion-points is a high price to pay.
To solve this problem, we reserve SPARC’s application registers (see section E.2.2) during the compilation phase of the application. Thus, the SAIT
heavily depends on the application register reserved by the compiler, that is registers %g2, %g3, and %g4 for our target architecture. Therefore, all snippets are
rewritten to use only those three registers. In fact, all snippets can be written
with only two reserved registers.
To reduce the cache pollution introduced by the DSZOOM system, we reduced
the size of the directory entries and MTAGs (see [77] for more details). One byte
129
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
(instead of two) is now used to store the information for both data structures.
This optimization also removes one load instruction from the store snippet.
E.4.2. Straight Execution Path
It is common that most of the code in the snippet in most cases will never be
executed. It only takes up place in the instruction cache. Thus, “taking away” the
part of the code that is not frequently used will produce a “straighter” execution
path for many of the instrumented applications. Basically, we divide snippets into
two parts: a fast-path part, and a slow-path part (see Figure E.5 for an example).
This functionality is also implemented in SAIT for all types of instrumented
loads/stores.
E.4.3. Avoiding Local Load/Store Instrumentation
It can be hard at instrumentation time to know what loads and stores are
global/shared. If we do not know this, there is a risk that SAIT instruments
even some local loads/stores, what leads to a larger instrumentation overhead
than necessary. One simple method to identify local loads/stores is to find out
if the effective address of a load/store comes from a constant. On SPARC, the
constants are constructed via the use of the instruction sethi. This instruction sets the highest 22 bits of a register to a constant. Then, using the assigned register and another constant, expressed via lo(some_name), a load or
store from a constant address can be done. The look of such an instruction
is, e.g., ld [%g1+lo(num_rows)],%g5. Therefore, choosing not to instrument
loads/stores where the argument includes a register plus a lo-construct, leads to
avoiding some of the local memory accesses otherwise instrumented. This simple
strategy is implemented in the instrumentation tool. In addition, explicit accesses
to the stack are ignored by SAIT. The explicit stack-references are built up from
two SPARC-specific registers: %sp or %fp.
E.4.4. Write Permission Cache (WPC)
The code for a store snippet is substantially more complicated compared with
load snippets, as shown by this pseudo code snippet:
Original instruction
ST Rx, addr
130
Replaced by code snippet
=⇒
LOCK (MTAG_lock[addr]);
LD R7, MTAG_value[addr];
if (R7 != 1)
call StoreProtocol;
ST Rx, addr;
UNLOCK (MTAG_lock[addr]);
E.4. Low-Level Optimization Techniques
FloatLoad
#
!! This is the fast-path part of the snippet
!! The code is placed next to the instrumented instruction
$I
!! the original instruction
*1 fcmps $F, $3, $3
*2 fcmpd $F, $3, $3
fbne,pn $F, .LY$L
!! if (NaN), goto slow-path
add $1, $2, %g3
!! %g3 = effective address
.LQ$L:
!! label generated by SAIT
#
FloatLoad2
#
!! This is the slow-path part of the snippet
!! The code is placed at the end of the instrumented procedure
.LY$L:
!! label generated by SAIT
srl %g3, 28, %g4
!! range check
sub %g4, 8, %g4
brnz,pt %g4, .LQ$L
!! if (local_load) goto LQ-label
nop
save %sp, -112, %sp
mov %y, %l0
!! save %y register
mov %g1, %l1
!! save %g1 register
mov %ccr, %l2
!! save %ccr register
mov %fprs, %l3
!! save %fprs register
mov %g5, %l5
!! save %g5 register
call DSZOOM_load_coherence_routine
mov %g3, %l7
!! save the original effective address
$D [$l7], $3
!! original load
stb $g4, [$g3]
!! release and update dir_entry
mov %l0, %y
!! restore %y register
mov %l1, %g1
!! restore %g1 register
mov %l2, %ccr
!! restore %ccr register
mov %l3, %fprs
!! restore %fprs register
mov %l5, %g5
!! restore %g5 register
restore
ba .LQ$L
!! goto end of the fast-path part
nop
Figure E.5.: The two-part snippet example.
131
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
Much of the complication comes from the fact that we allow for more than one
thread to run in each node. Before a store can be performed, the cache line’s
write permission in the local structure “MTAG” should be consulted. However,
in order to avoid corner cases where the cache lines gets downgraded between
the MTAG consultation and the point in time where the store is performed,
the MTAG entry must be locked before the consultation. The content of the
MTAG may reveal the need to call the store coherence routine (StoreProtocol
in the code above) in order to get write permission, after that the store can be
performed and the MTAG lock released. Note the use of a dedicated register R7
in the example above. The R7 register has to be a globally reserved register, or
an unused register found by SAIT’s liveness analysis.
The store snippet currently is responsible for about half of the instrumentation overhead. Reducing the load instrumentation would expose the efficiency of
store instrumentation further. To cut the store snippet overhead, we introduce
a write permission cache (WPC). The idea is the following: when a thread has
ensured that it has the write permission for a cache line, it holds that permission
hoping that following stores will be to the same cache line (spatial locality). The
address/ID of the cache line is stored in a dedicated register (in our implementation: %g2). If indeed the next store is to the same cache line, the store snippet is
reduced to conditional branch operation, i.e., no extra memory instructions need
to be added, as shown by this pseudo code:
Original instr.
Fast-path snippet:
ST Rx, addr =⇒ if (Rwpc != addr)
call Slow-path;
ST Rx, addr;
Slow-path
UNLOCK (MTAG_lock[Rwpc]);
LD Rwpc, #addr;
LOCK (MTAG_lock[addr]);
LD R7, MTAG_value[addr];
if (R7 != 1)
call StoreProtocol;
Rwpc denotes a reserved register used to store the address held in the WPC. It
should be noted that the fast-path of this store snippet only consists of ALU
instructions for “hits” in the write permission cache. One possible UltraSPARC
implementation of a 1-entry WPC is shown in Figure E.6. The code is for 64
bytes cache lines and 1-byte MTAG entries, 32-bit address space.
E.5. Performance Study
E.5.1. Experimental Setup
All the experiments are performed on the same system that was used to test and
develop the original implementation of DSZOOM (using EEL). In this paper, we
132
E.5. Performance Study
Store
#
!! This is the fast-path part of the snippet
add $2, $3, %g4
!! %g4 = effective address
srl %g4, 6, %g3
!! %g3 = cache line ID
sub %g3, %g2, %g3
!! %g2 = Rwpc
brnz,pt %g3, .LY$L_WPC_MISS
nop
.LQ$L_MSTATE:
!! label generated by SAIT
$I
!! the original instruction
#
Store2
#
!! This is the slow-path part of the snippet
.LY$L_WPC_MISS:
!! label generated by SAIT
srl %g2, 22, %g3
!! range check
sub %g3, 8, %g3
brnz %g3, .LQ$L_FALSE_MTAG
nop
sethi %hi(0x8dc00000), %g4
add %g4, %g2, %g4
stb %g0, [$g4]
!! release and update old MTAG
.LQ$L_FALSE_MTAG:
!! label generated by SAIT
add $2, $3, %g4
!! %g4 = effective address
srl %g4, 28, %g3
!! range check
sub %g3, 8, %g3
brnz,pt %g3, .LY$L_MSTATE
srl %g4, 6, %g2
!! %g2 = new Rwpc
sethi %hi(0x8dc00000), %g4
add %g4, %g2, %g4
!! %g4 = new MTAG addr
.LQ$L_PREV:
!! label generated by SAIT
ldstub [$g4], %g2
!! lock new MTAG
sub %g2, 255, %g3
brz,pn %g3, .LQ$L_PREV
nop
add $2, $3, %g3
!! %g3 = effective address
brz,pn %g2, .LY$L_MSTATE !! check if in the M-state
srl %g3, 6, %g2
!! %g2 = new Rwpc
... call DSZOOM_store_coherence_routine ...
ba .LQ$L_MSTATE
nop
Figure E.6.: The UltraSPARC implementation example of a 1-entry WPC.
133
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
Program
FFT
LU-c
LU-nc
Radix
Barnes
Cholesky
FMM
Ocean-c
Ocean-nc
Radiosity
Raytrace
Water-nsq
Water-sp
Problem Size
1,048,576 points (48.1
1024×1024, block 16 (8.0
1024×1024, block 16 (8.0
4,194,304 items (36.5
16,384 bodies (32.8
tk29.0 (25.3
32,768 particles (8.1
514×514 (57.5
258×258 (22.9
room (29.4
car (50.2
2197 molecules, 2 steps (2.0
2197 molecules, 2 steps (1.5
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
MB)
Seq. Time
-xO0 [sec.]
Seq. Time
-fast [sec.]
14.29
66.61
80.30
30.95
57.02
20.18
117.58
46.76
18.32
28.98
11.28
134.47
34.33
3.18
13.56
30.56
6.67
13.28
3.45
25.03
14.61
3.72
11.10
3.89
26.07
7.76
Table E.1.: Data-set sizes and sequential-execution times for non-instrumented
SPLASH-2 applications, compiled with -xO0 (no optimizations) and
-fast (high compiler optimization) flags.
use Sun’s Forte Developer 6.2 C Patch 111679-08 compiler instead of the GNU’s
gcc-2.8.1 compiler. For hardware details, see the original DSZOOM paper by
Radović and Hagersten [77].
E.5.2. Applications
To test the performance of this new DSZOOM system, we use the well-known
scientific workloads from the SPLASH-2 benchmark suite [111]. The sizes of the
data-set used and the un-instrumented uniprocessor execution times are presented
in Table E.1. The programs are compiled both without optimizations, using the
cc’s -xO0 flag, and with the highest optimization levels, with the -fast flag. The
demonstrated speedup for optimized applications is over four times (average).
In this setup, we are able to execute all applications without any modifications,
except for Volrend. The reason why we cannot run Volrend is that the global
variables are used as shared. It should be possible to modify this application to
get rid of this problem. We began all measurements at the start of the parallel
phase to avoid DSZOOM’s run-time system initialization.
134
E.5. Performance Study
Program
With
Appl. Regs
Without
Appl. Regs
FFT
LU-c
LU-nc
Radix
Barnes
Cholesky
FMM
Ocean-c
Ocean-nc
Radiosity
Raytrace
Water-nsq
Water-sp
3.36
13.15
29.71
6.96
12.99
3.39
24.10
15.19
3.83
11.27
3.74
24.22
7.30
3.37
13.27
30.16
6.96
13.05
3.40
22.99
15.17
3.86
11.04
3.76
24.52
7.36
1.00
1.01
1.02
1.00
1.00
1.00
0.95
1.00
1.01
0.98
1.01
1.01
1.01
3.18
13.62
30.61
6.66
13.24
3.47
28.35
14.66
3.73
11.03
3.97
26.35
8.01
0.95
1.04
1.03
0.96
1.02
1.02
1.18
0.97
0.97
0.98
1.06
1.09
1.10
Average
12.25
12.22
1.000063
12.84
1.03
Overhead
“Empty”
Delay Slots
Overhead
Table E.2.: The original overhead built into this new system. The time is given
in seconds.
E.5.3. Performance Overview
There are some initial overheads for our new instrumentation system. To avoid
register spilling during the instrumentation phase, all applications are compiled
with the -xregs=no%appl flag. That reserves three application registers that the
compiler otherwise could have used to optimize the compiled code even further
(typically to reduce the number of loads/stores in the application). The SAIT is
also sometimes replacing loads and stores from delay slots with our code snippets,
what “work against” already performed compiler optimizations. Table E.2 shows
the results of those two initial slow-downs. The exclusion of the application
registers does not really affect the execution time, but moving the loads and
stores from the delay slots slows down the programs with about 3%.
Sequential performance after instrumentation. Table E.3 shows the
performance for instrumented sequential programs for two different snippets. The
first snippet is just the “normal,” somewhat improved snippet that was also used
in the original DSZOOM system (see section E.4.1). The second snippet uses
the technique of dividing the snippet in two parts, thereby getting a straighter
execution path in most cases (this is described in section E.4.2). The “straight
code” optimization is positive for all applications, except for Water-nsq. The
percentage of statically replaced loads and stores is also shown in this table. Those
numbers are quite high and result in a rather large number of local loads/stores
that are instrumented for some programs.
135
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
Program
% Loads
Replaced
% Stores
Replaced
Orig. Snippet
(Section E.4.1)
Straight Code
(Section E.4.2)
Relative
Speedup
FFT
LU-c
LU-nc
Radix
Barnes
Cholesky
FMM
Ocean-c
Ocean-nc
Radiosity
Raytrace
Water-nsq
Water-sp
50.2%
40.6%
43.0%
32.3%
59.1%
60.5%
64.8%
63.5%
41.7%
69.8%
63.5%
42.6%
37.2%
45.1%
33.2%
36.1%
29.1%
64.4%
43.8%
51.7%
69.9%
65.6%
64.0%
55.6%
32.6%
27.5%
10.19
51.74
68.18
10.95
21.14
12.33
42.17
30.91
8.03
23.32
10.60
51.14
16.09
9.93
51.19
66.80
10.64
20.13
11.83
37.45
N/A
N/A
21.99
9.01
52.02
14.96
0.97
0.99
0.98
0.97
0.95
0.96
0.89
N/A
N/A
0.94
0.85
1.02
0.93
Average
51.4%
47.6%
0.95
Table E.3.: Sequential execution times in seconds for instrumented SPLASH-2
applications, compiled with the cc’s -fast flag. The percentage of
statically replaced loads and stores is given in the second and the
third column, respectively.
136
E.5. Performance Study
By profiling the DSZOOM system, we have noticed that there are many local
loads and stores that are still instrumented by SAIT even though the SAIT is trying to avoid many stack/static references (see section E.4.3 for more details). To
avoid local load/store-instrumentation, we perform a two-phase instrumentation.
This instrumentation is based on the profile data generated from the first run of
the instrumented application. In the second instrumentation phase, we simply
avoid to instrument loads/stores that were classified as local from the first run.
The results obtained using this technique (we call it for program slicing in this
paper [109]) are given in Table E.4. The number of statically replaced loads and
stores is drastically reduced compared to the numbers from Table E.3. These
numbers are relevant for the instrumentation system that could be integrated
with the higher levels of a compiler, such as the OpenMP compiler [26], the
Unified Parallel C (UPC) compiler [17, 28], or the JIT code generator of a Java
system [35]. In the same table, we also present sequential overheads for a 1-entry
WPC implementation from Figure E.6 (the DSZOOM_store_coherence_routine
is never called in this case since the sequential protocol is always in the M-state).
Program slicing is an effective method for Barnes, Radiosity, Raytrace, and Water applications. The WPC implementation, on the other hand, can lower the
instrumentation overheads for the following applications: FFT, LU-c, and LU-nc.
Parallel performance. Figure E.7 shows the execution times in seconds for
several different configurations for 8- and 16-processor runs. All applications
are compiled with the -fast flag and instrumented with SAIT with two-part
snippets (section E.4.2). Figure E.8 shows the execution times in seconds for sliced
programs (applications with only global/shared-memory access instrumentation)
and sliced & 1-entry WPC implementation.
Finally, in Figure E.9, we compare the performance of a DSZOOM system that
uses SAIT and the new optimizations available, and an earlier DSZOOM system
that uses EEL [77].
E.5.4. WPC Study
The write permission cache optimization is introduced in section E.4.4. In this
section we perform several experiments (on sequential code) with different WPC
settings to see how much the instrumentation overhead could be reduced if there
is some spatial locality for memory-store operations. We report the average
number of consecutive stores to the same cache line (before the MTAG should
be released) for different cache line settings (64, 128, and 256 bytes) and for a 1and 2-entry WPC. Results are shown in Figure E.10. There is a large amount
of spatial locality for stores. The WPC with two entries of 256 bytes each could
reduce the sequential instrumentation overhead by several factors for especially
LU-c, Cholesky, and Ocean-c.
137
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
-fast compiler flag (optimized code)
Program
% Loads
Replaced
% Stores
Replaced
Original
Overhead
Program
Slicing
Slicing &
1-entry WPC
FFT
LU-c
LU-nc
Radix
Barnes
Cholesky
FMM
Ocean-c
Ocean-nc
Radiosity
Raytrace
Water-nsq
Water-sp
15.1%
13.5%
13.1%
8.8%
19.8%
16.8%
N/A
39.7%
31.9%
N/A
20.8%
17.1%
19.0%
17.8%
16.0%
15.1%
16.6%
21.4%
13.7%
N/A
52.5%
50.7%
N/A
14.8%
15.4%
12.0%
212.9%
276.8%
119.8%
59.5%
50.8%
242.6%
53.3%
115.1%
117.5%
98.8%
131.9%
100.8%
90.9%
206.3%
279.3%
122.3%
58.9%
15.1%
243.2%
N/A
116.1%
114.5%
11.7%
68.9%
27.7%
20.7%
164.2%
147.1%
65.2%
75.4%
19.0%
345.2%
N/A
90.9%
106.2%
15.7%
67.1%
21.6%
18.6%
Average
19.6%
22.4%
128.5%
107.1%
94.7%
-xO0 compiler flag (un-optimized code)
Program
% Loads
Replaced
% Stores
Replaced
Original
Overhead
Program
Slicing
Slicing &
1-entry WPC
FFT
LU-c
LU-nc
Radix
Barnes
Cholesky
FMM
Ocean-c
Ocean-nc
Radiosity
Raytrace
Water-nsq
Water-sp
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
87.8%
123.9%
103.8%
17.8%
28.2%
72.2%
22.6%
72.3%
51.3%
40.1%
130.7%
111.6%
116.1%
90.1%
122.5%
103.1%
12.5%
1.3%
81.2%
9.3%
66.7%
42.6%
16.5%
5.3%
39.3%
41.9%
43.3%
41.1%
36.0%
19.6%
0.7%
65.7%
8.5%
36.0%
27.1%
16.2%
5.7%
39.7%
39.6%
Average
N/A
N/A
75.3%
48.6%
29.2%
Table E.4.: Sequential performance overhead for “original snippets” from Table E.3, for instrumentation of only global/shared-memory accesses
with a program slicing technique, and in the last column, the overhead for program slicing with a 1-entry WPC (from Figure E.6). The
“original overhead” is calculated with values from Table E.1.
138
E.5. Performance Study
8 processors
12.00
10.00
Time [s]
8.00
6.00
4.00
2.00
0.00
OceanOcean-nc Radiosity Raytrace
cont
Waternsq
Water-sp
FFT
LU-c
LU-nc
Radix
Barnes
Cholesky
FMM
E6000 8 CPUs
0.78
1.87
4.22
1.24
1.89
0.51
4.45
CC-NUMA 2x4
1.20
2.28
5.49
1.53
2.08
0.70
4.62
2.69
1.04
1.92
0.83
3.52
1.25
DSZOOM-WF 1x8
1.85
7.08
10.41
2.22
2.88
1.75
5.71
4.45
1.76
3.08
1.26
7.05
2.52
DSZOOM-WF 2x4
3.35
8.02
10.65
3.62
3.49
2.19
7.44
5.85
2.21
4.25
1.94
7.35
2.59
Waternsq
Water-sp
2.40
0.64
1.56
0.58
3.42
1.23
16 processors
7.00
6.00
5.00
Time [s]
4.00
3.00
2.00
1.00
0.00
OceanOcean-nc Radiosity Raytrace
cont
FFT
LU-c
LU-nc
Radix
Barnes
Cholesky
FMM
E6000 16 CPUs
0.63
1.03
2.57
0.89
1.04
0.37
2.79
1.79
0.54
0.96
0.35
1.77
0.91
CC-NUMA 2x8
1.02
1.12
3.05
1.32
1.23
0.56
3.11
1.98
0.87
1.66
0.99
1.85
0.94
DSZOOM-WF 1x16
1.04
3.82
6.09
1.28
1.64
0.97
3.38
2.52
1.19
1.76
0.69
3.68
1.85
DSZOOM-WF 2x8
2.40
4.34
6.20
2.34
2.29
1.44
4.67
3.75
1.66
2.93
1.88
3.91
1.93
Figure E.7.: Execution times in seconds for 8- and 16-processor runs for Sun
Enterprise E6000, 2-node Sun WildFire (CC-NUMA), single-node
DSZOOM-WF, and a 2-node DSZOOM-WF.
139
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
8 processors
12.00
10.00
Time [s]
8.00
6.00
4.00
2.00
0.00
FFT
LU-c
LU-nc
Radix
Raytrace
Water-nsq
Water-sp
CC-NUMA 2x4
1.20
2.28
5.49
1.53
0.83
3.52
1.25
DSZOOM-WF 2x4 (a)
3.35
8.02
10.65
3.62
1.94
7.35
DSZOOM-WF 2x4 (b)
3.29
7.89
10.87
3.67
1.83
5.00
1.67
DSZOOM-WF 2x4 (c)
3.16
5.38
8.48
3.79
1.81
5.84
1.76
2.59
16 processors
7.00
6.00
5.00
Time [s]
4.00
3.00
2.00
1.00
0.00
FFT
LU-c
LU-nc
Radix
Raytrace
Water-nsq
Water-sp
CC-NUMA 2x8
1.02
1.12
3.05
1.32
0.99
1.85
0.94
DSZOOM-WF 2x8 (a)
2.40
4.34
6.20
2.34
1.88
3.91
1.93
DSZOOM-WF 2x8 (b)
2.52
4.27
6.34
2.58
2.04
2.86
1.24
DSZOOM-WF 2x8 (c)
2.30
2.95
4.86
2.42
1.90
2.94
1.27
Figure E.8.: Parallel performance for 8- and 16-processor runs. Execution times
in seconds for a 2-node Sun WildFire (CC-NUMA), (a) 2-node
DSZOOM-WF (from Figure E.7), (b) 2-node DSZOOM-WF with
only global/shared-memory accesses, and (c) a 2-node DSZOOMWF with only global/shared-memory accesses and a 1-entry WPC.
140
E.6. Conclusions and Future Work
E6000 16 CPUs
CC-NUMA 2x8
DSZOOM 2x8
DSZOOM 2001 2x8
12
Execution Time [s]
10
8
6
4
2
p
r-s
at
e
W
r-n
sq
ce
at
e
W
R
ay
tra
ity
t
io
s
R
ad
t
-C
on
-C
on
ce
an
-N
on
FM
M
ce
an
O
O
ix
es
-H
ut
C
ho
le
sk
y
R
ad
rn
Ba
on
t
on
-C
on
t
-C
-N
LU
LU
FF
T
0
Figure E.9.: A comparison between Sun Enterprise E6000, a 2-node Sun WildFire, a 2-node DSZOOM-WF compiled with the -fast flag and instrumented with SAIT, and a 2-node DSZOOM-WF from 2001 [77].
E.6. Conclusions and Future Work
In this paper, we have described SAIT, a SPARC assembler instrumentation
tool. SAIT is today a crucial part of the DSZOOM instrumentation system. It
can instrument programs compiled with modern compilers and with the highest compiler optimization levels. The instrumentation overhead for sequential
execution of the SPLASH-2 benchmarks instrumented with DSZOOM-related
snippets ranges from around 30% for the un-optimized programs to around 100%
for programs with high optimization. Still, the actual execution time is lower
for the optimized programs than the un-optimized ones. The performance improvements range from 1.07 to 2.82 times (average 1.73) if we compare this new
DSZOOM instrumentation system with the old one from the original DSZOOM
paper. Writing and changing snippets with SAIT is an easy task.
Currently, SAIT is not performing any optimizations after the instrumentation
phase. We would like to extend our tool to perform code scheduling and register
allocation after the instrumentation. Alternatively, the instrumentation could be
integrated with the higher levels of a compiler, such as the GNU’s gcc compiler,
the OpenMP compiler, or the JIT code generator of a Java system, and rely on
their optimization phases.
We also introduce a write permission cache optimization in section E.4.4. We
have performed some initial experiments and shown that there is a large amount
of spatial locality for stores and that an WPC with two entries of 256 bytes
each would reduce the instrumentation overhead by several factors. However,
141
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
64 bytes
128 bytes
256 bytes
Average # Stores until UNLOCK
32
28
24
20
16
12
8
4
ky
O
FM
M
ce
an
O
ce
-c
an
on
-n
t
on
-c
on
t
R
ad
io
si
ty
R
ay
tra
ce
W
at
er
-n
sq
W
at
er
-s
p
rn
C
ho
Ba
le
s
es
ix
ad
R
co
n
t
nt
n-
-c
o
-n
o
LU
LU
FF
T
0
(a) 1-entry WPC
64 bytes
128 bytes
256 bytes
Average # Stores until UNLOCK
32
28
24
20
16
12
8
4
O
FM
M
ce
an
O
ce
-c
an
on
-n
t
on
-c
on
t
R
ad
io
si
ty
R
ay
tra
ce
W
at
er
-n
sq
W
at
er
-s
p
ky
le
s
es
ho
C
Ba
rn
ix
ad
R
n-
co
n
t
nt
-n
o
-c
o
LU
LU
FF
T
0
(b) 2-entry WPC
Figure E.10.: The average number of consecutive stores to the same cache line
for different cache line sizes, using a WPC with (a) one entry and
(b) two entries, respectively.
142
E.6. Conclusions and Future Work
using a WPC also raises some correctness, liveness, and performance concerns.
The WPC entries have to be released at synchronization points, at failure to
acquire a directory entry or MTAG, and at thread termination. This allowed us
to correctly and efficiently run all the applications in the SPLASH-2 benchmark
suite. However, this is clearly not sufficient for cases that are more general. More
attention should be added to this matter.
143
E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System
144
Bibliography
[1] Anderson, T. E. The Performance Implications of Spin-Waiting Alternatives
for Shared-Memory Multiprocessors. In Proceedings of the 1989 International
Conference on Parallel Processing (Aug. 1989), vol. II Software, pp. 170–174.
[2] Anderson, T. E. The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems
1, 1 (Jan. 1990), 6–16.
[3] Appel, A. W. Modern Compiler Implementation in ML. The Press Syndicate
of the University of Cambridge, 1998.
[4] Artiaga, E. Personal communication, Apr. 2001.
[5] Artiaga, E., Martorell, X., Becerra, Y., and Navarro, N. Experiences
on Implementing PARMACS Macros to Run the SPLASH-2 Suite on Multiprocessors. In Proceedings of the 6th Euromicro Workshop on Parallel and Distributed
Processing (Jan. 1998).
[6] Artiaga, E., Navarro, N., Martorell, X., and Becerra, Y. Implementing PARMACS Macros for Shared-Memory Multiprocessor Environments.
Tech. Rep. UPC-DAC-1997-07, Department of Computer Architecture, Polytechnic University of Catalunya, Jan. 1997.
[7] Bailey, D. H. FFT’s in External or Hierarchical Memory. Journal of Supercomputing 4, 1 (Mar. 1990), 23–35.
[8] Barroso, L., Gharachorloo, K., McNamara, R., Nowatzyk, A.,
Qadeer, S., Sano, B., Smith, S., Stets, R., and Verghese, B. Piranha:
A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of
the 27th Annual International Symposium on Computer Architecture (ISCA’00)
(June 2000), pp. 282–293.
[9] Barroso, L. A., Gharachorloo, K., and Bugnion, E. Memory System
Characterization of Commercial Workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA’98) (June 1998), pp. 3–
14.
[10] Bershad, B. N., Zekauskas, M. J., and Sawdon, W. A. The Midway
Distributed Shared Memory System. In Proceedings of the 38th IEEE Computer
Society International Conference (Feb. 1993), pp. 528–537.
145
Bibliography
[11] Bilas, A., Liao, C., and Singh, J. P. Using Network Interface Support to
Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems. In
Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99) (May 1999).
[12] Bilas, A., and Singh, J. P. The Effects of Communication Parameters on End
Performance of Shared Virtual Memory Clusters. In Proceedings of Supercomputing ’97 (Nov. 1997).
[13] Blelloch, G. E., Leiserson, C. E., Maggs, B. M., Plaxton, C. G.,
Smith, S. J., and Zagha, M. A Comparison of Sorting Algorithms for the
Connection Machine CM-2. In ACM Symposium on Parallel Algorithms and Architectures (July 1991), pp. 3–16.
[14] Boyle, J., Butler, R., Disz, T., Glickfield, B., Lusk, E., Overbeek, R.,
Patterson, J., and Stevens, R. Portable Programs for Parallel Processors.
Holt, Rinehart and Winston, New York, NY, 1987.
[15] Brewer, T., and Astfalk, G. The Evolution of the HP/Convex Exemplar.
In Proceedings of COMPCON Spring’97: 42nd IEEE Computer Society International Conference (Feb. 1997), pp. 81–86.
[16] Carlisle, M. C., and Rogers, A. Software Caching and Computation Migration in Olden. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming (July 1995), pp. 29–38.
[17] Carlson, W. W., Draper, J. M., Culler, D. E., Yelick, K., Brooks,
E., and Warren, K. Introduction to UPC and Language Specification. Tech.
Rep. CCS-TR-99-157, The George Washington University, May 1999.
[18] Carter, J. B., Bennett, J. K., and Zwaenepoel, W. Implementation and
Performance of Munin. In Proceedings of the 13th ACM Symposium on Operating
Systems Principles (SOSP’91) (Oct. 1991), pp. 152–164.
[19] Charlesworth, A. E. STARFIRE: Extending the SMP Envelope. IEEE Micro
18 (Jan./Feb. 1998), 39–49.
[20] Charlesworth, A. E. The Sun Fireplane System Interconnect. In Proceedings
of Supercomputing 2001 (Nov. 2001).
[21] Cifuentes, C., Emerik, M. V., Ramsey, N., and Lewis, B. The University
of Queensland Binary Translator (UQBT) framework, Dec. 2001. http://www.
itee.uq.edu.au/cms/uqbt.html.
[22] Clark, R., and Alnes, K. SCI Interconnect Chipset and Adapter: Building
Large Scale Enterprise Servers with Pentium Pro SHV Nodes. In Proceedings of
IEEE Hot Interconnects (Aug. 1996), pp. 41–52.
[23] Craig, T. S. Building FIFO and Priority-Queuing Spin Locks from Atomic
Swap. Tech. Rep. TR 93-02-02, Department of Computer Science, University of
Washington, Feb. 1993.
146
Bibliography
[24] Culler, D. E., Dusseau, A., Goldstein, S. C., Krishnamurthy, A.,
Lumetta, S., von Eicken, T., and Yelick, K. Parallel Programming in
Split-C. In Proceedings of Supercomputing ’93 (Nov. 1993), pp. 262–273.
[25] Culler, D. E., Singh, J. P., and Gupta, A. Parallel Computer Architecture:
A Hardware/Software Approach. Morgan Kaufman, 1999.
[26] Dagum, L., and Menon, R. OpenMP: An Industry-Standard API for Shared
Memory Programming. IEEE Computational Science and Engineering 5, 1 (Jan.Mar. 1998), 46–55.
[27] Dwarkadas, S., Gharachorloo, K., Kontothanassis, L., Scales, D. J.,
Scott, M. L., and Stets, R. Comparative Evaluation of Fine- and CoarseGrain Approaches for Software Distributed Shared Memory. In Proceedings of the
5th International Symposium on High-Performance Computer Architecture (Jan.
1999), pp. 260–269.
[28] El-Ghazawi, T., and Cantonnet, F. UPC Performance and Potential: A
NPB Experimental Study. In Proceedings of Supercomputing 2002 (Baltimore,
Maryland, USA, Nov. 2002).
[29] Erlichson, A., Nuckolls, N., Chesson, G., and Hennessy, J. L. SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared
Memory. In Proceedings of the 7th International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS-VII) (Oct.
1996), pp. 210–220.
[30] Gharachorloo, K. Personal communication, Oct. 2000.
[31] Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A.,
and Hennessy, J. Memory Consistency and Event Ordering in Scalable Sharedmemory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90) (May 1990), pp. 15–26.
[32] Gharachorloo, K., Sharma, M., Steely, S., and Doren, S. V. Architecture and Design of AlphaServer GS320. In Proceedings of the 9th International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS-IX) (Nov. 2000), pp. 13–24.
[33] Gibson, C., and Bilas, A. Performance of Shared Virtual Memory on Clusters
of DSMs. In Proceedings of the 8th International Conference on High Performance
Computing (HiPC 2001) (Dec. 2001).
[34] Goodman, J. R., Vernon, M. K., and Woest, P. J. Efficient Synchronization Primitives for Large-Scale Cache-Coherent Shared-Memory Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS-III) (Apr. 1989),
pp. 64–75.
147
Bibliography
[35] Gosling, J., Joy, B., and Steele, G. The Java Programming Language,
3rd ed. The Java Series. Addison-Wesley, 2000.
[36] Graunke, G., and Thakkar, S. Synchronization Algorithms for Shared Memory Multiprocessors. IEEE Computer 23, 6 (1990), 60–69.
[37] Greengard, L. The Rapid Evaluation of Potential Fields in Particle Systems.
ACM Press, 1987.
[38] Grenholm, O. Simple and Efficient Instrumentation for the DSZOOM System.
Master’s thesis, School of Engineering, Uppsala University, Sweden, Dec. 2002.
UPTEC F-02-096, ISSN 1401-5757.
[39] Grenholm, O., Radović, Z., and Hagersten, E. Latency-hiding and Optimizations of the DSZOOM Instrumentation System. Tech. Rep. 2003-029, Department of Information Technology, Uppsala University, May 2003.
[40] Hagersten, E., and Koster, M. WildFire: A Scalable Path for SMPs. In
Proceedings of the 5th IEEE Symposium on High-Performance Computer Architecture (Feb. 1999), pp. 172–181.
[41] Hanrahan, P., Salzman, D., and Aupperle, L. Rapid Hierarchical Radiosity
Algorithm. In Proceedings of SIGGRAPH (1991).
[42] Hennessy, J. L., and Patterson, D. A. Computer Architecture: A Quantitative Approach, 3rd ed. Morgan Kaufmann, 2003.
[43] IEEE Std 1003.1-1996, ISO/IEC 9945-1. Portable Operating System Interface
(POSIX)–Part1: System Application Programming Interface (API) [C Language],
1996.
[44] Iftode, L., Blumrich, M., Dubnicki, C., Oppenheimer, D. L., Singh,
J. P., and Li, K. Shared Virtual Memory with Automatic Update Support.
Tech. Rep. TR-575-98, Princeton University, Feb. 1998.
[45] Iftode, L., and Singh, J. P. Shared Virtual Memory: Progress and Challenges.
Proceedings of the IEEE, Special Issue on Distributed Shared Memory 87, 3 (Mar.
1999), 498–507.
[46] InfiniBand(SM) Trade Association, InfiniBand Architecture Specification, Release
1.0, Oct. 2000. Available from: http://www.infinibandta.org.
[47] Jamieson, P., and Bilas, A. CableS: Thread Control and Memory Mangement Extentions for Shared Virtual Memory Clusters. In Proceedings of the 8th
International Symposium on High-Performance Computer Architeture (HPCA-8)
(Feb. 2002).
[48] Johnson, K., Kaashoek, M. F., and Wallach, D. A. CRL: HighPerformance All-Software Distributed Shared Memory. In Proceedings of the 15th
ACM Symposium on Operating Systems Principles (Dec. 1995).
148
Bibliography
[49] Kägi, A., Burger, D., and Goodman, J. R. Efficient Synchronization: Let
Them Eat QOLB. In Proceedings of the 24th Annual International Symposium
on Computer Architecture (ISCA’97) (June 1997), pp. 170–180.
[50] Karlsson, M., Moore, K., Hagersten, E., and Wood, D. Memory System
Behavior of Java-Based Middleware. In Proceedings of the Ninth International
Symposium on High Performance Computer Architecture (HPCA-9) (Anaheim,
California, USA, Feb. 2003), pp. 217–228.
[51] Keleher, P. Lazy Release Consistency for Distributed Shared Memory. PhD
thesis, Department of Computer Science, Rice University, Jan. 1995.
[52] Keleher, P., Cox, A. L., Dwarkadas, S., and Zwaenepoel, W. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating
Systems. In Proceedings of the Winter 1994 USENIX Conference (Jan. 1994),
pp. 115–131.
[53] Kontothanassis, L., Hunt, G., Stets, R., Hardavellas, N., Cierniak,
M., Parthasarathy, S., Meira, W., Dwarkadas, S., and Scott, M. VMbased Shared Memory on Low-Latency, Remote-Memory-Access Networks. In
Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97) (June 1997).
[54] Kontothanassis, L., and Scott, M. Using Memory-Mapped Network Interfaces to Improve the Performance of Distributed Shared Memory. In Proceedings
of the 2nd IEEE Symposium on High Performance Computer Architecture (Feb.
1996).
[55] Krewell, K. Sun Weaves Multithreaded Future. In Microprocessor Report
(Apr. 2003).
[56] Lamport, L. How to Make a Multiprocessor Computer That Correctly Executes
Multiprocess Programs. IEEE Transactions on Computers C-28, 9 (Sept. 1979),
690–691.
[57] Larus, J. R., and Schnarr, E. EEL: Machine-Independent Executable Editing. In Proceedings of the SIGPLAN ’95 Conference on Programming Language
Design and Implementation (June 1995), pp. 291–300.
[58] Laudon, J., and Lenoski, D. The SGI Origin: A ccNUMA Highly Scalable
Server. In Proceedings of the 24th Annual International Symposium on Computer
Architecture (ISCA’97) (June 1997), pp. 241–251.
[59] Lenoski, D., Laudon, J., Gharachorloo, K., Weber, W.-D., Gupta,
A., Hennessy, J., Horowitz, M., and Lam, M. S. The Stanford Dash
Multiprocessor. IEEE Computer 25, 3 (Mar. 1992), 63–79.
[60] Li, K. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis,
Department of Computer Science, Yale University, Sept. 1986.
149
Bibliography
[61] Li, K. IVY: A Shared Virtual Memory System for Parallel Computing. In Proceedings of the 1988 International Conference on Parallel Processing (ICPP’88)
(Aug. 1988), vol. II, pp. 94–101.
[62] Li, K., and Hudak, P. Memory Coherence in Shared Virtual Memory Systems.
In Proceedings of the 5th ACM Annual Symposium on Principles of Distributed
Computing (PODC’86) (Aug. 1986), pp. 229–239.
[63] Li, K., and Hudak, P. Memory Coherence in Shared Virtual Memory Systems.
ACM Transactions on Computer Systems 7, 4 (Nov. 1989), 321–359.
[64] Lim, B.-H., and Agarwal, A. Reactive Synchronization Algorithms for Multiprocessors. In Proceedings of the 6th International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS-VI) (Oct.
1994), pp. 25–35.
[65] Lovett, T., and Clapp, R. STiNG: A CC-NUMA Computer System for
the Commercial Marketplace. In Proceedings of the 23rd Annual International
Symposium on Computer Architecture (ISCA’96) (May 1996), pp. 308–317.
[66] Magnusson, P., Landin, A., and Hagersten, E. Queue Locks on Cache
Coherent Multiprocessors. In Proceedings of the 8th International Parallel Processing Symposium (Cancun, Mexico, Apr. 1994), pp. 165–171. Extended version available as “Efficient Software Synchronization on Large Cache Coherent
Multiprocessors,” SICS Research Report T94:07, Swedish Institute of Computer
Science, February 1994.
[67] McVoy, L. W., and Staelin, C. lmbench: Portable Tools for Performance
Analysis. In Proceedings of the 1996 USENIX Annual Technical Conference (Jan.
1996), pp. 279–294.
[68] Mellor-Crummey, J., and Scott, M. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21–65.
[69] Mueller, F. Distributed Shared-Memory Threads: DSM-Threads. In Proceedings of the Workshop on Run-Time Systems for Parallel Programming (Apr.
1997), pp. 31–40.
[70] Mukherjee, S. S., Falsafi, B., Hill, M. D., and Wood, D. A. Coherent
Network Interfaces for Fine-Grain Communication. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA’96) (Apr. 1996),
pp. 247–258.
[71] Nieh, J., and Levoy, M. Volume Rendering on Scalable Shared-Memory MIMD
Architectures. In Proceedings of the Boston Workshop on Volume Visualization
(Oct. 1992).
[72] Noordergraaf, L., and van der Pas, R. Performance Experiences on Sun’s
Wildfire Prototype. In Proceedings of Supercomputing ’99 (Nov. 1999).
150
Bibliography
[73] Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K.
The Case for a Single Chip Multiprocessor. In Proceedings of the 7th International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS-VII) (Oct. 1996), pp. 2–11.
[74] Radović, Z. DSZOOM – Low Latency Software-Based Shared Memory. Master’s
thesis, School of Engineering, Uppsala University, Sweden, Dec. 2000. UPTEC
F-00-093, ISSN 1401-5757.
[75] Radović, Z., and Hagersten, E. DSZOOM – Low Latency Software-Based
Shared Memory. Tech. Rep. 2001:03, Parallel and Scientific Computing Institute
(PSCI), Sweden, Apr. 2001.
[76] Radović, Z., and Hagersten, E. Implementing Low Latency Distributed
Software-Based Shared Memory. In Proceedings of the Workshop on Memory
Performance Issues, held in conjunction with the 28th International Symposium
on Computer Architecture (ISCA28) (Göteborg, Sweden, June 2001).
[77] Radović, Z., and Hagersten, E. Removing the Overhead from SoftwareBased Shared Memory. In Proceedings of Supercomputing 2001 (Denver, Colorado, USA, Nov. 2001).
[78] Radović, Z., and Hagersten, E. Efficient Synchronization for Nonuniform
Communication Architectures. In Proceedings of Supercomputing 2002 (Baltimore, Maryland, USA, Nov. 2002).
[79] Radović, Z., and Hagersten, E. RH Lock: A Scalable Hierarchical Spin
Lock. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues
(WMPI 2002), held in conjunction with the 29th International Symposium on
Computer Architecture (ISCA29) (Anchorage, Alaska, USA, May 2002).
[80] Radović, Z., and Hagersten, E. Hierarchical Backoff Locks for Nonuniform Communication Architectures. In Proceedings of the Ninth International
Symposium on High Performance Computer Architecture (HPCA-9) (Anaheim,
California, USA, Feb. 2003), pp. 241–252.
[81] Rajwar, R., Kägi, A., and Goodman, J. R. Improving the Throughput of
Synchronization by Insertion of Delays. In Proceedings of the 6th International
Symposium on High-Performance Computer Architecture (Jan. 2000), pp. 168–
179.
[82] Rudolph, L., and Segall, Z. Dynamic Decentralized Cache Schemes for
MIMD Parallel Processors. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA’84) (June 1984), pp. 340–347.
[83] Scales, D. J., and Gharachorloo, K. Design and Performance of the Shasta
Distributed Shared Memory Protocol. In Proceedings of the 11th ACM International Conference on Supercomputing (July 1997). Extended version available as
Technical Report 97/2, Western Research Laboratory, Digital Equipment Corporation, February 1997.
151
Bibliography
[84] Scales, D. J., and Gharachorloo, K. Towards Transparent and Efficient
Software Distributed Shared Memory. In Proceedings of the 16th ACM Symposium
on Operating System Principles, Saint-Malo, France (Oct. 1997).
[85] Scales, D. J., Gharachorloo, K., and Aggarwal, A. Fine-Grain Software
Distributed Shared Memory on SMP Clusters. Tech. Rep. 97/3, Western Research
Laboratory, Digital Equipment Corporation, Feb. 1997.
[86] Scales, D. J., Gharachorloo, K., and Thekkath, C. A. Shasta: A LowOverhead Software-Only Approach to Fine-Grain Shared Memory. In Proceedings
of the 7th International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS-VII) (Oct. 1996), pp. 174–185.
[87] Scherer, A., Lu, H., Gross, T., and Zwaenepoel, W. Transparent Adaptive Parallelism on NOWs using OpenMP. In Proceedings of the Seventh Conference on Principles Practice of Parallel Programming (May 1999), pp. 96–106.
[88] Schoinas, I., Falsafi, B., Hill, M., Larus, J. R., and Wood, D. A.
Sirocco: Cost-Effective Fine-Grain Distributed Shared Memory. In Proceedings
of the 6th International Conference on Parallel Architectures and Compilation
Techniques (Oct. 1998).
[89] Schoinas, I., Falsafi, B., Hill, M. D., Larus, J. R., Lucas, C. E.,
Mukherjee, S. S., Reinhardt, S. K., Schnarr, E., and Wood, D. A.
Implementing Fine-Grain Distributed Shared Memory On Commodity SMP
Workstations. Tech. Rep. 1307, Computer Sciences Department, University of
Wisconsin–Madison, Mar. 1996.
[90] Schoinas, I., Falsafi, B., Lebeck, A. R., Reinhardt, S. K., Larus, J. R.,
and Wood, D. A. Fine-grain Access Control for Distributed Shared Memory. In Proceedings of the 6th International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS-VI) (Oct. 1994),
pp. 297–306.
[91] Scott, M. L. Non-Blocking Timeout in Scalable Queue-Based Spin Locks. In
Proceedings of the 21st ACM Symposium on Principles of Distributed Computing
(PODC) (July 2002). Expanded version available as TR 773, Computer Science
Dept., University of Rochester, February 2002.
[92] Scott, M. L., and Mellor-Crummey, J. M. Fast, Contention-Free Combining Tree Barriers for Shared-Memory Multiprocessors. International Journal of
Parallel Programming 22, 4 (Aug. 1994).
[93] Scott, M. L., and Scherer, W. N. Scalable Queue-Based Spin Locks with
Timeout. In Proceedings of the ACM SIGPLAN 2001 Symposium on Principles
and Practice of Parallel Programming (PPoPP’01) (Snowbird, Utah, USA, June
2001). Source code is available for download: ftp://ftp.cs.rochester.edu/
pub/packages/scalable_synch/.
152
Bibliography
[94] Singh, J. P., Gupta, A., and Levoy, M. Parallel Visualization Algorithms:
Performance and Architectural Implications. IEEE Computer 27, 7 (July 1994),
45–55.
[95] Singh, J. P., Weber, W.-D., and Gupta, A. SPLASH: Stanford Parallel
Applications for Shared Memory. Computer Architecture News 20, 1 (Mar. 1992),
5–44.
[96] Singhal, A., Broniarczyk, D., Cerauskis, F., Price, J., Yuan, L.,
Cheng, C., Doblar, D., Fosth, S., Agarwal, N., Harvey, K., Hagersten, E., and Liencres, B. Gigaplane: A High Performance Bus for Large
SMPs. In Proceedings of IEEE Hot Interconnects IV (Aug. 1996), pp. 41–52.
[97] Sistare, S. J., and Jackson, C. J. Ultra-High Performance Communication
with MPI and the Sun Fire Link Interconnect. In Proceedings of Supercomputing
2002 (Baltimore, Maryland, USA, Nov. 2002).
[98] Speight, E., and Bennett, J. Brazos: A Third Generation DSM System. In
Proceedings of the 1st USENIX Windows NT Symposium (Aug. 1997).
[99] Sterling, T. L., Salmon, J., Becker, D. J., and Savarese, D. F. How
to Build a Beowulf: A Guide to the Implementation and Application of the PC
Clusters. MIT Press, Cambridge, MA, 1999.
[100] Stets, R., Dwarkadas, S., Hardavellas, N., Hunt, G., Kontothanassis, L., Parthasarathy, S., and Scott, M. Cashmere-2L: Software Coherent
Shared Memory on a Clustered Remote-Write Network. In Proceedings of the 16th
ACM Symposium on Operating System Principle (Oct. 1997).
[101] Sun Microsystems. ABI Compliance and Global Register Usage in SPARC V8
and V9 Architecture. http://soldc.sun.com.
[102] Sun Microsystems. C User’s Guide Forte Developer 6 update 2, 2001.
[103] Sun Microsystems. Sun Fire Superclusters for High Performance and Technical
Computing, Apr. 2003. White Paper.
[104] Tendler, J. M., Dodson, S., Fields, S., Le, H., and Sinharoy, B. Power4
system microarchitecture. IBM Journal of Research and Development 46, 1 (Jan.
2002), 5–25.
[105] Thain, D., and Livny, M. Multiple Bypass: Interposition Agents for Distributed Computing. Cluster Computing 4 (2001), 39–47.
[106] Tullsen, D. M., Eggers, S. J., and Levy, H. M. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95) (June 1995),
pp. 392–403.
[107] Weaver, D. L., and Germond, T., Eds. The SPARC Architecture Manual,
Version 9. PTR Prentice Hall, Englewood Cliffs, New Jersey, 2000.
153
Bibliography
[108] WebGain. Java Compiler Compiler Grammar File Documentation, June 2002.
http://www.webgain.com.
[109] Weiser, M. Program Slicing. IEEE Transactions on Software Engineering SE10, 4 (July 1984), 352–357.
[110] Welsh, M., and DeCristofaro, J. Shared-Memory Multiprocessor Support
for Split-C. Tech. rep., Cornell University, May 1995.
[111] Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. The
SPLASH-2 Programs: Characterization and Methodological Considerations. In
Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95) (June 1995), pp. 24–36.
[112] Woo, S. C., Singh, J., and Hennessy, J. The Performance Advantages
of Integrating Message Passing in Cache-Coherent Multiprocessors. Tech. Rep.
CSL-TR-93-593, Stanford University, Dec. 1993.
[113] Woo, S. C., Singh, J. P., and Hennessy, J. L. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. In
Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI) (Oct. 1994), pp. 219–
229.
[114] Yeung, D., Kubiatowicz, J., and Agarwal, A. MGS: A Multigrain Shared
Memory System. In Proceedings of the 23rd Annual International Symposium on
Computer Architecture (ISCA’96) (May 1996), pp. 45–55.
154
Recent licentiate theses from the Department of Information Technology
2002-004
Martin Nilsson: Iterative Solution of Maxwell’s Equations in Frequency Domain
2002-005
Kaushik Mahata: Identification of Dynamic Errors-in-Variables Models
2002-006
Kalyani Munasinghe: On Using Mobile Agents for Load Balancing in High
Performance Computing
2002-007
Samuel Sundberg: Semi-Toeplitz Preconditioning for Linearized Boundary
Layer Problems
2002-008
Henrik Lundgren: Implementation and Real-world Evaluation of Routing Protocols for Wireless Ad hoc Networks
2003-001
Per Sundqvist: Preconditioners and Fundamental Solutions
2003-002
Jenny Persson: Basic Values in Software Development and Organizational
Change
2003-003
Inger Boivie: Usability and Users’ Health Issues in Systems Development
2003-004
Malin Ljungberg: Handling of Curvilinear Coordinates in a PDE Solver
Framework
2003-005
Mats Ekman: Urban Water Management - Modelling, Simulation and Control
of the Activated Sludge Process
2003-006
Tomas Olsson: Bootstrapping and Decentralizing Recommender Systems
2003-007
Maria Karlsson: Market Based Programming and Resource Allocation
2003-008
Zoran Radovic: Efficient Synchronization and Coherence for Nonuniform
Communication Architectures
Department of Information Technology, Uppsala University, Sweden
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement