pdf-Dokument - Universität Bonn

Fast Repeater Tree Construction
Dissertation
zur
Erlangung des Doktorgrades (Dr. rer. nat.)
der
Mathematisch-Naturwissenschaftlichen Fakultät
der
Rheinischen Friedrich-Wilhelms-Universität Bonn
Vorgelegt von
Christoph Bartoschek
aus
Peiskretscham, Polen
Bonn, Mai 2014
Angefertigt mit der Genehmigung der Mathematisch-Naturwissenschaftlichen
Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn
1. Gutachter: Professor Dr. Jens Vygen
2. Gutachter: Professor Dr. Stephan Held
Tag der Promotion: 11. Juli 2014
Erscheinungsjahr: 2014
Danksagung/Acknowledgments
At this point, I would like to express my gratitude to my supervisors Professor Dr.
Jens Vygen and Professor Dr. Stephan Held. This work would not be possible
without their extensive support, inspiration, and extraordinary patience.
A very special thanks goes to Professor Dr. Dr. h.c. Bernhard Korte for his
encouraging support and for creating such an excellent working environment at the
Research Institute for Discrete Mathematics at the University of Bonn.
I would further like to thank my past and present colleagues at the institute
working with me on timing optimization, especially Dr. Jens Maßberg, Professor Dr.
Dieter Rautenbach, Daniel Rotter, Dr. Christian Szegedy and Dr. Jürgen Werber.
Their support and ideas proved to be invaluable.
Special thanks go to all the students for their collaboration, especially Laura
Geisen, Nicolas Kämmerling and Philipp Ochsendorf.
I also thank all other colleagues at the institute for inspiring discussions, in
particular Dr. Ulrich Brenner, Christian Panten, Jan Schneider and Dr. Markus
Struzyna.
I am grateful that I have been able to work with all the people from IBM who
shared their knowledge and hardest chip designs with us, especially Karsten Muuss,
Dr. Matthias Ringe, and Alexander J. Suess.
But my biggest thanks go to my wife Kerstin and my little daughters Johanna and
Barbara. Without their support and endless patience, I would have never finished
this thesis.
Contents
1 Introduction
9
2 Timing Optimization – Basic Concepts
2.1 Basic Notation . . . . . . . . . . . . . . . . . .
2.2 Integrated Circuit Design . . . . . . . . . . . .
2.3 Static Timing Analysis . . . . . . . . . . . . . .
2.4 Repeater . . . . . . . . . . . . . . . . . . . . . .
2.5 Wire Extraction . . . . . . . . . . . . . . . . .
2.5.1 Elmore Delay . . . . . . . . . . . . . . .
2.5.2 Higher Order Delay Models . . . . . . .
2.6 Slew Limit Propagation . . . . . . . . . . . . .
2.7 Required Arrival Time Functions . . . . . . . .
2.7.1 Propagation of Required Arrival Times
3 Repeater Tree Problem
3.1 Repeater Tree Instances . . .
3.2 The Repeater Tree Problem .
3.2.1 Repeater Tree Timing
3.2.2 Feasible solutions . . .
3.2.3 Objectives . . . . . . .
3.3 Our Repeater Tree Algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Instance Preprocessing
4.1 Analysis of Library and Wires . . . . .
4.1.1 Estimating two-pin connections
4.1.2 Parameter dwire . . . . . . . .
4.1.3 Buffering Modes . . . . . . . .
4.1.4 Slew Parameters . . . . . . . .
4.1.5 Sinkdelay . . . . . . . . . . . .
4.1.6 Further Preprocessing . . . . .
4.2 Blockage Map and Congestion Map . .
4.2.1 Grid . . . . . . . . . . . . . . .
4.2.2 Blockage Map . . . . . . . . . .
4.2.3 Blockage Grid . . . . . . . . .
4.2.4 Congestion Map . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
14
15
18
18
19
20
20
21
.
.
.
.
.
.
23
23
25
25
27
27
28
.
.
.
.
.
.
.
.
.
.
.
.
29
29
29
34
34
36
37
38
39
39
40
40
42
5
Contents
5 Topology Generation
5.1 A Simple Delay Model . . . . . . . . . . .
5.1.1 Time Tree . . . . . . . . . . . . . .
5.2 Repeater Tree Topology Problem . . . . .
5.2.1 Topology Algorithm Overview . .
5.3 Restricted Repeater Tree Problem . . . .
5.4 Sink Criticality . . . . . . . . . . . . . . .
5.5 A Simple Topology Generation Algorithm
5.5.1 Topology Generation Algorithm .
5.5.2 Theoretical Properties . . . . . . .
5.6 Topology Generation Algorithm . . . . . .
5.6.1 Handling High Fanout Trees . . . .
5.7 Blockages . . . . . . . . . . . . . . . . . .
5.8 Plane Assignment . . . . . . . . . . . . . .
5.9 Global Wires as Topologies . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
44
47
48
49
49
49
50
50
52
56
58
59
60
60
6 Repeater Insertion
6.1 Computing Required Arrival Time Targets . . . .
6.1.1 Linear Time-cost Tradeoff . . . . . . . . .
6.1.2 Effort Assignment Algorithm . . . . . . .
6.2 Repeater Insertion Algorithm . . . . . . . . . . .
6.2.1 Cluster . . . . . . . . . . . . . . . . . . .
6.2.2 Initialization . . . . . . . . . . . . . . . .
6.2.3 Timing Model during Repeater Insertion .
6.2.4 Finding a new Repeater . . . . . . . . . .
6.2.5 Buffering Algorithm . . . . . . . . . . . .
6.2.6 Merging operation . . . . . . . . . . . . .
6.2.7 Moving operation . . . . . . . . . . . . . .
6.2.8 Arriving at the root . . . . . . . . . . . .
6.2.9 Running Time . . . . . . . . . . . . . . .
6.2.10 Repeater Insertion - Summary . . . . . .
6.3 Dynamic Programming . . . . . . . . . . . . . .
6.3.1 Basic Dynamic Programming Approach .
6.3.2 Buffering Positions . . . . . . . . . . . . .
6.3.3 Extensions to Dynamic Programming . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
64
66
68
69
69
70
70
73
75
75
77
79
80
81
82
83
84
84
7 BonnRepeaterTree
7.1 Repeater Library . . . . . . . . . . . . . . . . .
7.1.1 Repeater and Wire Analysis . . . . . . .
7.1.2 RAT and Slew Backwards Propagation .
7.2 Blockages and Congestion Map . . . . . . . . .
7.3 Processing Repeater Tree Instances . . . . . . .
7.3.1 Identifying Repeater Tree Instances . .
7.3.2 Constructing Repeater Trees . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
89
89
90
90
91
92
92
98
6
.
.
.
.
.
.
.
Contents
7.4
7.5
7.6
7.3.3 Replacing Repeater Tree Instances . . . . .
Implementation Overview . . . . . . . . . . . . . .
7.4.1 Repeater Tree Construction Framework . .
7.4.2 Repeater Tree API . . . . . . . . . . . . . .
7.4.3 Parallelization . . . . . . . . . . . . . . . .
BonnRepeaterTree in Global Timing Optimization
BonnRepeaterTree Utilities . . . . . . . . . . . . .
7.6.1 Removing Existing Repeaters . . . . . . . .
7.6.2 Postprocessing Repeater Chains . . . . . .
8 Experimental Results
8.1 Comparison to an Industrial Tool . . . . .
8.2 Comparison to Bounds . . . . . . . . . . .
8.2.1 Running Time . . . . . . . . . . .
8.2.2 Wirelength . . . . . . . . . . . . .
8.2.3 Number of Inserted Inverters . . .
8.2.4 Timing . . . . . . . . . . . . . . .
8.3 Fast Buffering vs. Dynamic Programming
8.4 Varying η . . . . . . . . . . . . . . . . . .
8.5 Varying dnode . . . . . . . . . . . . . . . .
8.6 Disabling Effort Assignment . . . . . . . .
8.7 Disabling Parallel Mode . . . . . . . . . .
8.8 Choosing Tradeoff Parameters . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
100
100
100
101
101
102
102
103
.
.
.
.
.
.
.
.
.
.
.
.
109
110
111
113
114
114
115
116
118
119
119
121
121
A Detailed Comparison Tables
125
B Bibliography
141
7
1 Introduction
We live in a world where computer chips can be found in nearly every device we
come into contact with. On the one hand, there is a huge demand for powerful but
not power-consuming chips that can be added to our wearables, houses or household
items. On the other hand, there is still a huge demand for fast processors that are
able to crunch the enormous amounts of data we produce daily.
Chip designers create the small miracles driving all of this in a process called
physical design. It is an area where one of our most advanced technologies meets
mathematics and computer science.
Physical design contains a lot of fascinating problems that can be tackled with
methods from combinatorial optimization. This thesis focuses on one of the problems
that arise during the optimization of the timing behaviour of a chip, the optimization
of interconnections or repeater trees. Interconnections distribute signals from a
source to one or several sinks. Figure 1.1 shows the interconnection between a source
(blue) and four sinks (red). If interconnections get too long, the speed of signals
degrades and electrical constraints get violated. To improve speed and to avoid
electrical violations, repeaters can be inserted into a chip.
Figure 1.1: Example of an interconnection. A signal has to be distributed from a root (blue)
to sinks (red).
There are two flavours of repeaters: buffers and inverters. Buffers are used to
refresh a signal. Inverters have, in addition, the property that they change the
polarity of a signal. A signal switching from a logical 0 to a 1 becomes a signal
switching from 1 to 0 and vice-versa. The logical symbols of both repeater types
and the schematic of an inverter are shown in Figure 1.2.
An interconnection distributes the signals in a tree-like fashion from a source to
9
1 Introduction
A
Inverter
Z
A
Buffer
Z
Vdd
A
Z
GND
Figure 1.2: The symbols of inverters and buffers and a schematic of a simple inverter. It
consists of two transistors with converse switching behaviour. GND (logical 0)
and Vdd (logical 1) are power supplies. Both repeaters have a single input A
and an output Z. If a 0 arrives, then only the gate to Vdd opens and vice-versa.
A buffer is usually constructed by two inverters in series.
the sinks via wires. A Repeater is added into the interconnection by subdividing a
wire segment and connecting the ends to the repeater’s input and output. We call
an interconnection that potentially has repeaters in between a repeater tree.
In the early years of physical design, interconnection optimization between gates
was a minor task. Repeaters were only necessary for very long distances or very high
fanouts. However, with the downscaling of technology, resulting in smaller gates and
thinner wires, the ability of gates to drive long wires deteriorated and more repeaters
became necessary to meet timing requirements. Saxena et al. (2003) predicted that
for the 45 nm and 32 nm technology nodes 35 % and 70 % of all circuits on typical
designs will be repeaters. In the meantime, the technology nodes have arrived and
we see that the numbers are slightly better than predicted. We see that on 45 nm
designs 25–40 %, on 32 nm designs 30–45 %, and on 22 nm designs 35–50 % of all
circuits are repeaters.
Nevertheless, if 30 % of all gates are repeaters, repeater tree construction becomes
an important task. It affects all aspects of physical design in addition to timing
and electrical correctness. For example, repeaters represent a significant number of
circuits that have to be placed and later connected by routing. A significant part
of the power consumption can be attributed to repeaters. Modern designs have
a wide range of instances with up to hundred thousands of sinks. We have seen
designs with millions of instances. For large instances, running time becomes a
crucial feature of a repeater tree construction algorithm.
The algorithms we present fit into all stages of physical design. A very fast
algorithm that we call Fast Buffering can be used for rebuilding all instances globally
in early and middle design stages. We can optimize around 5.7 million instances
per hour on a single core and a couple of minutes in parallel. For later stages of
physical design, a more accurate version of our algorithm can be enabled that is
able to squeeze out the last tenth of a picosecond.
The structure of this thesis is as outlined in the following paragraphs. The basic
concepts from timing optimization in chip design that we use are explained in
10
Chapter 2. We then define the Repeater Tree Problem in Chapter 3. Our
problem formulation encapsulates most of the constraints that have been studied so
far. To the best of our knowledge, we also consider several aspects for the first time,
for example, slew dependent required arrival times at repeater tree sinks. These
make our formulation more adequate to the challenges of real-world repeater tree
construction.
For creating good repeater trees, one has to take the overall design environment
into account. The employed technology, the properties of available repeaters and
metal wires, the shape of the chip, the temperature, the voltages, and many other
factors highly influence the results of repeater tree construction. To take all this
into account, we first preprocess the environment to extract parameters for our
algorithms. These parameters allow us to quickly and yet quite accurately estimate
the timing of a tree before it has even been buffered. Chapter 4 shows how we
extract them from the timing environment.
The next two chapters explain our algorithm to solve the Repeater Tree
Problem. Chapter 5 shows how we construct an underlying Steiner tree for our
solution. We prove that our algorithm is able to create timing-efficient as well as
cost-efficient trees.
Chapter 6 deals with the problem of adding buffers to a given Steiner tree. The
predominantly used algorithms to solve this problem use dynamic programming.
However, they have several drawbacks. Firstly, potential repeater positions along
the Steiner tree have to be chosen upfront. Secondly, the algorithms strictly
follow the given Steiner tree and miss optimization opportunities. Finally, dynamic
programming causes high running times. We present our new buffer insertion
algorithm that overcomes these limitations. It is able to produce results with similar
quality to a dynamic programming approach but a much better running time. In
addition, we also present our improvements to the dynamic programming approach
that allow us to push the quality at the expense of a high running time.
As part of this thesis, we implemented the discussed algorithms as a module
within the BonnTools optimization suite which is developed at the Research Institute
for Discrete Mathematics at the University of Bonn in an industrial cooperation
with IBM. BonnTools are used by chip designers worldwide within IBM. Some
implementation details will be described in Chapter 7. Our algorithms are used and
help engineers dealing with some of the most complex chips in the world. In the
same chapter, we also shortly describe the framework we have written that makes it
easy to implement new repeater tree construction algorithms. As an example, we
will show how routing congestion on a chip can be reduced by rerouting repeater
chains.
Our cooperation partner IBM provided us with a large number of real world chip
designs. For this thesis we have chosen a set of twelve challenging chips with a total
number of more than 3.3 million different repeater tree instances. In Chapter 8, we
present experimental results that show the quality and speed of our algorithms.
11
2 Timing Optimization – Basic Concepts
2.1 Basic Notation
We use the same notation for graphs as Korte and Vygen (2012).
For an arborescence (i.e. a directed tree where each node except for the root has
exactly one entering edge) T = (V, E) with nodes v, w ∈ V we denote the edges on
the path from v to w by E[v,w] . The parent of node v ∈ V is called parent(v).
For a set X ⊂ Rn , we define the componentwise maximum
max x := max x1 , max x2 , . . . , max xn
x∈X
x∈X
x∈X
x∈X
and the componentwise minimum
min x := min x1 , min x2 , . . . , min xn .
x∈X
x∈X
x∈X
x∈X
As current technologies prefer to route wires only in horizontal and vertical
direction, we almost exclusively use the `1 -norm:
kak := kak1
2.2 Integrated Circuit Design
The netlist of a chip design consists of primary pins (input or output), gates and
nets. Gates are circuits computing small Boolean functions, for example, NOT,
AND, or macros encapsulating larger functionality. Gates have pins, input pins or
output pins, as external connection points. Pins, primary pins and gate pins, are
connected by nets. Typically a net has a single source (gate output pin or primary
input pin) and a set of sinks (gate input pin or primary output pin). If a gate’s
output pin is the source of a net, then we say that the gate is the driver of the
net. Physical design is the phase in the process of creating a chip where a netlist
that has just been compiled from a hardware description language is mapped to a
chip image, typically a rectangular area with free space for gates. During physical
design, gates get placed on the chip image. Then, optimizations are performed
without changing the logical function of the design to improve the timing behaviour.
One such operation is rebuilding repeater trees. Finally, the pins of each net get
connected with wires by a routing tool. Modern chip designs and technologies are so
challenging that placement, timing optimization, and routing have to be considered
together most of the time. An introduction into the optimization steps of modern
physical design is given by Held (2008).
13
2 Timing Optimization – Basic Concepts
2.3 Static Timing Analysis
We use the concepts of static timing analysis (Hitchcock et al., 1982) that is based
on the critical path method (Kelley and Walker, 1959). A thorough introduction
to timing analysis is given by Sapatnekar (2004). In this thesis only the following
concepts are important.
The voltage at a given point of a chip compared to ground defines the logical
state. The voltage Vdd represents a logical 1 and GND (ground) represents a 0.
A signal is defined as the change of the voltage over time. A rising signal describes
a change from GND to Vdd . On the other hand, a falling signal describes the change
from Vdd to GND. We call the direction the signal’s edge. The possible edges are r
(rise) and f (fall). We define the inversion of an edge as
f −1 := r,
r−1 := f.
In static timing analysis signals are measured at certain points of the design that
are also called timing points. For most of this discussion, it is sufficient to restrict
ourselves to gate pins and primary pins as timing points. In addition, some gates
have internal timing points. We only have to consider them when we work with real
chip designs (Chapter 7). A signal is estimated by a piecewise linear function that
is given by the arrival time and slew of the signal. Usually, the arrival time of a
rising or falling signal is given as the time when the voltage change reaches 50 %.
Similarly, the slew is given as the time between 10 % and 90 % of the voltage change.
Static timing analysis makes worst case assumptions and computes at each
measurement point an early and a late signal. A real signal will arrive after the
early signal and before the late signal.
At certain timing points the arrival times of signals are compared to the arrival
times of other signals or design-specific constants. This imposes constraints on the
signals. For example, signals are not allowed to be too fast (their early arrival time
has to be high enough) or too slow (the late arrival time has to be small enough)
when they arrive at registers1 compared to clock signals. Repeaters are used to slow
down signals that are too fast and they are also used to speed up signals. In this
thesis, we only consider the problem of speeding up signals to meet requirements on
the late arrival time. As we are not interested in the early arrival times of signals,
we will ignore them for the remainder of this thesis and only work with late signals.
Transistors show different characteristics for rising and falling signals depending
on their size or the technology. Gates show asymmetric behaviour depending on the
edge of the incoming signal. Therefore, timing analysis computes the arrival times
and slews for both signals separately and stores them in time pairs, one value for
each signal edge.
Definition 1. A time pair is a tuple (rise, f all) ∈ R2 of time values. For a given
time pair t = (rise, f all) we define tr := rise and tf := f all.
1
Registers are gates that store information between clock cycles.
14
2.4 Repeater
At each timing point, a signal is given by an arrival time time pair (we often write
arrival time pair) and a slew time pair (we often write slew pair):
(atr , atf ), (slewr , slewf ).
The measurement points are connected by directed propagation arcs. Timing
nodes and propagation arcs form the timing graph. For a net, there is a propagation
arc for each sink pin p connecting the source of the net to p. For gates, there are
technology dependent rules specifying which internal timing points and arcs are
added to the timing graph. Propagation arcs are also called propagation segments.
Repeaters normally have a single propagation arc from their input pin to their
output pin. The propagation arc within an inverter is an inverting propagation arc;
a rising (falling) signal edge at the input becomes a falling (rising) signal edge at
the output. Buffers have, similar to nets, a non-inverting propagation arc; a rising
(falling) signal edge at the input becomes a rising (falling) signal edge at the output.
The difference between the arrival time of a signal at the head of a propagation
arc and the arrival time at the tail is called delay.
The computer program that computes arrival times and slews for all signals on
all timing nodes is called timing engine.
2.4 Repeater
A repeater t is characterized by
• its logical function (buffers implement the identity function, inverters implement the negation),
• an input pin ta and an output pin tz with pin capacitances capin (t) and
capout (t),
• its delay function delayt ,
• its slew function slewt , and
• its leakage power consumption pwr(t).
The logical function determines whether a signal propagating through a repeater
is inverted or not. Inverters change, according to their propagation arcs, a rising
signal into a falling signal and vice-versa. Buffers do not change the edge of a signal.
Given a signal edge ∗ ∈ {r, f }, we define
t(∗) :=
( −1
∗
∗
if t is an inverter
if t is a buffer.
Each repeater is driver of the net connected to its output pin. The sum of pin
and wire capacitances visible from the output pin, including capout (t), is called load
capacitance or load.
The functions delayt and slewt are called timing functions. The delay function
computes the delay over the internal propagation arc of a repeater. It depends on
15
2 Timing Optimization – Basic Concepts
the load at the output pin and the slew pair at the input pin. The function is given
by
delayt : [0, loadlim(t)] × [0, slewlim(t)]2 → R2
delayt (l, s) := delaytr (l, sr ), delaytf (l, sf
where, for each signal edge ∗ ∈ {r, f }, there is a function
delayt∗ : [0, loadlim(t)] × [0, slewlim(t)] → R.
As inverters change the signal edge, we have to distinguish between inverters and
buffers when we want to add delays to a given arrival time pair. For time pair a,
load l and slew pair s, we use the function
updatet : R2 × [0, loadlim(t)] × [0, slewlim(t)]2 → R2


 af + delaytf (l, sf ), ar + delaytr (l, sr )
updatet (a, l, s) := 
 ar + delaytr (l, sr ), af + delaytf (l, sf )
if t is an inverter
if t is a buffer.
Similarly, the slew function determines the slew at the output pin of the repeater
depending on the slew pair at the slew pin and the load at the output pin. It is
given by
slewt : [0, loadlim(t)] × [0, slewlim(t)]2 → R2≥0


 slewtf (l, sf ), slewtr (l, sr )
slewt (l, s) = 
 slewtr (l, sr ), slewtf (l, sf )
if t is an inverter
if t is a buffer,
and, for each input signal edge ∗ ∈ {r, f }, there is the function
slewt∗ : [0, loadlim(t)] × [0, slewlim(t)] → R≥0
computing the slew for the output signal edge t(∗).
Each repeater has a pair slewlim(t) of limits on the maximum rising respectively
falling slew associated with the input pin and a maximum output load limit loadlim(t)
associated with the output pin. Both define the domain of the timing functions.
There are also tiny lower limits on the possible load and slews to the delay and slew
functions. However, they are not relevant in practice. The minimum load limit can
only be violated by unconnected pins and the lower slew limits are so steep that it
is not possible to reach them within the corresponding technology. We therefore
ignore them and use 0 for lower limits. The delay and slew functions as well as the
pin capacitances and limits are called timing rules.
Especially in the early design stages, it is often not possible to keep the slews and
loads within the domain of the load and slew functions. Such a condition is called
16
2.4 Repeater
Output Slew
Output Slew
electrical violation. The timing engine extrapolates the timing functions in such a
case to work with somehow reasonable values.
Generally, we do not make many restricting assumptions on the timing functions
of a repeater. There might be small deviations from the following, but we can
assume that both functions are strictly monotonically increasing for each input. For
example, if for a fixed load the input slews are increased, then the delays and output
slews also increase.
100 ps
100 ps
400 f F
200 ps
Load
Delay
(b)
Delay
(a)
100 ps
100 ps
400 f F
(c)
Input Slew
Load
200 ps
Input Slew
(d)
Figure 2.1: Delays and slews of an example repeater depending on the input slew and load.
Figures (a) and (c) show the results of slewtr and delaytr for fixed input slew
and different load capacitances. Figures (b) and (d) show the results of slewtr
and delaytr for fixed load capacitance but different input slews. In all cases the
rise-fall transition of an inverter is shown.
We use the leakage power consumption of repeaters as costs. Repeaters also
cause dynamic power consumption that depends on the switching behaviour of the
circuit. Circuits that switch often consume more power than circuits that keep their
state. The power consumption also depends on the capacitance of the capacitors
that change voltage. The dynamic power consumption is roughly linear to the total
17
2 Timing Optimization – Basic Concepts
capacitance of a circuit. As all circuits in a repeater tree show the same switching
patterns, we can basically expect that shorter trees use less dynamic power.
2.5 Wire Extraction
Given a net, we are interested in the delay between the source of the net and each
sink. The process of computing the delays and slew changes is called wire extraction.
There are several different approaches to model the timing behaviour of nets. The
most dominant one is shown in the next section.
2.5.1 Elmore Delay
The most commonly used delay model for nets in works on interconnection optimization is the Elmore delay because it is easy to calculate and gives good results
compared to earlier approaches2 .
Elmore (1948) proposed to estimate the delay of a monotonic step response of a
circuit by using the mean of the impulse response. It can be shown that the Elmore
delay is an upper bound on the actual 50 % delay to a sink. Rubinstein et al. (1983)
showed how to compute the Elmore delay on an RC-tree in linear time. An RC-tree
describes the physical properties of the wires of a net. A wire segment e is modeled
using the π-model, that is, a resistance segment re between two capacitors ce/2. Here,
re is the resistance of the wiring segment and ce is its capacitance.
At the first design stages there are often no wires that can be used for RC-tree
generation. To estimate the delay over a net, it is common to compute a Steiner
tree S and use it as the RC-tree with default resistances and capacitances for the
edges of the Steiner tree.
We assume that the tree is oriented away from the input pin of the net, that each
vertex is assigned a position in the plane, and that the edges are embedded at the
shortest paths between their nodes. On a path E(S)[a,b] from vertex a to vertex b,
the Elmore delay rc is calculated by:
rcs =
X
re
e∈E(S)[a,b]
ce
+ downcap(e) .
2
The downward capacitance downcap(e) is the sum of all wiring and pin capacitances reachable from e in the oriented tree.
Both re and ce are proportional to the length of edge e. The resulting rc value is
therefore quadratic in the length.
The Elmore delay approximates the response of a net to a step excitation. It
is used as a raw value, often called RC-delay, that is merged with environmental
variables into a delay function and a slew function that compute the response of
2
See Pileggi (1995) for a description of some earlier models.
18
2.5 Wire Extraction
the net to a skewed input signal. We assume that the timing engine provides a
wiredelay function
wiredelay : R × R≥0 → R
and a wireslew function
wireslew : R × R≥0 → R≥0 .
Given a wire segment with Elmore delay rc and a slew sin at the segment’s start,
wiredelay(rc, sin ) computes the delay over the wire and wireslew(rc, sin ) computes
the slew at the end. Both functions are sometimes linear, but often they cannot be
calculated by a simple expression. The functions are independent from the signal
edge. To simplify the calculation of delays and slews for time pairs, we define the
combinations wiredelay and wireslew:
wiredelay : R × R2≥0 → R2
wiredelay(rc, (slewr , slewf )) = (wiredelay(rc, slewr ), wiredelay(rc, slewf ))
wireslew : R × R2≥0 → R2
wireslew(rc, (slewr , slewf )) = (wireslew(rc, slewr ), wireslew(rc, slewf )).
The Elmore delay has the property that splitting up arbitrary nodes in the tree
does not change the result. For example, if we have an edge (a, b) and split it with
node c then
rc(a,b) = rc(a,c) + rc(c,b) .
However, we do not assume that the same holds for wiredelay or wireslew. The
following equations are not necessarily true for an input slew sin :
wiredelay(rc(a,b) , sin ) = wiredelay(rc(a,c) , sin )
+ wiredelay(rc(c,b) , wireslew(rc(a,c) , sin )))
wireslew(rc(a,b) , sin ) = wireslew(rc(c,b) , wireslew(rc(a,c) , sin )).
This means that in general we cannot compute delays and slews segment by segment
and just stitch them together. Instead, we have to calculate the Elmore delay for
the whole net first before we ask for the total delays and output slews by using the
black-box functions wiredelay and wireslew.
2.5.2 Higher Order Delay Models
The Elmore delay is popular for timing optimization because of its simplicity.
However, it is too pessimistic for some applications. Timing engines typically
support more accurate delay models in addition to Elmore delay.
While we focus on the Elmore delay in our discussion, the operations where we
extract a net and compute the delay and slew degradation are independent from
the delay model. With small modifications and by using black-box functions that
19
2 Timing Optimization – Basic Concepts
return delays and slews for a given net, it would be possible to create versions of our
algorithm that work on more accurate delay models. Until now, we refrained from
doing so because we expect only a small improvement in quality to our solutions
that would be paid by a drastically increased running time.
A description of higher order delay models can be found in Sapatnekar (2004) or
Alpert et al. (2008), p. 546ff.
2.6 Slew Limit Propagation
In the context of repeater trees, only input pins of gates have slew limits. However,
if one has to choose a gate that drives a net, one often wants to know the maximum
slew that may arrive at the source of the net such that the limit is not violated for
each sink of the net. We assume that there exists a function
slewinv : R × R2≥0 → R2≥0
that computes the maximum slew limit at net sources. For a sink s with a slew
limit pair slewlim(s) and RC-delay rcs the pair of maximum slews that are allowed at the source of the net such that the sink’s limits are not violated is
slewinv(rcs , slewlim(s)). Sometimes it is not possible to obey a sink’s slew limit
because it is too tight or the net is too long. In such cases, we assume that slewinv
returns 0 for the corresponding signal edge.
For a net with sink set S, the maximum allowable slews at the source are the
componentwise minimum values over all slew limits:
slewlim := min{slewinv(rcs , slewlim(s))}.
s∈S
We ask the same question for each repeater: What is the highest slew that may
arrive at the input pin such that the output slew is below a certain limit for a given
load capacitance? We assume that for each repeater t a function
slewinvt : R≥0 × R2≥0 → R2≥0
is given. For a slew limit pair slewlim at the output pin and a load capacitance
load, the pair of highest allowable slews at the input pin is slewinvt (load, slewlim).
2.7 Required Arrival Time Functions
As indicated above, there are constraints on the arrival times. If a signal is not
allowed to arrive too late at a timing point, then the latest feasible arrival time is
called required arrival time (RAT). We define required arrival times for all timing
nodes even if they have no direct arrival time constraints. For a timing point v,
the required arrival time for a signal is the latest arrival time such that for all
timing nodes reachable from v the arrival time constraints are met. As the delays
20
2.7 Required Arrival Time Functions
of subsequent propagation segments depend on the slew at v, the required arrival
time is a function of slews.
For each timing point, there is a RAT function
rat : R2≥0 → R2
rat(slew) = ratr (slewr ), ratf (slewf )
with ratr , ratf being edge-specific RAT functions. For a signal edge ∗ ∈ {r, f }, we
have
rat∗ : R≥0 → R.
Given arrival time a∗ and slew s∗ for signal edge ∗ ∈ {r, f }, the required arrival
time constraint at a point is feasible if
a∗ ≤ rat∗ (s∗ ).
The slack at the point is defined as
σ ∗ := rat∗ (s∗ ) − a∗ .
The timing of a netlist is clean if the slacks are non-negative for all constraints on
timing points.
2.7.1 Propagation of Required Arrival Times
Given a net, the RAT function at its source r depends on the net topology and the
RAT functions at the sinks. Given a sink s with RAT function rats , we can ask for
a function ratr such that arrival times and slews are feasible at s if they are feasible
at r. Delay and slew over a wire segment only depend on the Elmore delay and the
input slew. We therefore assume there is function
ratinv : R × (RR≥0 × RR≥0 ) → RR≥0 × RR≥0
ratinv rc, (ratr , ratf ) = ratinv r (rc, ratr ), ratinv f (rc, ratf )
and for each edge ∗ ∈ {r, f } a function
ratinv ∗ : R × RR≥0 → RR≥0
such that, for every slew slew and Elmore delay rcs ,
ratinv ∗ (rcs , rat∗s )(slew∗ ) = rat∗s (wireslew(rcs , slew∗ ))−wiredelay(rcs , slew∗ ).
At the source of multi-pin nets, we have to take the minimum over all RAT functions
coming from the sinks to be sure that arrival times are feasible at all sinks. The
minimum is the function
ratr := min ratinv(rcs , rats )
s∈S
21
2 Timing Optimization – Basic Concepts
such that for each edge ∗ ∈ {r, f } and all slews slew ∈ R≥0
rat∗r (slew) = min ratinv ∗ (rcs , rats )(slew).
s∈S
If we have to compute the minimum over a set of RAT functions, the result is the
lower contour of input RAT functions. In practice, however, we always approximate
the lower contour by a linear function.
Similarly, we assume that for each repeater t there is a function ratinvt giving us
the RAT function at the input pin for a RAT function at the output pin. Assume
that t drives a net with load capacitance load and that the RAT function at the
output pin is rat. We then have a function
ratinvt : R≥0 × (RR≥0 × RR≥0 ) → RR≥0 × RR≥0
ratinvt load, (ratr , ratf ) = ratinvtr (load, ratt(r) ), ratinvtf (load, ratt(f ) )
and for each edge ∗ ∈ {r, f } a function
ratinvt∗ : R × RR≥0 → RR≥0
such that, for every slew pair slew and load capacitance load,
∗
∗
∗
∗
ratinvt∗ (load, rat∗s )(slew∗ ) = ratt(∗)
s (slewt (load, slew )) − delayt (load, slew ).
Note that for inverters the rise RAT function at the input pin is determined by the
fall RAT function at the output pin and vice-versa.
22
3 Repeater Tree Problem
3.1 Repeater Tree Instances
An instance of the Repeater Tree Problem consists of
• a root r, its location in the plane P l(r) ∈ R2 , a root arrival time function
atr : R+ → R2 , a root slew function slewr : R+ → R2 , a pin capacitance
capout , and a capacitance limit loadlim(r),
• a set S of sinks, and for each sink s ∈ S its parity par(s) ∈ {+, −}, its location
in the plane P l(s) ∈ R2 , input capacitance capin (s), a RAT function rats , and
a pair of slew limits slewlim(s),
• a set L of repeaters with timing rules delayt , slewt , loadlim(t), slewlim(t),
capin (t) and capout (t) for each repeater t ∈ L,
• a set A of rectangles defining blocked areas,
• a global routing graph,
• a set W of wiring modes, and
• timing functions wiredelay and wireslew.
Root
The root r is typically an output pin of a circuit or a primary input pin of the
netlist. The root arrival time (resp. slew) function computes a time pair of arrival
times (resp. slews) at the root pin for a given load capacitance.
Sinks
Sinks are usually primary outputs or input pins of circuits that are not repeaters.
The parity determines how many inverting repeaters are required on root-sink paths.
The number of inversions on the path from the root to the sink must be even (odd)
if the sink has parity + (−). We say a sink is positive (negative) if it has parity +
(−).
Most formulations of the Repeater Tree Problem assume fixed required
arrival times at the sinks. In practice, required arrival times depend on the slews
that arrive at the pins1 . Higher slews cause higher delays in following stages and
1
See Section 2.7.
23
3 Repeater Tree Problem
reduce the required arrival times. The first propagation segment after a sink has
the highest impact on the delays introduced by higher slews. If the sink is a circuit,
its delay function often shows nearly linear behaviour that can be captured by a
linear function. Effects on subsequent propagation segments are much smaller and
can be neglected.
For sinks that are primary outputs or other nodes where required arrival times
are created, the RAT function is often constant.
The slew limit is determined by the timing rules of the sink and global parameters.
Blockages and Congestion
Information about areas of the design that are blocked for repeater insertion is given
in a blockage map. Basically, the blockage map is a set of rectangles.
Similarly, congestion on the wiring layers is passed to the repeater tree routine
via a global routing graph. Both data structures are described in Section 4.2.
Wiring Modes
We assume to have a fixed number of wiring modes, each of which corresponds to a
type of wire we can route on a plane. A wiring mode w is a 4-tuple (p, width(w),
wirecap(w), wireres(w)) consisting of
• a routing plane p,
• routing space consumption width(w),
• capacitance per unit length wirecap(w), and
• resistance per unit length wireres(w).
The routing space consumption depends on the wire width and the necessary
spacing to neighboring wires and is used to update the congestion map. Usually,
there is one wiring mode per usable plane of a design.
Given a wire segment ws with mode w, the total capacitance and resistance of
the wire are linear in its length l(ws):
cap(ws) := wirecap(w) · l(ws)
res(ws) := wireres(w) · l(ws).
There are two default wiring modes, one on a horizontal layer wh∗ and one on a
vertical layer wv∗ , that are used for the bulk of the wires in the design. Typically,
they are the least expensive wiring modes in terms of routing space consumption
and most expensive in terms of delay. As routers are free to choose higher planes
than the plane of the assigned wiring mode and as higher planes typically mean
better timing, the assignment of the default wiring modes is a pessimistic choice.
24
3.2 The Repeater Tree Problem
Timing Rules
For each repeater t, the functions delayt and slewt are given with the according
electrical limits loadlim(t) and slewlim(t). The capacitance of its input pin is
capin (t) and it is capout (t) for its output pin.
For nets, wiredelay is the timing rule that computes the delay of the timing
engine for a given RC-delay and input slew. Similarly, wireslew computes the slew
for a given RC-delay and input slew.
3.2 The Repeater Tree Problem
The Repeater Tree Problem is the task of computing a repeater tree for a given
repeater tree instance. We first define what a repeater tree is:
Definition 2. A repeater tree R for a Repeater Tree Problem instance is a
tuple (T, P l, Rt , RW ). It consists of
• an arborescence T = (V (T ), E(T )) with V (T ) = {r} ∪˙ S ∪˙ Ir ∪˙ Is rooted at
r, leaves S, and inner nodes Ir ∪ Is (nodes corresponding to repeaters and
steiner nodes),
• an embedding of the nodes into the plane P l : V (T ) → R2 ,
• a repeater assignment function Rt : V (T ) → L ∪ {∅} with Rt (v) ∈ L iff v ∈ Ir ,
and
• a wiring mode assignment function RW : E(T ) → W .
The root and leaves of a repeater tree are the root and sinks of the corresponding
repeater tree instance. When a repeater tree is inserted into a chip design, for each
node v ∈ Ir , a new repeater gate with type Rt (v) is created and placed at P l(v).
Root, sinks, and new repeaters are connected with nets that consist of the edges
between the corresponding nodes. The nodes Is and their incident edges determine
the topology of the nets. There is exactly one net for each node in {r} ∪ Ir .
3.2.1 Repeater Tree Timing
Given a repeater tree R = (T, P l, Rt , RW ) for a repeater tree instance, we have to
compute the arrival times and slews at the sinks to compare them with required
arrival times. To achieve this, we have to extract the nets between the involved pins
and compute delays over repeaters and nets.
For each node v ∈ V (T ), let Tv be the maximal subtree rooted at v such that all
its inner nodes are in Is . We say Tv is a net iff v ∈ {r} ∪ Ir . The sinks of the net
are the leaves of Tv .
Given a sink a ∈ V (Tv ) for a net rooted at v, let p(a) := v be the root node of
the net. On the one hand, each repeater node v ∈ Ir is root of the net Tv , and on
the other hand, it is a sink in the net Tp(v) .
25
3 Repeater Tree Problem
We first compute the Elmore delay to each net’s sink. Given an edge (v, w), its
capacitance and resistance are
cap((v, w)) := wirecap(RW ((v, w)))||P l(v) − P l(w)||
res((v, w)) := wireres(RW ((v, w)))||P l(v) − P l(w)||.
Given an edge (w, y) ∈ E(Tv ), the capacitance visible downwards is the sum of
all edge capacitances in the subtree rooted at y and the input pin capacitances of
reachable sinks:
downcap((w, y)) :=
X
cap((a, b)) +
(a,b)∈E(Ty )
X
capin (R(a)).
a∈V (Ty )
δ + (a)=0
For the root node and the internal repeaters, the load capacitance is the sum of
visible capacitances:
load : Ir ∪ {r} → R
load(x) = capout (x) +
X
downcap(e).
e∈δ + (x)
Now we can compute the Elmore delay rc for each sink:
rc : Ir ∪ S → R
rc(x) =
cap(e)
+ downcap(e) .
res(e)
2
)
X
e∈E(Tp(x)
[p(x),x]
The next step is to propagate the slews and the arrival times from the root to the
instance sinks. We define the slew recursively distinguishing between the input slew,
slewi , for input pins and output slew, slewo , for output pins:
slewo : Ir ∪ {r} → R2
(
slewo (v) =
slewr (load(v))
v=r
slewR(v) (load(v), slewi (v)) v =
6 r
slewi : Ir ∪ S → R2
slewi (v) = wireslew(rc(v), slewo (p(v))).
In a similar way, the arrival times at inputs (ati ) and output pin (ato ) are defined as
ato : Ir ∪ {r} → R2
atr (load(v))
v=r
updateR(v) (ati (v), load(v), slewi (v)) v =
6 r
(
ato (v) =
ati : Ir ∪ S → R2
ati (v) = ato (p(v)) + wiredelay(rc(v), slewo (p(v))).
26
3.2 The Repeater Tree Problem
The slack of the repeater tree is now
slack(T ) := min min{ratrs (slewir (s)) − atri (s), ratfs (slewif (s)) − atfi (s)}.
s∈S
We also define the static power consumption of the tree
power(T ) :=
X
pwr(R(v))
v∈Ir
and its length
length(T ) :=
X
||P l(v) − P l(w)||.
(v,w)∈E(T )
The length roughly correlates with the dynamic power consumption of a repeater
tree.
3.2.2 Feasible solutions
A repeater tree is feasible if internal nodes Ir are legally placed and connected in
such a way that the signals arrive at each sink with the correct parity. Repeaters
are placed legally if their position is not marked as blocked in the blockage map:
P l(v) ∈
/A
∀v ∈ Ir .
Note that we ignore overlaps between repeaters and other gates.
Furthermore, we have to obey capacitance and slew limits everywhere:
load(r) ≤ loadlim(r)
load(v) ≤ loadlim(R(v))
∀v ∈ Ir
slewi (v) ≤ slewlim(R(v))
∀v ∈ Ir
slewi (s) ≤ slewlim(s)
∀s ∈ S.
The timing of the repeater tree is feasible if slack(T ) ≥ 0.
3.2.3 Objectives
Among the repeater trees satisfying the above conditions, one typically searches a
tree with the smallest power consumption.
In practice, however, it is often not possible to achieve a positive slack. It might
also not be possible to get a solution that has no electrical violations. In such
cases, our first objective is to minimize the sum of electrical violations. The second
objective is to maximize min{0, slack} followed by minimizing power. In addition,
we seek solutions minimizing the use of wiring resources.
It is often desirable to balance between the two main objectives, timing and
wirelength. To this end, we introduce a parameter ξ ∈ [0, 1], indicating how timingcritical a given instance is. For ξ = 1, we primarily optimize the worst slack, and we
27
3 Repeater Tree Problem
optimize wirelength for ξ = 0. We consider the other objective only in case of ties.
In practice, however, we mainly use values of ξ that are strictly between 0 and 1.
The Repeater Tree Problem is NP-hard because it contains the Steiner
Minimum Tree Problem if one just wants to minimize `1 -netlength (see Garey
and Johnson (1977)). In addition, delay and slew functions are non-linear leading
to further difficulties.
A good overview of existing approaches to solve the Repeater Tree Problem
can be found in Alpert et al. (2008), Chapter 24–28. For an older discussion see
Cong et al. (1996).
3.3 Our Repeater Tree Algorithm
We present our algorithm for the Repeater Tree Problem, that we call Fast
Buffering, in the next chapters. Most repeater tree instances are build in the same
environment with the same repeater library, blockages, global routing, wiring modes,
and timing functions. We therefore spend some time to preprocess the environment
and to compute parameters that allow us to perform further steps efficiently. The
preprocessing step is explained in Chapter 4.
A common approach to the Repeater Tree Problem is to divide it into two
steps, Steiner tree generation (also called topology generation) and repeater insertion
(also called buffering). Our algorithm takes the same route. A main reason is that
a) for some applications the topology is fixed and we only have to do repeater
insertion, and b) other applications only need a timing-aware topology. Thus, our
topology generation algorithm can be used for different applications, for example, the
optimization of symmetric fan-in trees2 . Similarly, our buffering algorithm can work
on topologies from other sources (e.g. routing). The division allows us to exchange
one algorithm without touching the other. The algorithms are independent from
each other, but, on the one hand, we already consider expected results from repeater
insertion during topology creation by using a delay model for estimation that tightly
matches buffering results3 , and, on the other hand, we allow our buffering algorithm
to modify the input topology if it is suitable.
Chapter 5 explains how we create topologies, and in Chapter 6 we show how we
buffer them.
2
Symmetric fan-in trees compute symmetrical functions with n inputs. They are reverse to
repeater trees. Signals from n sources are merged into a single output.
3
See Figure 5.3 and the surrounding discussion.
28
4 Instance Preprocessing
4.1 Analysis of Library and Wires
We compute some auxiliary data and parameters in advance, which are then used for
all instances of a design. It is possible to precompute parameters for a single global
optimization run because the environment does not change and because technology,
library, and wire types are the same for a lot of instances.
On the other hand, it is not possible to precompute useful data between several
runs because the environment changes too often. Due to different timing rules,
voltages, and temperatures between different optimization runs, it is necessary to
recompute the parameters even if the basic technology or library did not change.
The main goal is to identify some parameters that allow us to estimate the timing
of a repeater tree based on the Steiner tree. In addition, we compute some values
that guide us in the buffering step of a particular instance.
4.1.1 Estimating two-pin connections
Figure 4.1 shows the delay over a two-pin connection depending on its length
after buffering it in an approximatively delay-minimal way (see Section 6.3). The
experiment was done with a 22 nm chip design using default planes and wire widths.
We see how repeater insertion linearizes the delay that would be quadratic
otherwise. The red line in the figure shows a linear approximation of the delay
function between the two inverters. The slope of the approximation is dwire . Given
this approximation, we can predict the delay for two-pin nets after buffering:
delay = dwire ∗ length.
(4.1)
While there are closed-form solutions for buffering two-pin nets1 , we do not use
them for approximating dwire because they rely on simplifications like Elmore delay
and do not capture all environmental parameters. Instead, we search for the best
way of buffering long two-pin nets by implementing it in the design and using the
timing engine to calculate delays and slews. This way, we capture all effects that
affect timing.
We now show how we compute the constant dwire . To bridge large distances
of wire in wiring mode w using repeater t, we partition the wire equidistantly by
adding a repeater after l units of wire.
1
See for example Alpert et al. (2008), p. 536f
29
Delay
4 Instance Preprocessing
250 µm
500 µm
Distance (ns)
750 µm
Figure 4.1: Two medium-sized inverters are placed at a given distance. The net between
them is then buffered with the highest effort. The graph shows the resulting delay
depending on the distance (black). The red line shows the linear approximation
that we compute for the delay function.
To measure the delay over the line, we add two repeaters of type t into the design
and connect them by a net such that the length of the net is l. We modify the pin
capacitance at the end of the line to be c. At the input of the first repeater, the slew
pair sin is asserted. The whole setup guarantees that most global timing parameters
are considered. Local timing parameters (e.g. coupling capacitance) are ignored.
Let d(t, sin , l, w, c), sout (t, sin , l, w, c) and p(t, sin , l, w, c) be the total delays over
the stage (through repeater and wire), the slews at the other end of the wire, and
the power consumption, respectively. We assume that all values are infinite if a load
limit or a slew limit is violated.
Let now s0 be a reasonable slew pair (we just use the minimum allowed slews for
t). We define
si+1 := sout (t, si , l, w, c)
(i ≥ 0).
(4.2)
The sequence (si )i=1,2,3,... typically converges very fast to a fixpoint or quickly
becomes ∞ due to an electrical violation. We call s∞ (t, l, w, c) := limi→∞ si the
stationary slews of (t, l, w, c). In practice, we iterate over si until we reach a fixpoint
due to the limited precision of the floating-point numbers used to represent slews.
Typically, the fixpoint is reached within ten iterations. We then use the computed
value as an approximation of s∞ (t, l, w, c). There is only a single fixpoint because
the slew function is typically contractive over a whole stage. Thus, the choice of
the initial slews does not matter as long as it is within the domain of the slew and
delay functions.
One might think that inverters need special treatment, because the rising (falling)
output slew does not depend on the rising (falling) input slew. Instead, the signal is
30
4.1 Analysis of Library and Wires
inverted internally. The stationary slews might only be achieved after propagating
over two stages in the inverter chain. There might also be different stationary slews
for odd and even numbers of stages respectively. Fortunately, one can easily show
that one has only to deal with a single stationary slew pair.
...
...
sri
sri+1
sri+2
sri+3
sfi
sfi+1
sfi+2
sfi+3
Figure 4.2: An endless chain of equidistantly distributed inverters. The slews alternate
between rise and fall. Although a given rise value is not used in the computation
of the following one, like, for example, sri and sri+1 , consecutive rise values
converge to a fixpoint.
Figure 4.2 shows an infinite line of equidistantly distributed inverters and the slews
at the input pins. The lines below show how slews are propagated. For example,
the rising slew sr1 is propagated to the falling slew sf2 . Let rf be the rise-fall slew
function for a whole stage2 of the chain and f r the fall-rise slew function. As both
functions are contractive, their compositions are also contractive.
Within the chain of inverters the even and odd slews form separate sequences
with
si+2 = f r(rf (sri )), rf (f r(sfi )) .
The blue and red lines indicate both sequences in the figure. The sequences only
differ by the starting point as the even sequence starts with minimum allowed slews
and the odd sequence starts with s1 . Because contractive functions only have a single
fixpoint, both sequences converge to it. However, this means that the combined
sequence also converges to the fixpoint. Therefore, it suffices to consider a single
stage, not only for buffers but also for inverters.
Due to the asymmetric nature of the delay and slew functions for the rising and
falling slews, we take the average for further processing and abbreviate:
i
1h
d(t,s∞ (t,l,w,c),l,w,c)r + d(t,s∞ (t,l,w,c),l,w,c)f
2
i
1h
s(t,l,w,c) :=
s∞ (t,l,w,c)r + s∞ (t,l,w,c)f
2
p(t,l,w,c) := p(t,s∞ (t,l,w,c),l,w,c).
d(t,l,w,c) :=
For a given wiring mode, a repeater, and a length, we can now compute the delay
2
The slew function for a whole stage combines the slew calculation from the input of a repeater
through the repeater and the following net up to the input of the next repeater.
31
4 Instance Preprocessing
per unit distance and power consumption per unit distance
in
¯ l, w) := d(t, l, w, cap (t))
d(t,
l
p(t, l, w, capin (t))
p¯(t, l, w) :=
l
and the according stationary slews
s¯(t, l, w) := s∞ (t, l, w, capin (t)).
Delay/Unit Distance
Figure 4.3 shows how the delay per unit distance typically behaves depending
on the length between two consecutive repeaters. The delay is dominated by the
repeater delay for small distances. The overall delay decreases until a delay-optimal
distance is reached. For larger distances the wire delay begins to dominate. The
curve ends as soon as the slews or loads create electrical violations. It is not shown
in the figure, but stationary slews increase monotonically with the length.
Repeater Spacing
¯
Figure 4.3: A typical curve showing d(t,l,w)
for a given repeater t and wiring mode w over
the range of valid lengths l. The curve shown is from a medium-sized inverter
from a 22 nm design on the third metal layer using the smallest wiring mode.
Using the same repeater and wiring mode as in the previous figure, we can see in
Figure 4.4 how power per unit distance and delay per unit distance relate to each
other. For small distances (upper right endpoint of the curve) the power consumption
is high due to the high amount of repeaters needed. Power consumption and delay
decrease with larger distances until we reach the optimal distance. Further power
reductions cause higher delays. The red points show possible stage lengths that
32
Power/Unit Distance
4.1 Analysis of Library and Wires
Minimum Repeater Spacing
Fastest Spacing
Maximum Feasible Spacing
Delay/Unit Distance
¯
Figure 4.4: A typical curve showing (d(t,l,w),
p¯(t,l,w)) for a given repeater t and wiring
mode w parametrized over the range of valid lengths l. The curve is generated
for the same repeater as in Figure 4.3.
33
4 Instance Preprocessing
are not dominated by distances that result in cheaper configurations with the same
delay or faster configurations with the same power consumption.
∗ ∈R
We choose for each wiring mode w a repeater t∗w ∈ L and a length lw
>0 which
minimize the linear combination
∗
¯ ∗ , l∗ , w) + (1 − ξ)¯
ξ d(t
p(t∗w , lw
, w).
w w
(4.3)
∗ minimize the equation, then we choose the
If two different choices for t∗w and lw
fastest one. We call the parameter ξ power-time-tradeoff. For library analysis, we
choose ξ = 1 to calculate a lower bound on the achievable wire delay. The minima
can then be found by binary search over all lengths for each repeater type. The
functions we have to minimize are similar to the one shown in Figure 4.3.
4.1.2 Parameter dwire
For delay estimation in topology generation, we do not want to distinguish between
horizontal and vertical wire segments, because we often do not want to fix the exact
embedding of path segments. Therefore, we use the default wiring modes wh∗ and
wv∗ to build an average delay value. Typically, both wiring modes have similar
electrical properties such that the resulting value is not far away from both. We
choose optimal repeater t∗ and length l∗ such that
¯ ∗ ,l∗ ,w∗ ) + d(t
¯ ∗ ,l∗ ,w∗ ) + (1 − ξ)(¯
ξ d(t
p(t∗ ,l∗ ,wh∗ ) + p¯(t∗ ,l∗ ,wv∗ ))
h
v
is minimized. The resulting delay per unit distance
dwire :=
¯ ∗ ,l∗ ,w∗ ) + d(t
¯ ∗ ,l∗ ,w∗ )
d(t
v
h
2
(4.4)
is the parameter we searched for. It will be used for delay estimation during topology
generation (see Equation 4.1).
We call the stationary slew corresponding to the repeater and length choice
optslew:
s¯(t∗ ,l∗ ,wh∗ ) + s¯(t∗ ,l∗ ,wh∗ )
optslew :=
.
2
The average capacitance over a stage is called maxcap:
maxcap :=
wirecap(wh∗ ) + wirecap(wv∗ ) ∗
l + capin (t∗ ).
2
4.1.3 Buffering Modes
As indicated in the previous section, we allow diagonal segments in our Steiner trees
such that it is not clear where we will eventually use horizontal or vertical wiring
segments. Thus, we assign a buffering mode that approximates the properties of
a horizontal and a vertical wiring mode to each segment. Buffering modes also
represent the effort we want to put into buffering of a segment.
A buffering mode m is a 3-tuple (mh , mv , mξ ) that consists of
34
4.1 Analysis of Library and Wires
• a horizontal wiring mode mh ∈ W ,
• a vertical wiring mode mv ∈ W , and
• a power-time tradeoff mξ .
During buffering of a wire segment, we will try to replicate long-distance chains.
The distance between repeaters in a chain buffered with a given mode is determined
by the target repeater and the slew targets. The stationary slews of the chain will
be slew targets.
We have to determine the set of buffering modes that we want to work with. We
assume that there is a set Wp ⊆ W × W of wiring mode pairs. Each pair consists of
a horizontal wiring mode and a vertical wiring mode. We also restrict ourselves to a
set Ξ of power-time-tradeoffs between 0.0 and 1.0.
Given a wiring mode pair (wh , wv ) with horizontal wiring mode wh ∈ W and
vertical wiring mode wv ∈ W and a power-time-tradeoff ξ ∈ Ξ, we define a buffering
mode (wh , wv , ξ).
For each buffering mode, we find the optimal repeater t ∈ L and distance l ∈ R>0
minimizing
¯
¯
ξ d(t,l,w
p(t,l, wh ) + p¯(t,l,wv )) .
h ) + d(t,l,wv ) + (1 − ξ) (¯
The optimal repeater for buffering mode m is called mt . The according slew targets
are
s¯(t, l, wh ) + s¯(t, l, wv )
ms :=
2
The delay of a buffering mode m is defined as
md :=
¯ l, wh ) + d(t,
¯ l, wv )
d(t,
.
2
The power consumption per length unit is
mp :=
p¯(t, l, wh ) + p¯(t, l, wv )
.
2
The capacitance of a stage is
mcap :=
wirecap(wh ) + wirecap(wv )
l + capin (t).
2
The average capacitance per unit length wire is
mwirecap :=
wirecap(wh ) + wirecap(wv )
.
2
We use dm and pm to estimate the delay of an edge that is buffered using mode
m.
35
4 Instance Preprocessing
In practice, the user creates the set Wp of reasonable wiring mode pairs. Typically,
the horizontal and vertical wiring mode have similar widths and spacings and lie
on neighboring planes for each wiring mode pair in Wp . In such a case, the delay,
power, and slew values do not differ significantly between the wiring modes of a
pair such that using the averages is not too far off. The set Ξ is also defined by the
user, but, in practice, we only use two tradeoffs: 0 and ξ. The default wiring modes
are always in Wp . Thus, there is always a buffering mode available with delay dwire
(compare to Section 4.1.2):
∗
m∗ := (wh∗ , ww
, ξ)
Finally, given Wp , we can determine the set M of buffering modes that we will
use for buffering:
M := {(wh , wv , ξ) | (wh , wv ) ∈ Wp , ξ ∈ Ξ}.
For each buffering mode m, there is a set of alternative buffering modes Mm
containing all buffering modes with the same horizontal and vertical wiring modes
as m including m itself. The alternative buffering modes only differ by the ξ value.
We assume that Mm only contains non-dominated buffering modes. A buffering
mode dominates another one if it is at the same time not slower and not more
expensive than the other one.
4.1.4 Slew Parameters
As described in Section 3.1, different slews at the input pins of repeater tree subtrees
have different effects to the downstream delays. To account for this where we do
not have an explicit RAT function, we introduce a parameter ν that translates slew
differences to delay differences (see also Vygen (2006)).
Let t∗ ∈ L and l∗ ∈ R>0 be the repeater and length that minimize Equation 4.4.
We define the slew pair of this optimal chain using only one of the default wiring
modes
sopt := s¯(t∗ , l∗ , wh∗ ).
We now compute
d1 :=
N
X
d(t∗ , si , l∗ , wh∗ , capin (t∗ ))
i=0
using the following slews:
s0 := sopt
si := sout (t∗ , si−1 , l∗ , wh∗ , capin (t∗ ))
In a second step, we compute d2 in an analog way to d1 by starting with s0 := 2 · sopt .
In practice, using 2 · sopt will not lead to a violation. We set the desired parameter
to
d2 − d1
ν :=
.
sopt
36
4.1 Analysis of Library and Wires
The number N is chosen such that the stationary slew is reached for both computations. The parameter depends on the timing environment and technology. Typically,
it lies between 0.10 and 0.25. We use it to define
slewdelay(s) := ν · (s − starget )
for a given target slew starget The function slewdelay is used for two similar tasks:
1. As discussed earlier, required arrival times are associated with individual slew
requirements at sinks. To better compare RATs, we normalize them to a target
slew. For example, if a sink has the slew requirement s, then we translate the
RAT to starget by adding slewdelay(s) to it.
2. If a signal with slew s arrives at a sink with a slew target s, then we add
slewdelay(s) to the arrival time before we compute the slack of the signal.
Both happens only in our buffering algorithm based on dynamic programming (See
Section 6.3).
4.1.5 Sinkdelay
Delay
Consider two inverters connected by a net and placed far apart (as described in
Section 4.1.1). We now add repeaters to the line such that the delay between the
inputs of the inverters is minimized.
250 µm
500 µm
Distance
750 µm
Figure 4.5: Two inverters are placed at a given distance. The net between them is buffered
with the highest effort. The graph shows the resulting delays for the same source
but three different sink capacitances at the end of the net: a small inverter
(green), a medium sized inverter (red), and a huge inverter (black).
Figure 4.5 shows the resulting delays for different distances between the boundary
inverters and for different capacitances at the end of the chain due to the different
37
4 Instance Preprocessing
inverter sizes. The difference between the delays remains nearly constant for longer
distances. We choose a distance where the delay differences are significant enough
and compute for different sink pin capacitances the resulting delay. Let d0 be the
delay of the chain if the capacitance at the sink is the input pin capacitance of t∗ ,
the optimal repeater minimizing Equation 4.4. For a given capacitance c at the end
of the chain and the resulting delay dc , we define
sinkdelay(c) := dc − d0
Delay
This function is used to estimate the delay difference on repeater chains due to
different sink capacitance compared to the optimal repeater.
250 µm
500 µm
Distance
750 µm
Figure 4.6: Two inverters are placed at a given distance. The net between them is buffered
with the highest effort. The graph shows the resulting delays for the same sink
capacitance but three different sources: a small inverter (green), a medium sized
inverter (red), and a huge inverter (black).
Figure 4.6 shows the effects of changing the first gate instead of the last one in
our setup. Stronger (weaker) gates result in smaller (larger) delays on the chain.
The delay differences are also nearly constant at larger distances. In contrast to
delays introduced by sink capacitances, we do not introduce a function to estimate
the delays. Instead, we evaluate the root arrival time for the capacitance of the
optimal chain.
4.1.6 Further Preprocessing
Next, we compute a parameter dnode that is used during topology generation to
model the extra delay to be expected along a path due to additional capacitances
induced by a side branch. We determine it by adding a small repeater to the repeater
chain and measuring the additional delay.
38
4.2 Blockage Map and Congestion Map
Let inv(c, s) denote the inverter with the smallest power consumption that still
achieves a slew of at most s at its output pin if its input slew is sopt and the load is
c.
Let t1 := inv(maxcap, sopt ), and let t2 := inv(capin (t1 ), sopt ). We compute d1 as
in Section 4.1.4. Now we modify the repeater chain by adding a capacitance load of
capin (t1 ) in the middle of the first segment and consider it during the delay and slew
computation of the first stage. We then get the new delay d2 and set our branch
penalty to
dnode := d2 − d1
4.2 Blockage Map and Congestion Map
Placement blockages and wiring congestion information is given to the repeater
tree routine via a blockage map and a congestion map, respectively. The blockage
map is used to check whether a given point is blocked. The congestion map holds
the global routing information showing on which parts of the chip routing space is
sparse.
4.2.1 Grid
Both the blockage map and the congestion map share the same grid. The grid
partitions the chip area into tiles. The tiles are nodes of a grid graph.
The bounding box ca of the design area is given by
[caminx , camaxx ] × [caminy , camaxy ].
Definition 3 (Grid). A grid is a pair (xlines, ylines) of cutlines xlines = {x0 , x1 ,
. . . , xm } and ylines = {y0 , y1 , . . . , yn } with x0 < x1 < . . . < xm and y0 < x1 <
. . . < xm and m > 1 and n > 1. We call a grid feasible for a chip area ca if
x0 ≤ caminx , camaxx < xm , y0 ≤ caminy , and camaxy < ym .
Most of the time, we use a grid with equidistant cutlines. An equidistant grid is
accurate enough in the context of repeater tree insertion if it is not spaced too wide.
Sometimes, however, one wants to use a Hanan grid given by the coordinates of all
edges of significantly large blockages such that their positions are exactly captured
in the blockage map.
Definition 4 (Tile). Given a grid (xlines, ylines) with xlines = {x0 , x1 , . . . , xm }
and ylines = {y0 , y1 , . . . , yn }, we call the rectangle [xi , xi+1 ) × [yj , yj+1 ) for 0 ≤ i <
m and 0 ≤ j < n the tile(i, j) of the grid.
For a given point of the chip area, there is exactly one tile in a feasible grid that
contains the point.
39
4 Instance Preprocessing
4.2.2 Blockage Map
The set of blocked regions for a repeater tree instance is stored in a data structure
that we call blockage map.
The most important operation that is performed on the blockage map is searching
for the nearest free location. Given a point in the plane, the blockage map is able
to give us the nearest free location in a given direction rectilinear to the grid or
the nearest free location in the whole plane with respect to `1 -metric. Points on
blockage boundaries that are next to free points are considered as free.
4.2.3 Blockage Grid
For an existing blockage map and a grid, we also construct a blockage grid. The
blockage grid stores information whether a grid tile is blocked or not:
Definition 5 (Blockage Grid). Given a grid (xlines, ylines) with
xlines = {x0 , x1 , . . . , xm } and
ylines = {y0 , y1 , . . . , yn }
and a blockage map, a blockage grid is a function
bg : {0, . . . , m − 1} × {0, . . . , n − 1} → {0, 1}
where bg(x, y) = 1 iff tile(x, y) is completely blocked by the blockages of the map.
Shortest Path Searches
Typically, blockages do not block all wiring layers in a design. It is possible to
cross them on higher layers. Repeater trees are also allowed to jump over blockages.
However, the possible distance is limited by the slew and capacitance limits as it is
not possible to place repeaters on blockages. Larger distances between repeaters
caused by jumping over blockages also cost additional delay compared to an optimally
spaced repeater chain.
At one step in our topology generation algorithm, we search for delay minimal
paths between points in the design. We use a modified version of Dijkstra’s shortest
path algorithm on the blockage grid for this task.
Given two points, we first identify the tiles they belong to in the blockage grid
and then compute a shortest path between both tiles. The costs of an edge between
two neighboring tiles depends on whether the tiles are blocked or not. Crossing
unblocked space costs proportional to dwire . Costs over blocked area increase first
linearly and after a threshold quadratically with the distance the path already went
over blockages.
40
4.2 Blockage Map and Congestion Map
Figure 4.7: Blockage map (red) and grid (blue) on the design Julius.
41
4 Instance Preprocessing
4.2.4 Congestion Map
We implemented a rough global routing engine as congestion map. In contrast to a
full-fledged global router, the congestion map does not try to find a congestion free
global routing solution by all means. Instead, we embed a short `1 -tree allowing
only small detours. We also limit the number of iterations spent for improving the
routing. The advantage is that we still see congestion that we then can try to avoid
during repeater tree generation. As can be seen in Table 7.2, using a full global
router would increase the running time of our algorithm significantly. Our algorithm
has recently been integrated into BonnRouteGlobal as a fast mode.
42
5 Topology Generation
The first step in our repeater tree algorithm is the construction of a repeater topology.
A repeater topology specifies the abstract geometric structure of the repeater tree.
Given an instance of the Repeater Tree Problem with root r and sink set S,
we can define:
Definition 6. A topology T = (V (T ), E(T )) with V (T ) = {r} ∪˙ S ∪˙ I is an
arborescence rooted at r with an embedding P l : V (T ) → R2 of the nodes into
the plane such that r has exactly one child, the internal nodes I have one or two
children each, and the sinks S are the leaves.
+
+
a
b
r
−
Figure 5.1: A topology for one root and three sinks. Steiner points like a and b are used to
route the topology around obstacles. Although we do not use directed edges in
our figures, the edges in a topology are always directed away from the root.
We often call the set I of internal points Steiner points. Internal points with only
a single child are used to force the topology to pass a certain point in the plane.
Figure 5.1 shows an example topology. We should clearly note that the internal
nodes do not represent repeaters and that the topology does not specify details
about the exact placement, routing, and types of repeaters used in the final tree.
The length of a topology is
X
||P l(v) − P l(w)||.
(v,w)∈E(T )
We have seen that after repeater insertion delays in a repeater tree are roughly
linear in the length of the segments. Connecting a sink to the root via a long
path results in a higher delay to a sink. The required arrival times at a sink then
decide whether a path is fine or too long. Consider the example in Figure 5.2. Both
topologies have the same length, but, for example, the distance to the root is 8
for the upper right sink in the first case and 2 in the second one. This example
43
5 Topology Generation
+
+
+
+
+
+
+
r
+
+
r
+
+
+
+
+
+
+
Figure 5.2: Two topologies with the same shortest possible length but different timing
behaviour. While all sinks are reached within 4 segments on the right side, it
takes up to 8 segments to the furthest sink on the left side.
illustrates that topologies have a high influence on the timing of a repeater tree. It
is therefore crucial to build timing-aware repeater trees.
Topologies for a root and a set of sinks that consider timing information do not
only have an application in repeater tree construction. They can also prove to be
useful in global routing. A global router internally often has to compute Steiner
trees for the nets of a design. The routing result can be better with regard to timing
if the Steiner trees are timing-aware topologies.
In this chapter, we first develop a way to estimate the timing of a topology and
then state the Repeater Tree Topology Problem. We show how our algorithm
solves the problem and prove some theoretical properties for restricted versions of
our algorithm.
The results in this chapter are joint work with Stephan Held, Jens Maßberg,
Dieter Rautenbach and Jens Vygen (Bartoschek et al., 2007a, 2010).
5.1 A Simple Delay Model
Since we want to evaluate the properties of our topologies with respect to timing, we
somehow have to compute a slack at root and sinks. It would be prohibitively slow
to insert repeaters into each topology we want to evaluate. Therefore, we propose a
simple delay model that estimates the timing from the geometric structure of the
topology. The delay model will compute arrival times and required arrival times for
all nodes of a topology giving us a slack that can be used to evaluate the topology.
The delay model mainly consists of two components: delay over wire segments and
delay due to bifurcations.
We have seen in Section 4.1 how buffering a long net linearizes the delay. Given
a buffering mode m, the estimated delay for a net between two points x and v is
given by
delay := md ||x − v||.
(5.1)
Every internal node of a topology with outdegree two is a bifurcation and thus an
additional capacitance load for the circuit driving both of the two outgoing branches
(compared to alternative direct connections). The real delay caused by bifurcations
is hard to estimate beforehand. It will depend on the strength of the driver, the
additional capacitance, and the position of the driver compared to the sinks. In
44
5.1 A Simple Delay Model
Section 4.1.6 we computed the parameter dnode estimating the average effect of a
bifurcation. It is a very rough estimation, but we will show in Section 8.5 that the
used value serves us well. To evaluate the delay through a topology, we will add the
additional delay to each outgoing edge of a node with two children.
It is reasonable to assume that the additional load capacitance will be smaller for
the less critical branch. Uncritical side path are more likely to be buffered by a small
repeater with nearly neglectable capacitance. We therefore allow the distribution of
dnode between both involved edges. We denote by dnode (e) the amount assigned to
edge e. We introduce a new parameter η controlling how uneven the distribution
of dnode can be. If e is an outgoing edge of a node with outdegree 1, then we
require dnode (e) = 0. Otherwise, we require that dnode (e) ≥ ηdnode . For two edges
e, e0 leaving the same internal node we require dnode (e) + dnode (e0 ) = dnode . The
parameter η has to be between 0 and 1/2 to be able to fulfill the requirements.
Next, we have to determine the arrival time at the root node. If the edge leaving
the root has buffering mode m assigned, then we assume that the root will have to
drive capacitance mcap 1 . We set the arrival time at the root to
n
o
atT (r) := max atrr (mcap ), atfr (mcap ) .
Note that for maximizing the worst slack, an accurate arrival time at the root is
not important because each change affects the slack at all sinks in the same way.
Finally, we have to determine the required arrival time for each sink. In our simple
delay model, we only want to handle a single RAT value and not a pair of functions.
Therefore, we evaluate the RAT function at the slew target of the incoming edge.
As shown in Section 4.1.5, the capacitance of a sink has to be taken into account
when the delay is estimated. This is done by subtracting the appropriate sinkdelay
from the resulting RAT.
Given a sink with required arrival time function rat, pin capacitance cap, and
buffering mode m at the sink’s incident edge, we define the RAT used in the delay
model as
n
o
sinkrat(rat, cap, m) := min ratr (mrs ), ratf (mfs ) − sinkdelay(cap).
(5.2)
Given a topology and a buffering mode assignment F : E(T ) → M , we can now
estimate the slack at sink s to be
σs := sinkrat(rats , capin (s), ms ) −
X
(dnode (e) + F (e)d ||P l(v) − P l(w)||) − atT (r)
e=(v,w)∈E(T )[r,s])
with m being the buffering mode of the arc entering s and ms the according slew
target.
Figure 5.3 shows how our delay model correlates with the slacks that are achieved
after buffering. For each instance of a 22 nm design we depict the difference between
1
See Section 4.1.3.
45
5 Topology Generation
Figure 5.3: Correlation between estimated slacks and exact slacks. For each instance
(slightly more than 300 000) of a middle-sized 22 nm design the difference (yaxis) between the slack in our delay model and the final slack after buffering is
shown. The instances are sorted by the distance (x-axis) of the most critical
sink to the root.
46
5.1 A Simple Delay Model
the slack of the topology used for repeater insertion and the slack of the final result.
Although there are some outliers where we overestimate the strength of the root
and are about 50 picoseconds too optimistic, the vast majority of instances are
estimated up to 20 picoseconds correctly.
5.1.1 Time Tree
Algorithm 1 TimeTree
Input: A topology T , an embedding P l, a buffering mode assignment F : E(T ) →
M , and parameters dnode , η
Output: Arrival time function atT , RAT function ratT and a dnode assignment
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
for v ∈ V (T ) traversed in postorder do
if v is a leaf then
Let e be the incoming edge to v if v 6= r
ratT (v) := sinkrat(ratv , capin (v), F (e))
else
if |δ + (v)| = 1 then
. v is root or Steiner point along a path
Let a = δ + (v)
ratT (v) := ratT (a) − F ((v, a))d ||P l(v) − P l(a)||
dnode ((v, a)) := 0
else
Let {a, b} = δ + (v) with ratT (a) ≤ ratT (b)
α := rat(a) − F ((v, a))d ||P l(v) − P l(a)||
β := rat(b) − F ((v, b))d ||P l(v) − P l(b)||
ratT (v) := maxηdnode ≤d≤(1−η)dnode min{α − d, β − (dnode − d)}
dnode ((v, a)) := ratT (a) − ratT (v)
dnode ((v, b)) := dnode − dnode ((v, a))
end if
end if
end for
Let e be the outgoing
edge of r
n
o
r
atT (r) := max atr (F (e)cap ), atfr (F (e)cap )
for v ∈ V (T ) \ r traversed in preorder do
Let w be the parent of v
atT (v) := atT (w) + dnode ((w,v)) + F ((w, v))d ||P l(w) − P l(v)||
end for
During topology construction, we will only maintain the required arrival times
and update them incrementally. Arrival times will not be explicitly calculated.
However, it is often desirable to compute the delay model of a given topology. This
can be done with Algorithm 1 (TimeTree). It first traverses the topology bottom
47
5 Topology Generation
up, computes required arrival times, and distributes dnode . Then, arrival times are
computed in a second top-down traversal. Both traversals have a running time that
is linear in the size of the topology as each update step can be done in constant
time.
5.2 Repeater Tree Topology Problem
The Repeater Tree Topology Problem is the task of finding a topology for an
instance of the Repeater Tree Problem, an embedding, and a buffering mode
assignment. As for the Repeater Tree Problem, we allow several objectives like
minimizing netlength or maximizing the delay model slack with minimal costs.
Minimizing the `1 -length is an objective for topology generation that appears in
early design stages or for timing-uncritical instances. This corresponds to computing
shortest rectilinear Steiner trees. Garey and Johnson (1977) showed that already
this problem is NP-hard.
Previous Work on Topology Generation
Alpert et al. (2008), Chapter 24–28, give a good overview of existing topology
generation algorithms beginning with different flavours of Steiner trees and finishing
with algorithms specific for repeater tree optimization.
Okamoto and Cong (1996) proposed a repeater tree procedure using a bottom-up
clustering of the sinks and a top-down buffering of the obtained topology. Similarly,
Lillis et al. (1996b) also integrated buffer insertion and topology generation. They
introduced the P-tree algorithm, which takes the locality of sinks into account, and
explored a large solution space via dynamic programming. Hrkić and Lillis (2002)
considered the S-tree algorithm which makes better use of timing information, and
integrated timing and placement information using so-called SP-trees (Hrkić and
Lillis, 2003). In these approaches the sinks are typically partitioned according to
criticality, and the initially given topology (e.g. a shortest Steiner tree) can be
changed by partially separating critical and noncritical sinks. Whereas the results
obtained by these procedures can be good, the running times tend to be prohibitive
for realistic designs in which millions of instances have to be solved.
Alpert et al. (2002) create topologies in a two step approach. First, sinks are
clustered based on parity and criticality. Second, clusters are merged by a PrimDijkstra heuristic that scales between shortest path trees and minimum spanning
trees.
Further approaches for the generation or appropriate modification of topologies
and their buffering were considered in Cong and Yuan (2000); Alpert et al. (2001b,
2004a); Müller-Hannemann and Zimmermann (2003); Dechu et al. (2005); Hentschke
et al. (2007); Pan et al. (2007).
Repeater topology generation loosely overlaps with the design of delay constraint
multicast networks where network traffic has to be distributed to clients. A survey
can be found in Oliveira and Pardalos (2005).
48
5.3 Restricted Repeater Tree Problem
5.2.1 Topology Algorithm Overview
We solve the Repeater Tree Topology Problem by splitting it into three steps.
In the first step, we restrict ourselves to the default buffering mode m∗ and compute
an initial topology ignoring blockages. During the second step, we navigate around
blockages if blocked segments in the initial topology get too long. In the final step,
buffering modes from higher layers are assigned to topology edges if their slack is
infeasible.
We start our explanation by describing a simplified version of the Repeater
Tree Topology Problem.
5.3 Restricted Repeater Tree Problem
The first step of our topology generation algorithm uses the default buffering mode
m∗ . This is the fastest mode using the default wiring modes. To simplify notation
we set d := m∗d and c := dnode/2. The bifurcation delay assigned to edge e is c(e).
We evaluate topologies with our delay model that does not distinguish between
signal edges and does not know RAT functions. We adapt the Repeater Tree
Topology Problem to this simplification. We set for each sink s ∈ S the required
arrival time to
as := sinkrat(rats , capin (s), m∗ ) − max atr (m∗cap ), atf (m∗cap ) .
n
o
Given a topology T , the slack for sink s ∈ S becomes
σs := as −
(d||P l(v) − P l(w)|| − c((v, w))).
X
(v,w)∈E(T )[r,s]
The slack of the whole topology is
σ(T ) := min σs .
s∈S
We call the simplified version of the topology problem Restricted Repeater
Tree Topology Problem. It is shown in Figure 5.4.
5.4 Sink Criticality
Our topology generation algorithm will insert sinks into the topology one by one,
and the resulting structure depends on the order in which the sinks are considered.
Sinks that are inserted first are favored because they will potentially be connected
shorter to the root. We thus want to prefer sinks that are more timing-critical. In
order to quantify correctly how critical a sink s is, it is crucial to take its reqiured
arrival time as as well as its location P l(s) into account. A sink that is further away
from the root will, other things being equal, result in worse slack because the signal
has to traverse the distance which costs delay. Similarly, if two sinks have the same
49
5 Topology Generation
Instance: An instance consists of
• a root r and its location P l(r),
• a set S of sinks and for each sink s ∈ S its location P l(s) and a required
arrival time as ,
• a value c = dnode/2 ∈ R≥0 , and
• a value d = dwire ∈ R≥0 .
Feasible Solution: A feasible solution is a topology over root r and sinks S.
Figure 5.4: Restricted Repeater Tree Topology Problem
distance to the root, then both will pay approximately the same delay to reach the
root but the sink with lower required arrival time will be more critical.
A good measure for the criticality of a sink s is the slack that would result from
connecting s optimally to r and disregarding all other sinks. We can estimate the
optimal connection using our delay model. The resulting slack equals:
σs = as − d||P l(r) − P l(s)||.
(5.3)
The smaller this number is, the more critical we will consider the sink to be.
5.5 A Simple Topology Generation Algorithm
Before we explain the topology generation algorithm that we use to solve the
Restricted Repeater Tree Topology Problem in Section 5.6, we first look
at an algorithm that has the basic structure of our final algorithm. We show the
algorithm in the next section before we discuss some of its theoretical properties.
Our first algorithm creates topologies where each internal vertex is a bifurcation.
We use η = 1/2 in addition such that each arc but the one leaving the root have a
node delay of c.
The slack of a topology T is then given by


σ(T ) := minas − c |E(T )[r,s] | − 1 −
s∈S
X
d||P l(v) − P l(w)||.
(v,w)∈E(T )[r,s]
The properties we show for the simple topology generation algorithm were first
published in Bartoschek et al. (2010).
5.5.1 Topology Generation Algorithm
Algorithm 2 inserts sinks into a topology one by one according to some order
s1 ,s2 , . . . ,sn starting with a tree containing only the root r and the first sink s1 .
50
5.5 A Simple Topology Generation Algorithm
Algorithm 2 Simple Topology Generation Algorithm
1: Choose a sink s1 ∈ S
2: V (T1 ) ← {r,s1 }
3: E(T1 ) ← {(r,s1 )}
4: T1 ← (V (T1 ),E(T1 ))
5: n ← |S|
6: for i = 2, . . . , n do
7:
Choose a sink si ∈ S \ {s1 ,s2 , . . . ,si−1 },
8:
an edge ei = (u,v) ∈ E(Ti−1 ),
9:
and an internal vertex xi with P l(xi ) ∈ R2 .
.
.
V (Ti ) ← V (Ti−1 ) ∪ {xi } ∪ {si }
11:
E(Ti ) ← (E(Ti−1 ) \ {(u,v)}) ∪ {(u,xi ),(xi ,v),(xi ,si )}
12:
Ti ← (V (Ti ),E(Ti ))
13: end for
10:
The sinks si for i ≥ 2 are inserted by subdividing an edge ei with a new internal
vertex xi located at P l(xi ) and connecting xi to si . The behaviour of the procedure
clearly depends on the choice of the order, the choice of the edge ei , and the choice
of the placement P l(xi ) ∈ R2 .
In view of the large number of instances which have to be solved in an acceptable
time, the simplicity of the above procedure is an important advantage for its practical
application. Furthermore, implementing suitable rules for the choice of si , ei , and
xi allows to pursue and balance various practical optimization goals.
We look at two variants (P1) and (P2) of the procedure corresponding to optimizing
the worst slack (P1) or minimizing the length of the topology (P2), respectively.
(P1) The sinks are inserted in an order of non-increasing criticality, where the
criticality of a sink s ∈ S is quantified by −σs as shown above.
During the i-th execution of the for-loop, the new internal vertex xi is always
chosen at the same position as r, and the edge ei is chosen such that σ(Ti ) is
maximized.
(Note that placing internal vertices at the same position means placing bifurcations at the same position. It does not mean placing several repeaters at
the same position during repeater insertion.)
(P2) The sink s1 is chosen such that ||P l(r) − P l(s1 )|| = min{||P l(r) − P l(s)|| | s ∈
S} and during the i-th execution of the for-loop, si , ei = (u,v), and P l(xi )
are chosen such that
l(Ti ) = l(Ti−1 ) + ||P l(u) − P l(xi )|| + ||P l(xi ) − P l(v)|| + ||P l(xi ) − P l(si )||
− ||P l(u) − P l(v)||
is minimized.
51
5 Topology Generation
5.5.2 Theoretical Properties
Theorem 1. Given an instance of the Restricted Repeater Tree Topology
Problem with η = 1/2, the largest achievable worst slack σopt equals
)
X
−b 1c (as −d||r−s||−σ)c
σ∈R
2
≤1 ,
(
σ (S) := max
∗
s∈S
and (P1) generates a repeater tree topology T(P 1) with σ T(P 1) = σopt .
Proof: Let a0s = as − d||r − s|| for s ∈ S. Let T be an arbitrary repeater tree
topology. By the definition of σ(T ) and the triangle-inequality for || · ||, we obtain
 

 j

X

1
k



d||u − v|| − σ(T )  ≤ 1c a0s − σ(T )
|E[r,s] | − 1 ≤ c as −
(u,v)∈E[r,s]
for every s ∈ S. Since the unique child of the root r is itself the root of a binary
subtree of T in which each sink s ∈ S has depth exactly |E[r,s] | − 1, Kraft’s inequality
(Kraft, 1949) implies
X
2−b c (as −σ(T ))c ≤
1
0
s∈S
X
2−|E[r,s] |+1 ≤ 1.
s∈S
By the definition of σ ∗ (S), this implies σ(T ) ≤ σ ∗ (S). Since T was arbitrary, we
obtain σopt ≤ σ ∗ (S).
It remains to prove that σ T(P 1) = σopt = σ ∗ (S), which we will do by induction
on n = |S|. For n = 1, the statement is trivial. Now let n ≥ 2. Let sn be the last
sink inserted by (P1), which means that a0sn = max{a0s | s ∈ S}. Let S 0 = S \ {sn }.
Claim
frac
σ ∗ (S)
c
n
∈ frac
a0s
c
o
s ∈ S0
(5.4)
where frac(x) := x − bxc denotes the fractional part of x ∈ R.
1 0
∗
∈
/ Z for every
c (as − σk (S))
j
1 0
1 0
∗
∗
c (as − σ (S)) = c (as − (σ (S) +
Proof of the claim:
If
j
s ∈
S, then there is some
k
> 0 such that
)) for every s ∈ S, which
immediately implies a contradiction to the definition of σ ∗ (S) as in the statement
of the theorem. Therefore, 1c (a0s − σ ∗ (S)) is an integer for at least one s ∈ S. If
1
0
∗
0
c (as − σ (S)) is an integer for some s ∈ S , then (5.4) holds. Hence, if the claim
1 0
1 0
∗
is false, then c asn − σ (S) ∈ Z and c (as − σ ∗ (S)) ∈
/ Z for every s ∈ S 0 . Since
a0sn − σ ∗ (S) ≥ a0s − σ ∗ (S) for every s ∈ S 0 , this implies
j
52
1
c
a0sn − σ ∗ (S)
k
> max
nj
1
c
a0s − σ ∗ (S)
o
k s ∈ S0 .
(5.5)
5.5 A Simple Topology Generation Algorithm
By the definition of σ ∗ (S), we have
Σ :=
X
2−b c (as −σ
1
0
∗ (S))
c ≤ 1.
s∈S
Considering the least significant non-zero bit in the binary representation of Σ, the
1
∗
0
strict inequality (5.5) implies that this bit corresponds to 2−b c (asn −σ (S))c . This
implies that
X
1
1
∗
∗
0
0
2−b c (as −σ (S))c ≤ 1 − 2−b c (asn −σ (S))c .
s∈S
Now, for some sufficiently small > 0, we obtain
X
2−b c (as −(σ
1
0
∗ (S)+))
c = 2−b 1c (a0sn −σ∗ (S))c+1 +
2−b c (as −σ
1
X
0
∗ (S))
c≤1
s∈S 0
s∈S
which contradicts the definition of σ ∗ (S) and completes the proof of the claim. 0
Let T(P
1) denote the tree produced by (P1) just before the insertion of the last
0
∗
0
sink sn . By induction, σ T(P
1) = σ (S ).
0
First, we assume that there is some sink s0 ∈ S 0 such that within T(P
1)
j
|E[r,s0 ] | − 1 <
1
c
a0s0 − σ ∗ (S 0 )
k
.
0
0
Choosing en as the edge of T(P
1) leading to s , results in a tree T such that
σ ∗ (S) ≥ σopt ≥ σ T(P 1) ≥ σ(T ) = σ ∗ (S 0 ) ≥ σ ∗ (S),
which implies σ T(P 1) = σopt = σ ∗ (S).
0
Next, we assume that within T(P
1)
|E[r,s] | − 1 =
j
1
c
a0s − σ ∗ (S 0 )
k
for every s ∈ S 0 . This implies
X
2−b c (as −σ
1
0
∗ (S 0 ))
c>
and hence
n n σ
c
= max σ σ < σ ∗ (S 0 ), frac
n = c max x x <
1
0
∗ (S 0 ))
c=1
< σ ∗ (S 0 ). By (5.4), we obtain
σ ∗ (S) ≤ max σ σ < σ ∗ (S 0 ), frac
=c
2−b c (as −σ
s∈S 0
s∈S
σ ∗ (S)
X
σ ∗ (S 0 )
c
σ ∗ (S 0 )
c ,
n
σ−σ ∗ (S 0 )
c
frac x −
− 1 + max frac
n
∈ frac
a0s
c
oo
s ∈ S0
n
∈ frac
σ ∗ (S 0 )
c
a0s −σ ∗ (S 0 )
c
n
a0s −σ ∗ (S 0 )
c
∈ frac
oo
s ∈ S0
a0s −σ ∗ (S 0 )
c
oo
s ∈ S0
o
s ∈ S0
= σ ∗ (S 0 ) − c(1 − δ)
53
5 Topology Generation
for
n
a0s −σ ∗ (S 0 )
c
δ = max frac
If s0 ∈ S 0 is such that
δ = frac
o
s ∈ S0 .
a0s0 −σ ∗ (S 0 )
c
,
0
0
then choosing en as the edge of T(P
1) leading to s , results in a tree T such that
σ ∗ (S) ≥ σopt ≥ σ T(P 1) ≥ σ(T ) = σ ∗ (S 0 ) − c(1 − δ) ≥ σ ∗ (S),
which implies σ T(P 1) = σopt = σ ∗ (S) and completes the proof. Theorem 2. (P2) generates a repeater tree topology T for which l(T ) is at most
the total length of a minimum spanning tree on {r} ∪ S with respect to || · ||.
Proof: Let n = |S| and for i = 0, 1, . . . , n, let T i denote the forest which is the
union of the tree produced by (P2) after the insertion of the first i sinks and the
remaining n − i sinks as isolated vertices. Note that T 0 has vertex set {r} ∪ S and
no edge, while for 1 ≤ i ≤ n, T i has vertex set {r} ∪ S ∪ {xj | 2 ≤ j ≤ i} and 2i − 1
edges.
Let F0 = (V (F0 ), E(F0 )) be a spanning tree on V (F0 ) = {r} ∪ S such that
l(F0 ) =
X
||u − v||
(u,v)∈E(F0 )
is minimum. For i = 1, 2, . . . , n, let Fi = (V (Fi ), E(Fi )) arise from
V T i , E(Fi−1 ) ∪ E T i
by deleting an edge e ∈ E(Fi−1 )∩E(F0 ) which has exactly one end vertex in V (Ti−1 )
such that Fi is a tree. (Note that this uniquely determines Fi .)
Since (P2) has the freedom to use the edges of F0 , the specification of the insertion
order and the locations of the internal vertices in (P2) imply that
l(F0 ) ≥ l(F1 ) ≥ l(F2 ) ≥ . . . ≥ l(Fn ).
Since Fn = Tn the proof is complete.
For the `1 -norm, the well-known result of Hwang (1976) together with Theorem 2
imply that (P2) is an approximation algorithm for the `1 -minimum Steiner tree on
the set {r} ∪ S with approximation guarantee 3/2.
We have seen in Theorem 1 and Theorem 2 that different insertion orders are
favourable for different optimization scenarios such as optimizing for worst slack or
minimum netlength.
Alon and Azar (1993) gave an example showing that for the online rectilinear
Steiner tree problem the best achievable approximation ratio is Θ(log n/ log log n),
54
5.5 A Simple Topology Generation Algorithm
where n is the number of terminals. Hence, inserting the sinks in an order disregarding the locations, like in (P1), can lead to long Steiner trees, no matter how we
decide where to insert the sinks.
The next example shows that inserting the sinks in an order different from the
one considered in (P1) but still choosing the edge ei as in (P1) results in a repeater
tree topology whose worst slack can be much smaller than the largest achievable
worst slack.
Example 1. Let c = 1, d = 0 and a ∈ N. We consider the following sequences of
−a’s and 0’s
A(1) = (−a, 0),
A(2) = (A(1), −a, 0),
A(3) = (A(2), −a, 0, . . . . . . , 0),
|
{z
}
1+(21 −1)(a+2)
A(4) = (A(3), −a, 0, . . . . . . . . . , 0), . . . ,
|
{z
}
1+(22 −1)(a+2)
i.e. for l ≥ 2, the sequence A(l)
is the concatenation of A(l − 1), one −a, and a
l−2
sequence of 0’s of length 1 + 2
− 1 (a + 2).
If the entries of A(l) are considered as the required arrival times of an instance of
the Restricted Repeater Tree Topology Problem, then Theorem 1 together
with the choice of c and d imply that the largest achievable worst slack for this
instance equals
$
− log2 l2a + 1 +
l X
1 + (2i−2 − 1)(a + 2)
!
20
!%
.
i=2
For l = a + 1 this is at least −2 − a − log2 (a + 2).
If we insert the sinks in the order as specified by the sequences A(l), and always
choose the edge into which we insert the next internal vertex such that the worst
slack is maximized, then the following sequence of topologies can arise: T (1) is the
topology with exactly two sinks at depth 2. The worst slack of T (1) is −(a + 1).
For l ≥ 2, T (l) arises from T (l − 1) by (a) subdividing the edge of T (l − 1) incident
with the root with a new vertex x, (b) appending an edge (x,y) to x, (c) attaching
to y a complete binary tree B of depth l − 2, (d) attaching to one leaf of B two
new leaves corresponding to sinks with required arrival times −a and 0, and (e)
attaching to each of the remaining 2l−2 − 1 many leaves of B a binary tree ∆ which
has a + 2 leaves, all corresponding to sinks of arrival times 0, whose depths in ∆ are
1, 2, 3, . . . , a − 1, a, a + 1, a + 1. Note that this uniquely determines T (l).
Clearly, the worst slack in T (l) equals −a − l. Hence for l = a + 1, the worst
slack equals −2a − 1, which differs approximately by a factor of 2 from the largest
achievable worst slack as calculated above.
55
5 Topology Generation
This example, however, does not show that there is no online algorithm for
approximately maximizing the worst slack, say up to an additive constant of c.
Recently, Held and Rotter (2013) presented an O(n log n) algorithm that given >
0 and an initial topology T0 solves the Restricted Repeater Tree Topology
Problem for η = 1/2 such that if the worst slack of the instance is non-negative
σ(T ) ≥ −dnode − max{as }
s∈S
l(T ) < 1 +
2
l(T0 ) +
2n · dnode
with n := |S|. The length of a topology l(T ) is the sum of edge lengths in T . Here,
T0 can be derived from any Steiner tree, for instance a minimum Steiner tree or an
approximation of it.
5.6 Topology Generation Algorithm
We now extend our topology generation from Algorithm 2. The basic structure
remains the same. However, we now allow the distribution of node delays using η
between 0 and 1/2.
Algorithm 3 shows the version of the topology generation that we use to construct
repeater trees. We use the delay model but only required arrival times are updated
during the process. In terms of Algorithm 2 the choice of si , P l(xi ), and ei is as
follows:
• Similar to (P1), the sinks are ordered by non-increasing criticality.
• P l(xi ) is chosen such that the netlength increase of the topology is minimized.
In general, the new nodes do not lie on top of the root node as in (P1).
• The edge ei is chosen such that the weighted sum of resulting topology slack
and netlength increase is minimized.
We hope to reduce netlength by connecting the sinks as short as possible to the
chosen candidate edge. However, by doing so, the slack is no longer guaranteed to
be optimal as shown in Theorem 1 for (P1).
The algorithm first sorts the sinks according to criticality. Then, it connects the
most critical sink directly to the root and initializes the topology and rat function
accordingly (lines 1–4). The next step is to iterate over all sinks and to add them
into the topology one after another. To determine ei , we compute for each existing
edge (v, w) the required arrival time at the root that we would get if we choose the
edge (lines 9–26). The Steiner point is always in the bounding box between v and
w due to the `1 -norm used. The netlength increase is therefore ||z − P l(si )||. Given
the resulting RAT rat at the root, we choose the edge that maximizes
ξ(min{rat, 0}) − (1 − ξ)||z − P l(si )||.
56
5.6 Topology Generation Algorithm
Algorithm 3 Topology Generation Algorithm
Input: An instance of the Restricted Repeater Tree Topology Problem
Output: A topology consisting of tree T = (V, E) and embedding P l
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
For each sink s ∈ S, compute the criticality σs
Sort S such that σs1 ≤ σs2 ≤ · · · ≤ σs|S|
V := {r, s1 } E := {(r, s1 )}
ratT (si ) := asi for 1 ≤ i ≤ |S|
for i := 2, . . . , |S| do
ei := ∅
bval := −∞
bdist := ∞
for all (v, w) ∈ E do
. Search for the best edge to connect si to
Choose z ∈ R2 minimizing ||z − P l(v)|| + ||z − P l(w)|| + ||z − P l(si )||
α1 := ratsi − dwire ||z − P l(si )||
α2 := ratw − dwire ||z − P l(w)||
rat := maxηdnode ≤b≤(1−η)dnode min{α1 − b, α2 − (dnode − b)}
for (x, y) ∈ E[δ+ (r),w] traversed bottom-up do
Let u be the sibling of y
α1 := ratu − dwire ||P l(x) − P l(u)||
α2 := rat − dwire ||P l(x) − P l(y)||
rat := maxηdnode ≤b≤(1−η)dnode min{α1 − b, α2 − (dnode − b)}
end for
val := ξ min{rat − dwire ||P l(r) − P l(δ + (r))||, 0} − (1 − ξ)||z − P l(si )||
if val > bval or val = bval ∧ ||z − P l(si )|| < bdist then
bval := val
bdist := ||z − P l(si )||
ei := (v, w)
end if
end for
Create a new Steiner node xi
V := V ∪ {si , xi }
E := E \ {(v ∗ , w∗ )} ∪ {(v ∗ , x), (x, w∗ ), (x, si )} with ei = (v ∗ , w∗ )
P l(x) := z ∈ R2 minimizing ||z − P l(v ∗ )|| + ||z − P l(w∗ )|| + ||z − P l(si )||
for (v, w) ∈ E[δ+ (r),si ] traversed bottom-up do
Let y be the sibling of w
α1 := raty − dwire ||P l(v) − P l(y)||
α2 := ratw − dwire ||P l(v) − P l(w)||
ratT (x) := maxηdnode ≤b≤(1−η)dnode min{α1 − b, α2 − (dnode − b)}
end for
end for
57
5 Topology Generation
If two solutions have the same value, we choose the shorter connection. The
parameter ξ allows us to scale between topologies that optimize the worst slack and
short topologies depending on the objective of the Repeater Tree Problem.
After choosing ei , we add the new sink to the tree splitting ei and update rat on
all affected edges (lines 27–37).
Given an internal node v with children u1 and u2 , let αi := ratT (ui ) − d||P l(ui ) −
P l(v)|| for i ∈ {1, 2}. By setting
ratT (v) := max min{α1 − b,
ηdnode ≤b≤(1−η)dnode
α2 − (dnode − b)}
we implicitly maintain the node delay for each edge. If without loss of generality
α1 < α2 , then ratT (v) can be computed in constant time by
ratT (v) := α1 − ηdnode − 21 max{(1 − 2η)dnode − (α2 − α1 ),
0}.
The additional delays of the outgoing edges,
dnode ((v, u1 )) = α1 − ratT (v)
dnode ((v, u2 )) = dnode − dnode ((v, u2 )) ≤ α2 − ratT (v),
satisfy our requirements on the node delay.
Theorem 3. The worst case running time of the algorithm is O(n3 ) if n = |S| and
the best case running time is Ω(n2 ).
Proof. There are n − 1 iterations of the outer loop of the algorithm (lines 5–37).
Each iteration removes one edge from the tree and adds three new. When ei is
searched, there are 1 + 2(i − 2) = 2i − 3 edges in the tree. The loop searching
P
for ei (lines 9–26) is called ni=2 (2i − 3) ∈ Θ(n2 ) times. The loop computing the
RAT at the root (lines 14–19) and the loop updating the RATs after sink insertion
(lines 31-36) can be stopped if the RAT does not change at a node. In such a case,
no updates will be done on the path to the root. In the best case, both loops only
perform a constant number of updates resulting in an overall best case running time
of Ω(n2 ). In the worst case, the inner loop (lines 14–19) iterates linearly in the size
of the graph resulting in an overall worst case running time of O(n3 ).
5.6.1 Handling High Fanout Trees
As we will show in our experimental results, the running time of our algorithm
is extremely small for instances up to 1 000 sinks. Nevertheless, our topology
generation, as described above, has a cubic running time. There are instances with
several hundred thousand sinks on actual designs, for which this would lead to
intolerable running times.
One way to reduce the running time would be to consider only the nearest k
edges while inserting a sink, where k is some positive integer. This would require to
58
5.7 Blockages
store the edges as rectangles in a suitable geometric data structure (e.g. a k-d-tree).
However, we chose a different approach.
For instances with more than 1 000 sinks, we first apply a clustering algorithm
to all sinks, except for the 100 most critical ones if ξ > 0. More precisely, we
.
.
find a partition S 0 = S1 ∪ · · · ∪ Sk of the set S 0 of less critical sinks, and Steiner
trees Ti for Si (i = 1, . . . , k), such that the total capacitance of Ti plus the input
capacitances of Si is at most maxcap (see Section 4.1.6). Among such solutions we
try to minimize the total wire capacitance plus k times the input capacitance of the
repeater t∗ .
For this facility location problem, we use an approximation algorithm by Maßberg
and Vygen (2008), which generates very good solutions in O(|S| log |S|) time. We
introduce an appropriate repeater for each component; its input pin constitutes a
new sink. This typically reduces the number of sinks by at least a factor of 10. If
the number of sinks is still greater than 1 000, we iterate the clustering step. Finally,
we run our topology generation algorithm as described above.
5.7 Blockages
The topology generation so far did not consider blockages or congestion. Nothing
prevents Steiner points from being located on blockages or topology segments from
having long intersections over them.
For the vast majority of instances, blockages do not play a role because there are
none in the bounding box of the involved points. However, some instances can only
be constructed properly if one navigates around blockages or considers congestion.
We handle blockages in the second step of our topology generation algorithm.
First, we iterate over the topology bottom-up and move each Steiner point s to a
free location. Our algorithm searches for the nearest free location in each direction
that minimizes the sum of `1 -distances between s and its neighbours. Then, we
search a buffer-aware shortest path (see Section 4.2.3) within the blockage grid for
each topology edge that crosses a blockage. A similar approach is shown by Zhang
et al. (2012) and improved in Zhang and Pan (2014). They take an input topology
and calculate slew degradations over blockages. In case of violations, they formulate
an ILP and solve it to find a good replacement topology. Huang and Young (2012)
propose a similar solution. Held and Spirkl (2014) propose a fast 2-approximation
algorithm to create rectilinear Steiner trees that can cross obstacles for a limited
distance.
After performing the shortest path search, some edges might still cross blockages
that are small enough to pass over. These edges are splitted on blockage boundaries
such that each topology segment is either completely blocked or free. Neighbouring
blocked edges on a chain are merged into a single one.
59
5 Topology Generation
5.8 Plane Assignment
So far, we only used one buffering mode and its choice is often appropriate for shorter
connections. For bridging large connections, it might be better to use buffering
modes on higher planes than the default ones. Such buffering modes are often faster
and use less placement space due to larger repeater spacing.
The third and final step of our topology generation is plane assignment. We use
a simple greedy routine to assign buffering modes to a computed topology. They
are then used by our repeater insertion routine. We restrict ourselves to the fastest
buffering mode for each wire mode. For each buffering mode, we know the repeater
spacing.
The algorithm processes the nodes of the topology in a BFS traversal starting at
the root. If the slack of the node is negative and if it is not too congested according
to the congestion map, then we consider the edge between the node and its parent.
The edge gets the fastest buffering mode assigned for which repeater spacing is
smaller than half the length of the edge. After changing an assignment, the slack
values are updated.
The algorithm runs in time O(m log b) where m is the number of edges in the
topology and b is the number of buffering modes we might assign. The buffering
modes can be sorted by the repeater spacing. If there is a buffering mode with
higher spacing and worse timing than another one, then it can be removed because
it will never be inserted. The resulting list is sorted increasingly by spacing and
decreasingly by delay per length. Thus, for an edge, the best buffering mode can be
found by binary search in O(log b) time. It is not necessary to recompute the whole
timing after each assigment. It is sufficient to propagate an arrival time delta to the
children of the node such that the timing updates are constant for each node.
The greedy routine is suitable at design stages where the placement of the circuits
and the timing are premature. In later design stages, it is desirable to consider
congestion better and to optimize plane assignment for best slack. We will present
an extension to the dynamic program used for repeater insertion that maximizes
slack and takes advantage of higher planes.
5.9 Global Wires as Topologies
As already mentioned, routing congestion is one of the biggest problems for optimizing current chip designs. The approach described in Section 5.7 works locally
without considering the big picture. A global router (see for example Müller (2009)),
on the other hand, optimizes the distribution of nets over the whole routing space.
A recent trend in the industry is to use the result of global routing for repeater tree
topologies.
The advantage of using global wires is a reduced expected congestion. However,
global routing has to be adapted to yield results suitable for repeater insertion. One
has to consider timing, placement space, blockages and instance sizes.
60
5.9 Global Wires as Topologies
A global routing that will later be buffered should not use wires that cross
blockages for too long because they cannot be processed in repeater insertion
without electrical and timing violations. The available placement space has also to
be considered. The number of necessary repeaters can be estimated using suitable
buffering modes for the wires.
The global router should also consider the timing criticality of nets and sinks
when it creates routes. Otherwise, uncritical nets might get short routes at the cost
of detours in critical nets. One approach might be to use the topology algorithm
presented here as a subroutine within the global router when it has to calculate
Steiner trees for nets.
The input presented to the global router is often stripped from all buffers and
unnecessary inverters. However, it is often not possible to remove some inverters if
one wants to preserve logical correctness of the design. The global router will then
use the placement of the inverter as a constraint to the routing. It will consider
two nets for a single instance and connect the sink of the first net with the current
inverter and the source of the second net. It is better to consider the inverter and
both nets as a single net. This is also the case if existing repeaters are not stripped
from the design2 .
2
see also Section 7.6.1.
61
6 Repeater Insertion
Finding a topology for a repeater tree instance is the first step in our approach
to solve the Repeater Tree Problem. The second step is to insert repeaters
along the topology to create a feasible solution. For this we consider the Repeater
Insertion Problem and describe how our algorithm solves it.
Instance: An instance (I, T, Bl, M, F ) consists of
• an instance I of the Repeater Tree Problem,
• a topology T with embedding P l connecting the root and sinks of I,
• a set Bl ∈ E(T ) of edges that are blocked,
• a set of buffering modes M , and
• buffering mode assignments F : E(T ) → M .
Task: Find a feasible solution of the Repeater Tree Problem for I minimizing costs such that each repeater lies on a shortest path between the endpoints
of a topology edge and all sinks reachable from the repeater in the final tree are
also reachable from the edge in the topology.
Figure 6.1: Repeater Insertion Problem
The dominant approach to solve the Repeater Insertion Problem is dynamic
programming. An extensive survey of the dynamic programming approach can be
found in Alpert et al. (2008), Sections 26.4 – 26.6. We give a short summary of
existing work in Section 6.3.
Our main contribution is the repeater insertion of our Fast Buffering algorithm
presented in Section 6.2 that, in practice, is considerably faster than the standard
dynamic program. Our routine can be characterized as a version of the dynamic
program that keeps only one solution at a time. Several heuristics are used to choose
a solution that will lead to an overall good solution.
A substantial difference to the dynamic programming approach is that our algorithm is able to change the topology in order to reduce the number of repeaters
inserted for preserving parity constraints. Figure 6.2 shows an extreme example
where different topologies for the same sink set result in a huge difference in the
minimum number of repeaters necessary to realise each of them. Given topology a),
our algorithm will often create a solution that lies between both extremes depending
on the criticality of the instance. However, our solution will still fulfill the constraints
from the Repeater Insertion Problem.
Finally, the dynamic program depends on precomputed repeater positions. In
63
6 Repeater Insertion
−
a)
−
−
−
r
+
−
b)
−
+
−
+
−
+
−
−
r
+
+
+
+
Figure 6.2: While topology b) requires only one inverter to realise the indicated sink parities,
topology a) would require five.
contrast, our algorithm is free to choose any position along an unblocked edge of
the input topology.
Our repeater insertion algorithm consists of two parts. In a first step, we assign
delay efforts to the edges of the topology by solving a Deadline Problem. This
allows us to buffer parts of the topology with less effort and leads to lower resource
usage. In a second step, we replace the topology by a repeater tree in a bottom-up
fashion.
We present in Section 6.3 how we use the standard dynamic programming technique to improve the solution found by the Fast Buffering algorithm.
The buffering algorithm we present here is an extension to joint work with Stephan
Held, Dieter Rautenbach and Jens Vygen (Bartoschek et al., 2009, 2007b).
6.1 Computing Required Arrival Time Targets
As shown in Section 4.1.1, it is possible to buffer a long line with different delay
and power consumption characteristics. For topology generation, we assumed that
each edge is buffered such that the fastest delay using the default wire modes can
be achieved.
After computing arrival times and required arrival times for all nodes of the
topology using our delay model, there are sinks that have non-positive slack even
if the fastest buffering mode is used on the path from the root. It is obvious that
we want to buffer the paths as fast as possible to keep timing constraint violations
small. However, other sinks and subtrees respectively might have positive slack
using the fastest buffering mode. Each edge of such a subtree can potentially be
slowed down to reduce the overall power consumption of the resulting repeater tree.
As input to the Repeater Insertion Problem each edge has a buffering mode m
assigned. We have the possibility to choose another buffering mode. At this point
in time, we do not want to change the layer assignment of the edges. Therefore, we
restrict ourselves to the alternative buffering modes in Mm (see Section 4.1.3).
64
6.1 Computing Required Arrival Time Targets
Instance: An instance consists of
• an instance I of the Repeater Tree Problem,
• a topology T for I,
• a set of buffering modes M ,
• buffering mode assignment for each edge F : E(T ) → M , and
• a maximal subtree Tz of T rooted at z ∈ V (T ) such that using our delay
model ratT (z) − atT (z) > 0
.
Task: Let E 0 be the set of edges reachable from z (E(Tz )) including the edge
leading to z if z 6= r.
Find an assignment F 0 : E(T ) → M with F 0 (e) = F (e) if e ∈
/ E 0 and F 0 (e) ∈
MF (e) and for all sinks reachable from z ratT (s) − atT (s) ≥ 0 such that the
total cost
X
F 0 (e)p ||P l(v) − P l(w)||
e=(v,w)∈E 0
is minimized.
Figure 6.3: Buffering Mode Assignment Problem
We call the problem of assigning buffering modes to a subtree Buffering Mode
Assignment Problem. It is shown in Figure 6.3. If the slack at the root of our
topology is positive, then the whole tree is an instance to buffering mode assignment.
Otherwise, each maximal subtree with positive slack is considered separately. The
initial assignment for such a subtree is a feasible assignment but probably not the
cheapest one. We find cheaper solutions, but we do not let the slack at the root
become negative. Thus, the result of buffering mode assignment does not change
RATs outside of the considered subtree. The problem can be solved independently
for each subtree.
The Buffering Mode Assignment Problem is very similar to the Discrete
Deadline Problem (see for example Skutella (1998)) as a special case of the
Time-Cost Tradeoff Problem (Kelley, 1961; Fulkerson, 1961). Figure 6.4 shows
the Discrete Deadline Problem. The graph P in an instance of the problem
is called project graph. Each edge e corresponds to a task that has to be executed
and Xe is the set of possible alternatives to finish the task with different execution
times and costs.
An instance to the Buffering Mode Assignment Problem can be transformed
to an instance of the Discrete Deadline Problem. We show how to transform a
subtree rooted at z with parent parent(z). The transformation is done by (a) using
the subtree induced by E 0 as a project graph, (b) adding a new node s and an edge
(s, parent(z)) with the single execution time atT (parent(z)) and costs 0, (c) adding
a new node t and edges (si , t) for all sinks si reachable from z with single execution
time −ratsi and costs 0, (d) setting execution times according to valid buffering
modes for all topology edges, and (e) setting the deadline to 0. Each topology edge
65
6 Repeater Insertion
Instance: An instance consists of
• a directed graph P = (V, E) with two nodes s, t ∈ V such that each node
is reachable from s and t is reachable from each node,
• for each edge e ∈ E a set Xe of execution times xe with costs ce (xe ), and
• a deadline D.
Task: For each edge e ∈ E find an execution time x∗e and an assignment of
arrival times a : V → R such that
a(s) ≥ 0
a(t) ≤ D
a(v) +
and
P
c(v,w) (x∗(v,w) )
≤ a(w)
∀(v, w) ∈ E
c(x∗e ) is minimized.
e∈E
Figure 6.4: Discrete Deadline Problem
(v, w) has the initial buffering mode F ((v, w)) assigned. We set
n
X(v,w) := xm m ∈ MF ((v,w))
o
with
xm := md ||P l(v) − P l(w)||
c(v,w) (xm ) := mp ||P l(v) − P l(w)||.
If z is the root of the whole topology, then we just connect s to z during project
graph construction. As mentioned earlier, each of our instances to the Buffering
Mode Assignment Problem has a feasible solution. It follows that using the
arrival times of the delay model as arrival times a in the Discrete Deadline
Problem is a feasible solution. On the other hand, any feasible solution to the
Discrete Deadline Problem results in a feasible buffering mode assignment if
we choose for each the buffering mode m if the edge has execution time xm .
The Discrete Deadline Problem is NP-hard, but Halman et al. (2008) have
shown that there exists a FPTAS on series-parallel networks like repeater tree
topologies.
6.1.1 Linear Time-cost Tradeoff
As our delay model is only a rough approximation of the reality after buffering,
it makes no sense to spend too much effort into solving the Buffering Mode
Assignment Problem exactly. Furthermore, it turns out that for our purpose, it
is sufficient to solve a linear relaxation of the problem.
66
6.1 Computing Required Arrival Time Targets
Power
m1
c(xm1 )
m2
m3
c(xm2 )
c(xm3 )
c(xm4 )
m4
xm1
xm 2
xm3
xm4
Delay
Figure 6.5: Piecewise-linear Relaxation of Buffering Modes. The buffering mode m3 is
dominated by a linear combination of m2 and m3 and can be cancelled.
If there is an edge e = (v, w) with execution time x∗e between execution times xa
and xb for buffering modes a, b such that
x∗e = αxa + (1 − α)xb ,
then e can be divided into two edges such that the first edge has length α||P l(v) −
P l(w)|| and buffering mode a and the second edge has length (1−α)||P l(v)−P l(w)||
with buffering mode b.
The linear relaxation of the problem uses piecewise linear time-cost functions
for each edge. For example, Figure 6.5 shows how delays and costs are relaxed for
an edge with four alternative buffering modes {m1 , m2 , m3 , m4 } that are sorted by
increasing delay. After removing buffering modes (m3 in the example) that are
dominated by linear combinations of others, the result is a convex piecewise-linear
time-cost function. It approximates the non-dominated part of the delay-power
tradeoff curve of a wire mode. One example is the red part of the curve shown in
Figure 4.4.
Now the Time-Cost Tradeoff Problem can be solved efficiently by solving a
Minimum-Cost Flow Problem (MCF). The construction is described in Fulkerson
(1961) or Lawler (2001). Algorithms to solve the MCF problems can be found
in Korte and Vygen (2012) Chapter 9. The input to the MCF algorithm is the
project graph where each edge is replaced by a chain of at most |M |, the number
of buffering modes and the maximum number of sampling points for a time-cost
function, edges. Then, each edge in the chain is doubled. After solving the MCF,
we get a node potential that corresponds to an arrival time assignment a.
Theorem 4. The effort assignment adds at most |S| new vertices and edges into
the topology.
Proof. The tree T in the spanning tree structure of any basic spanning tree solution
of the MCF spans all vertices in our original topology. For any r-s-path with s ∈ S,
67
6 Repeater Insertion
it omits at most one edge. By complementary slackness, all edges in T define
integral buffering modes. Therefore there are at most |S| fractional edges which are
divided.
Note that the Network Simplex Algorithm always maintains basic solutions.
Furthermore, one can transform non-basic optimum solutions into basic ones in at
most |E| pivots.
6.1.2 Effort Assignment Algorithm
The algorithm we use to solve the Buffering Mode Assignment Problem is
outlined in Algorithm 4 and called AssignEffort. The input is a topology and
allowed buffering modes for every edge.
Algorithm 4 Buffering Mode Assignment Problem
1: procedure AssignEffort(T )
2:
Compute timing using the fastest buffering mode.
3:
for all n ∈ |V (T )| with slack(n) > 0 and slack(parent(n)) ≤ 0 do
4:
Create MCF instance I for subtree rooted at n
5:
Solve I
6:
Assign buffering modes according to node potentials in I
7:
end for
8: end procedure
First, we assign the fastest buffering mode to each edge of the topology and
recompute the timing using Algorithm 1 (TimeTree). For all sinks with slack
smaller or equal to 0 the fastest buffering mode is kept.
Then, we identify instances to the Buffering Mode Assignment Problem
using a DFS search and process each subtree separately1 as a Deadline Problem.
We solve the min-cost-flow formulation and compute node potentials. After having
filtered out the sinks with non-positive slacks, we know that the problem is feasible
and that the potentials are feasible.
Given potentials π : V (Tn ) → R, the delay we want to assign to an edge (v, w) is
π(v) − π(w). For non-fractional edges the according buffering mode is used. For
fractional edges we do not subdivide the edge as outlined in the previous section.
Instead, we just round to the cheapest buffering mode faster than the fractional
solution.
After rounding, the delays on all edges correspond to a buffering mode. We
update for each edge e ∈ E(Tn ) the buffering mode assignment F accordingly.
1
Note that it is possible to merge all deadline problems in a single one because they correspond to
disjunct subtrees of the topology.
68
6.2 Repeater Insertion Algorithm
6.2 Repeater Insertion Algorithm
We now describe the main part of our repeater insertion algorithm. The input is an
instance of the Repeater Insertion Problem (I, T in , Bl, M, F ), for example, as
it has been computed by our topology generation algorithm. The result will be a
repeater tree R = (T, P l, Rt , RW )2 .
We first update the buffering assignment F of the input topology T in using
AssignEffort. Then, the algorithm traverses the topology in post-order fashion.
We create a pair of so-called clusters at each node of the input topology. During
topology traversal, leaf nodes are moved with their clusters towards their parents
(Move operation). Eventually, the clusters are merged with the clusters at the
parent of their node (Merge operation). The node and the clusters get removed
from the topology. At the same time we insert repeaters (mostly inverters) and
build up T . Thus, the topology is successively replaced by the final repeater tree.
First, we will explain clusters. Then, we explain the timing model that we use in
our algorithm. Finally, we describe the main parts of the algorithm.
6.2.1 Cluster
A cluster C is a triple (S(C), M (C), P (C)) which is assigned to a node V (C) in the
topology and consists of
• a set of sinks S(C) containing pins corresponding to sinks of the original
repeater tree instance as well as input pins of repeaters that have already been
inserted earlier,
• a buffering mode M (C) ∈ M or the empty set, and
• a so-called merge point P (C) ∈ R2 .
By an empty cluster we mean (∅, ∅, (0, 0)). The position of a cluster is always the
same as the position of the node to which it is assigned P l(C) = P l(V (C)).
Definition 7. We say that a pair of clusters (C + , C − ) at a node is in parallel mode
if the sink sets S(C + ) and S(C − ) are both non-empty.
For a cluster pair (C + , C − ) in parallel mode, the merge points P (C + ) and P (C − )
are both defined. They store the last location of a cluster where the sink set was
changed. It is the location where the cluster pair entered parallel mode if the sink
set did not change since then.
Figure 6.6 shows a cluster pair in parallel mode and the merge points for both
parities. We depict cluster pairs as two stacked rectangles, a green (above) for
positive sinks and a red (below) for negative sinks in all pictures showing clusters.
As both clusters are always assigned to a node at the same position, we do not show
the node explicitly.
2
See Section 3.2
69
6 Repeater Insertion
+
c
a
y
x
z
−
b
−
d
−
Figure 6.6: Example for a cluster pair (C + ,C − ) in parallel mode. Their current position is
P l(C + ) = P l(C − ) = z. We have S(C + ) = {c}, S(C − ) = {a,b,d}, P (C + ) = x
and P (C − ) = y. Parallel mode was entered at point x; the last negative sink
entered at point y.
By moving a cluster and adding repeaters, we want to realize a repeater chain
that corresponds to the buffering mode m := M (C) of the cluster. We say a cluster
has target slews ms and a target repeater mt 3 . The resulting wires either use mh or
mw as wiring mode depending on the direction.
6.2.2 Initialization
We want to move the nodes of input topology T in but keep the instance sinks at
their place. Therefore, we start initialization by replacing each sink s ∈ S in T in
by a new node vs at the same place. Then, we assign a pair of empty clusters, one
for each parity, to each node in the modified topology. Finally, each sink is added
to the sink set of the cluster at node vs with the same parity resulting in cluster
({s}, F (e), (0, 0)) with e being the edge incident to vs .
We initialize the resulting tree by T := (S, ∅).
6.2.3 Timing Model during Repeater Insertion
Our algorithm depends on the timing model during repeater insertion to guide the
decisions. In the process of the algorithm, there are three different structures we
maintain:
• At the top, there are the remaining parts of the initial topology T in with
cluster pairs at all nodes.
• The bottom T is a set of subtrees that will be part of the final repeater tree.
At the beginning, the bottom consists of the instance sinks. Each inserted
repeater extends T possibly merging subtrees and the result is a tree rooted
at the new repeater which is then part of T .
• Clusters connect the topology with the final tree. While each cluster is
associated with a node in the topology, its sink set consists of roots in T .
3
See Section 4.1.3
70
6.2 Repeater Insertion Algorithm
For the resulting repeater tree we also maintain the node placement P l, the repeater
assignment Rt and the wiring mode assignment Rw .
r
−
−
a
+
−
+
n
+
+
+
Figure 6.7: Topology, clusters, and resulting tree during repeater insertion. So far, the final
tree forest consists of all sinks in clusters and the net n with three sinks and a
driving inverter.
Figure 6.7 shows a possible intermediate state of our buffering algorithm. At the
top, we have the remaining parts of the topology (blue edges) with cluster pairs at
all nodes. The cluster pair at the root is not shown. Dashed black lines show for
each cluster a Steiner tree between the cluster and its sinks. Solid lines show the
final repeater tree. In this example, the set of final trees consists of net n with its
driving repeater and all sinks.
We maintain additional information during the course of the algorithm:
• For each cluster sink s, we know a pair of slew limits Sl(s), a required arrival
time function rat(s), and the pin capacitance capin (s).
• For each cluster C, we maintain a pair of slew limits Sl(C), a pair of slew
targets St(C), the load capacitance cap(C), and a required arrival time function
rat(C).
• For each node v in the topology, we have an arrival time atT in (v) and a
required arrival time value ratT in (v) coming from our delay model.
71
6 Repeater Insertion
Note that during buffering rat is a function that, given a cluster sink or a cluster,
returns us a required arrival time function which has to be evaluated for a slew. We
now explain the data structures in more detail:
Cluster Sinks
Cluster sinks are either instance sinks or input pins of inserted repeaters. For
instance sinks, the required arrival time function and slew limit pair are given with
the input.
Each inserted repeater (see below) drives a set of cluster sinks. A Steiner tree is
created between the repeater and the sinks forming a new net. The new net is then
extracted as described in Section 3.2.1 using a minimum Steiner tree.
Given the Elmore delay for each sink, the required arrival times and slew limits
can be propagated backwards to the source of the net (see also Section 2.6f) where
they are merged. The results can then be propagated to the repeater’s input pin.
The input pin is then treated as a new cluster sink with slew limits and required
arrival times.
Clusters
Each time a cluster is modified, for example, the cluster is moved or a sink is added,
we recompute the timing of the cluster.
The cluster and its sinks are treated as a net. A Steiner tree connecting all
pins is computed and extracted using the wiring modes of the cluster’s buffering
mode. Then, similar to the previous section, the rat functions and slew limits are
propagated backwards to the root of the Steiner tree.
In addition, the slew target of the cluster’s buffering mode is treated as a separate
slew limit for each cluster sink and propagated backwards resulting in St(C). The
capacitance of a cluster cap(C) is the sum of sink pin capacitances and the wire
capacitances of the segments in the Steiner tree.
Topology
Delays in the unbuffered topology are estimated using the delay model introduced
in Section 5.1 with parameters dnode and η. For each edge e, we have a wiring delay
F (e)d as it is given by the edges buffering mode F (e).
The timing of the topology is calculated by treating each non-empty cluster as a
virtual node of the topology connected to its associated node via a zero-length edge.
For a cluster C at node v the RAT in the topology ratT in is given by
ratT in (C) := min{ratr (C)(F (e)rs ), ratf (C)(F (e)fs )}
(6.1)
with e being the edge pointing to v. We assume that the topology will try to reach
the cluster with the edge’s target slew F (e)s .
For a node with two non-empty clusters (e.g. node a in Figure 6.7), we assume
that there is an additional virtual node with both virtual cluster nodes as children.
72
6.2 Repeater Insertion Algorithm
(C + , C − )
−
+
P (C + )
?
+
+
−
−
−
Figure 6.8: An inverter is searched for the positive cluster of pair (C + , C − ). Its input pin
will become a sink in S(C − ). The position of the inverter is M (C + ).
The virtual node is then connected via a zero-length edge to the original node. In
the resulting virtual tree, each node has at most two outgoing edges. The required
arrival time can then be computed for each node discarding values for virtual nodes.
6.2.4 Finding a new Repeater
At certain stages of the buffering algorithm, we insert a repeater that drives the sinks
of a cluster or test the effect of such an insertion. We describe this operation for a
cluster C + , which is part of a cluster pair (C + , C − ). The operation is completely
analogous (exchanging + and −) for cluster C − . It will be applied only to non-empty
clusters.
After inserting a new repeater for C + , its input pin is a new sink that is inserted
into an existing cluster C 0 . This cluster can be
1. C + itself (if we insert a buffer along a path),
2. C − (if we insert an inverter along a path), or
3. a cluster from a different cluster pair (during a Merge operation).
The new repeater is going to drive all sinks in S(C + ). The location of the new
repeater depends on the mode of the cluster pair (C + , C − ). If the cluster pair is in
parallel mode, we insert a repeater at position P (C + ). If the cluster pair is not in
parallel mode, the location of the new repeater is the current position of the cluster
P l(C + ).
The operation is called InsertRepeater. It takes three parameters: the cluster
for which a repeater is searched, the cluster to which the new sink should be added,
and the type of repeater we want to insert (buffer or inverter).
73
6 Repeater Insertion
Figure 6.8 shows the situation when an inverter is searched for positive cluster
in parallel mode. The new inverter should be inserted into the negative cluster
C 0 = C − of the cluster pair.
The routine first computes a Steiner tree for cluster C 0 containing the new sink
position. Then, for a repeater t of the requested type, the required arrival time
function and slew limits are computed at the input pin using the load is has to drive
C+
rat(t) := ratinvt (cap(C + ), rat(C))
Sl(t) := slewinvt (cap(C + ), Sl(C)).
A Steiner tree is extracted using the wiring modes from buffering mode M (C 0 ) such
that we have an Elmore delay rci for each sink i ∈ S(C 0 ) ∪ {t}. Finally, the RAT
function at C 0 is computed
rat := min
ratinv(rci , rat(i)).
0
i∈S(C )∪{t}
Using the resulting capacitance cap(C 0 ) of cluster C 0 , we can compute a new
required arrival time for C 0 (see Equation 5.2) and propagate it towards the root
(we have to rebalance dnode with side branches) where we can calculate a slack σt .
The weighted slack using the power-time tradeoff ξ is
σt∗ := ξ min{σt , 0} − (1 − ξ)pwr(t).
Finally, we assume that the resulting cluster C 0 is driven by repeater M (C 0 )t with
input slews M (C 0 )s . For each sink i ∈ S(C 0 ) ∪ {t}, e propagate the slews through the
Steiner tree resulting in slew pair si . We then compute the sum of slew violations:
svio
max{si − Sl(i), 0}.
t := min
0
i∈S(C )∪{t}
We also add possible load violations at both repeaters:
ltvio := max{cap0 − loadlim(M (C 0 )t ), 0} + max{cap(C + ) − loadlim(t), 0}.
After processing all repeaters, we have for each of them an estimated weighted
slack σt∗ , its power consumption pwr(t), the load violation cvio
t , and the slew violation
vio
st . We choose the repeater that lexicographically minimizes
vio
∗
cvio
t , st , −σt , pwr(t) .
After having chosen a repeater, we update the resulting tree. The Steiner tree
behind the new repeater and the repeater are merged into T and P l, the function
Rt is updated to reflect the chosen repeater, and RW is updated to use the wiring
modes from M (C + ) for the new edges.
74
6.2 Repeater Insertion Algorithm
Algorithm 5 Buffering Algorithm
1: procedure Buffering(T in )
2:
AssignEffort(T in )
3:
Initialize the topology for buffering T in .
4:
Initialize result (T, P l, Rt , RW ).
5:
while |V (T )| > 0 do
6:
Choose leaf v ∈ V (T in )
7:
if P l(v) 6= P l(parent(v)) then
8:
Move(v)
9:
else
10:
Merge(v)
11:
end if
12:
end while
13:
ConnectRoot
14: end procedure
. See Section 6.2.2
. Results in the removal of v
6.2.5 Buffering Algorithm
The overall structure of the Buffering algorithm is described in Algorithm 5.
Input to the algorithm is a topology. After having assigned new buffering modes to
the topology, the data structures are initialized as described above. The topology
is then successively modified to create a repeater tree. Leaves are moved towards
their parent nodes and merged with them until only the root node is left. In a final
step, the last remaining sinks are connected to the root.
In the next section, we describe the Merge operation because it is also used in
the Move operation which we explain later.
6.2.6 Merging operation
When some node l and its cluster pair has been moved to the position of another
cluster pair, they will be merged. Let (Cl+ ,Cl− ) be the cluster pair that arrives along
arc e at the cluster pair (Cr+ ,Cr− ) which is at the tail of arc e. We compute the
merged cluster pair (C + , C − ) using the Merge operation.
If |S(Cl+ )| · |S(Cr+ )| = 0 and |S(Cl− )| · |S(Cr− )| = 0, the merging operation is
straightforward. We set C + := Cl+ if |S(Cl+ )| = 0 and C + := Cr+ otherwise. The
same is done for C − . If the resulting cluster is not in parallel mode, the merge point
is not updated. Otherwise, the merge point is set to the current cluster position for
both parities.
In other cases, we give us five options: inserting an inverter driving one of the four
clusters Cl+ , Cl− , Cr+ , Cr− , or merging clusters of the same parity without inserting
any inverter. Note that this does not exclude the possibility of inserting an inverter
later, as merge points are (re)defined if there are sinks of both parities after merging.
Therefore, we do not evaluate possibilities that can be realised by resolving parallel
75
6 Repeater Insertion
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
13)
14)
15)
Figure 6.9: The possible merge configurations that are tested during a Merge operation. In
each case, the clusters (Cl+ , Cl− ) arrive from the left and the clusters (Cr+ , Cr− )
arrive from the right side of the resulting cluster pair.
mode later. We evaluate the remaining fifteen possibilities of inserting inverters in
front of one or more clusters as shown in Figure 6.9.
For example, in Case 12 we would first put an inverter in front of Cl+ and put
its input pin as sink into Cl− . Then, we would add another inverter in front of Cl− .
The result would be C + . We would also put an inverter in front of Cr+ and add its
input as sink into Cr− resulting in C − .
Possibilities are only evaluated if we do not have to insert a repeater in front of
an empty cluster. Table 6.1 shows which of the fifteen cases are evaluated if a set of
clusters is empty. For example, if Cl+ and Cr+ are empty (first row in the table),
then only cases 1, 3, 5, 9, 11, 13, 15 are evaluated. In all other cases, we would try
to insert one or two repeaters either in front of Cl+ or Cr+ .
Cl+
Cl−
∅
Cr+
Cr−
∅
∅
∅
∅
∅
∅
∅
Cases
1, 3,
1, 2,
1, 3,
1, 2,
1, 2,
1, 2,
all
5,
4,
4,
4,
3,
3,
9,
8,
5,
5,
5,
4,
11, 13, 15
10, 12, 14
7, 9, 10, 11, 13, 15
6, 8, 10, 11, 12, 14
6, 8, 9, 11, 13, 15
7, 8, 9, 10, 12, 14
Table 6.1: Possible cases which of the four clusters are empty. The table shows which of
the fifteen cases of Figure 6.9 are considered.
76
6.2 Repeater Insertion Algorithm
To evaluate a case, repeaters are tentatively inserted using the InsertRepeater
function. Inverters are used if they are available. In case that we would be adding
two inverters in front of an input cluster without additional side sinks, we add a
single buffer instead. For example, this can happen if in Case 12 as explained above
cluster Cl− is empty at the beginning. Then only a single buffer is added to drive
Cl+ resulting in C + .
Similarly to the InsertRepeater function, the resulting cluster pair has required
arrival time functions at the current position and load capacitances. We use them to
compute a slack in our delay model. We evaluate power consumption, slew violations,
and load violations for both clusters in the same way as in InsertRepeater and
add the values up as well as the values for each InsertRepeater invocation.
As a result, we get for each case a slack σ, a sum of slew violations svio , a sum of
load violations svio , and a sum of power consumption. We compute the weighted
slack using the power-time tradeoff ξ
σ ∗ := ξ min{σ, 0} − (1 − ξ)pwr.
Among all cases, we choose the solution that lexicographically minimizes
cvio , svio , −σ ∗ , pwr .
We realize the chosen case and update our data structures accordingly. At the
end, we replace (Cr+ ,Cr− ) by (C + , C − ) and remove node l together with its clusters
and incoming edge from the topology.
Handling Different Buffering Modes
During the Merge operation, it can happen that we have to merge clusters with
different buffering modes. We treat both clusters as if they have the wiring mode
with the better wire delay in such a case. This prevents us from arbitrarily worsening
the delay to the sinks that are currently in the cluster with the better wire delay.
For example, we would not set a cluster that drives a long distance on higher planes
to the default planes.
6.2.7 Moving operation
We now describe how to move cluster pairs within the topology towards the root. A
leaf node of the current topology is moved together with its associated cluster pair
(C + , C − ) along the incoming edge. Algorithm 6 shows the Move operation.
First, we have to decide how far both clusters can be moved. Procedure RemainingDistance computes the maximum moving distance for a given cluster C. If the
sink set S(C) is empty, we return infinite distance. Otherwise, we assume that the
repeater M (C)t that is optimal for the cluster’s buffering mode M (C) drives a wire
segment and the cluster’s sinks. We maximize the length of the wire segment such
that slew targets and limits are not violated.
77
6 Repeater Insertion
Algorithm 6 Move Procedure
1: Let p = parent(v)
2: l+ := RemainingDistance(C + (v))
3: l− := RemainingDistance(C − (v))
4: l := min{l+ , l− }
5: if l ≥ ||P l(v) − P l(p)|| then
6:
P l(v) := P l(p)
7: else
8:
if S(C + (v)) > 0 and S(C − (v)) > 0 then
9:
ResolveParallel(v)
10:
else
11:
if Bl((p, v)) = 1 then
12:
InsertRepeater(v)
13:
P l(v) := P l(p)
14:
else
15:
Choose z with ||P l(v) − z|| = l minimizing ||z − P l(p)||
16:
P l(v) := z
17:
InsertRepeater(v)
18:
end if
19:
end if
20: end if
+
a
x
C
+
+
Figure 6.10: The maximum distance x is searched by which cluster C can be moved such
that, given slew targets at the input a of an optimal repeater, slew limits and
slew targets at C are not violated.
78
6.2 Repeater Insertion Algorithm
Figure 6.10 shows the situation for positive cluster C. A wire segment with length
x is added in front of the cluster. Let m := M (C) be the buffering mode stored at
cluster C. Repeater mt drives the resulting wire and we assume slew pair ms at
input pin a. Let rc be the Elmore delay of the wire segment. We can now compute
the slew pair arriving at C:
sout = wireslew(rc, slewmt (xmwirecap + cap(C), ms )).
We then search the maximum x such that sout ≤ Sl(C) and sout ≤ St(C) using
binary search.
After computing the remaining distance for both clusters, we take the minimum
for l. If a node and both clusters can be moved to the parent’s place, the move is
performed and both clusters are updated.
Otherwise, we have to insert a repeater. In the non-parallel mode one of the two
clusters, say C − , is empty and we insert a repeater for C + using InsertRepeater.
We use a buffer if mt is a buffer and we use an inverter otherwise.
The position of the new repeater depends on whether the edge is blocked or not.
For a blocked edge, the repeater is added at the head of the edge. The resulting
cluster is moved to the parent node, and the cluster timing is updated. For an
unblocked edge we search a point along the path to the parent node such that the
distance to the current position is l and the distance to the parent is minimized.
The cluster is then moved and the solution from InsertRepeater is realised.
Resolving Parallel Mode
Both clusters are non-empty in parallel mode, and merge points are defined for both.
We resolve such a situation by treating the cluster pair (C + , C − ) as two cluster
pairs, (C + , C 0 ) and (C 00 , C − ), with empty dummy clusters C 0 , C 00 and using the
Merge operation. However, we restrict the procedure to choose from Case 2 and
Case 5 in Figure 6.9, the two valid operations that directly resolve parallel mode.
Running time
We stop the binary search as soon as the difference between the upper and lower
bound gets smaller then the width of the smallest repeater ltmin . The running time
of a single invocation of RemainingDistance is in O(log( lltmax )) with lmax being
min
the length of the longest edge in the topology.
6.2.8 Arriving at the root
When the last leaf node arrives at the root, we have to connect the remaining cluster
pair to the root pin. As we get the load dependent arrival times from the root, we
no longer depend on the topology delay model. Instead, we enumerate all possible
solutions for connecting the root via a sequence of zero, one, or two repeaters.
79
6 Repeater Insertion
In non-parallel mode all possible sequences that result in a correct parity are
connected to the cluster with non-empty sink set. If the clusters are in parallel
mode, we have to resolve it. In contrast to resolving a parallel mode in the Move
operation, we search for an inverter for the positive cluster and an inverter for the
negative cluster using InsertRepeater.
If the clusters are not in parallel mode, only one cluster has a non-empty sink set.
We then create a chain of zero, one or two repeaters at the root for all combinations
of repeaters that have the correct parity. The chain connects the root to the cluster
sinks. Arrival times and slews are propagated from the root to the cluster.
For each combination that we evaluate, we get an estimated slack σ, the sum of
slew violations svio , the sum of load violations cvio , and the total power consumption
pwr. Similar to the InsertRepeater and Move operations, we compute the
weighted slack
σ ∗ = ξ min{σ, 0} − (1 − ξ)pwr
and lexicographically minimize
cvio , svio , −σ ∗ , pwr .
If the clusters are in parallel mode, then we search for an inverter for both
clusters respectively and then try all combinations in the same way as we do in the
non-parallel case. The overall best solution is then chosen using the same criteria as
above and all data structures are updated accordingly creating the final repeater
tree.
6.2.9 Running Time
It is possible that the buffering algorithm that we presented does not terminate if it
gets stuck by making no progress in the Move operation. This could happen, for
example, if the cluster is not allowed to move but inserting any repeater in front of
the cluster creates the same cluster or one with worse constraints. To prevent this
problem, we move at least the width of the smallest repeater before we insert a new
one. This is no limitation because after legalization of the repeaters’ placement no
two repeaters are allowed to overlap.
We also limit the number of sinks allowed in a cluster by a constant as this
is often done in practice by designers. This allows us to also bound the running
time necessary to evaluate a possibility in InsertRepeater, Merge or for root
connection by a constant as Steiner trees are computed over a bounded set of
terminals.
In practice, we run the Network Simplex Algorithm to solve the Buffering
Mode Assignment Problem. However, it is not a polynomial-time algorithm.
For running time considerations, we use Orlin’s algorithm (Orlin, 1993) which solves
the Minimum Cost Flow Problem in O(m log m(m + n log n)) with m being the
number of edges and n the number of nodes. As we work on a series-parallel graph
with roughly twice as many edges as nodes, the running time becomes O(m2 (log m)2 ).
80
6.2 Repeater Insertion Algorithm
Converting a solution into a basic tree solution takes at most m iterations with
linear running time. Finding the AssignEffort solution therefore has a worst case
running time of O(m2 (log m)2 ).
Theorem 5. Given an instance of the Repeater Insertion Problem with input
topology T in , set M of buffering modes, and repeater library L, the worst case
running time of the algorithm is
O(Cm2 |L| + (r + m)(log(
lmax
) + Cm|L|) + C|L|2 + (|M |m)2 (log(|M |m))2 )
ltmin
with m = E(T in ) and r being the number of inserted repeaters in the output. The
computation of each Steiner tree during the algorithm is bounded by C.
Proof. To solve the Minimum Cost Flow Problem an input graph is constructed
with at most 2|M |m edges. The assignment itself runs in linear time. In total,
AssignEffort runs in O((|M |m)2 (log(|M |m))2 ).
A single invocation of InsertRepeater runs in O(C|L|) time in the best case
and O(Cm|L|) in the worst case. The worst case arises, if, for calculating a slack
for a solution, we have to traverse a significant number of edges of the topology.
A call to Merge makes a bounded number of calls to InsertRepeater and is
therefore also in O(Cm|L|). There are O(m) calls to Merge resulting in a total
worst case running time of O(Cm2 |L|).
Move executes the binary search in a worst case running time of O(log( lltmax )),
min
where lmax is the longest topology edge and ltmin is the smallest repeater width.
The binary search is followed by at most one call to Merge or InsertRepeater.
The number of calls to Move is bounded by r + m. The total worst case running
time is in O((r + m)(log( lltmax ) + Cm|L|)).
min
Connecting to the root is in O(C|L|2 ) as at most two repeaters are tried and each
combination can be evaluated in constant time because there is only one edge left
to the root. Putting everything together, we get the claimed running time.
6.2.10 Repeater Insertion - Summary
As we show in the experimental results, our repeater insertion algorithm produces
good results very quickly. For repeater libraries as they appear in practice, it
generally finds a solution without capacitance violations if such a solution exists.
Slew violations are also avoided most of the time but they appear slightly more
often than capacitance violation. This is mainly due to tighter slew limits and the
fact that the Steiner trees used for net extraction can change their topology if a sink
or the driver is slightly moved.
Strictly Following the Topology
The repeater insertion algorithm presented here does not create repeater trees that
follow the input topology strictly. This is one of its main features. Instead, some
81
6 Repeater Insertion
r
r
−
−
−
−
Figure 6.11: A topology (left) and the resulting repeater tree (right). Topology detours are
removed by recomputing Steiner trees.
parts of the topology are used twice while moving clusters in parallel. Other parts
are discarded due to recomputed Steiner trees. Figure 6.11 shows an example of
how a detour is discarded during repeater insertion.
−
−
+
−
r
−
+
−
r
−
Figure 6.12: A topology (left) and the resulting repeater tree (right). By using a Steiner
tree instead of following the topology, a detour for the sink with positive parity
is avoided (orange dashed line).
In practice, recomputing Steiner trees gives better results than keeping the input
topology because delay calculations are more close to the final (pre-routing) timing.
Figure 6.12 shows how detours are avoided that would be induced by following
topologies in parallel.
There are cases where it is desirable to strictly follow a topology, for example,
if an existing routing should be buffered such that the result can use the same
routes. This can be done by changing clusters to also store a subtree of the topology
connecting the cluster to its sinks. Delay calculations are then performed on the
stored tree.
6.3 Dynamic Programming
In his groundbreaking paper van Ginneken (1990) proposed a dynamic programming
algorithm for buffering of repeater tree topologies maximizing the slack at the root.
The algorithm worked only for a single buffer type and a single wiring mode. In
addition to the input of the Repeater Insertion Problem, the algorithm needs
buffering positions along the topology. The canonical approach is to add repeater
positions equidistantly along the topology. However, it makes sense to choose buffer
positions based on library and input characteristics as shown by Alpert et al. (2004b).
The running time of which was O(n2 ) where n is the number of buffer positions.
Later, Lillis et al. (1996a) extended the approach to handle a library consisting of
b buffers or inverters with a running time of O(n2 b2 ). They also proposed a way to
82
6.3 Dynamic Programming
handle power consumption. However, the algorithm is not polynomial.
The running time of the algorithm was later improved (Shi and Li, 2005; Li
and Shi, 2006; Li et al., 2012) by using clever data structures and better pruning
techniques than in the previous papers. For instances with only a single sink, they
achieve a running time of O(b2 n). For nets with m sinks, the running time becomes
O(b2 n + bmn).
There are a lot of extensions to the basic version of Lillis et al. (1996a). There
are works considering higher-order delay models (Alpert et al., 1999; Chen and
Menezes, 1999), simultaneous buffer insertion and tree construction (Okamoto and
Cong, 1996; Hrkić and Lillis, 2002, 2003; Hu et al., 2003), segmenting wires (Alpert
and Devgan, 1997), or minimum buffer insertion under slew constraints (Hu et al.,
2007).
For an instance of the Repeater Insertion Problem with given repeater
positions, the task of finding a repeater tree with maximum slack can be solved
efficiently as we have just seen. If one wants to get the cheapest solution that satisfies
the slack targets, then the problem becomes NP-complete even if one ignores load
limits at the source and repeaters as shown by Shi et al. (2004). An FPTAS for the
problem was presented by Hu et al. (2009).
6.3.1 Basic Dynamic Programming Approach
We have implemented a version of the dynamic program as introduced by Lillis et
al. and added some improvements. We did not use the running time improvements
by Li, Zhou and Shi for several reasons that we give below.
The dynamic program algorithm works with sets of candidates that are characterized by a required arrival time, a downstream capacitance, and a solution subtree.
Candidates are propagated bottom up by adding wire segments and adding repeaters
at buffering positions. A new candidate for each repeater type is created at each
buffering position as long as capacitance limits are not violated. At inner nodes of
the topology, the candidates of the left and right branch are merged together by
adding all combinations to the candidate list.
An explosion of candidates is prevented by only keeping candidates that are not
dominated (i.e. there is no other candidate with better or equal RAT and lower
or equal capacitance). At each node, we have two candidate lists for subtrees that
need a positive or negative signal to preserve parity.
Candidates are created at sinks. As the dynamic program algorithm does not
work with RAT functions, we collapse the RAT to a single value and evaluate the
functions for optslew.
We compare the quality of the Fast Buffering algorithm with the dynamic program
algorithm in Section 8.3.
83
6 Repeater Insertion
6.3.2 Buffering Positions
The running time and quality of a van Ginneken style algorithm highly depends
on the choice of repeater positions. The result of our repeater insertion algorithm
(Algorithm 5) is a complete repeater tree. Repeater nodes and Steiner nodes are
used as buffering positions. If there is a node v such that
• v is a sink and has outdegree higher than 0,
• v is a root and has outdegree higher than 1, or
• v is an inner node and has outdegree higher than 2,
then we add additional nodes at the same positions and reconnect children to them
until no node satisfies any of above conditions. We create at each sink an additional
buffering position.
Optionally, we also split long edges in the resulting tree such that there is a
potential buffer position at most after a given length. As the result of Fast Buffering
already has quite optimal distances between consecutive repeaters on long lines, it
does not make much sense to split the edges further.
The repeater tree from Fast Buffering already navigates around big blockages.
Thus, it is not necessary to worry much about them. However, the nodes of the
created Steiner trees can lie above blockages. Thus, we add repeater positions at
edges that cross a blockage boundary. Finally, all nodes in the interior of blockages
are marked as blocked. They are not used as repeater positions.
6.3.3 Extensions to Dynamic Programming
We describe in this section our changes to the basic dynamic program algorithm.
Our changes are motivated by timing properties as observed in practice. By using
some more accurate calculations the algorithm achieves better slacks than the basic
version by Lillis et al. (1996a). This makes the algorithm suitable as postprocessing
to Fast Buffering.
Black Box Timing Rules
As we have already discussed in Section 2.5.1, we do not exploit some properties
of the Elmore delay and work with black box wiredelay and wireslew functions.
This prevents us from using most of the techniques suggested by Shi, Li, and others
to speed up the dynamic program. Each candidate knows the position of the last
inserted repeater or the last merge of several topology branches. We call the position
the sink of a candidate. For the sink, we always keep the required arrival time and
the capacitance up-to-date. Instead of updating the sink when wire segments are
added, we just accumulate the RC-delay and compute the required arrival time
on demand. This makes the algorithm slower than assuming pure Elmore delays
but improves the result quality, because we are nearer to the final timing over wire
segments. Because of running time reasons, we only want to have one sink for each
candidate. Thus, we update the sink on merge points.
84
6.3 Dynamic Programming
Slew Effects
While computing the required arrival time for a candidate after adding wire or a
repeater, we do not know the slew that will arrive at the candidate. This makes
the calculations inaccurate and can lead to pruning of otherwise optimal candidates.
One solution to mitigate the problem is the use of buckets of discrete slew values
as for example proposed by Hu et al. (2007). For minimizing power consumption,
this is a viable solution. However, for optimizing slack, this leads to an explosion of
candidates. The resulting running times make such a solution impractical if one
wants to handle millions of instances.
While we still assume a prototype slew at the inputs of our candidates, we see
the resulting slew at the sink of each candidate. We use the difference between
the slew that we assumed for the required arrival time calculation at the sink and
the arriving slew to estimate the real required arrival time. Given a RAT at a
candidate’s sink and slew s, we can compute the required arrival time that we use
for further calculations:
rat = RAT − slewdelay(s).
See Section 4.1.4 for the slewdelay function.
Wiring Mode Assignment
An important extension to the dynamic program is the handling of different wiring
modes. For this, each candidate has a buffering mode assigned of which only
the horizontal and vertical wiring modes are used. All candidates with the same
buffering mode and parity are kept in a candidate list. Only candidates with the
same buffering mode are merged during the merging step.
When we realise the net behind a candidate, all horizontal (vertical) wiring
segments of the net will get the same horizontal (vertical) wiring mode. We use the
wiring modes stored in the buffering mode of the candidate. Physically, multiple
wiring modes per net would be possible, but most industrial routers tolerate only
a single wiring mode per net and dimension. After a repeater has been inserted
into a candidate, a buffering mode change can occur. Thus, the resulting candidate
is copied into the lists of all modes. Such an extension was first proposed by
Alpert et al. (2001a). They also showed that wire tapering, that is, the continuous
assignment of widths to wire segments, only has marginal advantages compared
to assigning a single wiring mode for each dimension to the whole net if there are
enough modes available. Following this, we disabled changing of buffering modes at
repeater positions if no repeater was inserted. With this, the number of buffering
modes changes the overall running time only linearly.
Our layer assignment routine will not use the same set of buffering modes uniformly.
Instead, each edge of the input topology gets a set of possible buffering modes.
Candidates that arrive at an edge with a not assigned buffering mode are just
discarded. To decide which buffering modes are available at an edge the blockage
85
6 Repeater Insertion
and congestion maps are used. We remove modes that would cause too high
congestion or are blocked. It can happen that the incident edges of a buffering
position that lies on top of a blockage have an empty set respectively. As we cannot
add a repeater and are not allowed to switch the buffering mode within a net, this
would lead to empty candidate sets on all layers. We ignore the congestion map in
such cases and allow the lowest buffering mode on all edges incident to the repeater
position.
Distinguishing Rise/Fall
A typical instance to the Repeater Insertion Problem has different required
arrival times for rise and fall as well as different arrival times at the root. Most
implementations of the dynamic program do not distinguish between both values.
Instead, they settle to a single value like the average or the worst value of both. To
compute the delay over a repeater, also a single value is used. We have seen that
several instances can be build better if one considers both values separately.
The candidates in our algorithm have a time pair as required arrival time. When
a wire is added or a buffer is inserted, the values can be updated separately by
calculating the according delays. This causes twice the effort of only propagating a
single number.
The pruning step uses only the worse value of both required arrival times when
comparing candidates to prevent an explosion of candidates that we have to handle.
Our experiments showed that the benefit of only pruning candidates that are
dominated in both required arrival times was small but it was paid with high
running times.
Slew Limits
In addition to the required arrival times, we also propagate a slew limit backwards.
The slew limit of a candidate is the maximum slew that can arrive at the current
node such that the slew limits are not violated in the whole subtree of the candidate.
Candidates are pruned as soon as their slew limit is not reachable unless all candidates
have to be pruned.
Candidate Selection
As soon as we arrive at the root, we have to choose the best candidate to return
a solution. Instead of relying on the required arrival times of the candidates, we
recompute the whole timing of each candidate propagating the slew accurately from
root to sinks. The sinks are then also evaluated using their required arrival time
functions instead of constant values.
86
6.3 Dynamic Programming
Power-aware Dynamic Programming
Lillis et al. (1996a) showed how to extend the dynamic program to find the cheapest
solution satisfying the required arrival time constraints. For their FPTAS for
the Repeater Insertion Problem with buffering positions, Hu et al. (2009)
discretized the power consumption values of repeaters into cost buckets. We have
also extended our implementation to work with cost bins. Basically, each candidate
is now characterized by three values: cost, cap, and rat. Every time a buffer is
added or candidates get merged, the result has to be inserted into the correct bucket.
There is a limited number of buckets. All candidates using more power than the
highest-valued bucket will be merged together. Unfortunately, the running time of
this version is prohibitive high when using a number of buckets that is sufficient for
good results. In comparison to the Fast Buffering solution, the timing results are
very good and the power consumption is smaller than for the basic dynamic program
version (see Table 8.4). However, the running time prevents that this version will
be used extensively in production.
We have taken random instances of different characteristics from a 22 nm design
and compared the results of the power-aware version of the dynamic program to the
basic one. The values are in Table 6.2. All runs of the power-aware version use 40
buckets between 0 and the power consumption of the Fast Buffering solution. We
see that that the sum of negative slacks and the worst slack are similar for both
runs. The basic version uses a lot of area for bigger instances while the running
time of the power-aware version explodes. See Chapter 8 for details of the hardware
setup and the instances. A comparison between Fast Buffering and the power-aware
dynamic program on the same instance set can be found in Section 8.1.
87
6 Repeater Insertion
Dynamic Program
Sinks
I01
I02
I03
I04
I05
I06
I07
I08
I09
I10
I11
I12
I13
I14
I15
I16
I17
I18
I19
Dyn. Program + Buckets
SNS
Slack
Area
Time
SNS
Slack
Area
Time
1
−378
1
−91
2
−414
2
−161
3
−161
3
−177
4
−22
4
−137
5
−371
8
−112
10
−444
15
−694
24
−71
33 −3317
47 −2563
65 −1523
73 −3762
120 −10275
322
0
−378
−91
−207
−81
−61
−68
−11
−36
−111
−18
−69
−59
−15
−188
−105
−96
−75
−104
25
154
12
138
6
16
42
25
24
48
48
51
187
89
309
310
310
355
510
1427
11
2
13
4
5
9
8
9
10
16
20
55
57
70
96
181
144
345
777
−377
−91
−413
−161
−143
−162
−22
−133
−351
−117
−387
−698
−60
−3167
−2922
−3852
−3424
−10701
0
−377
−91
−206
−81
−61
−68
−11
−36
−113
−20
−68
−59
−8
−189
−109
−96
−78
−107
25
160
14
150
6
11
35
25
8
40
34
22
116
38
144
101
48
132
333
851
374
56
446
128
214
370
351
371
487
829
886
2820
3660
3711
5372
13125
10078
22624
46548
Table 6.2: Results of optimizing for slack and power on several instances from a 22 nm
design using the basic version and the power-aware version of the dynamic
program. All times are given in ps. SNS is the sum of negative slacks for all
sinks. Slack is the worst slack of the instance. Area is the space consumed by
the internal repeaters measured in placement grid steps. Time gives the running
time of the dynamic program excluding other parts in milliseconds.
88
7 BonnRepeaterTree
The repeater tree algorithm that we described in Chapter 5 and Chapter 6 has been
implemented together with a small framework for repeater tree optimization as part
of the BonnTools (Korte et al., 2007; Held et al., 2011) suite of physical design
optimization tools developed at the Research Institute for Discrete Mathematics,
University of Bonn in an industrial cooperation with IBM. The main algorithm
together with some utility tools and its APIs is called BonnRepeaterTree.
BonnTools are now part of the IBM electronic design automation tools. They
have to work with the requirements of industrial physical design. Thus, a huge
amount of development work is spent to cope with real-world designs.
This chapter describes the aspects one has to consider if one wants to implement
our algorithms for an existing physical optimization environment. We start with
details that are valid for a whole range of repeater tree instances, like the repeater
library or blockage map. We then show how a single instance is processed. We
take a brief look at the BonnRepeaterTree software architecture and finish with a
description of two tools that use our framework.
7.1 Repeater Library
A typical standard gate library has a lot of different repeaters. There are repeaters
for special purposes like repeaters for clock trees or repeaters that should just
add delays to the signals. Then there are standard repeaters that are used within
repeater trees. They are sorted into families of similar properties.
First, there are repeaters from different Vt -levels. The voltage threshold (Vt ) is
the voltage where the gate starts to switch. Gates with a lower Vt -level are faster
because they switch earlier but their power consumption is much higher due to
higher leakage power. The optimization flow or the designer set the currently active
Vt -level. BonnRepeaterTree prefers repeaters from the active Vt -level.
Second, repeaters are distinguished by their beta ratio (i.e. the difference between
the fall and rise delays). Repeaters can be build such that for similar inputs the rise
and fall delays are either balanced or asymmetrical. Chains consisting of balanced
repeaters are usually slower than chains with unbalanced ones. We perform longdistance calculations for each repeater (see Section 4.1.1) and then choose the family
with the fastest repeaters.
There are repeaters of different sizes or BHCs (Block Hardware Codes) within
each family. Smaller repeaters consume less leakage power. They have lower load
limits and are very sensitive to the load. In general, they are also slower than larger
89
7 BonnRepeaterTree
repeaters that can drive higher load capacitances. The largest repeater is often the
gate type that can drive the highest capacitance over all gates in the whole library.
For a given Vt -level and beta ratio, we often use the whole family of BHCs and only
hide the smallest repeaters because they are too sensitive, so that small changes in
routing can result in huge timing differences.
As the running time of all repeater tree algorithms depends on the size of the
library, one might choose to work with a subset of a family. For example, Alpert
et al. (2000) proposed an algorithm to select a proper set of repeaters such that
the results do not deteriorate too much. While such an approach would certainly
improve the running time of our algorithms, the times we see in practice are fast
enough to keep all repeater sizes (except the smallest ones as described above).
The sizes of some example libraries are shown in Table 8.3. Typically, the library
has between 15–25 inverters and buffers. There are libraries that consist only of
buffers and libraries that consist only of inverters.
In practice, we distinguish between the repeaters that can be removed from the
design and the repeaters that can be inserted. While we limit the number of buffers
and inverters that are used for construction, we want to be able to remove as much
of the existing repeater trees as possible.
7.1.1 Repeater and Wire Analysis
During repeater tree construction, the timing rules are called millions of times to
evaluate intermediate solutions. Unfortunately, evaluating timing rules is quite slow.
It is prohibitive for running time reasons to call the timing rules each time one
wants to calculate delays or slews. To speed up the calculations, we approximate the
timing rules of all repeaters. To this end, we sample the domain of both functions
equidistantly and use bilinear approximation between the sampling points, which
can be evaluated very quickly.
Using 64 sampling points in both dimensions of the rules (input slew and output
load), limits the error to about 2 picoseconds for the technologies in our testbed.
Similarly, we approximate the delay rules over net segments. Given an input
slew and an Elmore delay, the timing rules compute an output slew and a delay.
For some technologies, the timing rule is just a linear scaling. Other technologies,
however, use more complicated functions. As we want to work with all inputs, we
sample the timing rules for nets with 256 sampling points in both directions and
also use bilinear approximation.
7.1.2 RAT and Slew Backwards Propagation
We use bilinear approximations for all flavours of the slewinv function. During
approximation a binary search is performed to find the highest slew that achieves a
given output slew and Elmore delay for nets or output slew and load capacitance
for repeaters.
90
7.2 Blockages and Congestion Map
We approximate required arrival times by linear functions. During repeater
insertion we are usually interested in the required arrival time for a given target slew
st . Thus, we approximate the tangent of the RAT function at st . To compute the
RAT function at the source of a net for a given sink and signal edge with Elmore
delay rc and sink RAT function rat, we compute the required arrival time for slews
st and st + for an appropriate small :
ratst := rat(wireslew(rc, st )) − wiredelay(rc, st )
ratst + := rat(wireslew(rc, st + )) − wiredelay(rc, st + ).
The resulting required arrival time at the source is then the linear function going
through ratst and ratst + . To merge required arrival time from several sinks, we do
not compute the lower contour. Instead, we evaluate all required arrival times for
slew st and only keep one RAT function that attains the minimum.
Required arrival times are propagated backwards over repeaters in an analogue
way. Additionally, we have to take care about signal edge inversions through
inverters.
7.2 Blockages and Congestion Map
The blockage map contains the regions of the design that are blocked for repeater
insertion. The bounding box of the chip area is an enclosing rectangle of all free
space in the design. Everything outside is considered blocked. In addition, we also
consider as blocked
• regions that belong to the bounding box of the chip area but not to the chip
area itself,
• regions that are not free for gate placement within the design,
• regions that are blocked by the user,
• gates that are fixed in their location, and
• large gates that are usually difficult to legalize.
Given all blockages, we first block all free regions that are too small for placing
a repeater within. In a second step, an overlap-free set of rectangles covering the
blocked areas is computed. The rectangles are then stored in a quadtree that
supports fast nearest free location searches.
Using the blockages, an equidistant blockage grid is created. The blockage map and
blockage grid are then used to initialize free routing capacities between neighbouring
tiles for a congestion map. Finally, all nets are added into the congestion map.
91
7 BonnRepeaterTree
7.3 Processing Repeater Tree Instances
The basic steps of optimizing a single instance are extracting the input data from
the netlist, building a new repeater tree, and replacing the original netlist with the
new one. We describe the steps in the following sections.
As our algorithms are either heuristics to the problem or simplify it, it is possible
that the new solution we computed is worse than the original one. To prevent
degradation of the design, we evaluate the original and new solutions and compute
some metrics like slack, length and number of electrical violations. If the new
solution is better, then it will be inserted into the design. Otherwise, it will be
discarded.
7.3.1 Identifying Repeater Tree Instances
The BonnRepeaterTree tool is designed to optimize all repeater tree instances of a
design. However, the runtime environment does not give us all instances and the
corresponding information directly. Instead, the tool has to find instances in the
netlist.
After identifying instances, we have to extract all data relevant to the instance.
We can assume that basic data like the repeater library, the wiring modes, the
blockage map, and potentially a congestion map are already given. It remains to
extract
• all nets and pins belonging to the instance and their placement,
• the arrival time functions of the rules and the the timing rules necessary to
compute them,
• required arrival time functions at the sinks, and
• if wiring already exists and it is requested, the existing wiring topology.
The following sections describe the steps mentioned in the list.
Identifying Roots and Inner Circuits
Generally, all nets of a design are part of repeater trees. Nets incident to repeaters
belong to the same repeater tree. However, some nets are not part of any repeater
tree because they are protected from optimization by the designer. There are several
sources of such hides:
• Nets are hidden if they are already optimized and if the designer does not
want a tool to mess up with the current result.
• Nets are hidden by other tools because they depend on the current solution.
92
7.3 Processing Repeater Tree Instances
• Nets that carry clock signals have to be buffered but there are special requirements, for example, the signal should arrive at the same time at all sinks, such
that special clock tree tools buffer them.
• Nets with analog signals should not be buffered.
• Nets can have multiple inputs. In such cases, special care has to be taken that
no short-circuit is created.
It can easily be queried from the runtime environment whether a net is hidden.
BonnRepeaterTree works on a set of nets from a design. To identify the instances
corresponding to a set of nets, the following steps are performed:
1. Nets that are hidden are filtered out. The remaining nets, either a root net or
a net at a deeper level, are part of a repeater tree.
2. For each net’s source we identify its root pin and collect it. The source pin
of a net is a repeater tree root if it has no gate or its gate is not a repeater
or if it is not possible to remove the repeater and its incident nets without
violating a hide or similar restriction. If a source pin is not a root, then it is
an output pin of a repeater that is part of the tree. We recursively continue
the search with the net connected to its input pin.
3. For each collected root, we start a forward search to include all repeaters and
sinks that belong to the instance. A sink pin is reached if it does not belong
to a gate or if its gate is not a repeater that can be removed.
After running the routine we have a set of instances. Each instance consists of a
root pin, a set of inner repeaters, a set of sink pins, and a set of connections between
them that are stored in a tree data structure.
Arrival Time Functions
After we have identified the root pin of an instance, we have to extract some
characteristics of the root. We are interested in
• the arrival time functions for all signals at the root and
• the output load limit that the root can drive.
It is possible that different signals are going through a repeater tree instance.
The timing engine creates so-called phases for different signal sources. Phases are
propagated separately such that there are several independent arrival times and
slews as well as required arrival times at the timing nodes. Slew limits are also
phase-specific.
The output load limit is used in ConnectRoot to check for electrical feasibility
of the solutions. For each phase of the instance, there is a pair of arrival time
93
7 BonnRepeaterTree
functions, one for each transition respectively. Given the load at the root pin, we
can compute the arrival time we are interested in.
There are several types of roots. For each type, we extract the arrival time
functions in a different manner. Within the BonnRepeater framework, we might see
root pins that are
• output pins of circuits,
• primary inputs of the netlist,
• pins on hierarchy boundaries, or
• output pins of circuits that are fed by transparent segments.
We describe each type of roots in the following sections.
a
p1
z
p2
b
Figure 7.1: A repeater tree root z at the output pin of a standard AND-gate with two input
pins. The arrival times and slews at the root are computed using arrival times
and slews at the input pins a, b and the propagation segments p1 , p2 .
Outputs of Circuits The most prominent type of root is an output pin of a standard
circuit or a macro. There are propagation segments heading towards the output pin
within the circuit coming from its inputs or internal timing points. To estimate the
timing at the root, we have to extract the timing rules of the propagation segments
together with the arrival times and slews at their tails. The load limit of the root is
given by the timing rule of the circuit’s pin.
r
Figure 7.2: A repeater tree root at a primary input of the design. Arrival times and slews
are constant.
Primary Inputs Primary input pins are starting points for signals coming from
outside of the design. We distinguish two different types designs and on each type
primary inputs are treated differently. Some designs are macros that are later
included into larger designs forming a hierarchy of designs. The primary inputs
and outputs of macro designs communicate with gates at higher hierarchy levels.
Incident nets can be optimized by the repeater tree routine.
Top-level designs are the second type. They communicate with components
outside of the chip image. Here, it is often not possible to optimize nets incident to
primary inputs because they need special treatment due to electrical constraints. In
94
7.3 Processing Repeater Tree Instances
both cases, information about timing coming from the outside is given in form of
arrival time and slew assertions. Normally, the values are constant, independent
from the load at the pins. In addition, there is a load limit asserted to the pin.
Lower Hierarchy Level Upper Hierarchy Level
a
z
y
r
b
i
Figure 7.3: A repeater tree root (r) at a hierarchy boundary. It is in the middle of a
net crossing the boundary. To compute arrival times and slews one has to
fetch them at the driving gate’s inputs and propagate over the gate and net.
Changing the load at pin r has effects on the timing at side pins i and z.
Pins on Hierarchy Boundaries It is possible to load hierarchical designs such that
the contents of all hierarchy levels are visible to optimization tools. Nets crossing
hierarchy boundaries could be optimized in a single step. However, tools are often
only allowed to work at a single level at a time. Thus, instances stop at hierarchy
boundaries and it is possible to get roots at virtual pins that mark the hierarchy
boundaries.
For roots at hierarchy boundaries, it is possible to take information about the
previous level into account. If one changes the load capacitance at such a root this
has an impact on the driver in the preceding net and the net itself. To calculate the
timing at the root one has to extract the driver’s timing as described in the previous
case and also the timing behaviour of the net between the driver and the root.
In our tool, we first extract the driver’s timing rules and then calculate a Steiner
tree for the preceding net. When we want to get the timing at the root, we first
calculate the final load at the driver and then compute the timing at the drivers
output pin. In a second step, we recompute the Elmore delay for the root and
propagate the timing over the net.
The load capacitance limit of the boundary pin is the highest capacitance we can
connect to the pin without violating the capacitance limit of the preceding driver.
Changing a repeater tree at a boundary hierarchy also changes the timing on all
pins that are siblings to the root in the preceding net. Figure 7.3 shows the situation
with sibling sinks i and z. The slack can turn negative or electrical violations can
appear. However, we ignore the timing at sibling pins because we did not see any
problems in the past. If it turns out to be a problem in the future, the timing of
sibling pins can be considered by limiting the capacitance load limit of the root such
95
7 BonnRepeaterTree
that the sibling’s timing remains feasible even in the worst case.
A different problem with siblings in the preceding net appears if they are hierarchy
boundary pins (pin z in Figure 7.3). Working on one of the pins can significantly
change the timing behaviour of the others. This can be a problem when we optimize
both instances in parallel where we first extract the timing constraints and then
optimize. Here, we do not see the changes we do during optimization of the first
root when we start to work on a subsequent root. A better solution would recognize
that all siblings belong together and optimize them in a single repeater tree instance.
However, the problem appears rarely in practice and did not impose a problem.
Thus, we ignore the situation so far.
z
p1
tz
a
r
p2
tr
b
Figure 7.4: A root at a macro boundary with transparent segments. The load capacitance
at r is visible over transparent segment tr to propagation segments p1 and p2 .
Arrival times and slews are calculated by propagating them first over p1 and p2
and then over tr . Timing at sibling pin z depends on the load capacitance at r
and vice-versa.
Transparent Segments There is a fourth type of roots similar to roots at hierarchy
boundaries that one can find at the output pins of macros. Macros that were
processed as a separate hierarchy level are later finalized and treated as a single
block. The timing of the whole block is catched in macro-specific timing rules.
The behaviour at hierarchy boundaries is modelled using transparent segments.
Transparent segments mimic preceding nets in this situation.
Normally, propagation segments within gates shield the load capacitance at their
head such that it is not visible at preceding segments. In contrast, the capacitance
at the head of transparent segments is visible at the tail and therefore influences
previous propagation segments. This is analogue to a net where the load at the sink
is visible at the source.
Transparent segments are preceded by propagation segments that correspond
to the driving circuit in the hierarchy boundary case. To fully catch the timing
behaviour, one has to extract the timing at the tails of the segments heading to a
transparent segment and the timing rules of all involved segments.
The timing rules of the macro give us the load capacitance limit of the root.
Transparent segments have similar problems as pins on hierarchy boundaries. It
is possible that there are sibling segments heading to other outgoing pins of the
macro. Here, both instances should be considered at once, too. However, we have
96
7.3 Processing Repeater Tree Instances
not seen such a problem in practice yet.
RAT Functions at Sinks
For each phase that arrives at a sink, we approximate the rise and fall RAT functions
by linear functions. This is a rough estimate compared to the effort we spend on
accurately estimating arrival times at the root. In practice, adding more effort into
getting better RAT functions improves the results only by a small margin. The
influence of the root is much higher.
We know the current slews and required arrival times at the sinks from the timing
engine. We then set the RAT functions for sinks at primary outputs, where required
arrival times are asserted, to constant functions. For pins that are gate inputs, we
set the RAT functions to the tangent around the current slews by evaluating the
outgoing propagation segments.
Phase shifts
Ideally, different phases are handled separately during repeater tree construction
because different sinks can be critical for different phases. Due to different slews
as well as slew limits, propagating only a single phase can be too pessimistic.
Because of running time reasons, we only handle a single phase during repeater tree
construction. The different phases have to be merged into a single one. This is done
by normalizing arrival times and RATs.
First, we assume that the root has a load capacitance of 0 and compute arrival
times. Using our delay model, we determine the criticality for each phase and signal
edge separately. The most critical phase and signal edge is used as reference. Then,
all other signals are shifted by a constant such that their arrival times match the
reference arrival time. The shift is also performed for the RAT functions at all sinks.
The arrival times and RATs used during repeater tree construction are the worst
ones after shifting. We also use the tightest limits over all phases. When it comes
to evaluate a solution, we propagate each phase independently again to avoid
pessimism.
Unplaced Pins
Especially in early design stages, it is possible that sink pins or the root have no
proper placement. A reason might be that the corresponding gate is not yet placed
or the gate’s design is not yet finished such that the pin’s position within the gate
is unclear. If there is at least one placed pin in an instance, then the unplaced pins
are positioned at the center of gravity of the placed ones. If all pins are unplaced,
we treat them as if they all lie at the origin of the coordinate system.
97
7 BonnRepeaterTree
Identifying Existing Wiring
As mentioned in Section 5.9, existing wiring can be used as a topology for the
buffering algorithm. Because wires are stored as a list of segments between two
coordinates in our optimization environment, we first have to reconstruct a graph
out of the segments. It can happen that the existing wiring is not connected or it
does not cover all pins. In such a case, we just discard the whole graph and use our
topology algorithm.
Slew Limits
In addition to the slew limits at input pins imposed by the timing rules, there might
be additional design specific slew limits. First, there is a global slew limit that
should not be violated at any pin. Second, there are phase-specific slew limits that
are only valid for arrival times belonging to the according phase.
Given an instance, we know which slew limits apply. We then modify the instance
such that the slew limits of the sinks and all repeaters in the library are not higher
than the instance-specific limits. In case of phase-specific slew limits, this means
that we choose the minimum over all slew limits for construction. This can be
too pessimistic. For example, we sometimes compare slews from uncritical phases
having higher slew limits with smaller slew limits from critical phases.
Lowering the slew limits at insertable repeaters can make some buffering modes
invalid if their slew target is above the limit. We just remove invalid buffering modes
before an instance is processed.
Capacitance Limits
Similar to phase-specific slew limits, we also see capacitance limits at output pins
that depend on the phases propagating to gates. They are imposed to lessen the
effects of electromigration. Electromigration is the movement of material in a
conductor caused by current. It decreases the reliability of integrated circuits over
time. One strategy used to cope with this is reducing the capacitance that gates are
allowed to drive. As the strength of the effect depends on several factors including
the frequency of the signals, the countermeasures also depend on the signals.
For a given instance, the load limits of all repeaters that can possibly be inserted
have to be lowered according to the signals going through the instance. This can
make some buffering modes invalid that depend on higher loads. They are just
removed for the instance.
7.3.2 Constructing Repeater Trees
For repeater tree construction, we always first construct a topology and then add
repeaters using our algorithm. The result can optionally be post-processed by the
dynamic programming repeater insertion that treats the result of Fast Buffering as
98
7.3 Processing Repeater Tree Instances
an input topology. The initial topology can be constructed using existing wiring or
by our topology algorithm.
Parameter ξ
The algorithms are controlled by the preprocessing that we presented in Chapter 4
and the ξ parameter. For our implementation, we have split the parameter into
three different ones: ξm for buffering mode creation, ξt for topology generation,
and ξr for the repeater insertion step. The parameters can be controlled by the
user independently. In Section 8.8, we give hints which values work best for the
parameters.
Currently, we only work with two buffering modes for each wire mode. The first
buffering mode is extracted using ξ = 0.0 and for the second one we use ξr . The
faster buffering mode is then used for topology generation and AssignEffort
chooses between both. As described earlier, lowering slew or capacitance limits can
render buffering modes invalid. This can only happen for the slower buffering mode
as it has higher limits. In such a case, we increase ξ until all limits are met. Thus,
we always have a choice between a slower and a faster buffering mode.
The parameter ξt is used within topology generation when we have to decide at
which edge we want to connect a sink.
Finally, ξr is used during repeater insertion when we have to decide which solution
to choose in InsertRepeater, Merge, and ConnectRoot.
7.3.3 Replacing Repeater Tree Instances
After a repeater tree has been constructed, we have to evaluate it. Our algorithm is
only a heuristic that cannot guarantee a good solution. We therefore compare our
solution to the existing repeater tree that was identified during instance collection.
For this, we have the choice to
• evaluate the result using our approximations of the timing rules or
• to use the timing engine for evaluation.
Evaluation using the approximations has the disadvantage that it is slightly
inaccurate and can miss effects influencing timing. However, it is much faster
than the timing engine and it can run on different instances at the same time (see
Section 7.4.3). If the new solution is not good enough to be kept, it is possible to
discard it without modifying the netlist. If we want to evaluate using the timing
engine, then we have to insert the result into the netlist first. Currently, the quick
evaluation mode is used most of the time. Only if we want to be 100 % accurate,
the timing engine is used.
The criteria used to evaluate a solution are ordered from the most important ones
to the least important: electrical violations, slack, power consumption, and length.
99
7 BonnRepeaterTree
7.4 Implementation Overview
BonnRepeaterTree is a module in the BonnTools suite. It has been implemented
using the C++ programming language. The module consists of
• a repeater tree API,
• a framework that can be used to implement repeater tree construction algorithms,
• implementations of our repeater tree construction algorithms, and
• a layer translating between the framework and the IBM physical design tools.
All our algorithms have been implemented using the framework. To migrate them
to a different physical design tool suite one only has to reimplement the translation
layer. It will not be necessary to touch the algorithms.
7.4.1 Repeater Tree Construction Framework
Our framework provides an algorithm with all information that belongs to an
instance as described in Chapter 3. In addition, the existing implementation of an
instance is available. For example, this is used to do post optimization of instances
or to compare new solutions with older ones.
To build a new tree, an algorithm only has to create a tree data structure that
consists of nodes for roots, sinks, repeaters, and the connections between them.
Evaluation and implementation into the design is done by the framework.
7.4.2 Repeater Tree API
The BonnRepeaterTrees are not only used as a standalone utility but also as a
subroutine for other tools, for example, BonnLogic (Werber, 2007; Werber et al.,
2007), a tool to restructure logic on the critical path.
There are also programs that need information about repeater tree instances. For
example, determining whether a pin is part of a repeater tree at all is a functionality
that is not provided by the timing engine.
We provide a small API to work on repeater trees using our algorithms consisting
of
• utility functions to determine whether a pin is in a repeater tree, whether a
pin is a root, and for a pin in a repeater tree the corresponding root,
• a function returning a whole repeater tree instance, and
• a function to construct a repeater tree using one of the algorithms.
100
7.5 BonnRepeaterTree in Global Timing Optimization
Instances expose the original repeater tree. The user has direct access to root and
sinks and can traverse the tree to get inner repeaters and nets.
For example, the RerouteChains tool uses the interface to fetch all repeater
trees and traverses the original tree to identify chains. Most tools, however, have a
set of pins and want to optimize the pins’ repeater trees. Such tools just iterate
over all pins, fetch an instance, and construct it.
7.4.3 Parallelization
About ten years ago the increase of CPU speed with each new generation slowed
down significantly. Instead, multi-core CPUs appeared with the number of cores
increasing with each generation. To fully utilize the power of modern CPUs, it is
necessary to distribute work on the cores.
It rarely happens that repeater tree optimization is started on a single instance.
Typically, hundreds or thousands of instances should be calculated at the same time.
Because of this and the small running time of optimizing a single instance, it makes
little sense to parallelize parts of our algorithm. Instead, we choose the simpler
approach of parallelizing the computation of different instances.
The optimization environment does not allow to modify or even query the netlist
or timing engine at the same time from different threads because this would lead
to race-conditions. Therefore, we choose to protect all calls to the environment by
a single mutex. To reduce congestion on the mutex the framework first fetches all
information necessary to compute a repeater tree while the mutex is held. Then,
during the whole computation, the mutex is never acquired. Only after the decision
to insert a repeater tree has been made, the mutex is locked again to modify the
netlist.
Testing Multithreading Running Times
We have tested the parallel execution on our testbed of chip designs using an Intel
Xeon machine with 4 processors having 8 cores each. Table 7.1 shows the running
times with different numbers of parallel threads. We achieve a speedup around 4
using 8 cores which is our default when we optimize all instances of a design. The
speedup is limited by the serial parts, instance identification and instance insertion.
7.5 BonnRepeaterTree in Global Timing Optimization
Our algorithm is able to process millions of instances in reasonable time. It is
therefore suitable to be used in global timing optimization where all instances are
processed. BonnRepeaterTree offers two parameter sets by default that have proven
useful for global optimization (see Held (2008)).
101
7 BonnRepeaterTree
Design
Baldassare
Beate
Gerben
Wolfram
Luciano
Benedikt
Renaud
Julius
Franziska
Meinolf
Iris
Gautier
1
2
4
8
Factor
12
16
24
32
20 s
35 s
40 s
45 s
102 s
250 s
259 s
309 s
358 s
449 s
501 s
1420 s
11 s
18 s
21 s
23 s
55 s
130 s
140 s
158 s
188 s
235 s
269 s
765 s
7s
11 s
12 s
14 s
35 s
78 s
89 s
92 s
113 s
138 s
170 s
473 s
6s
8s
9s
11 s
29 s
63 s
74 s
72 s
89 s
108 s
142 s
383 s
3.34×
4.09×
4.46×
4.08×
3.48×
3.95×
3.52×
4.38×
4.02×
4.16×
3.53×
3.70×
6s
8s
9s
10 s
28 s
59 s
69 s
69 s
84 s
103 s
135 s
372 s
6s
8s
8s
10 s
27 s
58 s
69 s
66 s
82 s
101 s
135 s
363 s
6s
8s
8s
10 s
28 s
59 s
69 s
69 s
84 s
101 s
136 s
378 s
6s
8s
10 s
11 s
28 s
59 s
70 s
68 s
85 s
102 s
138 s
390 s
Table 7.1: Running times with different numbers of used threads. As the running time
gains from going further than 8 threads are rather small we use 8 threads by
default. The speedup factor is then around 4.
Power Trees
The first parameter set aims to build all instances with minimal power consumption
while electrical violations are avoided. We achieve this by relaxing all required
arrival time constraints to infinity. We assume that the root gate is replaced by the
strongest version from its BHC family. We mainly build short topologies, but to
prevent too long daisy-chains, the parameter ξt is slightly above 0.0.
Slack Trees
The second parameter sets tries to improve the slack above the slack target for each
instance. The ξ parameters are set around 0.8 to prevent a too high resource usage.
Uncritical instance parts are built with parameters similar to the power tree ones
due to the AssignEffort subroutine.
7.6 BonnRepeaterTree Utilities
Besides optimization, there are other tasks that are related to repeater trees. The
most common tasks are implemented as separate tools that can be called directly by
the user. The two existing tools are a rip out routine and repeater chain optimization.
7.6.1 Removing Existing Repeaters
It is desirable to remove all existing repeater trees to get into a clean initial state
for optimization. Our RipOut routine removes all inverters and buffers from a
design or specified instances such that at most one parity preserving inverter is left.
102
7.6 BonnRepeaterTree Utilities
−
+
−
r
+
Figure 7.5: The netlength of this instance can nearly be halved by adding a second inverter
and placing one of both at each negative sink.
The inverter is placed within the bounding box of all negative sinks such that it is
nearest to the root.
Figure 7.5 shows how removing as many repeaters as possible increases the
netlength of the design. This has to be considered if one uses a physical design
flow where all repeater trees are built along topologies coming from a global router.
Typical global routers are not capable of inserting inverters on their own. The
topologies that they can generate are limited by the input, and the example above
shows that removing as many repeaters as possible is no satisfactory solution because
it can lead to unnecessary high netlength.
We offer a mode in our RipOut routine that tries to circumvent the problem.
In this mode the routine will not insert the inverting repeater but modify logic by
connecting all sinks directly to the root. The original parity is stored for each sink
pin. As soon as the instance is touched again by one of our tools, the sink parities
are restored. The routine is dangerous because the whole design might get broken
if another tool changes the logic between rip out and restore. For global routers,
however, each repeater tree instance is presented as a single net.
7.6.2 Postprocessing Repeater Chains
Larger distances are often covered by chains of repeaters. A chain is a set of
subsequent nets with fanout one. For larger distances, it is more probable that it
is better to use higher layers with wider wires for the nets than the default layers.
It is often not possible to reach the slack targets without using wider wires and
a lot of short nets on lower layers increase placement congestion due to a lot of
additional repeaters. On the other hand, it makes no sense to assign high fanout
nets to higher layers. High fanout nets often connect a lot of sinks locally due to
the pin capacitance. The benefits of wider wires are not high in such a situation.
Thus, we restrict ourselves to repeater chains.
We offer two routines for improving the placement and layer assignment of repeater
chains.
Postprocessing
The first routine tries to improve the layer assignment of repeater tree chains by
ripping them out and rebuilding them on higher layers. The routine iterates over
all repeater chains and then builds a new solution using each configured layer
assignment with the Fast Buffering routine. The best solution according to our
103
7 BonnRepeaterTree
Figure 7.6: All repeater chains longer than 4 mm on a large design. Distributing the chains
more evenly can give us less congestion while preserving the timing.
criteria, electrical violations, slack, power consumption, and length is kept. The
routine either uses a shortest path computation in the blockage grid or uses the
path search of the congestion map if available. In the later case, assignments are
only performed if they do not violate congestion targets on the probed layers.
Congestion-aware Rerouting
The second routine, RerouteChains, does not try to improve the layer assignment
of repeater tree chains. Instead, it tries to improve congestion by moving the
repeaters into less congested areas. Figure 7.6 shows all repeater chains longer than
4 mm on a design that has a lot of congestion. By distributing the chains evenly and
by avoiding congested areas, it is possible to reduce overall congestion. We have a
simple ripup-and-reroute heuristic that reroutes the chains.
This routine only works in presence of a congestion map. In several iterations all
chains are collected that use an edge in the map above a certain congestion level.
Then, for each chain, the following steps are performed:
1. Collect the nets of the chain and the costs of their routing in the congestion
map.
2. Search for a new path from the start point to the end point of the chain in
the congestion map.
3. Distribute the internal repeaters of the chain along the new path such that
relative distances are preserved. The repeaters are then placed legally at the
free position next to their target position.
4. New routes for all incident nets are computed.
104
7.6 BonnRepeaterTree Utilities
5. If the costs of the new nets are smaller than the old costs, then the new
solution is kept. Otherwise, it is reverted.
The search for the new path does not take existing layer assignments into account
because it is not supported by the congestion map. The assignments are only used
when new routes for the nets are recomputed. Similarly, the path search ignores
timing. We restrict the congestion map to the bounding box between the start and
end of the path plus some tiles for detour. Thus, we expect that the routes do not
get too long and timing is not degraded too much. Instead, the number of nets with
long detours generated by the global router should become smaller.
The routine also does not consider placement congestion directly apart from
searching for free positions. In practice, placing gates densely such that no gaps occur
results in unroutable designs. Thus, if we have a design that is considered routable
by the global router, we expect that placing the circuits legally in the corresponding
global routing tiles should be feasible without much placement distortion. This
cannot be guaranteed but the experiments did not show problems with placement
legalization. If we skip placing the repeaters legally after movement, then the results
are similar to runs where the circuits are legalized.
A potentially better solution to the problem has been proposed by Janzen (2012).
He integrates the placement of repeater chains into the global router. Whole chains
are considered as a single net by the global router, and, during path search, timing
as well as placement area are considered. Compared to the method presented here,
the running time of the global router increased significantly due to the additional
work during path search. This approach is not suitable for application in practice
so far.
Experimental Results
We test the RerouteChains routine on our testbed of chip designs (see Section 8).
The test cases are output of a standard BonnTools optimization flow (see Section 7.5).
We compare the effects of our routine on routability and timing.
Table 7.2 and Table 7.3 summarize the experiments. BonnRouteGlobal1 is used
to measure routability. We have one run of the global router before and one run
after RerouteChains. The global router and the congestion map have nearly the
same number of global routing tiles in both dimensions.
Table 7.2 first shows the number of repeater chains considered on the test instances.
Our tool runs in five iterations over all chains and tries to reroute a chain if it uses an
edge in the congestion map with more than 85 % utilization. If a cheaper solution is
found, then it is kept. The second column shows the number of improvements found
over all five iterations. The following columns show the running times of both global
router runs (1st GR, 2nd GR), congestion map creation (CM), and RerouteChains
(RC). The congestion map proves to be very fast. The RerouteChains running
1
see (Gester et al., 2011; Müller et al., 2011; Müller, 2009)
105
7 BonnRepeaterTree
Running Times
Design
Chains
Reroutes
1st GR
CM
RC
2nd GR
Baldassare
Beate
Wolfram
Gerben
Luciano
Benedikt
Renaud
Julius
Franziska
Meinolf
Iris
Gautier
7613
5043
3842
3493
27922
37893
14908
41677
56697
128023
76153
165491
381
251
1473
10
4200
3209
12379
7315
3108
68221
141865
214135
9s
14 s
18 s
14 s
46 s
106 s
295 s
178 s
151 s
165 s
750 s
6454 s
2s
2s
2s
1s
5s
14 s
13 s
24 s
22 s
23 s
30 s
205 s
3s
5s
6s
5s
20 s
38 s
33 s
185 s
65 s
103 s
193 s
1344 s
10 s
14 s
19 s
15 s
50 s
111 s
289 s
150 s
160 s
170 s
797 s
5247 s
Table 7.2: Testbed for the RerouteChains tool. The running times are given for the
global routing (GR), congestion map creation (CM), and RerouteChains (RC).
times also contain the legalization of moved repeaters. Overall, the running times
are acceptable.
Table 7.3 first shows the total overflow over all global routing edges as reported by
BonnRouteGlobal. Then, the number of nets is reported with length longer than two
times the length of a Steiner minimum tree and the highest relative detour. Finally,
timing quality is measured with the sum of negative slacks (SNS). Most instances are
uncritical and the tool has little effect. On Benedikt and Franziska the routability
decreases slightly due to differences in the free edge capacities seen between the global
router and the congestion map. On designs Gautier and Renaud the congestion was
reduced significantly resulting in less detours. Due to the reduction of detours, also
the sum of negative slacks was improved on these instances. In summary, there are
instances where running the tool improves routability, while on uncritical instances
no harm is done.
Figure 7.7 and Figure 7.8 show congestion maps as reported by BonnRouteGlobal
for designs Julius and Gautier. The colors show the maximum relative edge utilization
over all layers. Especially on Gautier, we see the huge reduction of overflow by a
factor more than 100×.
106
7.6 BonnRepeaterTree Utilities
Design
Step
Overflow
Nets > 100 %
Max detour
SNS
Baldassare
before
after
0
0
0
0
8%
8%
−140049 ps
−140039 ps
Beate
before
after
0
0
0
0
78 %
78 %
−151997 ps
−151990 ps
Wolfram
before
after
0
0
0
0
25 %
25 %
−75187 ps
−75210 ps
Gerben
before
after
0
0
0
0
30 %
30 %
−109589 ps
−109626 ps
Luciano
before
after
0
0
52
24
524 %
342 %
−464277 ps
−481492 ps
Benedikt
before
after
1446
2906
0
0
72 %
103 %
−207397 ps
−210871 ps
Renaud
before
after
4140245
3444979
913
825
747 %
513 %
−6557941 ps
−5822497 ps
Julius
before
after
0
0
0
0
99 %
89 %
−18787024 ps
−18766033 ps
Franziska
before
after
1307
10982
0
0
78 %
78 %
−1694730 ps
−1694434 ps
Meinolf
before
after
0
0
0
0
98 %
89 %
−4622849 ps
−4715832 ps
Iris
before
after
2843214
1891486
4057
3032
1662 %
1662 %
−19704061 ps
−15602736 ps
Gautier
before
after
12682825
123511
4609
111
306 %
131 %
−9647137 ps
−7058513 ps
Table 7.3: Difference in routability and timing before and after RerouteChains.
107
7 BonnRepeaterTree
120 %
110 %
10the0 %
95 %
91 %
87 %
82 %
76 %
70 %
60 %
50 %
30 %
Figure 7.7: Congestion on design Julius before (left) and after (right) RerouteChains.
120 %
110 %
100 %
95 %
91 %
87 %
82the %
76 %
70 %
60 %
50 %
30 %
Figure 7.8: Congestion on design Gautier before (left) and after (right) RerouteChains.
108
8 Experimental Results
As part of the IBM optimization tool suite, the BonnRepeaterTree tool is in daily
use to optimize the physical design of current chip designs. It proved to be useful for
current ASIC (application-specific integrated circuit) designs as well as for bleedingedge processor units. Designers choose it as part of the BonnTools optimization
tools to achieve timing-closure.
In this chapter, we want to show performance metrics of our tool. We compare it
to another tool used by IBM for repeater tree optimization. We also compare against
lower bounds for slack, netlength and repeater counts. We then show how parallel
walk and effort assignment affect the results and give hints on proper selection of
parameters ξ, η, and dnode .
Design
Baldassare
Beate
Wolfram
Gerben
Luciano
Benedikt
Renaud
Julius
Franziska
Meinolf
Iris
Gautier
Technology
Instances
Max. Sinks
22 nm
22 nm
32 nm
32 nm
22 nm
22 nm
45 nm
22 nm
22 nm
22 nm
22 nm
45 nm
20552
34382
44413
44677
101625
241465
268775
284934
328600
364969
393457
1275731
152
166
274
194
263
12061
389
1589
740
1539
3264
6859
Table 8.1: Designs used for experimental results. For each design, all repeater tree instances
were build.
We have chosen twelve current chip designs from our industrial partner IBM for
our experiments. For each design, we build all repeater tree instances regardless of
difficulty. In total, we have more than 3.3 million instances of varying sizes with up
to 12061 sinks. Table 8.1 shows the designs we used, their codename, technology,
the number of instances, and the number of sinks in the biggest instance.
The distribution of instance sizes is very uneven as shown by Table 8.2, which
shows it subdivided into the different technologies. The designs are dominated by
single sink instances. Instances with up to four sinks already make up for more
than 90 % of all instances. It is important to produce good repeater trees for
109
8 Experimental Results
Sinks
45 nm
32 nm
22 nm
Total
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
251–500
501–1000
> 1000
958307
191709
144054
115818
27687
25420
16523
11518
27348
14842
1667
775
233
75
110
58920
11422
9077
2219
2384
968
652
521
2336
759
239
46
1
0
0
1307808
179836
96330
42237
24882
20435
19334
18883
44070
12434
5192
2311
269
30
28
2325035
382967
249461
160274
54953
46823
36509
30922
73754
28035
7098
3132
503
105
138
68.39 %
11.26 %
7.34 %
4.71 %
1.62 %
1.38 %
1.07 %
0.91 %
2.17 %
0.82 %
0.21 %
0.09 %
0.01 %
0.00 %
0.00 %
Total
1536086
89544
1774079
3399709
100.00 %
Table 8.2: Test instances grouped by number of sinks and technologies.
small instances to get overall acceptable results. However, it is also important to
optimize instances with more sinks well because, in practice, they are often among
the timing-critical instances.
Table 8.3 shows the size of the repeater families used as libraries. The designs from
32 nm and 22 nm technologies do not have any buffers. Instead, two consecutive
inverters are used if necessary.
We have performed all of our experiments on a machine with two Intel Xeon X5690
processors with 6 cores each. The machine runs with a base clock speed of 3.46 GHz.
The Intel® Turbo Boost Technology is enabled with a maximum clock speed of
3.73 GHz. Hyperthreading was disabled. All experiments were run single-threaded,
but up to 12 experiments were run in parallel if not noted otherwise. The running
times reported might be higher than necessary because the machine was fully loaded
with experiments. However, in practice, designers often have to share computing
resources with others, too. The machine has 192 GiB of main memory.
The code was compiled with GCC version 4.1.2 under Red Hat Enterprise Linux
Server 5.6 at optimization level O2.
8.1 Comparison to an Industrial Tool
As a first experiment, we compare our algorithm to a repeater tree construction tool
that is used by IBM for most repeater trees. It uses a van Ginneken-style approach
with the running time improvements by Li et al. (2012). It is the default tool in
110
8.2 Comparison to Bounds
Inverter
Buffer
45 nm
32 nm
22 nm
18
18
20
0
22
0
Table 8.3: Repeater Library Sizes
the IBM optimization suite due to its good results and tight integration with the
placement tool of the suite.
The integration makes it hard to compare both tools on all instances, because it
is not possible to run the industrial tool on a single repeater tree instance without
huge overhead. We have chosen a random sample of 19 instances from the Franziska
design with different characteristics. Instances have different numbers of sinks and
diameters. While it took seconds to run both tools on all instances, testing took
one hour due to the overhead.
Table 8.4 shows the results of running the industrial tool, our BonnRepeaterTree
Fast Buffering routine, and our BonnRepeaterTree routine with dynamic programming using 40 power buckets. All tools are configured to maximize slack. We have,
however, reduced ξ to 0.8 for our repeater insertion. The reason is that higher
values would not improve the slack anymore but cause higher area consumption.
In practice, we also seldomly use higher ξ values. Both tools are configured to
obey blockages and they are only allowed to use the default wiring modes. The
instances are already optimized to give all tools reasonable arrival times at the root
and required arrival times at the sinks. In practice, the IBM tool would prune the
size of the repeater library to improve running time. We configured the tool to use
the whole library because this leads to significantly better slacks. Our tool does not
prune the library so far.
The overall result is that our dynamic program produces the best slack followed
by Fast Buffering even though we reduced the ξ parameter. Better slack, on the
other hand, costs more area. In general, we see that the industrial tool uses less area.
While the Fast Buffering algorithm finds solutions without any electrical violation,
both other tools create violations.
The experiment was performed on an otherwise empty machine to make the
running times comparable. The running time of our Fast Buffering version is
1.26 seconds. The IBM tool uses 1.78 seconds. The dynamic program needs
113.8 seconds with 40 buckets and 3.2 seconds without buckets. The results of the
version without buckets using the same setup are shown in Table 6.2. All running
times include identifying and replacing instances and not only the core algorithm.
8.2 Comparison to Bounds
It is hard to show the quality of our algorithm for the Repeater Tree Problem
because optimal solutions are not known. Despite that, for some aspects of the
111
Industrial Tool
I01
I02
I03
I04
I05
I06
I07
I08
I09
I10
I11
I12
I13
I14
I15
I16
I17
I18
I19
BonnRepeaterTrees + DP
SNS
Slack
Area
Vio
SNS
Slack
Area
Vio
SNS
Slack
Area
Vio
1
−400
1
−91
2
−475
2
−156
3
−157
3
−167
4
−25
4
−141
5
−368
8
−190
10
−368
15
−761
24
−74
33 −3799
47 −2998
65 −3354
73 −3208
120 −10988
322
−273
−400
−91
−243
−82
−66
−71
−16
−40
−118
−30
−74
−72
−16
−238
−140
−99
−102
−147
−29
106
12
112
6
3
25
13
6
23
14
25
68
31
123
69
40
86
162
145
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/1
0/0
0/0
0/0
0/0
0/25
−379
−91
−417
−153
−140
−183
−26
−136
−349
−127
−428
−690
−79
−3327
−2833
−2784
−4358
−9431
−6
−379
−91
−213
−85
−63
−71
−14
−40
−116
−20
−68
−62
−16
−197
−108
−96
−86
−107
−3
160
10
144
8
11
43
20
8
38
28
20
107
23
228
113
127
165
256
342
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
−377
−91
−413
−161
−143
−162
−22
−133
−351
−117
−387
−698
−60
−3167
−2922
−3852
−3424
−10701
0
−377
−91
−206
−81
−61
−68
−11
−36
−113
−20
−68
−59
−8
−189
−109
−96
−78
−107
25
160
14
150
6
11
35
25
8
40
34
22
116
38
144
101
48
132
333
851
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/0
0/2
0/0
0/0
0/0
0/0
0/0
0/0
0/0
Table 8.4: Results of optimizing for slack on several instances from a 22 nm design for the industrial tool, our Fast Buffering routine and
our dynamic program implementation that considers power consumption. All times are given in ps. SNS is the sum of negative
slacks for all sinks. Slack is the worst slack of the instance. Area is the space consumed by the internal repeaters measured in
placement grid steps. Vio gives us the number of load and slew violations in the result.
112
8 Experimental Results
Sinks
BonnRepeaterTrees
8.2 Comparison to Bounds
solutions like netlength, number of inserted repeaters, and slack, we can compare
the results to their respective bounds.
Length
ξ
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Slack Dev.
Power
Inversions
Avg.
Max.
Avg.
Max.
Wall Time
739.174
797.945
882.063
967.900
1069.397
1177.334
1274.543
1479.679
1689.111
2027.596
2970.202
1504726
1587022
1746005
1920838
2076479
2265218
2547997
2914892
3412856
4163780
6254108
2.0 %
1.8 %
1.8 %
1.8 %
1.8 %
1.8 %
1.8 %
1.9 %
2.3 %
3.6 %
14.1 %
837.7 %
833.1 %
832.0 %
841.3 %
852.5 %
874.0 %
896.3 %
877.7 %
904.3 %
926.3 %
3321.0 %
12.07 ps
9.23 ps
7.48 ps
6.15 ps
5.16 ps
4.36 ps
3.52 ps
2.79 ps
2.16 ps
1.60 ps
1.14 ps
1419.46 ps
1037.46 ps
781.36 ps
666.23 ps
534.88 ps
376.63 ps
352.00 ps
250.13 ps
212.43 ps
174.89 ps
178.36 ps
2301 s
2347 s
2302 s
2291 s
2280 s
2257 s
2213 s
2194 s
2156 s
2116 s
2119 s
Table 8.5: Results of our repeater tree algorithm for different ξ values.
We ran our algorithm on all instances for ξ values ranging from 0.0 to 1.0. The
results are shown in Table 8.5. In general we see that power consumption and
netlength increase with higher ξ values. The slack deviation (see below) gets better
and even the running time reduces.
Table 8.5 is a summary of all runs over all technologies and all numbers of sinks.
We have added detailed tables in Appendix A where the data is separated by
technology and number of sinks.
8.2.1 Running Time
We did not focus on running time during testing. The results show that the
algorithm runs very fast. The average running time for a single instance is about 0.6
milliseconds, which means that we can solve about 5.7 million instances per hour.
The wall time reported in Table 8.5 only contains the time used to build topologies
and to insert repeaters. Overhead like identifying instances and adding the result
into the design is not reported. Depending on the design, the overhead is between
100 % and 150 % of the running time used to solve the Repeater Tree Problem.
The running time decreases slightly with higher ξ values. The more we try to
optimize slack the earlier repeaters are inserted to shield uncritical side paths from
the critical ones. In addition more repeaters are added along paths due to timing
reasons. The result is that the algorithm works with nets that have a smaller number
of sinks. Due to the smaller instances the, subroutine that computes Steiner trees
runs significantly faster.
113
8 Experimental Results
8.2.2 Wirelength
A lower bound on the total wire length of a repeater tree instance is the length of
a Steiner minimum tree spanning the root and all sinks. We computed one for all
instances with less than 36 sinks. For bigger instances we used the minimum of a
Steiner tree heuristic guaranteeing a 3/2-approximation and the result of our routine
over all test runs.
Table 8.5 shows how many percent we are away from the optimal repeater tree
length. The optimal length can be 0 if the root is a primary input pin and a single
sink is directly below it. Let I be the set of instances with non-zero optimal length.
Given for each i ∈ I the length of our tree length(i) and the optimal length opt(i),
Table 8.5 shows the result of the following equation:
1 X length(i) − opt(i)
.
|I| i∈I
opt(i)
The average length increase compared to an optimal Steiner tree is quite low.
However, there are some instances with huge wirelength increases. The detailed
tables in Appendix A show that this only happens on instances that use the clustering
preprocessing or with high ξ values where we accept detours to keep bifurcations
from the critical path.
Due to parallel walk and instances with sinks of different parities, a deviation of
almost 100 % can be optimal. Consider an instance with one negative sink close
to the root and two other sinks, one negative and one positive, at some distance.
One inverter is necessary to negate the inversions, but we have the choice between
bridging the distance twice or adding an additional inverter at the negative sink
away from the root.
8.2.3 Number of Inserted Inverters
1 To
obtain a lower bound on the number of inverters needed to legally buffer a
repeater tree instance let Capextra arise from the sum of the wire capacitance of a
minimum Steiner tree and the input capacitances of all sinks by subtracting the
maximum capacitance that can be driven by the root with the given input slew,
such that the output slew is at most optslew. Every inserted repeater of type t can
drive a certain amount loadlim(t) of this capacitance but also contributes its own
input capacitance capin (t). Let value M axCap(t) be the biggest load the repeater
can drive with an input slew of optslew such that the output slew is smaller or equal
to optslew.
We may assume M axCap(t) > capin (t). Therefore, if there is a legal inverter tree
using xt inverters of type t, then
Capextra +
X
t∈L
1
capin (t)xt ≤
X
t∈L
Part of this section is from Bartoschek et al. (2007b).
114
M axCap(t)xt
(8.1)
8.2 Comparison to Bounds
has to be satisfied.
Depending on whether we are interested in the number of inserted inverters, in
their total area, or in power consumption, we can assign a cost ct ≥ 0 to each
inverter type t ∈ L. We ask how well our algorithm minimizes this cost.
To obtain a lower bound on the cost that any inverter tree must have, we consider
the problem of minimizing the total cost
X
ct xt
t∈L
subject to (8.1) and xt ≥ 0 for all t ∈ L. This is a very simple linear program (LP).
The dual LP is
maximize Capextra y
subject to (M axCap(t) − capin (t))y ≤ ct for all t ∈ L
y≥0
If Capextra ≤ 0, then y ∗ = 0 is the optimum solution of this LP. If Capextra > 0,
then
ct
∗
t ∈ L
y = min
M axCap(t) − capin (t) is optimum. By the LP duality theorem, the optimum value of the original (primal)
LP is
X
ct Capextra
∗
ct xt = min
t∈L .
M axCap(t) − capin (t) t
Of course, if we consider the number of inverters (i.e. ct = 1 for all t), we can round
up this lower bound to the next integer.
Three further modifications are possible to improve this bound in some cases:
First, if the lower bound is 0 but there is a sink of negative parity, we clearly need
at least one inverter. Moreover, if our lower bound is 1 but all sinks have positive
parity, we need at least two inverters. Finally, if there is only one sink, we can round
up the lower bound to the next even or odd integer, depending on the sink’s parity,
+ or −.
The resulting minimum number of inverters that have to be inserted over all of
our instances is 1150712. Table 8.5 shows the number of inversions we have added
for different ξ values. Each buffer added on the 45 nm instances is counted as two
inversions. With ξ = 0, we are only 31 % above the bound that does not take slew
propagation over wire segments into account. Getting better timing results increases
the number of used inversions significantly such that for ξ = 1.0 we are 540 % over
the bound.
8.2.4 Timing
It is hard to obtain an upper bound on the slack that one can achieve for a single
repeater tree instance. We have chosen the following approach.
115
8 Experimental Results
r
−
Figure 8.1: Test setup used for computing slack upper bounds.
For instances with a single sink, we build repeater trees with the highest effort
and different parameter sets. We use our algorithm and also the dynamic program
algorithm for post optimization. We choose the maximum slack we get over all
different runs on the same instance as an upper bound. Obviously, it is not proven
that the bound is indeed an upper bound. However, we are confident that the real
upper bound is not far away. Any provable upper bound we can conceive is too far
away from the actual achievable results because it has to be based on too optimistic
assumptions.
For instances with two or more sinks, we construct a test instance for each
sink. Each test instance consists of the original sink and an additional sink that
corresponds to the input pin of the repeater with the smallest input capacitance.
The additional sink is located at the root pin and has infinite required arrival time.
The setup models the smallest impact that shielding off all other sinks can have on
a critical sink. Figure 8.1 shows the setup.
Similar to instances with a single sink, we then build the best possible repeater
tree we can achieve for each test instance. The maximum slack achievable over all
runs is used as the upper bound for the according sink.
The minimum slack bound over all sinks is then used as the upper bound for the
achievable slack of the repeater tree instance. As in the single sink case, it is not
guaranteed that we get a real upper bound. However, it is not possible to achieve
the best possible slack at all sinks simultaneously. The more sinks with similar
criticality an instance has the worse the achievable slack is compared to the upper
bound.
Table 8.5 shows the effect of ξ on the resulting slack deviation from the upper
bound. While going to the extreme might be undesirable due to the high increase
in netlength and power consumption, using a ξ of up to 0.9 can be justified by the
slack improvements.
8.3 Fast Buffering vs. Dynamic Programming
We use our version of the dynamic program algorithm as a post optimization step
for the most critical instances. Table 8.6 shows how the Fast Buffering algorithm
compares to the dynamic program. The Fast Buffering algorithm is run with ξ = 1.0.
The table summarizes the results on our testbed for all technologies but separated
by number of sinks.
Slack deviation and length deviation are computed in the same way as in Section 8.2. The power columns sum up the power consumption and number of
116
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
251–500
501–1000
> 1000
Total
Total >2
Power
No DP
DP
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
0.32
34.23
1.55
31.78
2.50
37.42
2.98
35.41
3.71
41.38
4.14
57.69
4.98
92.07
7.42
101.25
7.74
73.60
13.92
92.16
20.55
109.36
21.22
86.98
25.57
148.72
51.36
178.36
56.81
163.90
0.16
43.72
1.08
31.27
1.69
36.92
2.14
32.18
2.55
36.78
2.73
57.49
3.34
84.38
4.82
83.78
5.25
61.00
9.32
71.64
13.62
82.43
14.07
80.86
18.49
109.02
35.06
132.57
30.47
99.67
Avg.
Max.
Avg.
Max.
1.25
178.36
4.87
178.36
0.81
132.57
3.30
132.57
Length
No DP
DP
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
500.412
1318262
136.984
507048
128.043
462522
100.239
303944
53.277
190835
71.950
189147
95.207
194208
152.175
268542
118.038
739550
143.395
560739
36.407
363539
25.143
326046
10.403
78538
14.425
28869
18.945
50864
545.606
1472938
148.558
625075
145.060
580626
124.410
410043
62.546
266861
82.125
268407
106.697
284789
167.032
369478
131.693
1055801
156.515
776370
36.224
518275
24.841
474345
11.826
116911
15.619
44360
98.136
273687
Pwr.
Rpt.
Pwr.
Rpt.
1605.045
5582653
967.648
3757343
1856.887
7537966
1162.723
5439953
117
8.3 Fast Buffering vs. Dynamic Programming
Slack Deviation
Running Time
No DP
DP
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
0.0
0.0
0.0
0.0
14.2
99.3
22.8
183.5
23.4
176.7
12.6
192.0
32.8
314.4
31.8
255.8
30.7
585.8
51.1
1352.1
75.4
1755.5
75.7
2110.0
82.9
926.1
221.6
1643.9
387.0
1066.1
0.0
0.0
0.0
0.0
14.2
99.3
22.8
183.5
23.3
176.7
12.6
192.0
32.7
296.6
31.8
255.8
30.6
585.8
50.8
1337.8
74.6
1747.7
74.8
2042.0
81.5
915.7
219.6
1637.6
376.9
1065.5
Avg.
Max.
Avg.
Max.
16.4
2110.0
33.4
2110.0
16.3
2042.0
33.2
2042.0
No DP
DP
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
48.28
420.46
11.12
115.85
8.99
102.82
6.96
83.89
4.04
51.98
3.56
48.01
3.56
51.15
4.17
59.68
16.74
225.67
18.61
192.43
18.43
133.07
27.78
117.25
20.37
36.41
18.83
13.99
5.22
7.55
52.36
9870.21
11.91
2805.00
9.66
2393.74
7.36
1864.14
4.26
1031.52
3.79
1328.70
3.75
1623.05
4.34
2496.50
17.24
3419.13
19.04
3187.63
19.00
1542.92
28.51
1256.83
20.79
394.31
19.35
285.59
5.41
1315.06
Top.
Buf.
Top.
Buf.
216.65
1660.21
157.26
1123.89
226.75
34814.33
162.49
22139.11
Table 8.6: Comparison between Fast Buffering and our dynamic program.
8 Experimental Results
repeaters inserted. Buffers are not counted twice here. The running times are given
for topology generation and buffering separately.
As expected, the average slack deviation of the dynamic program is smaller. It
should be used to get the last tenth of a picosecond from most critical instances. In
general, the running time is more than 10× compared to Fast Buffering. In addition,
the power consumption increases significantly. However, the additional power might
be necessary to get better slacks. Netlengths are very similar for both algorithms.
8.4 Varying η
Length
η
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Slack Dev.
Power
Repeaters
Avg.
Max.
Avg.
Max.
1835.132
1611.987
1598.349
1596.091
1595.128
1592.330
1593.975
1592.487
1591.764
1590.603
1589.791
7415542
5700476
5644917
5596761
5555896
5519986
5505832
5499882
5497906
5492262
5487923
16.0 %
15.0 %
15.4 %
15.7 %
16.0 %
16.1 %
16.4 %
16.4 %
16.4 %
16.4 %
16.3 %
2042.0 %
1301.6 %
1491.2 %
1688.2 %
1759.2 %
2110.0 %
2367.6 %
2481.0 %
2534.0 %
2246.3 %
2117.9 %
0.81 ps
1.24 ps
1.23 ps
1.24 ps
1.24 ps
1.25 ps
1.26 ps
1.26 ps
1.26 ps
1.26 ps
1.26 ps
132.57 ps
148.93 ps
224.05 ps
201.43 ps
216.41 ps
178.36 ps
221.72 ps
195.71 ps
204.04 ps
203.28 ps
196.78 ps
Table 8.7: Results of our repeater tree algorithm for different η values.
The next set of experiments compares the results of our algorithm optimizing for
slack (ξ = 1.0) but with different η parameters. Table 8.7 shows the resulting power
consumption, number of repeaters, length deviation and slack deviation.
In general, higher values of η consume less power but have higher netlength and
higher slack deviation. Our preferred value of 0.25 seems to be a reasonable choice.
Smaller values like 0.15 and 0.20 are also good candidates.
A choice of η = 0.0 is a special case. It allows to assign the whole dnode to a branch.
Thus, all sinks whose criticality is better than dnode compared to the most critical
sink can be connected to the critical sink without degrading the required arrival
time. This leads to the degenerated case that it is favourable to add bifurcations to
the most critical sink. During buffering a lot of repeaters are added to reduce the
impact of the additional bifurcations. This explains the high power consumption of
this run. Somewhat surprisingly, the average slack deviation is best with η = 0.0.
118
8.5 Varying dnode
8.5 Varying dnode
Length
dnode -factor
0.0
0.5
1.0
1.5
2.0
2.5
Slack Dev.
Power
Repeaters
Avg.
Max.
Avg.
Max.
147.096
136.963
135.092
138.993
144.373
149.157
2412342
2296316
2264650
2321192
2406360
2489224
11.7 %
25.5 %
32.9 %
40.8 %
48.9 %
56.0 %
225.7 %
1181.6 %
2110.0 %
2154.6 %
2486.2 %
2560.3 %
6.41 ps
4.71 ps
4.52 ps
4.52 ps
4.63 ps
4.73 ps
97.72 ps
78.19 ps
84.31 ps
115.26 ps
112.52 ps
127.15 ps
Table 8.8: Results of our repeater tree algorithm for different dnode scaling factor values.
We only consider instances with more than two sinks.
In Chapter 5 we claimed that it is necessary to add a bifurcation delay in our delay
model to take additional capacitance on side paths into account. Table 8.8 shows
the results of scaling the precomputed dnode value by certain factors optimizing
our instances with ξ = 1.0. We only consider instances with more than two sinks,
because dnode has little effect for two-sink instances and no effect on single-sink
instances. Disabling dnode altogether has good effects on the repeater tree length
because no detours are added to avoid additional delay on paths to critical sinks.
On the other hand, the average slack deviation goes up significantly, and a lot of
repeaters are added to shield off critical repeaters.
If dnode gets too big, the algorithm tends to build more balanced topologies.
Detours are accepted to decrease the number of bifurcations on root-sink paths.
The result is high netlength and a high repeater count.
Altogether, our current choice of dnode with factor 1.0 has the lowest repeater
count and slack deviation. The additional netlength is acceptable because it leads
to less repeaters.
8.6 Disabling Effort Assignment
All experiments so far used the assign effort step (see Section 6.1) to reduce the
power consumption in tree parts that are above the slack target. To evaluate the
effects of this step, we compared it to runs where the step was skipped using ξ = 1.0.
Table 8.9 shows how power consumption can be saved due to AssignEffort.
We see that for some instances the power savings are significant. The potential
depends on how critical the timing of the design is. The table shows that for some
instances the power gets worse. This is due to the inexact repeater insertion. Most
instances, however, are completely below the slack threshold and AssignEffort
does not apply.
119
8 Experimental Results
Repeaters
pwr(A) - pwr(N)
Design
Assign
No Assign
Baldassare
Beate
Wolfram
Gerben
Luciano
Benedikt
Renaud
Julius
Franziska
Meinolf
Iris
Gautier
28151
53198
40469
44680
165743
183350
155931
604506
631607
588236
963296
641085
30687
55264
45411
45852
179219
289006
152752
666291
667934
614779
1026381
1179108
pwr(A)/pwr(N )
88.87 %
95.16 %
89.68 %
98.47 %
89.38 %
60.38 %
90.73 %
89.18 %
92.86 %
93.55 %
92.85 %
49.22 %
<0
1849
1588
2694
1192
8619
32021
7492
25790
14515
18623
18016
186116
=0
18108
31938
40816
42300
89374
201440
256103
250624
306362
335051
365939
1079702
>0
668
1104
1247
1295
3802
6887
5080
9164
9182
12974
9861
1593
Table 8.9: Repeaters used and power consumption in runs with and without AssignEffort. pwr(A) (pwr(N )) is the power consumption of the run with (without)
AssignEffort. The last three columns show on how many instances the
power has improved (< 0), stayed the same (= 0), or degraded (> 0) due to
AssignEffort.
Slack Degradation
Design
Baldassare
Beate
Wolfram
Gerben
Luciano
Benedikt
Renaud
Julius
Franziska
Meinolf
Iris
Gautier
slk(A) − slk(N )
Assign
No Assign
<0
=0
>0
0.54
1.27
0.43
1.79
0.56
0.08
0.20
0.77
0.98
0.60
0.51
0.02
0.53
1.28
0.46
1.79
0.56
0.08
0.18
0.76
0.97
0.56
0.49
0.00
594
818
717
832
2539
1041
1533
7049
6219
8225
5435
989
19678
33189
43166
42300
97663
238633
266437
273785
320887
355124
385838
1266301
353
623
874
676
1593
674
705
4744
3553
3299
2543
121
Table 8.10: Slack degradation in runs with and without AssignEffort. The last three
columns show for how many instances the slack below the slack target got
worse, stayed the same, or got better due to AssignEffort.
120
8.7 Disabling Parallel Mode
Table 8.10 shows how the slack is affected by AssignEffort. The slack degradation is measured in the same way as in Section 8.2. The overall average degradation
is shown. The effect of AssignEffort is that for a lot of instances the slack gets
closer to the slack target. Due to the heuristic nature of our buffer insertion, more
slacks can get worse than the slack target. This is also the reason for improvements
on some nets. A different repeater insertion on an uncritical sidepath can accidentally lead to better buffering on the critical path. Overall, the vast majority of
instances does not get worse.
8.7 Disabling Parallel Mode
A large part of the complexity of our repeater insertion routine lies in handling
clusters in parallel mode. Buffering algorithms that use a variant of van Ginneken’s
algorithm, on the other hand, do not change the topology during processing and
produce good results. We have to ask, whether using parallel clusters is worth the
effort.
We processed the test instances with a variant of our algorithm that is not allowed
to enter parallel mode. Merge solutions that would enter parallel mode are disabled.
Table 8.11 shows the result of the comparison summed up for all technologies. The
table compares the slack deviation from our upper bound, the power consumption
and number of inserted inverter stages, and the deviation from the Minimum Steiner
Tree for both runs. In general, the slack gets better if we are allowed to walk in
parallel. The static power consumption is reduced by about 20 %. The average
increase in netlength that corresponds to increase in dynamic power consumption is
quite small. It is expected that netlength increases if we are allowed to go in parallel
due to segments that are used twice. As we have disabled merge cases that result in
parallel walk, the work done during merging decreases. Accordingly, the running
time is smaller. Given the big improvements in slack and static power consumption,
we accept the small degradations in netlength and running time.
8.8 Choosing Tradeoff Parameters
So far, we used our power-slack parameter ξ uniformly for all parts of the algorithm.
However, as explained in Section 7.3.2 it is possible to use a different parameter for
topology generation (ξt ), repeater insertion (ξr ), or buffering mode selection (ξm ).
We have varied all three parameters independently from 0.0 to 1.0 in 0.1 steps and
optimized each design for each combination. The input netlists were placed, repeater
trees were optimized for power, and gate sizing was performed. For each of the
resulting 11 × 11 × 11 runs we measured the resulting sum of negative endpoint slacks
(SNS), area consumption (which correlates with power consumption), netlength,
and worst slack. The measurements are not restricted to repeater trees. Instead,
SNS is read at timing points where tests are performed. All nets are counted for
netlength. Area consumption counts all gates in the design.
121
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
8 Experimental Results
251–500
501–1000
> 1000
Total
Total >2
Power
Parallel
No Parallel
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
0.28
36.20
1.53
41.61
1.83
37.42
2.99
60.79
3.40
72.62
4.00
62.55
6.27
92.07
7.32
101.25
7.24
73.60
12.24
92.16
18.17
109.36
19.50
105.18
26.48
154.40
73.19
232.30
59.89
163.90
0.28
36.20
2.43
43.85
3.20
43.55
5.72
60.79
5.80
81.64
7.03
62.55
9.49
93.68
9.62
88.82
10.13
76.63
15.50
112.70
24.55
110.56
28.24
114.12
38.84
148.21
89.59
198.38
74.60
177.99
Avg.
Max.
Avg.
Max.
1.21
232.30
4.14
232.30
1.79
198.38
6.50
198.38
Length
Parallel
No Parallel
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
738.292
1545171
476.554
814674
310.435
717013
271.303
528539
115.726
250945
293.898
350513
220.964
294115
199.535
306462
179.345
810857
187.628
612392
54.287
392700
40.271
338506
20.481
84870
32.716
39868
21.881
53972
738.291
1545171
513.288
852053
339.320
764715
362.414
574810
137.936
277530
337.360
381772
259.513
329636
224.720
329179
225.163
912467
219.563
682592
71.303
469790
53.588
420841
25.198
112493
38.808
49175
27.224
60018
Pwr.
Rpt.
Pwr.
Rpt.
3163.316
7140597
1948.470
4780752
3573.690
7762242
2322.111
5365018
Running Time
Parallel
No Parallel
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
0.0
0.0
0.0
0.0
14.7
99.8
15.9
183.5
30.1
196.6
39.2
198.4
31.8
314.4
27.4
255.8
25.6
585.8
44.5
1352.1
65.5
1755.5
70.2
2110.0
110.3
3321.0
229.1
1643.9
380.9
1096.2
0.0
0.0
0.0
0.0
14.7
99.8
15.8
183.5
29.8
196.6
39.0
198.4
31.4
314.4
27.0
307.3
24.0
585.8
41.7
1347.0
60.4
1755.6
63.8
2041.7
99.5
3196.4
220.9
1636.7
372.2
1086.4
Avg.
Max.
Avg.
Max.
16.2
3321.0
31.5
3321.0
15.7
3196.4
30.6
3196.4
Table 8.11: Testing the effects of parallel mode.
Parallel
No Parallel
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
58.77
513.68
16.21
176.20
14.50
166.36
12.06
150.73
5.14
66.42
5.21
71.96
4.81
68.37
4.75
67.55
19.69
263.99
23.55
246.82
20.62
150.97
34.20
146.51
26.52
51.69
40.53
25.18
5.34
8.04
60.57
521.59
16.21
167.21
14.34
154.34
11.78
133.12
5.06
57.58
5.16
62.54
4.78
56.56
4.71
58.32
19.78
218.87
23.39
204.03
20.93
118.91
34.46
114.15
26.31
38.03
39.83
18.70
5.37
6.05
Top.
Buf.
Top.
Buf.
291.89
2174.45
216.92
1484.57
292.70
1929.99
215.92
1241.19
122
Slack Deviation
# Sinks
8.8 Choosing Tradeoff Parameters
SNS
Worst Slack
Area
Netlength
ξt
ξr
ξm
-2460784 ps
-2503732 ps
-2526592 ps
-2529250 ps
-2581487 ps
-2587899 ps
-2598980 ps
-2661338 ps
-2692252 ps
-2721807 ps
-2774560 ps
-2812004 ps
-2909451 ps
-2912460 ps
-2914087 ps
-2933527 ps
-2963781 ps
-2975351 ps
-2985396 ps
-3006647 ps
-3022317 ps
-3052016 ps
-3055216 ps
-3096810 ps
-3136457 ps
-3154245 ps
-3157275 ps
-3200980 ps
-3273230 ps
-3282542 ps
-3419364 ps
-360.960 ps
-371.797 ps
-368.434 ps
-361.152 ps
-400.693 ps
-351.772 ps
-345.037 ps
-357.609 ps
-381.317 ps
-380.967 ps
-381.654 ps
-378.829 ps
-402.322 ps
-402.322 ps
-408.144 ps
-422.991 ps
-397.433 ps
-415.343 ps
-403.745 ps
-387.869 ps
-381.193 ps
-420.933 ps
-414.426 ps
-387.917 ps
-434.217 ps
-407.107 ps
-455.027 ps
-411.824 ps
-433.453 ps
-427.694 ps
-431.093 ps
4264514
4161722
4139336
4039524
4003001
3977094
3940266
3909961
3882043
3856202
3841977
3816806
3815549
3814486
3802287
3797866
3795544
3792524
3785308
3784393
3780095
3779169
3774567
3769622
3765553
3765336
3763362
3759277
3759111
3756085
3754110
11421392775 nm
11396585038 nm
11384182618 nm
11357719162 nm
10823042062 nm
11315124985 nm
11287051361 nm
11255902718 nm
11247554519 nm
11222038997 nm
11215744785 nm
11186602609 nm
10683834075 nm
10667397255 nm
10777654190 nm
11155820658 nm
11146478260 nm
11131954147 nm
10769355282 nm
10702389423 nm
10764354020 nm
11128458742 nm
11104510839 nm
10762057030 nm
10756910041 nm
10696614904 nm
11074617288 nm
10753998314 nm
10697382852 nm
10748223421 nm
10734650521 nm
1.0
1.0
1.0
1.0
0.9
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.7
0.6
0.9
1.0
1.0
1.0
0.9
0.8
0.9
1.0
1.0
0.9
0.9
0.8
1.0
0.9
0.8
0.9
0.9
0.7
0.6
0.7
0.6
0.6
0.6
0.6
0.6
0.5
0.5
0.4
0.4
0.5
0.5
0.4
0.4
0.3
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.2
1.0
1.0
0.9
0.9
0.9
0.8
0.7
0.6
0.7
0.6
0.7
0.6
0.5
0.5
0.6
0.5
0.6
0.4
0.5
0.5
0.4
0.5
0.4
0.5
0.4
0.4
0.0
0.3
0.3
0.0
0.0
Table 8.12: Non-dominated parameter sets on the Franziska design.
SNS
Worst Slack
Area
Netlength
ξt
ξr
ξm
-5196277 ps
-5257051 ps
-5269220 ps
-5365769 ps
-5383743 ps
-5449405 ps
-5503767 ps
-5546630 ps
-5621983 ps
-5653641 ps
-5666127 ps
-5725306 ps
-5780351 ps
-5807757 ps
-5860053 ps
-5919674 ps
-6007460 ps
-6071154 ps
-6130646 ps
-6189021 ps
-6306196 ps
-6434807 ps
-6512938 ps
-6635688 ps
-6639647 ps
-6656564 ps
-6674811 ps
-6694420 ps
-6745947 ps
-6836574 ps
-6864691 ps
-7008116 ps
-7052258 ps
-7104017 ps
-7165642 ps
-7306835 ps
-7430083 ps
-7819054 ps
-8071842 ps
-631.933 ps
-635.508 ps
-635.053 ps
-635.359 ps
-638.916 ps
-636.583 ps
-653.201 ps
-635.822 ps
-637.161 ps
-666.391 ps
-649.055 ps
-649.055 ps
-642.621 ps
-658.837 ps
-663.175 ps
-657.425 ps
-661.243 ps
-666.057 ps
-677.938 ps
-666.374 ps
-660.793 ps
-682.030 ps
-695.812 ps
-662.423 ps
-759.335 ps
-673.105 ps
-725.778 ps
-766.073 ps
-718.396 ps
-728.312 ps
-737.688 ps
-738.604 ps
-732.529 ps
-762.535 ps
-741.704 ps
-778.668 ps
-778.668 ps
-844.052 ps
-797.117 ps
4527479
4509205
4472708
4348123
4336342
4247059
4197911
4157311
4143570
4136134
4090594
4082070
4079167
4039493
4035272
4031406
4004925
3964178
3953669
3948620
3919318
3896112
3874760
3872070
3859165
3851146
3847456
3838560
3833113
3827277
3819098
3807331
3803516
3798011
3781045
3769847
3769836
3765793
3759992
8925535585 nm
8819283865 nm
8785736183 nm
8906582047 nm
8801360797 nm
8816022521 nm
8928168330 nm
8907015090 nm
8775993529 nm
8934564389 nm
8907969542 nm
8817992649 nm
8786841694 nm
8825447381 nm
8794310975 nm
8777209504 nm
8906715403 nm
8904311050 nm
9410850288 nm
8776106642 nm
9393366028 nm
8898249360 nm
8893895626 nm
9372604172 nm
9345686625 nm
9357824853 nm
8897120664 nm
8897072449 nm
8893540101 nm
8829467006 nm
8895643021 nm
8893223837 nm
8831553876 nm
8887622641 nm
9280930291 nm
8875640716 nm
8825179592 nm
8852407160 nm
8848223450 nm
0.9
0.8
0.8
0.9
0.8
0.8
0.9
0.9
0.7
0.9
0.9
0.8
0.7
0.8
0.7
0.6
0.9
0.9
1.0
0.6
1.0
0.9
0.9
1.0
1.0
1.0
0.9
0.9
0.9
0.8
0.9
0.9
0.8
0.9
1.0
0.9
0.8
0.9
0.9
0.8
0.8
0.7
0.7
0.7
0.7
0.7
0.6
0.6
0.7
0.6
0.6
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.4
0.5
0.4
0.5
0.5
0.4
0.4
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.9
0.9
1.0
0.9
0.9
0.8
0.7
0.8
0.8
0.5
0.7
0.7
0.7
0.6
0.6
0.6
0.7
0.6
0.5
0.6
0.4
0.4
0.3
0.4
0.0
0.3
0.1
0.0
0.3
0.3
0.2
0.1
0.1
0.0
0.0
0.0
0.0
0.1
0.0
Table 8.13: Non-dominated parameter sets on the Iris design.
123
8 Experimental Results
Table 8.12 and Table 8.13 show non-dominated parameter sets (ξt , ξr , ξm ) for two
example designs. A set is not dominated if there is no other set with better or equal
SNS and better or equal area. Area consumption is measured in units internal to
the placement engine. It can only be compared relatively.
The tables indicate that choosing ξt above 0.7 is preferred, even if one does not
care for timing. On the other hand, ξm and ξr scale nicely between power-aware
and slack optimizing repeater trees.
124
A Detailed Comparison Tables
125
1
2
3
4
5
6
7
8
9–20
21–50
51–100
A Detailed Comparison Tables
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
10.64
247.71
25.26
256.22
36.01
220.36
28.06
235.76
45.04
320.20
71.94
269.21
94.60
374.31
109.82
584.53
88.28
557.50
127.30
480.46
203.58
693.86
275.94
569.89
312.17
701.41
670.98
1329.86
367.51
1082.28
9.44
213.35
21.34
193.40
29.37
171.63
23.00
165.66
34.24
292.29
56.64
184.60
63.26
262.16
72.70
584.53
58.02
557.50
93.22
266.21
132.33
350.25
155.55
363.89
171.85
414.25
357.14
696.60
271.57
849.02
7.48
190.29
16.90
155.58
22.61
141.15
18.09
131.52
25.04
253.06
44.09
168.54
47.82
214.23
55.14
273.53
41.49
227.28
74.47
237.06
97.18
339.29
111.38
259.88
120.24
266.72
271.76
567.61
219.91
736.17
5.87
167.15
13.42
137.14
18.34
131.27
14.76
112.36
19.89
160.18
34.07
149.68
38.31
230.03
45.16
273.53
33.78
227.28
62.32
203.11
79.59
284.41
92.02
222.04
93.99
273.57
216.80
511.95
204.83
666.23
4.62
148.34
11.13
124.63
15.12
125.30
12.14
103.72
16.29
124.47
27.78
134.47
31.67
156.98
37.91
185.21
28.45
188.57
52.33
167.58
64.40
266.51
72.73
201.31
71.59
229.18
181.17
418.86
166.20
364.46
3.70
132.12
9.27
106.57
12.47
93.83
10.16
97.22
13.27
92.84
23.49
121.58
26.16
146.81
31.37
180.72
23.99
159.22
43.39
147.85
52.39
208.37
60.87
188.48
51.82
186.29
147.32
355.99
146.71
300.75
2.87
126.82
7.87
90.57
10.35
78.46
8.70
67.57
11.10
77.20
20.16
103.94
22.10
138.78
26.51
149.05
19.91
107.55
35.49
136.01
40.45
182.17
46.41
135.90
36.83
131.03
120.71
352.00
133.56
294.24
2.00
88.91
6.39
68.01
8.31
63.28
7.36
48.54
8.99
70.00
15.98
83.45
17.23
112.25
20.84
120.49
15.05
82.63
27.87
105.62
29.60
153.52
33.80
122.70
27.54
116.98
97.80
236.72
111.10
232.69
1.39
66.38
5.03
53.13
6.28
46.90
5.83
52.09
6.68
55.70
11.39
64.64
12.88
104.10
15.41
108.67
11.12
71.42
21.64
77.62
22.61
135.76
25.04
87.87
21.10
106.49
83.03
207.64
100.55
212.43
0.80
40.60
3.44
30.26
4.17
31.32
3.96
35.72
4.40
46.37
6.96
49.41
8.81
104.44
10.23
93.22
8.10
50.14
16.89
71.61
17.49
120.09
19.19
87.92
16.18
113.01
69.00
174.89
81.25
141.56
0.38
23.81
1.59
31.78
2.31
32.90
2.45
35.41
3.01
41.38
4.36
57.69
6.25
92.07
7.12
101.25
6.60
73.60
15.01
92.16
15.56
109.36
17.13
81.82
16.39
148.72
65.18
178.36
66.35
117.80
0.07
43.72
0.94
31.27
1.41
36.92
1.72
32.18
2.03
27.92
2.27
57.49
3.70
84.38
4.13
83.78
4.79
61.00
9.95
71.64
11.18
77.88
12.27
69.03
12.46
95.53
46.11
132.57
36.37
83.70
Avg.
Max.
Avg.
Max.
21.01
1329.86
53.64
1329.86
16.91
849.02
39.88
849.02
13.16
736.17
30.45
736.17
10.51
666.23
24.70
666.23
8.54
418.86
20.46
418.86
7.00
355.99
17.01
355.99
5.71
352.00
14.24
352.00
4.37
236.72
11.37
236.72
3.23
212.43
8.61
212.43
2.11
174.89
5.93
174.89
1.28
178.36
4.12
178.36
0.70
132.57
2.69
132.57
Table A.1: Worst Slack Deviation on 45 nm Instances
126
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
0.0
0.0
0.0
0.0
0.1
89.5
0.4
94.9
2.5
96.2
0.9
97.6
3.7
97.6
4.1
99.4
7.7
97.9
8.9
86.2
20.3
84.9
32.5
92.1
18.7
49.5
24.0
58.5
373.5
837.7
0.0
0.0
0.0
0.0
0.1
92.1
0.4
96.1
2.6
97.1
0.9
97.6
2.2
98.5
3.1
96.7
6.1
97.9
8.5
86.2
19.6
83.0
31.9
92.1
21.7
54.8
26.8
60.5
381.5
833.1
0.0
0.0
0.0
0.0
0.1
92.1
0.5
96.1
2.5
97.1
0.9
97.6
2.2
98.5
2.9
96.7
5.9
96.1
8.7
86.2
19.6
83.0
31.6
85.2
22.8
51.7
27.8
57.6
380.5
832.0
0.0
0.0
0.0
0.0
0.1
92.1
0.5
96.1
2.4
96.8
0.9
97.6
2.0
98.5
2.7
94.9
5.9
95.8
8.9
86.2
19.4
83.0
30.9
85.2
24.7
51.5
28.6
57.5
384.1
841.3
0.0
0.0
0.0
0.0
0.1
92.1
0.6
96.1
2.5
96.8
0.9
97.6
1.9
96.0
2.6
94.5
6.0
95.8
9.1
86.2
19.4
86.5
31.8
85.2
24.0
49.0
28.4
57.4
385.6
852.5
0.0
0.0
0.0
0.0
0.2
92.1
0.7
96.1
2.5
96.8
0.9
97.6
1.8
98.9
2.5
94.3
6.0
94.9
9.6
89.4
19.5
78.5
32.4
86.6
25.6
49.0
27.6
52.9
388.1
874.0
0.0
0.0
0.0
0.0
0.3
92.1
0.8
94.1
2.4
96.8
0.9
93.9
1.7
98.9
2.4
93.9
5.9
93.2
9.6
89.4
18.1
84.3
29.4
76.2
26.2
54.9
26.5
42.9
387.0
896.3
0.0
0.0
0.0
0.0
0.5
92.1
1.2
87.3
2.8
96.8
1.2
106.6
1.8
128.1
2.4
112.9
5.9
107.1
10.0
81.4
15.7
62.1
25.2
64.5
23.5
36.7
27.0
37.1
391.9
877.7
0.0
0.0
0.0
0.0
1.5
92.1
2.9
94.1
4.5
96.8
2.4
125.3
2.3
134.7
2.8
137.0
8.0
112.3
12.4
78.3
15.8
58.8
24.5
56.7
24.6
34.4
28.7
40.6
396.0
904.3
0.0
0.0
0.0
0.0
4.8
96.2
8.7
109.3
8.0
118.4
4.3
128.2
3.8
160.4
4.3
112.9
12.0
123.2
17.5
102.0
19.0
61.7
26.9
64.5
27.4
38.2
34.8
43.3
399.8
926.3
0.0
0.0
0.0
0.0
16.9
99.3
27.2
167.2
26.9
125.8
11.3
151.0
34.2
221.4
34.4
247.7
34.6
319.4
51.9
1352.1
52.1
313.1
58.1
638.6
58.2
609.6
286.7
1026.0
440.4
947.4
0.0
0.0
0.0
0.0
16.9
99.3
27.2
167.2
26.9
125.8
11.3
151.0
34.2
221.4
34.4
247.7
34.6
319.4
51.7
1337.8
51.9
310.7
57.8
638.0
57.9
609.6
284.3
1019.6
432.2
950.2
Avg.
Max.
Avg.
Max.
2.0
837.7
4.3
837.7
1.8
833.1
3.7
833.1
1.7
832.0
3.6
832.0
1.7
841.3
3.6
841.3
1.8
852.5
3.7
852.5
1.8
874.0
3.7
874.0
1.8
896.3
3.6
896.3
1.8
877.7
3.8
877.7
2.4
904.3
5.0
904.3
4.0
926.3
8.2
926.3
14.8
1352.1
30.8
1352.1
14.8
1337.8
30.7
1337.8
Table A.2: Length Deviation on 45 nm Instances
127
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
A Detailed Comparison Tables
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
141.378
80207
35.435
26199
25.558
28425
12.538
18647
8.873
11293
21.505
14298
25.252
15677
41.883
23675
15.400
19572
13.661
15082
1.605
1987
0.362
663
0.314
332
0.454
414
1.042
1515
147.234
90849
37.535
28785
26.851
30466
13.908
20776
9.619
12938
22.933
15964
27.492
17828
44.377
26320
17.629
22702
16.881
17837
2.035
2289
0.507
776
0.408
379
0.616
486
1.154
1625
162.069
111514
41.529
35294
30.084
36236
16.832
26546
11.013
15787
24.317
17974
29.111
20178
45.992
29053
19.565
26283
19.065
20688
2.269
2661
0.601
946
0.476
461
0.714
551
1.235
1727
179.284
138754
46.462
42839
33.203
42246
19.660
31550
12.198
17741
25.338
19963
29.750
21990
46.531
31618
20.870
29546
21.014
23561
2.498
3078
0.718
1100
0.540
550
0.797
632
1.315
1839
200.144
165707
51.158
48828
37.621
47224
23.012
35162
13.658
19586
26.962
21838
31.153
23623
49.262
34004
22.722
33796
23.858
27241
2.819
3738
0.854
1444
0.647
713
0.952
770
1.450
1997
223.675
194912
56.820
54944
42.734
53043
26.202
39926
15.302
21759
29.007
24019
32.811
25673
51.807
37235
24.571
37720
26.962
31799
3.221
4451
1.037
1786
0.765
893
1.103
951
1.585
2182
238.193
231389
59.704
61542
47.177
59630
29.267
44840
16.729
24364
29.875
26542
33.499
28088
52.388
40672
26.477
43845
30.415
38133
3.669
5835
1.262
2514
0.897
1106
1.334
1264
1.706
2405
287.045
281488
70.059
69676
54.582
67186
33.483
50020
18.940
27716
34.028
29856
38.156
31656
59.482
45849
30.710
52812
35.743
45762
4.381
7906
1.530
3341
1.021
1341
1.516
1384
1.984
2676
321.524
322208
78.370
79428
63.747
78419
40.067
60109
21.934
33213
38.036
35317
42.116
36564
65.767
52848
35.626
65158
43.316
57477
5.188
10107
1.862
4359
1.268
1840
1.914
1783
2.287
3013
371.676
371683
90.067
92518
77.255
95953
52.097
78893
26.998
41799
45.235
43514
48.643
43244
75.756
62585
44.132
83375
57.159
77829
6.605
13217
2.450
5773
1.644
2674
2.532
2565
2.836
3680
450.346
437870
115.505
131093
108.964
138495
89.763
131243
45.021
66649
63.878
64404
78.558
65219
121.124
94480
74.457
145130
112.488
157571
12.945
26476
5.080
11548
3.796
7010
7.938
6913
4.362
4633
494.524
472559
126.756
151748
126.653
170192
113.599
172242
53.831
85141
73.540
85815
88.351
87882
134.596
126580
88.898
209131
128.392
234372
15.368
40344
6.056
18294
4.628
10606
8.589
10743
23.114
40911
Pwr.
Rpt.
Pwr.
Rpt.
345.260
257986
168.447
151580
369.176
290020
184.407
170386
404.873
345899
201.275
199091
440.179
407007
214.433
225414
486.271
465671
234.969
251136
537.601
531293
257.106
281437
572.593
612169
274.695
319238
672.660
718669
315.556
367505
763.023
841843
363.129
440207
905.085
1019302
443.342
555101
1294.228
1488734
728.377
919771
1486.895
1916560
865.615
1292253
Table A.3: Power Consumption on 45 nm Instances
128
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
12.72
107.79
2.63
27.56
2.48
29.40
3.35
40.78
1.31
18.13
1.01
14.98
0.74
11.69
0.94
14.92
3.39
59.87
4.65
91.65
0.78
20.29
0.42
9.61
0.45
6.22
0.75
6.09
0.03
0.40
12.91
109.17
2.69
27.77
2.55
29.63
3.47
40.92
1.37
18.21
1.07
14.95
0.80
11.70
1.02
14.94
3.96
59.32
7.01
89.61
1.61
19.91
1.32
9.89
3.02
6.76
9.19
6.92
0.05
0.44
12.46
104.63
2.62
26.86
2.49
28.52
3.41
39.58
1.35
17.60
1.06
14.44
0.79
11.37
1.01
14.37
3.91
57.80
6.92
85.31
1.59
18.93
1.28
9.35
2.87
6.45
8.49
6.56
0.05
0.45
12.84
109.14
2.67
27.60
2.53
29.15
3.45
40.16
1.36
17.88
1.06
14.61
0.79
11.53
1.01
14.55
3.92
58.00
6.89
84.44
1.58
18.42
1.27
8.91
2.85
6.38
8.07
6.16
0.06
0.48
13.01
110.92
2.69
28.30
2.56
29.82
3.49
41.15
1.37
18.19
1.07
14.98
0.80
11.70
1.02
14.80
3.95
58.17
6.97
84.77
1.58
17.76
1.27
8.63
2.81
6.12
7.95
5.98
0.07
0.52
12.81
108.70
2.66
27.98
2.53
29.60
3.44
40.66
1.36
18.02
1.06
14.86
0.80
11.64
1.01
14.72
3.96
57.92
6.88
82.62
1.58
17.11
1.26
8.31
2.75
5.84
7.71
5.73
0.07
0.54
12.21
103.09
2.56
26.81
2.45
28.20
3.35
38.76
1.33
16.93
1.04
14.08
0.78
11.03
0.99
14.08
3.85
55.40
6.77
76.58
1.56
15.53
1.24
7.32
2.68
5.41
7.38
5.17
0.08
0.52
12.43
108.09
2.61
28.01
2.48
28.64
3.41
39.70
1.34
16.80
1.05
14.03
0.79
10.71
1.00
13.72
3.89
50.04
6.89
73.25
1.57
13.25
1.25
6.43
2.73
4.90
7.45
5.11
0.09
0.58
12.53
109.71
2.59
28.14
2.47
28.25
3.38
38.83
1.33
16.36
1.04
13.60
0.78
10.35
0.99
13.33
3.85
47.85
6.82
69.14
1.57
12.14
1.24
5.75
2.68
4.42
7.15
4.65
0.11
0.60
12.37
109.07
2.58
28.12
2.47
28.23
3.38
38.91
1.34
16.23
1.05
13.54
0.78
10.35
1.00
13.33
3.87
46.59
6.93
67.16
1.59
11.40
1.26
5.38
2.70
4.16
7.09
4.40
0.12
0.64
12.07
107.74
2.55
28.41
2.44
28.76
3.37
39.49
1.33
16.42
1.04
13.73
0.78
10.68
0.99
13.92
3.82
45.98
6.96
66.16
1.62
10.91
1.32
5.08
2.75
3.85
6.88
4.09
0.15
0.69
13.36
6697.77
2.79
1755.45
2.63
1502.51
3.54
1353.03
1.40
640.97
1.10
957.22
0.82
1116.63
1.03
1726.36
3.87
1473.50
7.08
1782.41
1.64
260.35
1.34
108.55
2.80
71.73
6.97
123.00
0.15
320.03
Top.
Buf.
Top.
Buf.
35.65
459.36
20.29
324.02
52.06
460.13
36.46
323.19
50.32
442.22
35.24
310.73
50.37
447.42
34.86
310.67
50.61
451.82
34.91
312.59
49.87
444.24
34.40
307.56
48.28
418.92
33.51
289.02
48.98
413.26
33.94
277.16
48.54
403.11
33.41
265.26
48.53
397.52
33.59
260.33
48.07
395.91
33.45
259.77
50.54
19889.51
34.39
11436.29
Table A.4: Runtime on 45 nm Instances
129
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
A Detailed Comparison Tables
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
0.61
39.71
4.44
48.59
8.40
56.10
11.41
56.79
14.06
57.25
15.70
61.04
17.08
77.36
23.26
68.80
35.80
189.57
60.57
178.05
94.56
206.96
220.78
281.01
–
–
–
–
–
–
0.18
23.20
3.35
27.91
7.01
51.36
9.57
49.70
11.40
46.03
13.48
49.21
14.62
51.35
19.91
59.97
31.12
137.19
50.96
154.09
77.28
174.81
159.80
181.01
–
–
–
–
–
–
0.18
23.20
3.34
27.91
6.83
44.24
9.42
44.27
11.11
39.79
13.21
43.81
14.39
55.33
19.24
59.97
29.26
154.90
44.91
134.17
66.69
122.20
109.71
151.52
–
–
–
–
–
–
0.17
23.20
3.27
27.86
6.49
34.05
8.93
37.90
10.59
34.71
12.02
42.60
12.92
52.03
16.71
59.97
26.42
120.08
39.92
102.68
60.80
108.30
104.54
146.40
–
–
–
–
–
–
0.15
18.58
3.13
27.86
6.23
34.05
8.38
37.35
10.05
33.58
10.92
38.29
11.89
43.64
14.77
47.92
23.18
81.90
33.74
88.17
52.95
99.33
89.92
124.77
–
–
–
–
–
–
0.14
17.88
2.98
27.86
5.82
29.56
7.92
35.36
9.60
32.77
10.16
36.41
10.92
39.74
13.39
47.92
20.41
69.51
29.95
79.84
46.68
85.80
68.87
90.89
–
–
–
–
–
–
0.12
15.05
2.83
27.42
5.52
30.43
7.53
34.83
9.26
32.19
9.36
30.95
10.04
35.92
12.39
33.65
18.26
62.30
25.94
65.38
41.26
85.84
60.84
87.55
–
–
–
–
–
–
0.10
15.05
2.73
25.50
5.28
28.03
7.14
29.98
8.95
31.28
8.82
28.05
9.31
35.31
11.43
36.05
16.04
52.06
22.79
62.68
35.77
64.86
58.81
77.68
–
–
–
–
–
–
0.07
12.54
2.65
22.24
5.11
23.87
6.69
29.73
8.71
27.07
8.03
27.34
8.07
29.52
10.10
30.46
13.90
48.98
19.47
53.27
33.00
65.40
45.66
67.84
–
–
–
–
–
–
0.06
13.63
2.53
20.65
4.85
24.41
5.91
23.55
8.44
29.04
6.86
26.51
6.68
28.52
9.30
31.80
11.89
40.42
17.13
45.74
30.30
52.01
39.90
57.93
–
–
–
–
–
–
0.13
13.08
2.25
20.74
4.55
24.95
5.61
25.17
9.53
25.12
6.98
27.99
6.85
28.21
9.73
27.91
12.05
39.42
17.53
47.44
30.53
55.24
42.28
54.29
–
–
–
–
–
–
0.05
8.30
1.77
12.24
3.43
16.16
4.02
16.62
6.40
17.77
4.63
20.68
4.31
18.41
6.43
19.22
7.77
29.89
11.02
36.56
20.56
43.72
28.54
41.42
–
–
–
–
–
–
Avg.
Max.
Avg.
Max.
4.13
281.01
15.31
281.01
3.21
181.01
12.92
181.01
3.09
154.90
12.33
154.90
2.88
146.40
11.39
146.40
2.65
124.77
10.43
124.77
2.45
90.89
9.57
90.89
2.27
87.55
8.87
87.55
2.12
77.68
8.24
77.68
1.96
67.84
7.61
67.84
1.79
57.93
6.93
57.93
1.80
55.24
6.94
55.24
1.25
43.72
4.81
43.72
Table A.5: Worst Slack Deviation on 32 nm Instances
130
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
0.0
0.0
0.0
0.0
0.8
82.8
4.2
77.8
3.1
90.8
6.8
80.4
7.4
63.8
11.6
89.9
19.4
90.0
23.3
87.3
37.4
71.0
71.7
97.4
–
–
–
–
–
–
0.0
0.0
0.0
0.0
0.8
82.8
3.8
77.8
2.7
90.8
6.4
80.4
6.9
63.8
10.4
89.9
18.9
90.0
21.6
86.0
33.3
67.4
71.8
101.2
–
–
–
–
–
–
0.0
0.0
0.0
0.0
0.8
82.8
3.4
77.8
2.7
90.8
6.2
80.4
6.7
63.8
9.3
89.9
18.3
90.0
21.0
85.7
31.5
69.1
65.5
90.0
–
–
–
–
–
–
0.0
0.0
0.0
0.0
0.9
82.8
2.9
74.6
2.6
90.8
5.6
80.4
6.7
63.8
8.8
89.9
17.7
90.0
20.3
87.4
31.9
71.4
68.2
91.5
–
–
–
–
–
–
0.0
0.0
0.0
0.0
0.9
82.8
2.7
74.6
2.7
90.8
5.5
80.4
6.4
71.6
9.1
89.9
18.1
89.0
20.3
115.6
31.5
64.5
74.7
94.3
–
–
–
–
–
–
0.0
0.0
0.0
0.0
1.0
69.5
2.6
74.6
2.6
90.8
5.6
80.4
6.3
71.6
9.1
89.9
17.8
89.0
20.4
115.1
30.7
70.3
72.7
94.3
–
–
–
–
–
–
0.0
0.0
0.0
0.0
1.1
72.6
2.8
74.1
2.8
90.8
5.7
80.4
6.9
71.6
9.3
89.9
18.3
89.0
21.2
100.5
32.3
78.4
72.4
92.5
–
–
–
–
–
–
0.0
0.0
0.0
0.0
1.4
76.9
2.9
74.1
3.0
90.8
6.2
80.4
8.2
84.1
10.1
89.9
18.8
115.9
22.3
102.8
33.9
81.9
70.8
85.4
–
–
–
–
–
–
0.0
0.0
0.0
0.0
1.8
83.2
4.0
72.8
4.0
90.8
7.5
86.6
9.5
100.9
11.7
89.9
19.2
143.0
22.1
106.6
36.3
94.9
73.6
86.2
–
–
–
–
–
–
0.0
0.0
0.0
0.0
2.7
87.4
6.4
74.3
6.1
129.0
10.0
93.4
12.9
97.0
14.5
147.3
20.7
163.1
25.0
129.9
38.2
121.6
68.9
93.1
–
–
–
–
–
–
0.0
0.0
0.0
0.0
4.9
87.4
11.7
129.0
11.1
145.1
15.3
162.8
20.7
131.5
22.3
127.8
36.4
278.1
58.4
476.2
112.4
566.9
214.8
538.3
–
–
–
–
–
–
0.0
0.0
0.0
0.0
4.8
87.4
11.6
129.0
10.5
145.1
14.9
162.8
20.2
129.2
21.4
127.8
35.5
277.9
57.3
415.9
109.6
564.0
207.8
520.8
–
–
–
–
–
–
Avg.
Max.
Avg.
Max.
5.5
97.4
11.1
97.4
5.1
101.2
10.5
101.2
5.0
90.8
10.1
90.8
4.8
91.5
9.8
91.5
4.9
115.6
9.9
115.6
4.8
115.1
9.8
115.1
5.0
100.5
10.2
100.5
5.3
115.9
10.7
115.9
5.6
143.0
11.4
143.0
6.5
163.1
13.3
163.1
12.9
566.9
26.4
566.9
12.6
564.0
25.7
564.0
Table A.6: Length Deviation on 32 nm Instances
131
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
A Detailed Comparison Tables
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
0.397
6557
0.251
4471
0.355
6272
0.122
2081
0.139
2127
0.065
1107
0.050
877
0.049
820
0.324
5004
0.154
2159
0.047
703
0.006
97
–
–
–
–
–
–
0.397
6557
0.249
4420
0.348
6083
0.120
2050
0.127
1857
0.064
1087
0.049
856
0.049
796
0.327
4792
0.158
2046
0.048
670
0.007
97
–
–
–
–
–
–
0.397
6557
0.250
4420
0.343
5928
0.122
2059
0.129
1851
0.065
1089
0.050
848
0.051
792
0.340
4691
0.166
2031
0.051
654
0.007
92
–
–
–
–
–
–
0.397
6557
0.252
4441
0.351
5888
0.126
2093
0.133
1881
0.070
1121
0.053
868
0.055
803
0.361
4752
0.176
2066
0.053
639
0.007
80
–
–
–
–
–
–
0.401
6557
0.259
4507
0.371
6146
0.131
2130
0.139
1937
0.075
1159
0.057
891
0.059
819
0.394
4942
0.190
2118
0.058
651
0.008
91
–
–
–
–
–
–
0.404
6557
0.271
4655
0.391
6337
0.138
2203
0.148
2000
0.081
1203
0.062
924
0.063
841
0.433
5153
0.204
2195
0.063
668
0.009
90
–
–
–
–
–
–
0.411
6557
0.290
4900
0.424
6637
0.148
2286
0.158
2072
0.090
1283
0.067
963
0.071
906
0.493
5528
0.234
2362
0.077
734
0.011
101
–
–
–
–
–
–
0.423
6557
0.310
5110
0.468
7002
0.162
2423
0.173
2166
0.100
1357
0.076
1024
0.082
967
0.596
6133
0.284
2581
0.093
790
0.014
107
–
–
–
–
–
–
0.470
7055
0.345
5447
0.559
7629
0.191
2687
0.215
2489
0.122
1563
0.099
1251
0.106
1174
0.799
7809
0.364
3209
0.128
959
0.018
126
–
–
–
–
–
–
0.637
9091
0.439
6425
0.735
9076
0.273
3513
0.315
3330
0.186
2178
0.150
1754
0.153
1600
1.172
11391
0.517
4609
0.181
1358
0.026
199
–
–
–
–
–
–
1.005
13057
0.973
12185
1.438
15642
0.533
5945
0.680
6327
0.356
3692
0.290
2983
0.279
2728
2.181
20121
1.037
9111
0.376
2793
0.065
486
–
–
–
–
–
–
0.939
12995
0.822
13818
1.175
19003
0.438
7366
0.459
7323
0.293
4972
0.247
4279
0.209
3495
1.730
28034
0.776
11943
0.275
3834
0.045
626
–
–
–
–
–
–
Pwr.
Rpt.
Pwr.
Rpt.
1.958
32275
1.310
21247
1.943
31311
1.297
20334
1.970
31012
1.324
20035
2.034
31189
1.385
20191
2.142
31948
1.482
20884
2.267
32826
1.592
21614
2.474
34329
1.772
22872
2.780
36217
2.048
24550
3.417
41398
2.602
28896
4.784
54524
3.708
39008
9.214
95070
7.237
69828
7.407
117688
5.647
90875
Table A.7: Power Consumption on 32 nm Instances
132
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
1.61
11.55
0.50
4.65
0.54
6.08
0.16
2.10
0.22
2.98
0.10
1.60
0.08
1.34
0.07
1.27
0.51
10.21
0.21
4.62
0.10
1.94
0.02
0.35
–
–
–
–
–
–
1.57
11.42
0.49
4.57
0.54
5.96
0.16
2.06
0.22
2.91
0.11
1.58
0.08
1.30
0.08
1.24
0.58
9.98
0.29
4.52
0.20
1.84
0.05
0.34
–
–
–
–
–
–
1.52
11.35
0.48
4.53
0.53
5.90
0.16
2.04
0.22
2.88
0.10
1.55
0.08
1.29
0.07
1.21
0.57
9.92
0.29
4.46
0.20
1.82
0.05
0.35
–
–
–
–
–
–
1.57
11.43
0.49
4.58
0.54
5.95
0.16
2.06
0.22
2.91
0.11
1.57
0.08
1.30
0.08
1.22
0.58
9.94
0.29
4.44
0.20
1.80
0.05
0.35
–
–
–
–
–
–
1.54
11.35
0.48
4.54
0.53
5.90
0.16
2.04
0.22
2.88
0.10
1.56
0.08
1.30
0.08
1.22
0.57
9.85
0.29
4.41
0.20
1.77
0.05
0.35
–
–
–
–
–
–
1.56
11.39
0.49
4.54
0.54
5.89
0.16
2.04
0.22
2.87
0.11
1.57
0.08
1.31
0.08
1.22
0.57
9.82
0.29
4.38
0.20
1.73
0.05
0.34
–
–
–
–
–
–
1.53
11.38
0.48
4.54
0.53
5.90
0.16
2.04
0.22
2.88
0.10
1.57
0.08
1.31
0.08
1.22
0.56
9.80
0.29
4.37
0.19
1.72
0.05
0.34
–
–
–
–
–
–
1.55
11.37
0.49
4.54
0.53
5.90
0.16
2.05
0.22
2.88
0.10
1.57
0.08
1.31
0.07
1.21
0.56
9.76
0.29
4.30
0.19
1.72
0.05
0.34
–
–
–
–
–
–
1.51
11.37
0.48
4.53
0.52
5.83
0.16
2.00
0.21
2.82
0.10
1.51
0.08
1.26
0.07
1.15
0.56
9.18
0.29
4.08
0.19
1.68
0.05
0.33
–
–
–
–
–
–
1.55
11.40
0.49
4.56
0.54
5.70
0.16
1.90
0.22
2.73
0.10
1.41
0.08
1.17
0.08
1.05
0.57
8.30
0.29
3.67
0.20
1.57
0.05
0.29
–
–
–
–
–
–
1.53
11.35
0.48
4.59
0.53
5.70
0.16
1.87
0.22
2.66
0.10
1.37
0.08
1.12
0.08
1.01
0.57
7.69
0.30
3.33
0.20
1.45
0.05
0.22
–
–
–
–
–
–
1.59
88.99
0.50
43.07
0.55
57.93
0.17
18.94
0.22
30.43
0.11
13.17
0.08
10.21
0.08
9.93
0.56
70.72
0.29
31.75
0.20
14.21
0.05
2.28
–
–
–
–
–
–
Top.
Buf.
Top.
Buf.
4.13
48.69
2.02
32.49
4.38
47.73
2.32
31.74
4.27
47.31
2.27
31.43
4.38
47.54
2.32
31.52
4.30
47.17
2.28
31.28
4.34
47.09
2.29
31.16
4.28
47.06
2.26
31.15
4.30
46.95
2.27
31.04
4.23
45.73
2.24
29.83
4.32
43.76
2.29
27.80
4.28
42.36
2.27
26.42
4.39
391.63
2.31
259.57
Table A.8: Runtime on 32 nm Instances
133
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
A Detailed Comparison Tables
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
2.48
211.39
6.66
164.17
10.98
150.56
12.46
163.12
14.06
242.04
18.55
160.01
19.10
139.31
27.90
458.97
36.80
439.75
63.71
520.35
103.29
489.63
159.81
902.37
225.13
642.05
58.25
174.76
396.87
658.76
0.91
166.31
4.10
140.04
6.96
104.37
8.79
143.07
10.74
216.13
14.61
137.93
15.51
114.57
20.73
397.18
27.60
415.04
49.90
536.68
80.71
456.71
122.84
1037.46
173.18
543.15
70.75
212.25
291.70
523.64
0.87
150.18
3.99
140.04
6.71
102.73
8.46
153.33
10.38
212.15
14.00
110.00
15.02
112.40
19.75
142.55
25.65
258.53
45.36
289.64
70.05
301.81
97.75
346.02
124.88
416.40
36.96
110.89
226.88
406.77
0.82
126.60
3.77
121.44
6.17
89.30
7.87
131.16
9.63
173.94
12.87
96.55
14.06
104.51
18.26
131.07
22.62
205.46
38.42
268.05
57.46
238.01
74.09
216.88
90.81
239.67
16.57
49.72
172.13
314.23
0.76
107.42
3.47
111.86
5.57
85.36
7.25
111.72
8.76
131.61
11.32
88.34
12.80
84.36
16.31
101.33
19.74
199.67
31.76
255.08
46.73
213.71
57.90
188.42
72.34
222.11
19.44
58.33
149.11
219.34
0.65
78.56
3.11
80.66
4.96
65.23
6.62
77.49
7.87
73.93
9.77
69.72
11.33
72.94
14.42
97.74
16.93
113.66
26.84
135.69
38.75
144.11
46.89
142.03
64.59
204.07
10.15
30.46
118.42
191.18
0.50
60.72
2.73
60.00
4.40
46.51
5.94
69.20
6.96
64.67
8.53
55.09
9.77
55.28
12.71
72.13
14.47
91.00
22.50
92.44
31.97
117.23
36.64
127.45
54.30
181.67
8.96
26.89
101.17
151.56
0.34
42.44
2.36
43.06
3.89
39.75
5.22
40.53
5.86
36.06
7.28
39.48
8.31
41.17
10.92
46.03
11.91
60.73
18.59
87.36
26.04
89.07
28.72
102.30
42.48
159.31
5.70
17.11
82.19
134.10
0.23
26.54
2.05
33.47
3.43
29.92
4.51
28.76
4.84
31.04
5.77
45.83
6.75
43.51
9.06
40.07
9.45
58.98
14.85
72.50
20.97
62.21
21.85
79.88
27.99
94.92
5.04
15.13
55.10
117.96
0.21
25.79
1.85
28.82
3.01
26.58
3.81
26.50
4.03
23.75
4.31
38.46
4.93
39.65
7.50
33.98
7.81
54.70
12.75
45.16
18.19
54.04
18.21
74.20
18.98
55.61
5.26
15.78
40.24
70.47
0.31
34.23
1.49
30.01
2.39
37.42
3.42
27.58
3.56
34.98
3.87
51.63
4.43
30.13
7.35
35.73
7.85
61.07
13.08
49.16
19.16
92.22
18.52
86.98
17.78
52.01
6.31
18.93
34.87
69.79
0.20
13.58
1.08
21.82
1.66
33.51
2.52
18.57
2.50
36.78
2.86
43.68
3.18
35.33
5.02
23.72
5.27
33.85
8.82
40.44
12.61
82.43
11.65
80.86
11.09
32.02
3.80
11.41
17.11
40.18
Avg.
Max.
Avg.
Max.
5.74
902.37
20.69
902.37
3.46
1037.46
15.23
1037.46
3.28
416.40
14.30
416.40
3.00
314.23
12.83
314.23
2.68
255.08
11.31
255.08
2.34
204.07
9.88
204.07
1.98
181.67
8.58
181.67
1.63
159.31
7.29
159.31
1.32
117.96
6.01
117.96
1.12
74.20
4.99
74.20
1.10
92.22
4.62
92.22
0.76
82.43
3.20
82.43
Table A.9: Worst Slack Deviation on 22 nm Instances
134
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
Avg.
Max.
0.0
0.0
0.0
0.0
2.2
96.8
2.8
96.1
5.0
97.3
7.7
91.9
12.9
98.4
5.4
95.0
10.5
97.1
19.2
108.7
31.9
105.3
35.2
122.6
35.9
96.8
37.4
52.6
285.4
500.0
0.0
0.0
0.0
0.0
2.1
96.8
2.7
96.1
4.8
97.3
7.5
91.9
12.7
98.4
5.0
95.0
9.8
97.1
17.5
108.7
29.1
103.2
33.1
108.9
35.6
93.1
37.2
50.2
283.4
509.4
0.0
0.0
0.0
0.0
2.0
96.8
2.6
96.1
4.6
97.3
7.3
91.9
12.4
95.6
4.9
101.6
9.4
97.1
16.6
108.7
28.1
100.1
32.4
106.1
35.5
92.6
36.7
45.5
283.6
501.8
0.0
0.0
0.0
0.0
1.7
96.8
2.3
96.1
4.3
97.3
6.7
91.9
11.8
95.6
4.5
95.0
8.9
97.1
15.6
108.7
27.4
98.1
32.1
102.7
35.5
93.1
37.1
49.7
283.8
512.8
0.0
0.0
0.0
0.0
1.7
96.8
2.2
96.1
4.2
97.3
6.5
91.9
11.5
94.5
4.5
95.0
9.0
97.1
15.8
111.4
28.0
104.5
32.7
108.7
35.6
92.9
36.9
46.9
284.9
497.6
0.0
0.0
0.0
0.0
1.7
96.8
2.2
96.1
4.1
97.3
6.1
91.9
11.3
100.0
4.3
95.0
8.6
97.1
15.5
111.4
27.3
112.2
32.1
104.1
34.8
91.7
36.7
45.4
285.0
509.1
0.0
0.0
0.0
0.0
1.6
96.8
2.2
96.1
4.2
97.3
5.9
93.3
11.4
103.8
4.3
95.0
8.5
104.4
15.3
111.4
27.2
117.4
32.3
127.9
35.1
94.1
36.6
43.8
285.0
510.3
0.0
0.0
0.0
0.0
1.7
96.8
2.4
96.1
4.7
100.8
6.2
101.7
11.7
122.8
4.5
98.8
8.7
128.5
15.7
119.7
28.1
125.3
33.0
161.8
34.6
94.1
36.5
43.7
285.8
507.9
0.0
0.0
0.0
0.0
2.0
96.8
2.9
107.9
5.8
123.2
7.2
139.3
12.6
122.8
5.6
117.8
9.5
166.8
16.2
159.7
28.9
161.2
32.9
226.1
35.7
91.2
36.6
44.5
284.9
516.4
0.0
0.0
0.0
0.0
3.2
95.9
4.6
173.7
8.4
156.2
10.7
138.0
15.1
140.6
8.8
150.9
11.9
200.7
19.2
170.2
31.8
207.4
33.2
377.3
35.5
87.8
35.5
39.5
276.5
488.0
0.0
0.0
0.0
0.0
9.1
96.9
9.7
183.5
16.2
176.7
17.3
192.0
22.8
314.4
18.6
255.8
27.5
491.0
49.4
946.9
79.4
893.4
70.1
2110.0
40.6
138.2
35.2
39.5
283.2
523.3
0.0
0.0
0.0
0.0
9.1
96.9
9.6
183.5
16.1
176.7
17.2
192.0
22.5
296.6
18.5
255.8
27.3
469.6
48.9
920.3
78.4
882.4
69.4
2042.0
39.7
133.8
35.2
39.5
268.5
499.6
Avg.
Max.
Avg.
Max.
4.2
500.0
10.0
500.0
3.9
509.4
9.3
509.4
3.7
501.8
9.0
501.8
3.6
512.8
8.5
512.8
3.6
497.6
8.5
497.6
3.5
509.1
8.3
509.1
3.4
510.3
8.2
510.3
3.5
507.9
8.5
507.9
3.8
516.4
9.2
516.4
4.8
488.0
11.5
488.0
10.5
2110.0
25.3
2110.0
10.4
2042.0
25.1
2042.0
Table A.10: Length Deviation on 22 nm Instances
135
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
A Detailed Comparison Tables
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
Pwr.
Rpt.
15.622
317791
6.226
132993
4.014
98442
2.264
51317
1.414
32732
1.215
28368
1.090
26470
1.905
36573
5.073
108515
1.978
44819
0.776
19848
0.300
8232
0.024
733
0.010
275
0.095
3156
15.987
322463
6.298
133800
4.070
98936
2.286
51579
1.439
32984
1.234
28574
1.110
26629
1.955
36686
5.304
109091
2.098
44270
0.848
19477
0.318
8146
0.025
723
0.010
275
0.104
3364
16.408
328275
6.422
135656
4.177
100650
2.339
52433
1.467
33463
1.277
29257
1.140
27097
2.030
37547
5.593
113001
2.274
45914
0.936
20297
0.359
8591
0.027
731
0.011
280
0.112
3607
17.187
337855
6.679
139332
4.402
104543
2.456
54125
1.549
34701
1.376
30795
1.214
28246
2.176
39411
6.131
121988
2.542
49839
1.059
22264
0.414
9426
0.030
771
0.011
285
0.124
3947
18.243
351949
7.044
145101
4.731
110353
2.623
56579
1.666
36503
1.514
33019
1.327
30026
2.405
42543
6.826
131892
2.924
55030
1.245
24538
0.491
10350
0.035
825
0.011
289
0.137
4244
19.218
373721
7.358
152932
5.172
117789
2.800
59789
1.794
38912
1.650
35800
1.457
32460
2.602
45937
7.615
144057
3.286
60388
1.431
27257
0.569
11537
0.037
857
0.011
297
0.153
4669
21.515
413117
8.081
165521
5.652
127204
3.123
65062
2.027
42621
1.868
39391
1.667
36117
2.979
50818
8.719
160540
3.864
68617
1.717
31329
0.692
13187
0.045
933
0.012
305
0.175
5211
24.661
479471
8.980
183836
6.363
139980
3.543
72539
2.322
48331
2.141
44305
1.942
40996
3.413
57934
10.128
184148
4.627
79827
2.109
36767
0.853
15429
0.053
1049
0.013
322
0.201
5837
29.614
573663
10.458
209308
7.574
160179
4.338
86223
2.834
57699
2.682
54614
2.458
50332
4.206
70653
12.695
227408
6.007
101952
2.778
46927
1.145
20089
0.072
1255
0.014
347
0.245
6801
36.243
694817
12.541
243218
9.443
191201
5.557
107224
3.684
72600
3.606
72070
3.392
68619
5.565
90564
16.867
293898
8.283
136184
3.975
64787
1.676
28618
0.103
1694
0.017
405
0.324
8564
46.240
860511
19.214
360371
15.889
302582
8.928
161459
6.146
113683
6.191
116926
6.055
115520
9.164
142253
29.752
505200
16.388
266374
8.491
138493
3.637
63291
0.202
3696
0.029
717
0.532
12547
47.151
980717
19.646
455857
15.333
384809
9.200
224340
6.624
169283
6.549
172306
6.657
179128
8.788
204131
28.784
728076
14.304
364601
7.711
205950
3.739
104768
0.316
8785
0.096
2133
1.179
36717
Pwr.
Rpt.
Pwr.
Rpt.
42.006
910264
20.159
459480
43.085
916997
20.800
460734
44.571
936799
21.741
472868
47.350
977528
23.483
500341
51.225
1033241
25.938
536191
55.155
1106402
28.579
579749
62.136
1219973
32.540
641335
71.350
1390771
37.708
727464
87.118
1667450
47.046
884479
111.274
2074463
62.489
1136428
176.859
3163623
111.404
1942741
176.076
4221601
109.278
2785027
Table A.11: Power Consumption on 22 nm Instances
136
# Sinks
1
2
3
4
5
6
7
8
9–20
21–50
51–100
101–250
251–500
501–1000
> 1000
Total
Total >2
ξ = 0.0
ξ = 0.1
ξ = 0.2
ξ = 0.3
ξ = 0.4
ξ = 0.5
ξ = 0.6
ξ = 0.7
ξ = 0.8
ξ = 0.9
ξ = 1.0
DP
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
Top.
Buf.
35.99
303.59
8.20
80.87
6.02
70.55
3.34
45.26
2.38
35.47
2.33
37.12
2.50
44.34
2.70
45.05
10.07
188.46
5.74
122.82
3.70
81.48
2.16
46.44
0.23
3.98
0.10
0.97
0.08
1.22
36.02
303.84
8.25
80.78
6.12
70.32
3.43
45.04
2.48
35.27
2.44
36.84
2.65
43.97
2.91
44.57
11.54
185.45
8.23
119.84
7.67
79.23
6.79
45.71
1.17
3.97
0.33
0.96
0.40
1.23
35.84
303.41
8.22
80.76
6.10
70.26
3.42
44.97
2.47
35.21
2.44
36.80
2.66
43.91
2.90
44.33
11.50
184.04
8.18
117.10
7.59
77.52
6.66
44.83
1.10
3.92
0.33
0.95
0.47
1.30
35.87
303.18
8.20
80.66
6.09
70.15
3.42
44.77
2.47
34.99
2.44
36.65
2.66
43.50
2.90
44.01
11.51
181.94
8.18
114.28
7.54
75.72
6.57
43.68
1.08
3.86
0.33
0.95
0.53
1.39
35.26
299.56
8.17
80.30
6.06
69.64
3.41
44.56
2.46
34.86
2.42
36.17
2.64
43.22
2.89
43.76
11.38
179.29
8.06
110.98
7.34
72.56
6.39
41.95
1.04
3.71
0.33
0.93
0.58
1.44
35.29
299.58
8.14
80.20
6.04
69.29
3.40
44.43
2.46
34.84
2.39
35.87
2.63
43.00
2.89
43.72
11.34
177.56
8.00
107.92
7.22
69.99
6.26
40.38
1.00
3.56
0.32
0.89
0.68
1.47
35.13
300.78
8.12
80.54
6.02
69.29
3.39
44.51
2.45
34.95
2.39
36.00
2.63
43.11
2.88
43.75
11.33
176.85
8.00
106.13
7.21
68.84
6.23
39.48
0.98
3.50
0.32
0.89
0.83
1.58
34.94
297.72
8.10
80.16
6.01
68.91
3.39
44.29
2.46
34.85
2.39
35.82
2.63
42.90
2.89
43.37
11.33
173.80
7.96
103.14
7.13
67.14
6.12
38.25
0.96
3.41
0.32
0.88
0.96
1.70
34.82
299.42
8.06
80.42
5.97
67.41
3.37
42.90
2.43
33.56
2.37
34.54
2.60
41.64
2.87
42.21
11.23
166.23
7.94
97.09
7.09
63.91
6.08
36.07
0.95
3.34
0.32
0.87
1.13
1.83
34.73
298.40
8.06
80.65
5.97
66.47
3.37
41.77
2.44
32.53
2.37
32.73
2.61
39.68
2.87
41.18
11.19
159.22
7.93
91.37
7.06
60.00
6.06
33.49
0.93
3.16
0.31
0.85
1.43
1.84
34.61
300.83
8.06
82.51
5.97
67.69
3.37
41.84
2.45
32.32
2.37
32.22
2.62
38.10
2.87
41.30
11.26
157.32
7.98
88.10
7.14
56.21
6.18
30.77
0.95
2.84
0.32
0.80
1.84
1.94
37.34
3035.93
8.58
985.77
6.42
803.75
3.59
473.42
2.58
336.41
2.53
331.57
2.76
351.92
3.00
445.96
11.69
1602.61
8.26
935.38
7.47
556.79
6.50
280.95
1.00
22.89
0.32
6.21
1.93
72.20
Top.
Buf.
Top.
Buf.
85.54
1107.61
41.35
723.15
100.41
1097.04
56.14
712.41
99.89
1089.31
55.83
705.14
99.77
1079.73
55.70
695.89
98.42
1062.92
55.00
683.06
98.05
1052.70
54.62
672.93
97.91
1050.21
54.66
668.89
97.57
1036.34
54.54
658.46
97.25
1011.44
54.36
631.59
97.32
983.35
54.54
604.30
97.99
974.79
55.32
591.46
103.96
10241.77
58.04
6220.07
Table A.12: Runtime on 22 nm Instances
137
# Sinks
Summary
Repeaters, inverters and buffers, are the logical gates that dominate modern chip
designs. We see designs where up to 50 % of all gates are repeaters.
Repeaters are used during physical design of chips to improve the electrical and
timing properties of interconnections. They are added along Steiner trees that
connect root gates to sinks, creating repeater trees. Their construction became a
crucial part of chip design. It has great impact on all other parts, for example,
placement and routing.
We first present an extensive version of the Repeater Tree Problem. Our
problem formulation encapsulates most of the constraints that have been studied so
far. We also consider several aspects for the first time, for example, slew dependent
required arrival times at repeater tree sinks. These make our formulation more
adequate to the challenges of real-world repeater tree construction.
For creating good repeater trees, one has to take the overall design environment
into account. The employed technology, the properties of available repeaters and
metal wires, the shape of the chip, the temperature, the voltages, and many other
factors highly influence the results of repeater tree construction. To take all this into
account, we extensively preprocess the environment to extract parameters for our
algorithms. These parameters allow us to quickly and yet quite accurately estimate
the timing of a tree before it has even been buffered.
We present an algorithm for Steiner tree creation and prove that our algorithm is
able to create timing-efficient as well as cost-efficient trees. Our algorithm is based
on a delay model that accurately describes the timing that one can achieve after
repeater insertion. This makes our algorithm suitable for creating good Steiner
trees, the input for subsequent repeater insertion algorithms.
Next, we deal with the problem of adding repeaters to a given Steiner tree. The
predominantly used algorithms to solve this problem use dynamic programming.
However, they have several drawbacks. Firstly, potential repeater positions along
the Steiner tree have to be chosen upfront. Secondly, the algorithms strictly
follow the given Steiner tree and miss optimization opportunities. Finally, dynamic
programming causes high running times. We present our new buffer insertion
algorithm, Fast Buffering, that overcomes these limitations. It is able to produce
results with similar quality to a dynamic programming approach but a much
better running time. In addition, we also present improvements to the dynamic
programming approach that allows us to push the quality at the expense of a high
running time.
We have implemented our algorithms as part of the BonnTools physical design
optimization suite developed at the Research Institute for Discrete Mathematics in
139
A Detailed Comparison Tables
cooperation with IBM. Our algorithms are used and help engineers dealing with
some of the most complex chips in the world. When we released the first version
of our global optimization tools, for the first time it became possible to optimize
chips with several million gates within reasonable running times. At the same time,
designers were able to achieve compelling results. Our tools are not only used for
global optimization in early design stages. For later stages of physical design, a
more accurate version of our algorithm can be enabled that is able to squeeze out
the last tenth of a picosecond.
Our implementation deals with all tedious details of a grown real-world chip
optimization environment. At the same time, we offer a clean framework abstracting
away the details such that new repeater tree construction algorithms can easily be
implemented.
As a side project, we implemented a blockage map that helps managing the
free/blocked information not only for our algorithms but also other optimization
tools within BonnTools. Recently, the congestion map that we implemented has
been added as a fast mode to BonnRouteGlobal, the global routing engine used
throughout the whole IBM physical design.
We have created extensive experimental results on challenging real-world test cases
provided by our cooperation partner. The testbed consists of more than 3.3 million
different repeater tree instances. The average running time for a single instance is
about 0.6 milliseconds, which means that we can solve about 5.7 million instances
per hour. We also compare our implementation to an state-of-the-art industrial tool
and show that our algorithm produces better results with less electrical violations.
140
B Bibliography
Noga Alon and Yossi Azar. On-Line Steiner Trees in the Euclidean Plane. Discrete
& Computational Geometry, 10:113–121, 1993. doi: 10.1007/BF02573969.
Charles J. Alpert and Anirudh Devgan. Wire Segmenting for Improved Buffer
Insertion. In Proceedings of the 34th Annual Design Automation Conference, DAC
’97, pages 588–593, New York, NY, USA, 1997. ACM. doi: 10.1145/266021.266291.
Charles J. Alpert, Anirudh Devgan, and Stephen T. Quay. Buffer insertion with
accurate gate and interconnect delay computation. In Proceedings of the 36th
Annual ACM/IEEE Design Automation Conference, DAC ’99, pages 479–484,
1999. doi: 10.1145/309847.309983.
Charles J. Alpert, R. Gopal Gandham, Jose L. Neves, and Stephen T. Quay. Buffer
Library Selection. In Proceedings of the International Conference on Computer
Design, pages 221–226, Los Alamitos, CA, USA, 2000. IEEE Computer Society.
doi: 10.1109/ICCD.2000.878289.
Charles J. Alpert, Anirudh Devgan, John P. Fishburn, and Stephen T. Quay.
Interconnect Synthesis Without Wire Tapering. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 20(1):90–104, 2001a. doi:
10.1109/43.905678.
Charles J. Alpert, Gopal Gandham, Jiang Hu, Jose L. Neves, Stephen T. Quay,
and Sachin S. Sapatnekar. Steiner Tree Optimization for Buffers, Blockages, and
Bays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 20(4):556–562, 2001b. doi: 10.1109/43.918213.
Charles J. Alpert, Gopal Gandham, Miloš Hrkić, Jiang Hu, Andrew B. Kahng,
John Lillis, Bao Liu, Stephen T. Quay, Sachin S. Sapatnekar, and A. J. Sullivan. Buffered Steiner Trees for Difficult Instances. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 21(1):3–14, 2002. doi:
10.1109/TCAD.2005.858348.
Charles J. Alpert, Gopal Gandham, Miloš Hrkić, Jiang Hu, Stephen T. Quay, and
C. N. Sze. Porosity-Aware Buffered Steiner Tree Construction. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 23(4):517–526,
2004a. doi: 10.1109/TCAD.2004.825864.
Charles J. Alpert, Miloš Hrkić, and Stephen T. Quay. A Fast Algorithm for
Identifying Good Buffer Insertion Candidate Locations. In Proceedings of the
141
B Bibliography
2004 International Symposium on Physical design, ISPD ’04, pages 47–52, New
York, NY, USA, 2004b. ACM. doi: 10.1145/981066.981076.
Charles J. Alpert, Dinesh P. Mehta, and Sachin S. Sapatnekar, editors. Handbook
of Algorithms for Physical Design Automation. Auerbach Publications, Boston,
MA, USA, 1st edition, 2008. ISBN 9780849372421.
Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen. Efficient
Generation of Short and Fast Repeater Tree Topologies. In Proceedings of the
2006 International Symposium on Physical Design, ISPD ’06, pages 120–127, New
York, NY, USA, 2006. ACM. doi: 10.1145/1123008.1123032.
Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen. Efficient
algorithms for short and fast repeater trees. I. Topology generation. Technical
Report No. 07977, Research Institute for Discrete Mathematics, University of
Bonn, 2007a.
Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen. Efficient
algorithms for short and fast repeater trees. II. Buffering. Technical Report
No. 07978, Research Institute for Discrete Mathematics, University of Bonn,
2007b.
Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen. Fast
Buffering for Optimizing Worst Slack and Resource Consumption in Repeater Trees. In Proceedings of the 2009 International Symposium on Physical Design, ISPD ’09, pages 43–50, New York, NY, USA, 2009. ACM. doi:
10.1145/1514932.1514942.
Christoph Bartoschek, Stephan Held, Jens Maßberg, Dieter Rautenbach, and Jens
Vygen. The repeater tree construction problem. Information Processing Letters,
110(24):1079–1083, 2010. doi: 10.1016/j.ipl.2010.08.016.
Chung-Ping Chen and Noel Menezes. Noise-aware Repeater Isertion and Wire Sizing
for On-chip Interconnect Using Hierarchical Moment-Matching. In Proceedings
of the 36th Annual ACM/IEEE Design Automation Conference, DAC ’99, pages
502–506, 1999. doi: 10.1145/309847.309987.
Jason Cong and Xin Yuan. Routing Tee Construction Under Fixed Buffer Locations.
In Proceedings of the 37th Annual Design Automation Conference, DAC ’00, pages
379–384, New York, NY, USA, 2000. ACM. doi: 10.1145/337292.337502.
Jason Cong, Lei He, Cheng-Kok Koh, and Patrick H. Madden. Performance
Optimization of Vlsi Interconnect Layout. Integration, the VLSI Journal, 21:1–94,
1996. doi: 10.1016/S0167-9260(96)00008-9.
Sampath Dechu, Cien Shen, and Chris Chu. An Efficient Routing Tree Construction
Algorithm With Buffer Insertion, Wire Sizing, and Obstacle Considerations. IEEE
142
B Bibliography
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24
(4):600–608, 2005. doi: 10.1109/TCAD.2005.844107.
William C. Elmore. The Transient Response of Damped Linear Networks with
Particular Regard to Wideband Amplifiers. Journal of Applied Physics, 19(1):
55–63, 1948. doi: 10.1063/1.1697872.
Delbert R. Fulkerson. A Network Flow Computation for Project Cost Curves.
Management Science, 7(2):167–178, 1961.
Michael R. Garey and David S. Johnson. The Rectilinear Steiner Tree Problem is
NP-Complete. SIAM Journal on Applied Mathematics, 32(4):826–834, 1977. doi:
10.1137/0132071.
Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to
the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA,
1979. ISBN 0716710447.
Michael Gester, Dirk Müller, Tim Nieberg, Christian Panten, Christian Schulte,
and Jens Vygen. BonnRoute: Algorithms and data structures for fast and good
VLSI routing. Technical Report No. 111039, Research Institute for Discrete
Mathematics, University of Bonn, 2011.
Nir Halman, Chung-Lun Li, and David Simchi-Levi. Fully polynomial time approximation schemes for time-cost tradeoff problems in series-parallel project
networks. In Proceedings of the 11th international workshop, APPROX 2008, and
12th international workshop, RANDOM 2008 on Approximation, Randomization
and Combinatorial Optimization: Algorithms and Techniques, APPROX ’08 /
RANDOM ’08, pages 91–103, Berlin, Heidelberg, 2008. Springer-Verlag. doi:
10.1007/978-3-540-85363-3_8.
Stephan Held. Timing Closure in Chip Design. PhD thesis, University of Bonn,
2008.
Stephan Held and Daniel Rotter. Shallow-Light Steiner Arborescences with Vertex
Delays. In IPCO, pages 229–241, 2013. doi: 10.1007/978-3-642-36694-9_20.
Stephan Held and Sophie Theresa Spirkl. A Fast Algorithm for Rectilinear Steiner
Trees with Length Restrictions on Obstacles. In Proceedings of the 2014 International Symposium on Physical Design, ISPD ’14, pages 37–44, New York, NY,
USA, 2014. ACM. doi: 10.1145/2560519.2560529.
Stephan Held, Bernhard Korte, Dieter Rautenbach, and Jens Vygen. Combinatorial
Optimization in VLSI Design. In Vasek Chvátal, editor, Combinatorial Optimization: Methods and Applications, volume 31 of NATO Science for Peace and
Security Series - D: Information and Communication Security, pages 33–96. IOS
Press, 2011. doi: 10.3233/978-1-60750-718-5-33.
143
B Bibliography
Renato F. Hentschke, Jagannathan Narasimham, Marcelo O. Johann, and Ricardo L.
Reis. Maze Routing Steiner Trees with Effective Critical Sink Optimization. In
Proceedings of the 2007 International Symposium on Physical Design, ISPD ’07,
pages 135–142, New York, NY, USA, 2007. ACM. doi: 10.1145/1231996.1232024.
Robert B. Hitchcock, Gordon L. Smith, and David D. Cheng. Timing Analysis of
Computer Hardware. IBM Journal of Research and Development, 26(1):100–105,
1982. doi: 10.1147/rd.261.0100.
Miloš Hrkić and John Lillis. S-Tree: A Technique for Buffered Routing Tree
Synthesis. In Proceedings of the 36th ACM/IEEE Annual Design Automation
Conference, pages 578–583, 2002. doi: 10.1145/513918.514066.
Miloš Hrkić and John Lillis. Buffer Tree Synthesis With Consideration of Temporal
Locality, Sink Polarity Requirements, Solution Cost, Congestion, and Blockages.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
22(4):481–491, 2003. doi: 10.1109/TCAD.2003.809648.
Jiang Hu, Charles J. Alpert, Stephen T. Quay, and Gopal Gandham. Buffer
Insertion With Adaptive Blockage Avoidance. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 22(4):492–498, 2003. doi:
10.1109/TCAD.2003.809647.
Shiyan Hu, Charles J. Alpert, Jiang Hu, S.K. Karandikar, Zhuo Li, Weiping Shi,
and C.N. Sze. Fast Algorithms for Slew-Constrained Minimum Cost Buffering.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
26(11):2009–2022, 2007. doi: 10.1109/TCAD.2007.906477.
Shiyan Hu, Zhuo Li, and Charles J. Alpert. A fully polynomial time approximation
scheme for timing driven minimum cost buffer insertion. In Proceedings of the
46th Annual Design Automation Conference, DAC ’09, pages 424–429, New York,
NY, USA, 2009. ACM. doi: 10.1145/1629911.1630026.
Tao Huang and Evangeline F. Y. Young. Construction of rectilinear steiner minimum
trees with slew constraints over obstacles. In Proceedings of the IEEE/ACM
International Conference on Computer-Aided Design 2012, ICCAD ’12, pages
144–151, New York, NY, USA, 2012. ACM. doi: 10.1145/2429384.2429411.
Frank Hwang. On steiner minimal trees with rectilinear distance. SIAM Journal on
Applied Mathematics, 30(1):104–114, 1976. doi: 10.1137/0130013.
Maxim Janzen. Buffer Aware Global Routing im Chip Design. Diplomarbeit,
University of Bonn, 2012.
James E. Kelley, Jr. Critical Path Planning and Scheduling: Mathematical Basis.
Operations Research, 9:296–320, 1961.
144
B Bibliography
James E. Kelley, Jr and Morgan R. Walker. Critical-path planning and scheduling.
In Papers presented at the December 1-3, 1959, eastern joint IRE-AIEE-ACM
computer conference, IRE-AIEE-ACM ’59 (Eastern), pages 160–173, New York,
NY, USA, 1959. ACM. doi: 10.1145/1460299.1460318.
Bernhard Korte and Jens Vygen. Combinatorial Optimization: Theory and Algorithms. Springer Publishing Company, Incorporated, 5th edition, 2012. ISBN
9783642244872.
Bernhard Korte, Dieter Rautenbach, and Jens Vygen. Bonntools: Mathematical
Innovation for Layout and Timing Closure of Systems on a Chip. Proceedings of
the IEEE, 95(3):555–572, 2007. doi: 10.1109/JPROC.2006.889373.
Leon G. Kraft, Jr. A Device for Quantizing, Grouping, and Coding AmplitudeModulated Pulses. Master’s thesis, Dept. of Electrical Engineering, M.I.T.,
Cambridge, Massachusets, 1949.
Eugene L. Lawler. Combinatorial Optimization: Networks and Matroids. Dover
Books on Mathematics Series. Dover Publications, 2001. ISBN 9780486414539.
Zhuo Li and Weiping Shi. An O(bn2 ) Time Algorithm for Optimal Buffer Insertion
With b Buffer Types. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 25(3):484–489, 2006. doi: 10.1109/TCAD.2005.854631.
Zhuo Li, Ying Zhou, and Weiping Shi. o(mn) Time Algorithm for Optimal
Buffer Insertion of Nets With m Sinks. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 31(3):437–441, 2012. doi:
10.1109/TCAD.2011.2174639.
John Lillis, Chung-Kuan Cheng, and Ting-Ting Y. Lin. Optimal Wire Sizing and
Buffer Insertion for Low Power and a Generalized Delay Model. IEEE Journal of
Solid-State Circuits, 31(3):437–447, 1996a. doi: 10.1109/4.494206.
John Lillis, Chung-Kuan Cheng, Ting-Ting Y. Lin, and Ching-Yen Ho. New
Performance Driven Routing Techniques With Explicit Area/Delay Tradeoff and
Simultaneous Wire Sizing. In Proceedings of the 33rd Annual Design Automation
Conference, DAC ’96, pages 395–400, New York, NY, USA, 1996b. ACM. doi:
10.1145/240518.240594.
Jens Maßberg and Jens Vygen. Approximation algorithms for a facility location
problem with service capacities. ACM Transactions on Algorithms, 4(4):50:1–50:15,
2008. doi: 10.1145/1383369.1383381.
Dirk Müller. Fast Resource Sharing in VLSI Routing. PhD thesis, University of
Bonn, 2009.
Dirk Müller, Klaus Radke, and Jens Vygen. Faster min-max resource sharing in
theory and practice. Mathematical Programming Computation, 3:1–35, 2011. doi:
10.1007/s12532-011-0023-y.
145
B Bibliography
Matthias Müller-Hannemann and Ute Zimmermann. Slack Optimization of TimingCritical Nets. In Algorithms - ESA 2003, volume 2832 of Lecture Notes in Computer Science, pages 727–739. Springer Berlin Heidelberg, 2003. doi: 10.1007/9783-540-39658-1_65.
Takumi Okamoto and Jason Cong. Buffered Steiner Tree Construction with Wire Sizing for Interconnect Layout Optimization. In Proceedings of the 1996 IEEE/ACM
International Conference on Computer-Aided Design, ICCAD ’96, pages 44–
49, Washington, DC, USA, 1996. IEEE Computer Society. doi: 10.1109/ICCAD.1996.568938.
Carlos A. S. Oliveira and Panos M. Pardalos. A Survey of Combinatorial Optimization Problems in Multicast Routing. Computers & Operations Research, 32(8):
1953–1981, 2005. doi: 10.1016/j.cor.2003.12.007.
James B. Orlin. A Faster Strongly Polynomial Minimum Cost Flow Algorithm.
Operations Research, 41(2):338–350, 1993. doi: 10.1287/opre.41.2.338.
Min Pan, Chris Chu, and Priyadarshan Patra. A novel performance-driven topology
design algorithm. In Proceedings of the 2007 Asia and South Pacific Design
Automation Conference, ASP-DAC ’07, pages 244–249, Washington, DC, USA,
2007. IEEE Computer Society. doi: 10.1109/ASPDAC.2007.357993.
Lawrence Pileggi. Coping with RC(L) Interconnect Design Headaches. In Proceedings
of the 1995 IEEE/ACM International Conference on Computer-Aided Design,
ICCAD ’95, pages 246–253, Washington, DC, USA, 1995. IEEE Computer Society.
doi: 10.1109/ICCAD.1995.480019.
Jorge Rubinstein, Jr. Paul Penfield, and Mark A. Horowitz. Signal Delay in rc Tree
Networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 2(3):202–211, 1983. doi: 10.1109/TCAD.1983.1270037.
Sachin S. Sapatnekar. Timing. Kluwer, 2004. ISBN 9781402076718.
Prashant Saxena, Noel Menezes, Pasquale Cocchini, and Desmond A. Kirkpatrick.
The Scaling Challenge: Can Correct-by-Construction Design Help? In Proceedings
of the 2003 International Symposium on Physical Design, ISPD ’03, pages 51–58,
New York, NY, USA, 2003. ACM. doi: 10.1145/640000.640014.
Weiping Shi and Zhuo Li. A Fast Algorithm for Optimal Buffer Insertion. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24
(6):879–891, 2005. doi: 10.1109/TCAD.2005.847942.
Weiping Shi, Zhuo Li, and Charles J. Alpert. Complexity Analysis and Speedup
Techniques for Optimal Buffer Insertion with Minimum Cost. In Proceedings of
the 2004 Asia and South Pacific Design Automation Conference, ASP-DAC ’04,
pages 609–614, 2004. doi: 10.1109/ASPDAC.2004.1337664.
146
B Bibliography
Martin Skutella. Approximation Algorithms for the Discrete Time-Cost Tradeoff
Problem. Mathematics of Operations Research, 23:909—-929, 1998.
Lukas P.P.P. van Ginneken. Buffer Placement in Distributed RC-Tree Networks for
Minimal Elmore Delay. In Proceedings of the 1990 IEEE International Symposium
on Circuits and Systems, volume 2, pages 865–868, 1990. doi: 10.1109/ISCAS.1990.112223.
Jens Vygen. Slack in Static Timing Analysis. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 25(9):1876–1885, 2006. doi:
10.1109/TCAD.2005.858348.
Jürgen Werber. Logic Restructuring for Timing Optimization in VLSI Design. PhD
thesis, University of Bonn, 2007.
Jürgen Werber, Dieter Rautenbach, and Christian Szegedy. Timing Optimization by
Restructuring Long Combinatorial Paths. In Proceedings of the 2007 IEEE/ACM
International Conference on Computer-Aided Design, ICCAD ’07, pages 536–543,
Piscataway, NJ, USA, 2007. IEEE Press. doi: 10.1109/ICCAD.2007.4397320.
Yilin Zhang and David Z. Pan. Timing-driven, over-the-block rectilinear steiner
tree construction with pre-buffering and slew constraints. In Proceedings of the
2014 International Symposium on Physical Design, ISPD ’14, pages 29–36, New
York, NY, USA, 2014. ACM. doi: 10.1145/2560519.2560533.
Yilin Zhang, Ashutosh Chakraborty, Salim Chowdhury, and David Z. Pan. Reclaiming Over-the-IP-Block Routing Resources With Buffering-Aware Rectilinear
Steiner Minimum Tree Construction. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design 2012, ICCAD ’12, pages 137–143,
New York, NY, USA, 2012. ACM. doi: 10.1145/2429384.2429410.
147