Fast Repeater Tree Construction Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn Vorgelegt von Christoph Bartoschek aus Peiskretscham, Polen Bonn, Mai 2014 Angefertigt mit der Genehmigung der Mathematisch-Naturwissenschaftlichen Fakultät der Rheinischen Friedrich-Wilhelms-Universität Bonn 1. Gutachter: Professor Dr. Jens Vygen 2. Gutachter: Professor Dr. Stephan Held Tag der Promotion: 11. Juli 2014 Erscheinungsjahr: 2014 Danksagung/Acknowledgments At this point, I would like to express my gratitude to my supervisors Professor Dr. Jens Vygen and Professor Dr. Stephan Held. This work would not be possible without their extensive support, inspiration, and extraordinary patience. A very special thanks goes to Professor Dr. Dr. h.c. Bernhard Korte for his encouraging support and for creating such an excellent working environment at the Research Institute for Discrete Mathematics at the University of Bonn. I would further like to thank my past and present colleagues at the institute working with me on timing optimization, especially Dr. Jens Maßberg, Professor Dr. Dieter Rautenbach, Daniel Rotter, Dr. Christian Szegedy and Dr. Jürgen Werber. Their support and ideas proved to be invaluable. Special thanks go to all the students for their collaboration, especially Laura Geisen, Nicolas Kämmerling and Philipp Ochsendorf. I also thank all other colleagues at the institute for inspiring discussions, in particular Dr. Ulrich Brenner, Christian Panten, Jan Schneider and Dr. Markus Struzyna. I am grateful that I have been able to work with all the people from IBM who shared their knowledge and hardest chip designs with us, especially Karsten Muuss, Dr. Matthias Ringe, and Alexander J. Suess. But my biggest thanks go to my wife Kerstin and my little daughters Johanna and Barbara. Without their support and endless patience, I would have never finished this thesis. Contents 1 Introduction 9 2 Timing Optimization – Basic Concepts 2.1 Basic Notation . . . . . . . . . . . . . . . . . . 2.2 Integrated Circuit Design . . . . . . . . . . . . 2.3 Static Timing Analysis . . . . . . . . . . . . . . 2.4 Repeater . . . . . . . . . . . . . . . . . . . . . . 2.5 Wire Extraction . . . . . . . . . . . . . . . . . 2.5.1 Elmore Delay . . . . . . . . . . . . . . . 2.5.2 Higher Order Delay Models . . . . . . . 2.6 Slew Limit Propagation . . . . . . . . . . . . . 2.7 Required Arrival Time Functions . . . . . . . . 2.7.1 Propagation of Required Arrival Times 3 Repeater Tree Problem 3.1 Repeater Tree Instances . . . 3.2 The Repeater Tree Problem . 3.2.1 Repeater Tree Timing 3.2.2 Feasible solutions . . . 3.2.3 Objectives . . . . . . . 3.3 Our Repeater Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Instance Preprocessing 4.1 Analysis of Library and Wires . . . . . 4.1.1 Estimating two-pin connections 4.1.2 Parameter dwire . . . . . . . . 4.1.3 Buffering Modes . . . . . . . . 4.1.4 Slew Parameters . . . . . . . . 4.1.5 Sinkdelay . . . . . . . . . . . . 4.1.6 Further Preprocessing . . . . . 4.2 Blockage Map and Congestion Map . . 4.2.1 Grid . . . . . . . . . . . . . . . 4.2.2 Blockage Map . . . . . . . . . . 4.2.3 Blockage Grid . . . . . . . . . 4.2.4 Congestion Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 14 15 18 18 19 20 20 21 . . . . . . 23 23 25 25 27 27 28 . . . . . . . . . . . . 29 29 29 34 34 36 37 38 39 39 40 40 42 5 Contents 5 Topology Generation 5.1 A Simple Delay Model . . . . . . . . . . . 5.1.1 Time Tree . . . . . . . . . . . . . . 5.2 Repeater Tree Topology Problem . . . . . 5.2.1 Topology Algorithm Overview . . 5.3 Restricted Repeater Tree Problem . . . . 5.4 Sink Criticality . . . . . . . . . . . . . . . 5.5 A Simple Topology Generation Algorithm 5.5.1 Topology Generation Algorithm . 5.5.2 Theoretical Properties . . . . . . . 5.6 Topology Generation Algorithm . . . . . . 5.6.1 Handling High Fanout Trees . . . . 5.7 Blockages . . . . . . . . . . . . . . . . . . 5.8 Plane Assignment . . . . . . . . . . . . . . 5.9 Global Wires as Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 44 47 48 49 49 49 50 50 52 56 58 59 60 60 6 Repeater Insertion 6.1 Computing Required Arrival Time Targets . . . . 6.1.1 Linear Time-cost Tradeoff . . . . . . . . . 6.1.2 Effort Assignment Algorithm . . . . . . . 6.2 Repeater Insertion Algorithm . . . . . . . . . . . 6.2.1 Cluster . . . . . . . . . . . . . . . . . . . 6.2.2 Initialization . . . . . . . . . . . . . . . . 6.2.3 Timing Model during Repeater Insertion . 6.2.4 Finding a new Repeater . . . . . . . . . . 6.2.5 Buffering Algorithm . . . . . . . . . . . . 6.2.6 Merging operation . . . . . . . . . . . . . 6.2.7 Moving operation . . . . . . . . . . . . . . 6.2.8 Arriving at the root . . . . . . . . . . . . 6.2.9 Running Time . . . . . . . . . . . . . . . 6.2.10 Repeater Insertion - Summary . . . . . . 6.3 Dynamic Programming . . . . . . . . . . . . . . 6.3.1 Basic Dynamic Programming Approach . 6.3.2 Buffering Positions . . . . . . . . . . . . . 6.3.3 Extensions to Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 64 66 68 69 69 70 70 73 75 75 77 79 80 81 82 83 84 84 7 BonnRepeaterTree 7.1 Repeater Library . . . . . . . . . . . . . . . . . 7.1.1 Repeater and Wire Analysis . . . . . . . 7.1.2 RAT and Slew Backwards Propagation . 7.2 Blockages and Congestion Map . . . . . . . . . 7.3 Processing Repeater Tree Instances . . . . . . . 7.3.1 Identifying Repeater Tree Instances . . 7.3.2 Constructing Repeater Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 89 90 90 91 92 92 98 6 . . . . . . . Contents 7.4 7.5 7.6 7.3.3 Replacing Repeater Tree Instances . . . . . Implementation Overview . . . . . . . . . . . . . . 7.4.1 Repeater Tree Construction Framework . . 7.4.2 Repeater Tree API . . . . . . . . . . . . . . 7.4.3 Parallelization . . . . . . . . . . . . . . . . BonnRepeaterTree in Global Timing Optimization BonnRepeaterTree Utilities . . . . . . . . . . . . . 7.6.1 Removing Existing Repeaters . . . . . . . . 7.6.2 Postprocessing Repeater Chains . . . . . . 8 Experimental Results 8.1 Comparison to an Industrial Tool . . . . . 8.2 Comparison to Bounds . . . . . . . . . . . 8.2.1 Running Time . . . . . . . . . . . 8.2.2 Wirelength . . . . . . . . . . . . . 8.2.3 Number of Inserted Inverters . . . 8.2.4 Timing . . . . . . . . . . . . . . . 8.3 Fast Buffering vs. Dynamic Programming 8.4 Varying η . . . . . . . . . . . . . . . . . . 8.5 Varying dnode . . . . . . . . . . . . . . . . 8.6 Disabling Effort Assignment . . . . . . . . 8.7 Disabling Parallel Mode . . . . . . . . . . 8.8 Choosing Tradeoff Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 100 100 100 101 101 102 102 103 . . . . . . . . . . . . 109 110 111 113 114 114 115 116 118 119 119 121 121 A Detailed Comparison Tables 125 B Bibliography 141 7 1 Introduction We live in a world where computer chips can be found in nearly every device we come into contact with. On the one hand, there is a huge demand for powerful but not power-consuming chips that can be added to our wearables, houses or household items. On the other hand, there is still a huge demand for fast processors that are able to crunch the enormous amounts of data we produce daily. Chip designers create the small miracles driving all of this in a process called physical design. It is an area where one of our most advanced technologies meets mathematics and computer science. Physical design contains a lot of fascinating problems that can be tackled with methods from combinatorial optimization. This thesis focuses on one of the problems that arise during the optimization of the timing behaviour of a chip, the optimization of interconnections or repeater trees. Interconnections distribute signals from a source to one or several sinks. Figure 1.1 shows the interconnection between a source (blue) and four sinks (red). If interconnections get too long, the speed of signals degrades and electrical constraints get violated. To improve speed and to avoid electrical violations, repeaters can be inserted into a chip. Figure 1.1: Example of an interconnection. A signal has to be distributed from a root (blue) to sinks (red). There are two flavours of repeaters: buffers and inverters. Buffers are used to refresh a signal. Inverters have, in addition, the property that they change the polarity of a signal. A signal switching from a logical 0 to a 1 becomes a signal switching from 1 to 0 and vice-versa. The logical symbols of both repeater types and the schematic of an inverter are shown in Figure 1.2. An interconnection distributes the signals in a tree-like fashion from a source to 9 1 Introduction A Inverter Z A Buffer Z Vdd A Z GND Figure 1.2: The symbols of inverters and buffers and a schematic of a simple inverter. It consists of two transistors with converse switching behaviour. GND (logical 0) and Vdd (logical 1) are power supplies. Both repeaters have a single input A and an output Z. If a 0 arrives, then only the gate to Vdd opens and vice-versa. A buffer is usually constructed by two inverters in series. the sinks via wires. A Repeater is added into the interconnection by subdividing a wire segment and connecting the ends to the repeater’s input and output. We call an interconnection that potentially has repeaters in between a repeater tree. In the early years of physical design, interconnection optimization between gates was a minor task. Repeaters were only necessary for very long distances or very high fanouts. However, with the downscaling of technology, resulting in smaller gates and thinner wires, the ability of gates to drive long wires deteriorated and more repeaters became necessary to meet timing requirements. Saxena et al. (2003) predicted that for the 45 nm and 32 nm technology nodes 35 % and 70 % of all circuits on typical designs will be repeaters. In the meantime, the technology nodes have arrived and we see that the numbers are slightly better than predicted. We see that on 45 nm designs 25–40 %, on 32 nm designs 30–45 %, and on 22 nm designs 35–50 % of all circuits are repeaters. Nevertheless, if 30 % of all gates are repeaters, repeater tree construction becomes an important task. It affects all aspects of physical design in addition to timing and electrical correctness. For example, repeaters represent a significant number of circuits that have to be placed and later connected by routing. A significant part of the power consumption can be attributed to repeaters. Modern designs have a wide range of instances with up to hundred thousands of sinks. We have seen designs with millions of instances. For large instances, running time becomes a crucial feature of a repeater tree construction algorithm. The algorithms we present fit into all stages of physical design. A very fast algorithm that we call Fast Buffering can be used for rebuilding all instances globally in early and middle design stages. We can optimize around 5.7 million instances per hour on a single core and a couple of minutes in parallel. For later stages of physical design, a more accurate version of our algorithm can be enabled that is able to squeeze out the last tenth of a picosecond. The structure of this thesis is as outlined in the following paragraphs. The basic concepts from timing optimization in chip design that we use are explained in 10 Chapter 2. We then define the Repeater Tree Problem in Chapter 3. Our problem formulation encapsulates most of the constraints that have been studied so far. To the best of our knowledge, we also consider several aspects for the first time, for example, slew dependent required arrival times at repeater tree sinks. These make our formulation more adequate to the challenges of real-world repeater tree construction. For creating good repeater trees, one has to take the overall design environment into account. The employed technology, the properties of available repeaters and metal wires, the shape of the chip, the temperature, the voltages, and many other factors highly influence the results of repeater tree construction. To take all this into account, we first preprocess the environment to extract parameters for our algorithms. These parameters allow us to quickly and yet quite accurately estimate the timing of a tree before it has even been buffered. Chapter 4 shows how we extract them from the timing environment. The next two chapters explain our algorithm to solve the Repeater Tree Problem. Chapter 5 shows how we construct an underlying Steiner tree for our solution. We prove that our algorithm is able to create timing-efficient as well as cost-efficient trees. Chapter 6 deals with the problem of adding buffers to a given Steiner tree. The predominantly used algorithms to solve this problem use dynamic programming. However, they have several drawbacks. Firstly, potential repeater positions along the Steiner tree have to be chosen upfront. Secondly, the algorithms strictly follow the given Steiner tree and miss optimization opportunities. Finally, dynamic programming causes high running times. We present our new buffer insertion algorithm that overcomes these limitations. It is able to produce results with similar quality to a dynamic programming approach but a much better running time. In addition, we also present our improvements to the dynamic programming approach that allow us to push the quality at the expense of a high running time. As part of this thesis, we implemented the discussed algorithms as a module within the BonnTools optimization suite which is developed at the Research Institute for Discrete Mathematics at the University of Bonn in an industrial cooperation with IBM. BonnTools are used by chip designers worldwide within IBM. Some implementation details will be described in Chapter 7. Our algorithms are used and help engineers dealing with some of the most complex chips in the world. In the same chapter, we also shortly describe the framework we have written that makes it easy to implement new repeater tree construction algorithms. As an example, we will show how routing congestion on a chip can be reduced by rerouting repeater chains. Our cooperation partner IBM provided us with a large number of real world chip designs. For this thesis we have chosen a set of twelve challenging chips with a total number of more than 3.3 million different repeater tree instances. In Chapter 8, we present experimental results that show the quality and speed of our algorithms. 11 2 Timing Optimization – Basic Concepts 2.1 Basic Notation We use the same notation for graphs as Korte and Vygen (2012). For an arborescence (i.e. a directed tree where each node except for the root has exactly one entering edge) T = (V, E) with nodes v, w ∈ V we denote the edges on the path from v to w by E[v,w] . The parent of node v ∈ V is called parent(v). For a set X ⊂ Rn , we define the componentwise maximum max x := max x1 , max x2 , . . . , max xn x∈X x∈X x∈X x∈X and the componentwise minimum min x := min x1 , min x2 , . . . , min xn . x∈X x∈X x∈X x∈X As current technologies prefer to route wires only in horizontal and vertical direction, we almost exclusively use the `1 -norm: kak := kak1 2.2 Integrated Circuit Design The netlist of a chip design consists of primary pins (input or output), gates and nets. Gates are circuits computing small Boolean functions, for example, NOT, AND, or macros encapsulating larger functionality. Gates have pins, input pins or output pins, as external connection points. Pins, primary pins and gate pins, are connected by nets. Typically a net has a single source (gate output pin or primary input pin) and a set of sinks (gate input pin or primary output pin). If a gate’s output pin is the source of a net, then we say that the gate is the driver of the net. Physical design is the phase in the process of creating a chip where a netlist that has just been compiled from a hardware description language is mapped to a chip image, typically a rectangular area with free space for gates. During physical design, gates get placed on the chip image. Then, optimizations are performed without changing the logical function of the design to improve the timing behaviour. One such operation is rebuilding repeater trees. Finally, the pins of each net get connected with wires by a routing tool. Modern chip designs and technologies are so challenging that placement, timing optimization, and routing have to be considered together most of the time. An introduction into the optimization steps of modern physical design is given by Held (2008). 13 2 Timing Optimization – Basic Concepts 2.3 Static Timing Analysis We use the concepts of static timing analysis (Hitchcock et al., 1982) that is based on the critical path method (Kelley and Walker, 1959). A thorough introduction to timing analysis is given by Sapatnekar (2004). In this thesis only the following concepts are important. The voltage at a given point of a chip compared to ground defines the logical state. The voltage Vdd represents a logical 1 and GND (ground) represents a 0. A signal is defined as the change of the voltage over time. A rising signal describes a change from GND to Vdd . On the other hand, a falling signal describes the change from Vdd to GND. We call the direction the signal’s edge. The possible edges are r (rise) and f (fall). We define the inversion of an edge as f −1 := r, r−1 := f. In static timing analysis signals are measured at certain points of the design that are also called timing points. For most of this discussion, it is sufficient to restrict ourselves to gate pins and primary pins as timing points. In addition, some gates have internal timing points. We only have to consider them when we work with real chip designs (Chapter 7). A signal is estimated by a piecewise linear function that is given by the arrival time and slew of the signal. Usually, the arrival time of a rising or falling signal is given as the time when the voltage change reaches 50 %. Similarly, the slew is given as the time between 10 % and 90 % of the voltage change. Static timing analysis makes worst case assumptions and computes at each measurement point an early and a late signal. A real signal will arrive after the early signal and before the late signal. At certain timing points the arrival times of signals are compared to the arrival times of other signals or design-specific constants. This imposes constraints on the signals. For example, signals are not allowed to be too fast (their early arrival time has to be high enough) or too slow (the late arrival time has to be small enough) when they arrive at registers1 compared to clock signals. Repeaters are used to slow down signals that are too fast and they are also used to speed up signals. In this thesis, we only consider the problem of speeding up signals to meet requirements on the late arrival time. As we are not interested in the early arrival times of signals, we will ignore them for the remainder of this thesis and only work with late signals. Transistors show different characteristics for rising and falling signals depending on their size or the technology. Gates show asymmetric behaviour depending on the edge of the incoming signal. Therefore, timing analysis computes the arrival times and slews for both signals separately and stores them in time pairs, one value for each signal edge. Definition 1. A time pair is a tuple (rise, f all) ∈ R2 of time values. For a given time pair t = (rise, f all) we define tr := rise and tf := f all. 1 Registers are gates that store information between clock cycles. 14 2.4 Repeater At each timing point, a signal is given by an arrival time time pair (we often write arrival time pair) and a slew time pair (we often write slew pair): (atr , atf ), (slewr , slewf ). The measurement points are connected by directed propagation arcs. Timing nodes and propagation arcs form the timing graph. For a net, there is a propagation arc for each sink pin p connecting the source of the net to p. For gates, there are technology dependent rules specifying which internal timing points and arcs are added to the timing graph. Propagation arcs are also called propagation segments. Repeaters normally have a single propagation arc from their input pin to their output pin. The propagation arc within an inverter is an inverting propagation arc; a rising (falling) signal edge at the input becomes a falling (rising) signal edge at the output. Buffers have, similar to nets, a non-inverting propagation arc; a rising (falling) signal edge at the input becomes a rising (falling) signal edge at the output. The difference between the arrival time of a signal at the head of a propagation arc and the arrival time at the tail is called delay. The computer program that computes arrival times and slews for all signals on all timing nodes is called timing engine. 2.4 Repeater A repeater t is characterized by • its logical function (buffers implement the identity function, inverters implement the negation), • an input pin ta and an output pin tz with pin capacitances capin (t) and capout (t), • its delay function delayt , • its slew function slewt , and • its leakage power consumption pwr(t). The logical function determines whether a signal propagating through a repeater is inverted or not. Inverters change, according to their propagation arcs, a rising signal into a falling signal and vice-versa. Buffers do not change the edge of a signal. Given a signal edge ∗ ∈ {r, f }, we define t(∗) := ( −1 ∗ ∗ if t is an inverter if t is a buffer. Each repeater is driver of the net connected to its output pin. The sum of pin and wire capacitances visible from the output pin, including capout (t), is called load capacitance or load. The functions delayt and slewt are called timing functions. The delay function computes the delay over the internal propagation arc of a repeater. It depends on 15 2 Timing Optimization – Basic Concepts the load at the output pin and the slew pair at the input pin. The function is given by delayt : [0, loadlim(t)] × [0, slewlim(t)]2 → R2 delayt (l, s) := delaytr (l, sr ), delaytf (l, sf where, for each signal edge ∗ ∈ {r, f }, there is a function delayt∗ : [0, loadlim(t)] × [0, slewlim(t)] → R. As inverters change the signal edge, we have to distinguish between inverters and buffers when we want to add delays to a given arrival time pair. For time pair a, load l and slew pair s, we use the function updatet : R2 × [0, loadlim(t)] × [0, slewlim(t)]2 → R2 af + delaytf (l, sf ), ar + delaytr (l, sr ) updatet (a, l, s) := ar + delaytr (l, sr ), af + delaytf (l, sf ) if t is an inverter if t is a buffer. Similarly, the slew function determines the slew at the output pin of the repeater depending on the slew pair at the slew pin and the load at the output pin. It is given by slewt : [0, loadlim(t)] × [0, slewlim(t)]2 → R2≥0 slewtf (l, sf ), slewtr (l, sr ) slewt (l, s) = slewtr (l, sr ), slewtf (l, sf ) if t is an inverter if t is a buffer, and, for each input signal edge ∗ ∈ {r, f }, there is the function slewt∗ : [0, loadlim(t)] × [0, slewlim(t)] → R≥0 computing the slew for the output signal edge t(∗). Each repeater has a pair slewlim(t) of limits on the maximum rising respectively falling slew associated with the input pin and a maximum output load limit loadlim(t) associated with the output pin. Both define the domain of the timing functions. There are also tiny lower limits on the possible load and slews to the delay and slew functions. However, they are not relevant in practice. The minimum load limit can only be violated by unconnected pins and the lower slew limits are so steep that it is not possible to reach them within the corresponding technology. We therefore ignore them and use 0 for lower limits. The delay and slew functions as well as the pin capacitances and limits are called timing rules. Especially in the early design stages, it is often not possible to keep the slews and loads within the domain of the load and slew functions. Such a condition is called 16 2.4 Repeater Output Slew Output Slew electrical violation. The timing engine extrapolates the timing functions in such a case to work with somehow reasonable values. Generally, we do not make many restricting assumptions on the timing functions of a repeater. There might be small deviations from the following, but we can assume that both functions are strictly monotonically increasing for each input. For example, if for a fixed load the input slews are increased, then the delays and output slews also increase. 100 ps 100 ps 400 f F 200 ps Load Delay (b) Delay (a) 100 ps 100 ps 400 f F (c) Input Slew Load 200 ps Input Slew (d) Figure 2.1: Delays and slews of an example repeater depending on the input slew and load. Figures (a) and (c) show the results of slewtr and delaytr for fixed input slew and different load capacitances. Figures (b) and (d) show the results of slewtr and delaytr for fixed load capacitance but different input slews. In all cases the rise-fall transition of an inverter is shown. We use the leakage power consumption of repeaters as costs. Repeaters also cause dynamic power consumption that depends on the switching behaviour of the circuit. Circuits that switch often consume more power than circuits that keep their state. The power consumption also depends on the capacitance of the capacitors that change voltage. The dynamic power consumption is roughly linear to the total 17 2 Timing Optimization – Basic Concepts capacitance of a circuit. As all circuits in a repeater tree show the same switching patterns, we can basically expect that shorter trees use less dynamic power. 2.5 Wire Extraction Given a net, we are interested in the delay between the source of the net and each sink. The process of computing the delays and slew changes is called wire extraction. There are several different approaches to model the timing behaviour of nets. The most dominant one is shown in the next section. 2.5.1 Elmore Delay The most commonly used delay model for nets in works on interconnection optimization is the Elmore delay because it is easy to calculate and gives good results compared to earlier approaches2 . Elmore (1948) proposed to estimate the delay of a monotonic step response of a circuit by using the mean of the impulse response. It can be shown that the Elmore delay is an upper bound on the actual 50 % delay to a sink. Rubinstein et al. (1983) showed how to compute the Elmore delay on an RC-tree in linear time. An RC-tree describes the physical properties of the wires of a net. A wire segment e is modeled using the π-model, that is, a resistance segment re between two capacitors ce/2. Here, re is the resistance of the wiring segment and ce is its capacitance. At the first design stages there are often no wires that can be used for RC-tree generation. To estimate the delay over a net, it is common to compute a Steiner tree S and use it as the RC-tree with default resistances and capacitances for the edges of the Steiner tree. We assume that the tree is oriented away from the input pin of the net, that each vertex is assigned a position in the plane, and that the edges are embedded at the shortest paths between their nodes. On a path E(S)[a,b] from vertex a to vertex b, the Elmore delay rc is calculated by: rcs = X re e∈E(S)[a,b] ce + downcap(e) . 2 The downward capacitance downcap(e) is the sum of all wiring and pin capacitances reachable from e in the oriented tree. Both re and ce are proportional to the length of edge e. The resulting rc value is therefore quadratic in the length. The Elmore delay approximates the response of a net to a step excitation. It is used as a raw value, often called RC-delay, that is merged with environmental variables into a delay function and a slew function that compute the response of 2 See Pileggi (1995) for a description of some earlier models. 18 2.5 Wire Extraction the net to a skewed input signal. We assume that the timing engine provides a wiredelay function wiredelay : R × R≥0 → R and a wireslew function wireslew : R × R≥0 → R≥0 . Given a wire segment with Elmore delay rc and a slew sin at the segment’s start, wiredelay(rc, sin ) computes the delay over the wire and wireslew(rc, sin ) computes the slew at the end. Both functions are sometimes linear, but often they cannot be calculated by a simple expression. The functions are independent from the signal edge. To simplify the calculation of delays and slews for time pairs, we define the combinations wiredelay and wireslew: wiredelay : R × R2≥0 → R2 wiredelay(rc, (slewr , slewf )) = (wiredelay(rc, slewr ), wiredelay(rc, slewf )) wireslew : R × R2≥0 → R2 wireslew(rc, (slewr , slewf )) = (wireslew(rc, slewr ), wireslew(rc, slewf )). The Elmore delay has the property that splitting up arbitrary nodes in the tree does not change the result. For example, if we have an edge (a, b) and split it with node c then rc(a,b) = rc(a,c) + rc(c,b) . However, we do not assume that the same holds for wiredelay or wireslew. The following equations are not necessarily true for an input slew sin : wiredelay(rc(a,b) , sin ) = wiredelay(rc(a,c) , sin ) + wiredelay(rc(c,b) , wireslew(rc(a,c) , sin ))) wireslew(rc(a,b) , sin ) = wireslew(rc(c,b) , wireslew(rc(a,c) , sin )). This means that in general we cannot compute delays and slews segment by segment and just stitch them together. Instead, we have to calculate the Elmore delay for the whole net first before we ask for the total delays and output slews by using the black-box functions wiredelay and wireslew. 2.5.2 Higher Order Delay Models The Elmore delay is popular for timing optimization because of its simplicity. However, it is too pessimistic for some applications. Timing engines typically support more accurate delay models in addition to Elmore delay. While we focus on the Elmore delay in our discussion, the operations where we extract a net and compute the delay and slew degradation are independent from the delay model. With small modifications and by using black-box functions that 19 2 Timing Optimization – Basic Concepts return delays and slews for a given net, it would be possible to create versions of our algorithm that work on more accurate delay models. Until now, we refrained from doing so because we expect only a small improvement in quality to our solutions that would be paid by a drastically increased running time. A description of higher order delay models can be found in Sapatnekar (2004) or Alpert et al. (2008), p. 546ff. 2.6 Slew Limit Propagation In the context of repeater trees, only input pins of gates have slew limits. However, if one has to choose a gate that drives a net, one often wants to know the maximum slew that may arrive at the source of the net such that the limit is not violated for each sink of the net. We assume that there exists a function slewinv : R × R2≥0 → R2≥0 that computes the maximum slew limit at net sources. For a sink s with a slew limit pair slewlim(s) and RC-delay rcs the pair of maximum slews that are allowed at the source of the net such that the sink’s limits are not violated is slewinv(rcs , slewlim(s)). Sometimes it is not possible to obey a sink’s slew limit because it is too tight or the net is too long. In such cases, we assume that slewinv returns 0 for the corresponding signal edge. For a net with sink set S, the maximum allowable slews at the source are the componentwise minimum values over all slew limits: slewlim := min{slewinv(rcs , slewlim(s))}. s∈S We ask the same question for each repeater: What is the highest slew that may arrive at the input pin such that the output slew is below a certain limit for a given load capacitance? We assume that for each repeater t a function slewinvt : R≥0 × R2≥0 → R2≥0 is given. For a slew limit pair slewlim at the output pin and a load capacitance load, the pair of highest allowable slews at the input pin is slewinvt (load, slewlim). 2.7 Required Arrival Time Functions As indicated above, there are constraints on the arrival times. If a signal is not allowed to arrive too late at a timing point, then the latest feasible arrival time is called required arrival time (RAT). We define required arrival times for all timing nodes even if they have no direct arrival time constraints. For a timing point v, the required arrival time for a signal is the latest arrival time such that for all timing nodes reachable from v the arrival time constraints are met. As the delays 20 2.7 Required Arrival Time Functions of subsequent propagation segments depend on the slew at v, the required arrival time is a function of slews. For each timing point, there is a RAT function rat : R2≥0 → R2 rat(slew) = ratr (slewr ), ratf (slewf ) with ratr , ratf being edge-specific RAT functions. For a signal edge ∗ ∈ {r, f }, we have rat∗ : R≥0 → R. Given arrival time a∗ and slew s∗ for signal edge ∗ ∈ {r, f }, the required arrival time constraint at a point is feasible if a∗ ≤ rat∗ (s∗ ). The slack at the point is defined as σ ∗ := rat∗ (s∗ ) − a∗ . The timing of a netlist is clean if the slacks are non-negative for all constraints on timing points. 2.7.1 Propagation of Required Arrival Times Given a net, the RAT function at its source r depends on the net topology and the RAT functions at the sinks. Given a sink s with RAT function rats , we can ask for a function ratr such that arrival times and slews are feasible at s if they are feasible at r. Delay and slew over a wire segment only depend on the Elmore delay and the input slew. We therefore assume there is function ratinv : R × (RR≥0 × RR≥0 ) → RR≥0 × RR≥0 ratinv rc, (ratr , ratf ) = ratinv r (rc, ratr ), ratinv f (rc, ratf ) and for each edge ∗ ∈ {r, f } a function ratinv ∗ : R × RR≥0 → RR≥0 such that, for every slew slew and Elmore delay rcs , ratinv ∗ (rcs , rat∗s )(slew∗ ) = rat∗s (wireslew(rcs , slew∗ ))−wiredelay(rcs , slew∗ ). At the source of multi-pin nets, we have to take the minimum over all RAT functions coming from the sinks to be sure that arrival times are feasible at all sinks. The minimum is the function ratr := min ratinv(rcs , rats ) s∈S 21 2 Timing Optimization – Basic Concepts such that for each edge ∗ ∈ {r, f } and all slews slew ∈ R≥0 rat∗r (slew) = min ratinv ∗ (rcs , rats )(slew). s∈S If we have to compute the minimum over a set of RAT functions, the result is the lower contour of input RAT functions. In practice, however, we always approximate the lower contour by a linear function. Similarly, we assume that for each repeater t there is a function ratinvt giving us the RAT function at the input pin for a RAT function at the output pin. Assume that t drives a net with load capacitance load and that the RAT function at the output pin is rat. We then have a function ratinvt : R≥0 × (RR≥0 × RR≥0 ) → RR≥0 × RR≥0 ratinvt load, (ratr , ratf ) = ratinvtr (load, ratt(r) ), ratinvtf (load, ratt(f ) ) and for each edge ∗ ∈ {r, f } a function ratinvt∗ : R × RR≥0 → RR≥0 such that, for every slew pair slew and load capacitance load, ∗ ∗ ∗ ∗ ratinvt∗ (load, rat∗s )(slew∗ ) = ratt(∗) s (slewt (load, slew )) − delayt (load, slew ). Note that for inverters the rise RAT function at the input pin is determined by the fall RAT function at the output pin and vice-versa. 22 3 Repeater Tree Problem 3.1 Repeater Tree Instances An instance of the Repeater Tree Problem consists of • a root r, its location in the plane P l(r) ∈ R2 , a root arrival time function atr : R+ → R2 , a root slew function slewr : R+ → R2 , a pin capacitance capout , and a capacitance limit loadlim(r), • a set S of sinks, and for each sink s ∈ S its parity par(s) ∈ {+, −}, its location in the plane P l(s) ∈ R2 , input capacitance capin (s), a RAT function rats , and a pair of slew limits slewlim(s), • a set L of repeaters with timing rules delayt , slewt , loadlim(t), slewlim(t), capin (t) and capout (t) for each repeater t ∈ L, • a set A of rectangles defining blocked areas, • a global routing graph, • a set W of wiring modes, and • timing functions wiredelay and wireslew. Root The root r is typically an output pin of a circuit or a primary input pin of the netlist. The root arrival time (resp. slew) function computes a time pair of arrival times (resp. slews) at the root pin for a given load capacitance. Sinks Sinks are usually primary outputs or input pins of circuits that are not repeaters. The parity determines how many inverting repeaters are required on root-sink paths. The number of inversions on the path from the root to the sink must be even (odd) if the sink has parity + (−). We say a sink is positive (negative) if it has parity + (−). Most formulations of the Repeater Tree Problem assume fixed required arrival times at the sinks. In practice, required arrival times depend on the slews that arrive at the pins1 . Higher slews cause higher delays in following stages and 1 See Section 2.7. 23 3 Repeater Tree Problem reduce the required arrival times. The first propagation segment after a sink has the highest impact on the delays introduced by higher slews. If the sink is a circuit, its delay function often shows nearly linear behaviour that can be captured by a linear function. Effects on subsequent propagation segments are much smaller and can be neglected. For sinks that are primary outputs or other nodes where required arrival times are created, the RAT function is often constant. The slew limit is determined by the timing rules of the sink and global parameters. Blockages and Congestion Information about areas of the design that are blocked for repeater insertion is given in a blockage map. Basically, the blockage map is a set of rectangles. Similarly, congestion on the wiring layers is passed to the repeater tree routine via a global routing graph. Both data structures are described in Section 4.2. Wiring Modes We assume to have a fixed number of wiring modes, each of which corresponds to a type of wire we can route on a plane. A wiring mode w is a 4-tuple (p, width(w), wirecap(w), wireres(w)) consisting of • a routing plane p, • routing space consumption width(w), • capacitance per unit length wirecap(w), and • resistance per unit length wireres(w). The routing space consumption depends on the wire width and the necessary spacing to neighboring wires and is used to update the congestion map. Usually, there is one wiring mode per usable plane of a design. Given a wire segment ws with mode w, the total capacitance and resistance of the wire are linear in its length l(ws): cap(ws) := wirecap(w) · l(ws) res(ws) := wireres(w) · l(ws). There are two default wiring modes, one on a horizontal layer wh∗ and one on a vertical layer wv∗ , that are used for the bulk of the wires in the design. Typically, they are the least expensive wiring modes in terms of routing space consumption and most expensive in terms of delay. As routers are free to choose higher planes than the plane of the assigned wiring mode and as higher planes typically mean better timing, the assignment of the default wiring modes is a pessimistic choice. 24 3.2 The Repeater Tree Problem Timing Rules For each repeater t, the functions delayt and slewt are given with the according electrical limits loadlim(t) and slewlim(t). The capacitance of its input pin is capin (t) and it is capout (t) for its output pin. For nets, wiredelay is the timing rule that computes the delay of the timing engine for a given RC-delay and input slew. Similarly, wireslew computes the slew for a given RC-delay and input slew. 3.2 The Repeater Tree Problem The Repeater Tree Problem is the task of computing a repeater tree for a given repeater tree instance. We first define what a repeater tree is: Definition 2. A repeater tree R for a Repeater Tree Problem instance is a tuple (T, P l, Rt , RW ). It consists of • an arborescence T = (V (T ), E(T )) with V (T ) = {r} ∪˙ S ∪˙ Ir ∪˙ Is rooted at r, leaves S, and inner nodes Ir ∪ Is (nodes corresponding to repeaters and steiner nodes), • an embedding of the nodes into the plane P l : V (T ) → R2 , • a repeater assignment function Rt : V (T ) → L ∪ {∅} with Rt (v) ∈ L iff v ∈ Ir , and • a wiring mode assignment function RW : E(T ) → W . The root and leaves of a repeater tree are the root and sinks of the corresponding repeater tree instance. When a repeater tree is inserted into a chip design, for each node v ∈ Ir , a new repeater gate with type Rt (v) is created and placed at P l(v). Root, sinks, and new repeaters are connected with nets that consist of the edges between the corresponding nodes. The nodes Is and their incident edges determine the topology of the nets. There is exactly one net for each node in {r} ∪ Ir . 3.2.1 Repeater Tree Timing Given a repeater tree R = (T, P l, Rt , RW ) for a repeater tree instance, we have to compute the arrival times and slews at the sinks to compare them with required arrival times. To achieve this, we have to extract the nets between the involved pins and compute delays over repeaters and nets. For each node v ∈ V (T ), let Tv be the maximal subtree rooted at v such that all its inner nodes are in Is . We say Tv is a net iff v ∈ {r} ∪ Ir . The sinks of the net are the leaves of Tv . Given a sink a ∈ V (Tv ) for a net rooted at v, let p(a) := v be the root node of the net. On the one hand, each repeater node v ∈ Ir is root of the net Tv , and on the other hand, it is a sink in the net Tp(v) . 25 3 Repeater Tree Problem We first compute the Elmore delay to each net’s sink. Given an edge (v, w), its capacitance and resistance are cap((v, w)) := wirecap(RW ((v, w)))||P l(v) − P l(w)|| res((v, w)) := wireres(RW ((v, w)))||P l(v) − P l(w)||. Given an edge (w, y) ∈ E(Tv ), the capacitance visible downwards is the sum of all edge capacitances in the subtree rooted at y and the input pin capacitances of reachable sinks: downcap((w, y)) := X cap((a, b)) + (a,b)∈E(Ty ) X capin (R(a)). a∈V (Ty ) δ + (a)=0 For the root node and the internal repeaters, the load capacitance is the sum of visible capacitances: load : Ir ∪ {r} → R load(x) = capout (x) + X downcap(e). e∈δ + (x) Now we can compute the Elmore delay rc for each sink: rc : Ir ∪ S → R rc(x) = cap(e) + downcap(e) . res(e) 2 ) X e∈E(Tp(x) [p(x),x] The next step is to propagate the slews and the arrival times from the root to the instance sinks. We define the slew recursively distinguishing between the input slew, slewi , for input pins and output slew, slewo , for output pins: slewo : Ir ∪ {r} → R2 ( slewo (v) = slewr (load(v)) v=r slewR(v) (load(v), slewi (v)) v = 6 r slewi : Ir ∪ S → R2 slewi (v) = wireslew(rc(v), slewo (p(v))). In a similar way, the arrival times at inputs (ati ) and output pin (ato ) are defined as ato : Ir ∪ {r} → R2 atr (load(v)) v=r updateR(v) (ati (v), load(v), slewi (v)) v = 6 r ( ato (v) = ati : Ir ∪ S → R2 ati (v) = ato (p(v)) + wiredelay(rc(v), slewo (p(v))). 26 3.2 The Repeater Tree Problem The slack of the repeater tree is now slack(T ) := min min{ratrs (slewir (s)) − atri (s), ratfs (slewif (s)) − atfi (s)}. s∈S We also define the static power consumption of the tree power(T ) := X pwr(R(v)) v∈Ir and its length length(T ) := X ||P l(v) − P l(w)||. (v,w)∈E(T ) The length roughly correlates with the dynamic power consumption of a repeater tree. 3.2.2 Feasible solutions A repeater tree is feasible if internal nodes Ir are legally placed and connected in such a way that the signals arrive at each sink with the correct parity. Repeaters are placed legally if their position is not marked as blocked in the blockage map: P l(v) ∈ /A ∀v ∈ Ir . Note that we ignore overlaps between repeaters and other gates. Furthermore, we have to obey capacitance and slew limits everywhere: load(r) ≤ loadlim(r) load(v) ≤ loadlim(R(v)) ∀v ∈ Ir slewi (v) ≤ slewlim(R(v)) ∀v ∈ Ir slewi (s) ≤ slewlim(s) ∀s ∈ S. The timing of the repeater tree is feasible if slack(T ) ≥ 0. 3.2.3 Objectives Among the repeater trees satisfying the above conditions, one typically searches a tree with the smallest power consumption. In practice, however, it is often not possible to achieve a positive slack. It might also not be possible to get a solution that has no electrical violations. In such cases, our first objective is to minimize the sum of electrical violations. The second objective is to maximize min{0, slack} followed by minimizing power. In addition, we seek solutions minimizing the use of wiring resources. It is often desirable to balance between the two main objectives, timing and wirelength. To this end, we introduce a parameter ξ ∈ [0, 1], indicating how timingcritical a given instance is. For ξ = 1, we primarily optimize the worst slack, and we 27 3 Repeater Tree Problem optimize wirelength for ξ = 0. We consider the other objective only in case of ties. In practice, however, we mainly use values of ξ that are strictly between 0 and 1. The Repeater Tree Problem is NP-hard because it contains the Steiner Minimum Tree Problem if one just wants to minimize `1 -netlength (see Garey and Johnson (1977)). In addition, delay and slew functions are non-linear leading to further difficulties. A good overview of existing approaches to solve the Repeater Tree Problem can be found in Alpert et al. (2008), Chapter 24–28. For an older discussion see Cong et al. (1996). 3.3 Our Repeater Tree Algorithm We present our algorithm for the Repeater Tree Problem, that we call Fast Buffering, in the next chapters. Most repeater tree instances are build in the same environment with the same repeater library, blockages, global routing, wiring modes, and timing functions. We therefore spend some time to preprocess the environment and to compute parameters that allow us to perform further steps efficiently. The preprocessing step is explained in Chapter 4. A common approach to the Repeater Tree Problem is to divide it into two steps, Steiner tree generation (also called topology generation) and repeater insertion (also called buffering). Our algorithm takes the same route. A main reason is that a) for some applications the topology is fixed and we only have to do repeater insertion, and b) other applications only need a timing-aware topology. Thus, our topology generation algorithm can be used for different applications, for example, the optimization of symmetric fan-in trees2 . Similarly, our buffering algorithm can work on topologies from other sources (e.g. routing). The division allows us to exchange one algorithm without touching the other. The algorithms are independent from each other, but, on the one hand, we already consider expected results from repeater insertion during topology creation by using a delay model for estimation that tightly matches buffering results3 , and, on the other hand, we allow our buffering algorithm to modify the input topology if it is suitable. Chapter 5 explains how we create topologies, and in Chapter 6 we show how we buffer them. 2 Symmetric fan-in trees compute symmetrical functions with n inputs. They are reverse to repeater trees. Signals from n sources are merged into a single output. 3 See Figure 5.3 and the surrounding discussion. 28 4 Instance Preprocessing 4.1 Analysis of Library and Wires We compute some auxiliary data and parameters in advance, which are then used for all instances of a design. It is possible to precompute parameters for a single global optimization run because the environment does not change and because technology, library, and wire types are the same for a lot of instances. On the other hand, it is not possible to precompute useful data between several runs because the environment changes too often. Due to different timing rules, voltages, and temperatures between different optimization runs, it is necessary to recompute the parameters even if the basic technology or library did not change. The main goal is to identify some parameters that allow us to estimate the timing of a repeater tree based on the Steiner tree. In addition, we compute some values that guide us in the buffering step of a particular instance. 4.1.1 Estimating two-pin connections Figure 4.1 shows the delay over a two-pin connection depending on its length after buffering it in an approximatively delay-minimal way (see Section 6.3). The experiment was done with a 22 nm chip design using default planes and wire widths. We see how repeater insertion linearizes the delay that would be quadratic otherwise. The red line in the figure shows a linear approximation of the delay function between the two inverters. The slope of the approximation is dwire . Given this approximation, we can predict the delay for two-pin nets after buffering: delay = dwire ∗ length. (4.1) While there are closed-form solutions for buffering two-pin nets1 , we do not use them for approximating dwire because they rely on simplifications like Elmore delay and do not capture all environmental parameters. Instead, we search for the best way of buffering long two-pin nets by implementing it in the design and using the timing engine to calculate delays and slews. This way, we capture all effects that affect timing. We now show how we compute the constant dwire . To bridge large distances of wire in wiring mode w using repeater t, we partition the wire equidistantly by adding a repeater after l units of wire. 1 See for example Alpert et al. (2008), p. 536f 29 Delay 4 Instance Preprocessing 250 µm 500 µm Distance (ns) 750 µm Figure 4.1: Two medium-sized inverters are placed at a given distance. The net between them is then buffered with the highest effort. The graph shows the resulting delay depending on the distance (black). The red line shows the linear approximation that we compute for the delay function. To measure the delay over the line, we add two repeaters of type t into the design and connect them by a net such that the length of the net is l. We modify the pin capacitance at the end of the line to be c. At the input of the first repeater, the slew pair sin is asserted. The whole setup guarantees that most global timing parameters are considered. Local timing parameters (e.g. coupling capacitance) are ignored. Let d(t, sin , l, w, c), sout (t, sin , l, w, c) and p(t, sin , l, w, c) be the total delays over the stage (through repeater and wire), the slews at the other end of the wire, and the power consumption, respectively. We assume that all values are infinite if a load limit or a slew limit is violated. Let now s0 be a reasonable slew pair (we just use the minimum allowed slews for t). We define si+1 := sout (t, si , l, w, c) (i ≥ 0). (4.2) The sequence (si )i=1,2,3,... typically converges very fast to a fixpoint or quickly becomes ∞ due to an electrical violation. We call s∞ (t, l, w, c) := limi→∞ si the stationary slews of (t, l, w, c). In practice, we iterate over si until we reach a fixpoint due to the limited precision of the floating-point numbers used to represent slews. Typically, the fixpoint is reached within ten iterations. We then use the computed value as an approximation of s∞ (t, l, w, c). There is only a single fixpoint because the slew function is typically contractive over a whole stage. Thus, the choice of the initial slews does not matter as long as it is within the domain of the slew and delay functions. One might think that inverters need special treatment, because the rising (falling) output slew does not depend on the rising (falling) input slew. Instead, the signal is 30 4.1 Analysis of Library and Wires inverted internally. The stationary slews might only be achieved after propagating over two stages in the inverter chain. There might also be different stationary slews for odd and even numbers of stages respectively. Fortunately, one can easily show that one has only to deal with a single stationary slew pair. ... ... sri sri+1 sri+2 sri+3 sfi sfi+1 sfi+2 sfi+3 Figure 4.2: An endless chain of equidistantly distributed inverters. The slews alternate between rise and fall. Although a given rise value is not used in the computation of the following one, like, for example, sri and sri+1 , consecutive rise values converge to a fixpoint. Figure 4.2 shows an infinite line of equidistantly distributed inverters and the slews at the input pins. The lines below show how slews are propagated. For example, the rising slew sr1 is propagated to the falling slew sf2 . Let rf be the rise-fall slew function for a whole stage2 of the chain and f r the fall-rise slew function. As both functions are contractive, their compositions are also contractive. Within the chain of inverters the even and odd slews form separate sequences with si+2 = f r(rf (sri )), rf (f r(sfi )) . The blue and red lines indicate both sequences in the figure. The sequences only differ by the starting point as the even sequence starts with minimum allowed slews and the odd sequence starts with s1 . Because contractive functions only have a single fixpoint, both sequences converge to it. However, this means that the combined sequence also converges to the fixpoint. Therefore, it suffices to consider a single stage, not only for buffers but also for inverters. Due to the asymmetric nature of the delay and slew functions for the rising and falling slews, we take the average for further processing and abbreviate: i 1h d(t,s∞ (t,l,w,c),l,w,c)r + d(t,s∞ (t,l,w,c),l,w,c)f 2 i 1h s(t,l,w,c) := s∞ (t,l,w,c)r + s∞ (t,l,w,c)f 2 p(t,l,w,c) := p(t,s∞ (t,l,w,c),l,w,c). d(t,l,w,c) := For a given wiring mode, a repeater, and a length, we can now compute the delay 2 The slew function for a whole stage combines the slew calculation from the input of a repeater through the repeater and the following net up to the input of the next repeater. 31 4 Instance Preprocessing per unit distance and power consumption per unit distance in ¯ l, w) := d(t, l, w, cap (t)) d(t, l p(t, l, w, capin (t)) p¯(t, l, w) := l and the according stationary slews s¯(t, l, w) := s∞ (t, l, w, capin (t)). Delay/Unit Distance Figure 4.3 shows how the delay per unit distance typically behaves depending on the length between two consecutive repeaters. The delay is dominated by the repeater delay for small distances. The overall delay decreases until a delay-optimal distance is reached. For larger distances the wire delay begins to dominate. The curve ends as soon as the slews or loads create electrical violations. It is not shown in the figure, but stationary slews increase monotonically with the length. Repeater Spacing ¯ Figure 4.3: A typical curve showing d(t,l,w) for a given repeater t and wiring mode w over the range of valid lengths l. The curve shown is from a medium-sized inverter from a 22 nm design on the third metal layer using the smallest wiring mode. Using the same repeater and wiring mode as in the previous figure, we can see in Figure 4.4 how power per unit distance and delay per unit distance relate to each other. For small distances (upper right endpoint of the curve) the power consumption is high due to the high amount of repeaters needed. Power consumption and delay decrease with larger distances until we reach the optimal distance. Further power reductions cause higher delays. The red points show possible stage lengths that 32 Power/Unit Distance 4.1 Analysis of Library and Wires Minimum Repeater Spacing Fastest Spacing Maximum Feasible Spacing Delay/Unit Distance ¯ Figure 4.4: A typical curve showing (d(t,l,w), p¯(t,l,w)) for a given repeater t and wiring mode w parametrized over the range of valid lengths l. The curve is generated for the same repeater as in Figure 4.3. 33 4 Instance Preprocessing are not dominated by distances that result in cheaper configurations with the same delay or faster configurations with the same power consumption. ∗ ∈R We choose for each wiring mode w a repeater t∗w ∈ L and a length lw >0 which minimize the linear combination ∗ ¯ ∗ , l∗ , w) + (1 − ξ)¯ ξ d(t p(t∗w , lw , w). w w (4.3) ∗ minimize the equation, then we choose the If two different choices for t∗w and lw fastest one. We call the parameter ξ power-time-tradeoff. For library analysis, we choose ξ = 1 to calculate a lower bound on the achievable wire delay. The minima can then be found by binary search over all lengths for each repeater type. The functions we have to minimize are similar to the one shown in Figure 4.3. 4.1.2 Parameter dwire For delay estimation in topology generation, we do not want to distinguish between horizontal and vertical wire segments, because we often do not want to fix the exact embedding of path segments. Therefore, we use the default wiring modes wh∗ and wv∗ to build an average delay value. Typically, both wiring modes have similar electrical properties such that the resulting value is not far away from both. We choose optimal repeater t∗ and length l∗ such that ¯ ∗ ,l∗ ,w∗ ) + d(t ¯ ∗ ,l∗ ,w∗ ) + (1 − ξ)(¯ ξ d(t p(t∗ ,l∗ ,wh∗ ) + p¯(t∗ ,l∗ ,wv∗ )) h v is minimized. The resulting delay per unit distance dwire := ¯ ∗ ,l∗ ,w∗ ) + d(t ¯ ∗ ,l∗ ,w∗ ) d(t v h 2 (4.4) is the parameter we searched for. It will be used for delay estimation during topology generation (see Equation 4.1). We call the stationary slew corresponding to the repeater and length choice optslew: s¯(t∗ ,l∗ ,wh∗ ) + s¯(t∗ ,l∗ ,wh∗ ) optslew := . 2 The average capacitance over a stage is called maxcap: maxcap := wirecap(wh∗ ) + wirecap(wv∗ ) ∗ l + capin (t∗ ). 2 4.1.3 Buffering Modes As indicated in the previous section, we allow diagonal segments in our Steiner trees such that it is not clear where we will eventually use horizontal or vertical wiring segments. Thus, we assign a buffering mode that approximates the properties of a horizontal and a vertical wiring mode to each segment. Buffering modes also represent the effort we want to put into buffering of a segment. A buffering mode m is a 3-tuple (mh , mv , mξ ) that consists of 34 4.1 Analysis of Library and Wires • a horizontal wiring mode mh ∈ W , • a vertical wiring mode mv ∈ W , and • a power-time tradeoff mξ . During buffering of a wire segment, we will try to replicate long-distance chains. The distance between repeaters in a chain buffered with a given mode is determined by the target repeater and the slew targets. The stationary slews of the chain will be slew targets. We have to determine the set of buffering modes that we want to work with. We assume that there is a set Wp ⊆ W × W of wiring mode pairs. Each pair consists of a horizontal wiring mode and a vertical wiring mode. We also restrict ourselves to a set Ξ of power-time-tradeoffs between 0.0 and 1.0. Given a wiring mode pair (wh , wv ) with horizontal wiring mode wh ∈ W and vertical wiring mode wv ∈ W and a power-time-tradeoff ξ ∈ Ξ, we define a buffering mode (wh , wv , ξ). For each buffering mode, we find the optimal repeater t ∈ L and distance l ∈ R>0 minimizing ¯ ¯ ξ d(t,l,w p(t,l, wh ) + p¯(t,l,wv )) . h ) + d(t,l,wv ) + (1 − ξ) (¯ The optimal repeater for buffering mode m is called mt . The according slew targets are s¯(t, l, wh ) + s¯(t, l, wv ) ms := 2 The delay of a buffering mode m is defined as md := ¯ l, wh ) + d(t, ¯ l, wv ) d(t, . 2 The power consumption per length unit is mp := p¯(t, l, wh ) + p¯(t, l, wv ) . 2 The capacitance of a stage is mcap := wirecap(wh ) + wirecap(wv ) l + capin (t). 2 The average capacitance per unit length wire is mwirecap := wirecap(wh ) + wirecap(wv ) . 2 We use dm and pm to estimate the delay of an edge that is buffered using mode m. 35 4 Instance Preprocessing In practice, the user creates the set Wp of reasonable wiring mode pairs. Typically, the horizontal and vertical wiring mode have similar widths and spacings and lie on neighboring planes for each wiring mode pair in Wp . In such a case, the delay, power, and slew values do not differ significantly between the wiring modes of a pair such that using the averages is not too far off. The set Ξ is also defined by the user, but, in practice, we only use two tradeoffs: 0 and ξ. The default wiring modes are always in Wp . Thus, there is always a buffering mode available with delay dwire (compare to Section 4.1.2): ∗ m∗ := (wh∗ , ww , ξ) Finally, given Wp , we can determine the set M of buffering modes that we will use for buffering: M := {(wh , wv , ξ) | (wh , wv ) ∈ Wp , ξ ∈ Ξ}. For each buffering mode m, there is a set of alternative buffering modes Mm containing all buffering modes with the same horizontal and vertical wiring modes as m including m itself. The alternative buffering modes only differ by the ξ value. We assume that Mm only contains non-dominated buffering modes. A buffering mode dominates another one if it is at the same time not slower and not more expensive than the other one. 4.1.4 Slew Parameters As described in Section 3.1, different slews at the input pins of repeater tree subtrees have different effects to the downstream delays. To account for this where we do not have an explicit RAT function, we introduce a parameter ν that translates slew differences to delay differences (see also Vygen (2006)). Let t∗ ∈ L and l∗ ∈ R>0 be the repeater and length that minimize Equation 4.4. We define the slew pair of this optimal chain using only one of the default wiring modes sopt := s¯(t∗ , l∗ , wh∗ ). We now compute d1 := N X d(t∗ , si , l∗ , wh∗ , capin (t∗ )) i=0 using the following slews: s0 := sopt si := sout (t∗ , si−1 , l∗ , wh∗ , capin (t∗ )) In a second step, we compute d2 in an analog way to d1 by starting with s0 := 2 · sopt . In practice, using 2 · sopt will not lead to a violation. We set the desired parameter to d2 − d1 ν := . sopt 36 4.1 Analysis of Library and Wires The number N is chosen such that the stationary slew is reached for both computations. The parameter depends on the timing environment and technology. Typically, it lies between 0.10 and 0.25. We use it to define slewdelay(s) := ν · (s − starget ) for a given target slew starget The function slewdelay is used for two similar tasks: 1. As discussed earlier, required arrival times are associated with individual slew requirements at sinks. To better compare RATs, we normalize them to a target slew. For example, if a sink has the slew requirement s, then we translate the RAT to starget by adding slewdelay(s) to it. 2. If a signal with slew s arrives at a sink with a slew target s, then we add slewdelay(s) to the arrival time before we compute the slack of the signal. Both happens only in our buffering algorithm based on dynamic programming (See Section 6.3). 4.1.5 Sinkdelay Delay Consider two inverters connected by a net and placed far apart (as described in Section 4.1.1). We now add repeaters to the line such that the delay between the inputs of the inverters is minimized. 250 µm 500 µm Distance 750 µm Figure 4.5: Two inverters are placed at a given distance. The net between them is buffered with the highest effort. The graph shows the resulting delays for the same source but three different sink capacitances at the end of the net: a small inverter (green), a medium sized inverter (red), and a huge inverter (black). Figure 4.5 shows the resulting delays for different distances between the boundary inverters and for different capacitances at the end of the chain due to the different 37 4 Instance Preprocessing inverter sizes. The difference between the delays remains nearly constant for longer distances. We choose a distance where the delay differences are significant enough and compute for different sink pin capacitances the resulting delay. Let d0 be the delay of the chain if the capacitance at the sink is the input pin capacitance of t∗ , the optimal repeater minimizing Equation 4.4. For a given capacitance c at the end of the chain and the resulting delay dc , we define sinkdelay(c) := dc − d0 Delay This function is used to estimate the delay difference on repeater chains due to different sink capacitance compared to the optimal repeater. 250 µm 500 µm Distance 750 µm Figure 4.6: Two inverters are placed at a given distance. The net between them is buffered with the highest effort. The graph shows the resulting delays for the same sink capacitance but three different sources: a small inverter (green), a medium sized inverter (red), and a huge inverter (black). Figure 4.6 shows the effects of changing the first gate instead of the last one in our setup. Stronger (weaker) gates result in smaller (larger) delays on the chain. The delay differences are also nearly constant at larger distances. In contrast to delays introduced by sink capacitances, we do not introduce a function to estimate the delays. Instead, we evaluate the root arrival time for the capacitance of the optimal chain. 4.1.6 Further Preprocessing Next, we compute a parameter dnode that is used during topology generation to model the extra delay to be expected along a path due to additional capacitances induced by a side branch. We determine it by adding a small repeater to the repeater chain and measuring the additional delay. 38 4.2 Blockage Map and Congestion Map Let inv(c, s) denote the inverter with the smallest power consumption that still achieves a slew of at most s at its output pin if its input slew is sopt and the load is c. Let t1 := inv(maxcap, sopt ), and let t2 := inv(capin (t1 ), sopt ). We compute d1 as in Section 4.1.4. Now we modify the repeater chain by adding a capacitance load of capin (t1 ) in the middle of the first segment and consider it during the delay and slew computation of the first stage. We then get the new delay d2 and set our branch penalty to dnode := d2 − d1 4.2 Blockage Map and Congestion Map Placement blockages and wiring congestion information is given to the repeater tree routine via a blockage map and a congestion map, respectively. The blockage map is used to check whether a given point is blocked. The congestion map holds the global routing information showing on which parts of the chip routing space is sparse. 4.2.1 Grid Both the blockage map and the congestion map share the same grid. The grid partitions the chip area into tiles. The tiles are nodes of a grid graph. The bounding box ca of the design area is given by [caminx , camaxx ] × [caminy , camaxy ]. Definition 3 (Grid). A grid is a pair (xlines, ylines) of cutlines xlines = {x0 , x1 , . . . , xm } and ylines = {y0 , y1 , . . . , yn } with x0 < x1 < . . . < xm and y0 < x1 < . . . < xm and m > 1 and n > 1. We call a grid feasible for a chip area ca if x0 ≤ caminx , camaxx < xm , y0 ≤ caminy , and camaxy < ym . Most of the time, we use a grid with equidistant cutlines. An equidistant grid is accurate enough in the context of repeater tree insertion if it is not spaced too wide. Sometimes, however, one wants to use a Hanan grid given by the coordinates of all edges of significantly large blockages such that their positions are exactly captured in the blockage map. Definition 4 (Tile). Given a grid (xlines, ylines) with xlines = {x0 , x1 , . . . , xm } and ylines = {y0 , y1 , . . . , yn }, we call the rectangle [xi , xi+1 ) × [yj , yj+1 ) for 0 ≤ i < m and 0 ≤ j < n the tile(i, j) of the grid. For a given point of the chip area, there is exactly one tile in a feasible grid that contains the point. 39 4 Instance Preprocessing 4.2.2 Blockage Map The set of blocked regions for a repeater tree instance is stored in a data structure that we call blockage map. The most important operation that is performed on the blockage map is searching for the nearest free location. Given a point in the plane, the blockage map is able to give us the nearest free location in a given direction rectilinear to the grid or the nearest free location in the whole plane with respect to `1 -metric. Points on blockage boundaries that are next to free points are considered as free. 4.2.3 Blockage Grid For an existing blockage map and a grid, we also construct a blockage grid. The blockage grid stores information whether a grid tile is blocked or not: Definition 5 (Blockage Grid). Given a grid (xlines, ylines) with xlines = {x0 , x1 , . . . , xm } and ylines = {y0 , y1 , . . . , yn } and a blockage map, a blockage grid is a function bg : {0, . . . , m − 1} × {0, . . . , n − 1} → {0, 1} where bg(x, y) = 1 iff tile(x, y) is completely blocked by the blockages of the map. Shortest Path Searches Typically, blockages do not block all wiring layers in a design. It is possible to cross them on higher layers. Repeater trees are also allowed to jump over blockages. However, the possible distance is limited by the slew and capacitance limits as it is not possible to place repeaters on blockages. Larger distances between repeaters caused by jumping over blockages also cost additional delay compared to an optimally spaced repeater chain. At one step in our topology generation algorithm, we search for delay minimal paths between points in the design. We use a modified version of Dijkstra’s shortest path algorithm on the blockage grid for this task. Given two points, we first identify the tiles they belong to in the blockage grid and then compute a shortest path between both tiles. The costs of an edge between two neighboring tiles depends on whether the tiles are blocked or not. Crossing unblocked space costs proportional to dwire . Costs over blocked area increase first linearly and after a threshold quadratically with the distance the path already went over blockages. 40 4.2 Blockage Map and Congestion Map Figure 4.7: Blockage map (red) and grid (blue) on the design Julius. 41 4 Instance Preprocessing 4.2.4 Congestion Map We implemented a rough global routing engine as congestion map. In contrast to a full-fledged global router, the congestion map does not try to find a congestion free global routing solution by all means. Instead, we embed a short `1 -tree allowing only small detours. We also limit the number of iterations spent for improving the routing. The advantage is that we still see congestion that we then can try to avoid during repeater tree generation. As can be seen in Table 7.2, using a full global router would increase the running time of our algorithm significantly. Our algorithm has recently been integrated into BonnRouteGlobal as a fast mode. 42 5 Topology Generation The first step in our repeater tree algorithm is the construction of a repeater topology. A repeater topology specifies the abstract geometric structure of the repeater tree. Given an instance of the Repeater Tree Problem with root r and sink set S, we can define: Definition 6. A topology T = (V (T ), E(T )) with V (T ) = {r} ∪˙ S ∪˙ I is an arborescence rooted at r with an embedding P l : V (T ) → R2 of the nodes into the plane such that r has exactly one child, the internal nodes I have one or two children each, and the sinks S are the leaves. + + a b r − Figure 5.1: A topology for one root and three sinks. Steiner points like a and b are used to route the topology around obstacles. Although we do not use directed edges in our figures, the edges in a topology are always directed away from the root. We often call the set I of internal points Steiner points. Internal points with only a single child are used to force the topology to pass a certain point in the plane. Figure 5.1 shows an example topology. We should clearly note that the internal nodes do not represent repeaters and that the topology does not specify details about the exact placement, routing, and types of repeaters used in the final tree. The length of a topology is X ||P l(v) − P l(w)||. (v,w)∈E(T ) We have seen that after repeater insertion delays in a repeater tree are roughly linear in the length of the segments. Connecting a sink to the root via a long path results in a higher delay to a sink. The required arrival times at a sink then decide whether a path is fine or too long. Consider the example in Figure 5.2. Both topologies have the same length, but, for example, the distance to the root is 8 for the upper right sink in the first case and 2 in the second one. This example 43 5 Topology Generation + + + + + + + r + + r + + + + + + + Figure 5.2: Two topologies with the same shortest possible length but different timing behaviour. While all sinks are reached within 4 segments on the right side, it takes up to 8 segments to the furthest sink on the left side. illustrates that topologies have a high influence on the timing of a repeater tree. It is therefore crucial to build timing-aware repeater trees. Topologies for a root and a set of sinks that consider timing information do not only have an application in repeater tree construction. They can also prove to be useful in global routing. A global router internally often has to compute Steiner trees for the nets of a design. The routing result can be better with regard to timing if the Steiner trees are timing-aware topologies. In this chapter, we first develop a way to estimate the timing of a topology and then state the Repeater Tree Topology Problem. We show how our algorithm solves the problem and prove some theoretical properties for restricted versions of our algorithm. The results in this chapter are joint work with Stephan Held, Jens Maßberg, Dieter Rautenbach and Jens Vygen (Bartoschek et al., 2007a, 2010). 5.1 A Simple Delay Model Since we want to evaluate the properties of our topologies with respect to timing, we somehow have to compute a slack at root and sinks. It would be prohibitively slow to insert repeaters into each topology we want to evaluate. Therefore, we propose a simple delay model that estimates the timing from the geometric structure of the topology. The delay model will compute arrival times and required arrival times for all nodes of a topology giving us a slack that can be used to evaluate the topology. The delay model mainly consists of two components: delay over wire segments and delay due to bifurcations. We have seen in Section 4.1 how buffering a long net linearizes the delay. Given a buffering mode m, the estimated delay for a net between two points x and v is given by delay := md ||x − v||. (5.1) Every internal node of a topology with outdegree two is a bifurcation and thus an additional capacitance load for the circuit driving both of the two outgoing branches (compared to alternative direct connections). The real delay caused by bifurcations is hard to estimate beforehand. It will depend on the strength of the driver, the additional capacitance, and the position of the driver compared to the sinks. In 44 5.1 A Simple Delay Model Section 4.1.6 we computed the parameter dnode estimating the average effect of a bifurcation. It is a very rough estimation, but we will show in Section 8.5 that the used value serves us well. To evaluate the delay through a topology, we will add the additional delay to each outgoing edge of a node with two children. It is reasonable to assume that the additional load capacitance will be smaller for the less critical branch. Uncritical side path are more likely to be buffered by a small repeater with nearly neglectable capacitance. We therefore allow the distribution of dnode between both involved edges. We denote by dnode (e) the amount assigned to edge e. We introduce a new parameter η controlling how uneven the distribution of dnode can be. If e is an outgoing edge of a node with outdegree 1, then we require dnode (e) = 0. Otherwise, we require that dnode (e) ≥ ηdnode . For two edges e, e0 leaving the same internal node we require dnode (e) + dnode (e0 ) = dnode . The parameter η has to be between 0 and 1/2 to be able to fulfill the requirements. Next, we have to determine the arrival time at the root node. If the edge leaving the root has buffering mode m assigned, then we assume that the root will have to drive capacitance mcap 1 . We set the arrival time at the root to n o atT (r) := max atrr (mcap ), atfr (mcap ) . Note that for maximizing the worst slack, an accurate arrival time at the root is not important because each change affects the slack at all sinks in the same way. Finally, we have to determine the required arrival time for each sink. In our simple delay model, we only want to handle a single RAT value and not a pair of functions. Therefore, we evaluate the RAT function at the slew target of the incoming edge. As shown in Section 4.1.5, the capacitance of a sink has to be taken into account when the delay is estimated. This is done by subtracting the appropriate sinkdelay from the resulting RAT. Given a sink with required arrival time function rat, pin capacitance cap, and buffering mode m at the sink’s incident edge, we define the RAT used in the delay model as n o sinkrat(rat, cap, m) := min ratr (mrs ), ratf (mfs ) − sinkdelay(cap). (5.2) Given a topology and a buffering mode assignment F : E(T ) → M , we can now estimate the slack at sink s to be σs := sinkrat(rats , capin (s), ms ) − X (dnode (e) + F (e)d ||P l(v) − P l(w)||) − atT (r) e=(v,w)∈E(T )[r,s]) with m being the buffering mode of the arc entering s and ms the according slew target. Figure 5.3 shows how our delay model correlates with the slacks that are achieved after buffering. For each instance of a 22 nm design we depict the difference between 1 See Section 4.1.3. 45 5 Topology Generation Figure 5.3: Correlation between estimated slacks and exact slacks. For each instance (slightly more than 300 000) of a middle-sized 22 nm design the difference (yaxis) between the slack in our delay model and the final slack after buffering is shown. The instances are sorted by the distance (x-axis) of the most critical sink to the root. 46 5.1 A Simple Delay Model the slack of the topology used for repeater insertion and the slack of the final result. Although there are some outliers where we overestimate the strength of the root and are about 50 picoseconds too optimistic, the vast majority of instances are estimated up to 20 picoseconds correctly. 5.1.1 Time Tree Algorithm 1 TimeTree Input: A topology T , an embedding P l, a buffering mode assignment F : E(T ) → M , and parameters dnode , η Output: Arrival time function atT , RAT function ratT and a dnode assignment 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: for v ∈ V (T ) traversed in postorder do if v is a leaf then Let e be the incoming edge to v if v 6= r ratT (v) := sinkrat(ratv , capin (v), F (e)) else if |δ + (v)| = 1 then . v is root or Steiner point along a path Let a = δ + (v) ratT (v) := ratT (a) − F ((v, a))d ||P l(v) − P l(a)|| dnode ((v, a)) := 0 else Let {a, b} = δ + (v) with ratT (a) ≤ ratT (b) α := rat(a) − F ((v, a))d ||P l(v) − P l(a)|| β := rat(b) − F ((v, b))d ||P l(v) − P l(b)|| ratT (v) := maxηdnode ≤d≤(1−η)dnode min{α − d, β − (dnode − d)} dnode ((v, a)) := ratT (a) − ratT (v) dnode ((v, b)) := dnode − dnode ((v, a)) end if end if end for Let e be the outgoing edge of r n o r atT (r) := max atr (F (e)cap ), atfr (F (e)cap ) for v ∈ V (T ) \ r traversed in preorder do Let w be the parent of v atT (v) := atT (w) + dnode ((w,v)) + F ((w, v))d ||P l(w) − P l(v)|| end for During topology construction, we will only maintain the required arrival times and update them incrementally. Arrival times will not be explicitly calculated. However, it is often desirable to compute the delay model of a given topology. This can be done with Algorithm 1 (TimeTree). It first traverses the topology bottom 47 5 Topology Generation up, computes required arrival times, and distributes dnode . Then, arrival times are computed in a second top-down traversal. Both traversals have a running time that is linear in the size of the topology as each update step can be done in constant time. 5.2 Repeater Tree Topology Problem The Repeater Tree Topology Problem is the task of finding a topology for an instance of the Repeater Tree Problem, an embedding, and a buffering mode assignment. As for the Repeater Tree Problem, we allow several objectives like minimizing netlength or maximizing the delay model slack with minimal costs. Minimizing the `1 -length is an objective for topology generation that appears in early design stages or for timing-uncritical instances. This corresponds to computing shortest rectilinear Steiner trees. Garey and Johnson (1977) showed that already this problem is NP-hard. Previous Work on Topology Generation Alpert et al. (2008), Chapter 24–28, give a good overview of existing topology generation algorithms beginning with different flavours of Steiner trees and finishing with algorithms specific for repeater tree optimization. Okamoto and Cong (1996) proposed a repeater tree procedure using a bottom-up clustering of the sinks and a top-down buffering of the obtained topology. Similarly, Lillis et al. (1996b) also integrated buffer insertion and topology generation. They introduced the P-tree algorithm, which takes the locality of sinks into account, and explored a large solution space via dynamic programming. Hrkić and Lillis (2002) considered the S-tree algorithm which makes better use of timing information, and integrated timing and placement information using so-called SP-trees (Hrkić and Lillis, 2003). In these approaches the sinks are typically partitioned according to criticality, and the initially given topology (e.g. a shortest Steiner tree) can be changed by partially separating critical and noncritical sinks. Whereas the results obtained by these procedures can be good, the running times tend to be prohibitive for realistic designs in which millions of instances have to be solved. Alpert et al. (2002) create topologies in a two step approach. First, sinks are clustered based on parity and criticality. Second, clusters are merged by a PrimDijkstra heuristic that scales between shortest path trees and minimum spanning trees. Further approaches for the generation or appropriate modification of topologies and their buffering were considered in Cong and Yuan (2000); Alpert et al. (2001b, 2004a); Müller-Hannemann and Zimmermann (2003); Dechu et al. (2005); Hentschke et al. (2007); Pan et al. (2007). Repeater topology generation loosely overlaps with the design of delay constraint multicast networks where network traffic has to be distributed to clients. A survey can be found in Oliveira and Pardalos (2005). 48 5.3 Restricted Repeater Tree Problem 5.2.1 Topology Algorithm Overview We solve the Repeater Tree Topology Problem by splitting it into three steps. In the first step, we restrict ourselves to the default buffering mode m∗ and compute an initial topology ignoring blockages. During the second step, we navigate around blockages if blocked segments in the initial topology get too long. In the final step, buffering modes from higher layers are assigned to topology edges if their slack is infeasible. We start our explanation by describing a simplified version of the Repeater Tree Topology Problem. 5.3 Restricted Repeater Tree Problem The first step of our topology generation algorithm uses the default buffering mode m∗ . This is the fastest mode using the default wiring modes. To simplify notation we set d := m∗d and c := dnode/2. The bifurcation delay assigned to edge e is c(e). We evaluate topologies with our delay model that does not distinguish between signal edges and does not know RAT functions. We adapt the Repeater Tree Topology Problem to this simplification. We set for each sink s ∈ S the required arrival time to as := sinkrat(rats , capin (s), m∗ ) − max atr (m∗cap ), atf (m∗cap ) . n o Given a topology T , the slack for sink s ∈ S becomes σs := as − (d||P l(v) − P l(w)|| − c((v, w))). X (v,w)∈E(T )[r,s] The slack of the whole topology is σ(T ) := min σs . s∈S We call the simplified version of the topology problem Restricted Repeater Tree Topology Problem. It is shown in Figure 5.4. 5.4 Sink Criticality Our topology generation algorithm will insert sinks into the topology one by one, and the resulting structure depends on the order in which the sinks are considered. Sinks that are inserted first are favored because they will potentially be connected shorter to the root. We thus want to prefer sinks that are more timing-critical. In order to quantify correctly how critical a sink s is, it is crucial to take its reqiured arrival time as as well as its location P l(s) into account. A sink that is further away from the root will, other things being equal, result in worse slack because the signal has to traverse the distance which costs delay. Similarly, if two sinks have the same 49 5 Topology Generation Instance: An instance consists of • a root r and its location P l(r), • a set S of sinks and for each sink s ∈ S its location P l(s) and a required arrival time as , • a value c = dnode/2 ∈ R≥0 , and • a value d = dwire ∈ R≥0 . Feasible Solution: A feasible solution is a topology over root r and sinks S. Figure 5.4: Restricted Repeater Tree Topology Problem distance to the root, then both will pay approximately the same delay to reach the root but the sink with lower required arrival time will be more critical. A good measure for the criticality of a sink s is the slack that would result from connecting s optimally to r and disregarding all other sinks. We can estimate the optimal connection using our delay model. The resulting slack equals: σs = as − d||P l(r) − P l(s)||. (5.3) The smaller this number is, the more critical we will consider the sink to be. 5.5 A Simple Topology Generation Algorithm Before we explain the topology generation algorithm that we use to solve the Restricted Repeater Tree Topology Problem in Section 5.6, we first look at an algorithm that has the basic structure of our final algorithm. We show the algorithm in the next section before we discuss some of its theoretical properties. Our first algorithm creates topologies where each internal vertex is a bifurcation. We use η = 1/2 in addition such that each arc but the one leaving the root have a node delay of c. The slack of a topology T is then given by σ(T ) := minas − c |E(T )[r,s] | − 1 − s∈S X d||P l(v) − P l(w)||. (v,w)∈E(T )[r,s] The properties we show for the simple topology generation algorithm were first published in Bartoschek et al. (2010). 5.5.1 Topology Generation Algorithm Algorithm 2 inserts sinks into a topology one by one according to some order s1 ,s2 , . . . ,sn starting with a tree containing only the root r and the first sink s1 . 50 5.5 A Simple Topology Generation Algorithm Algorithm 2 Simple Topology Generation Algorithm 1: Choose a sink s1 ∈ S 2: V (T1 ) ← {r,s1 } 3: E(T1 ) ← {(r,s1 )} 4: T1 ← (V (T1 ),E(T1 )) 5: n ← |S| 6: for i = 2, . . . , n do 7: Choose a sink si ∈ S \ {s1 ,s2 , . . . ,si−1 }, 8: an edge ei = (u,v) ∈ E(Ti−1 ), 9: and an internal vertex xi with P l(xi ) ∈ R2 . . . V (Ti ) ← V (Ti−1 ) ∪ {xi } ∪ {si } 11: E(Ti ) ← (E(Ti−1 ) \ {(u,v)}) ∪ {(u,xi ),(xi ,v),(xi ,si )} 12: Ti ← (V (Ti ),E(Ti )) 13: end for 10: The sinks si for i ≥ 2 are inserted by subdividing an edge ei with a new internal vertex xi located at P l(xi ) and connecting xi to si . The behaviour of the procedure clearly depends on the choice of the order, the choice of the edge ei , and the choice of the placement P l(xi ) ∈ R2 . In view of the large number of instances which have to be solved in an acceptable time, the simplicity of the above procedure is an important advantage for its practical application. Furthermore, implementing suitable rules for the choice of si , ei , and xi allows to pursue and balance various practical optimization goals. We look at two variants (P1) and (P2) of the procedure corresponding to optimizing the worst slack (P1) or minimizing the length of the topology (P2), respectively. (P1) The sinks are inserted in an order of non-increasing criticality, where the criticality of a sink s ∈ S is quantified by −σs as shown above. During the i-th execution of the for-loop, the new internal vertex xi is always chosen at the same position as r, and the edge ei is chosen such that σ(Ti ) is maximized. (Note that placing internal vertices at the same position means placing bifurcations at the same position. It does not mean placing several repeaters at the same position during repeater insertion.) (P2) The sink s1 is chosen such that ||P l(r) − P l(s1 )|| = min{||P l(r) − P l(s)|| | s ∈ S} and during the i-th execution of the for-loop, si , ei = (u,v), and P l(xi ) are chosen such that l(Ti ) = l(Ti−1 ) + ||P l(u) − P l(xi )|| + ||P l(xi ) − P l(v)|| + ||P l(xi ) − P l(si )|| − ||P l(u) − P l(v)|| is minimized. 51 5 Topology Generation 5.5.2 Theoretical Properties Theorem 1. Given an instance of the Restricted Repeater Tree Topology Problem with η = 1/2, the largest achievable worst slack σopt equals ) X −b 1c (as −d||r−s||−σ)c σ∈R 2 ≤1 , ( σ (S) := max ∗ s∈S and (P1) generates a repeater tree topology T(P 1) with σ T(P 1) = σopt . Proof: Let a0s = as − d||r − s|| for s ∈ S. Let T be an arbitrary repeater tree topology. By the definition of σ(T ) and the triangle-inequality for || · ||, we obtain j X 1 k d||u − v|| − σ(T ) ≤ 1c a0s − σ(T ) |E[r,s] | − 1 ≤ c as − (u,v)∈E[r,s] for every s ∈ S. Since the unique child of the root r is itself the root of a binary subtree of T in which each sink s ∈ S has depth exactly |E[r,s] | − 1, Kraft’s inequality (Kraft, 1949) implies X 2−b c (as −σ(T ))c ≤ 1 0 s∈S X 2−|E[r,s] |+1 ≤ 1. s∈S By the definition of σ ∗ (S), this implies σ(T ) ≤ σ ∗ (S). Since T was arbitrary, we obtain σopt ≤ σ ∗ (S). It remains to prove that σ T(P 1) = σopt = σ ∗ (S), which we will do by induction on n = |S|. For n = 1, the statement is trivial. Now let n ≥ 2. Let sn be the last sink inserted by (P1), which means that a0sn = max{a0s | s ∈ S}. Let S 0 = S \ {sn }. Claim frac σ ∗ (S) c n ∈ frac a0s c o s ∈ S0 (5.4) where frac(x) := x − bxc denotes the fractional part of x ∈ R. 1 0 ∗ ∈ / Z for every c (as − σk (S)) j 1 0 1 0 ∗ ∗ c (as − σ (S)) = c (as − (σ (S) + Proof of the claim: If j s ∈ S, then there is some k > 0 such that )) for every s ∈ S, which immediately implies a contradiction to the definition of σ ∗ (S) as in the statement of the theorem. Therefore, 1c (a0s − σ ∗ (S)) is an integer for at least one s ∈ S. If 1 0 ∗ 0 c (as − σ (S)) is an integer for some s ∈ S , then (5.4) holds. Hence, if the claim 1 0 1 0 ∗ is false, then c asn − σ (S) ∈ Z and c (as − σ ∗ (S)) ∈ / Z for every s ∈ S 0 . Since a0sn − σ ∗ (S) ≥ a0s − σ ∗ (S) for every s ∈ S 0 , this implies j 52 1 c a0sn − σ ∗ (S) k > max nj 1 c a0s − σ ∗ (S) o k s ∈ S0 . (5.5) 5.5 A Simple Topology Generation Algorithm By the definition of σ ∗ (S), we have Σ := X 2−b c (as −σ 1 0 ∗ (S)) c ≤ 1. s∈S Considering the least significant non-zero bit in the binary representation of Σ, the 1 ∗ 0 strict inequality (5.5) implies that this bit corresponds to 2−b c (asn −σ (S))c . This implies that X 1 1 ∗ ∗ 0 0 2−b c (as −σ (S))c ≤ 1 − 2−b c (asn −σ (S))c . s∈S Now, for some sufficiently small > 0, we obtain X 2−b c (as −(σ 1 0 ∗ (S)+)) c = 2−b 1c (a0sn −σ∗ (S))c+1 + 2−b c (as −σ 1 X 0 ∗ (S)) c≤1 s∈S 0 s∈S which contradicts the definition of σ ∗ (S) and completes the proof of the claim. 0 Let T(P 1) denote the tree produced by (P1) just before the insertion of the last 0 ∗ 0 sink sn . By induction, σ T(P 1) = σ (S ). 0 First, we assume that there is some sink s0 ∈ S 0 such that within T(P 1) j |E[r,s0 ] | − 1 < 1 c a0s0 − σ ∗ (S 0 ) k . 0 0 Choosing en as the edge of T(P 1) leading to s , results in a tree T such that σ ∗ (S) ≥ σopt ≥ σ T(P 1) ≥ σ(T ) = σ ∗ (S 0 ) ≥ σ ∗ (S), which implies σ T(P 1) = σopt = σ ∗ (S). 0 Next, we assume that within T(P 1) |E[r,s] | − 1 = j 1 c a0s − σ ∗ (S 0 ) k for every s ∈ S 0 . This implies X 2−b c (as −σ 1 0 ∗ (S 0 )) c> and hence n n σ c = max σ σ < σ ∗ (S 0 ), frac n = c max x x < 1 0 ∗ (S 0 )) c=1 < σ ∗ (S 0 ). By (5.4), we obtain σ ∗ (S) ≤ max σ σ < σ ∗ (S 0 ), frac =c 2−b c (as −σ s∈S 0 s∈S σ ∗ (S) X σ ∗ (S 0 ) c σ ∗ (S 0 ) c , n σ−σ ∗ (S 0 ) c frac x − − 1 + max frac n ∈ frac a0s c oo s ∈ S0 n ∈ frac σ ∗ (S 0 ) c a0s −σ ∗ (S 0 ) c n a0s −σ ∗ (S 0 ) c ∈ frac oo s ∈ S0 a0s −σ ∗ (S 0 ) c oo s ∈ S0 o s ∈ S0 = σ ∗ (S 0 ) − c(1 − δ) 53 5 Topology Generation for n a0s −σ ∗ (S 0 ) c δ = max frac If s0 ∈ S 0 is such that δ = frac o s ∈ S0 . a0s0 −σ ∗ (S 0 ) c , 0 0 then choosing en as the edge of T(P 1) leading to s , results in a tree T such that σ ∗ (S) ≥ σopt ≥ σ T(P 1) ≥ σ(T ) = σ ∗ (S 0 ) − c(1 − δ) ≥ σ ∗ (S), which implies σ T(P 1) = σopt = σ ∗ (S) and completes the proof. Theorem 2. (P2) generates a repeater tree topology T for which l(T ) is at most the total length of a minimum spanning tree on {r} ∪ S with respect to || · ||. Proof: Let n = |S| and for i = 0, 1, . . . , n, let T i denote the forest which is the union of the tree produced by (P2) after the insertion of the first i sinks and the remaining n − i sinks as isolated vertices. Note that T 0 has vertex set {r} ∪ S and no edge, while for 1 ≤ i ≤ n, T i has vertex set {r} ∪ S ∪ {xj | 2 ≤ j ≤ i} and 2i − 1 edges. Let F0 = (V (F0 ), E(F0 )) be a spanning tree on V (F0 ) = {r} ∪ S such that l(F0 ) = X ||u − v|| (u,v)∈E(F0 ) is minimum. For i = 1, 2, . . . , n, let Fi = (V (Fi ), E(Fi )) arise from V T i , E(Fi−1 ) ∪ E T i by deleting an edge e ∈ E(Fi−1 )∩E(F0 ) which has exactly one end vertex in V (Ti−1 ) such that Fi is a tree. (Note that this uniquely determines Fi .) Since (P2) has the freedom to use the edges of F0 , the specification of the insertion order and the locations of the internal vertices in (P2) imply that l(F0 ) ≥ l(F1 ) ≥ l(F2 ) ≥ . . . ≥ l(Fn ). Since Fn = Tn the proof is complete. For the `1 -norm, the well-known result of Hwang (1976) together with Theorem 2 imply that (P2) is an approximation algorithm for the `1 -minimum Steiner tree on the set {r} ∪ S with approximation guarantee 3/2. We have seen in Theorem 1 and Theorem 2 that different insertion orders are favourable for different optimization scenarios such as optimizing for worst slack or minimum netlength. Alon and Azar (1993) gave an example showing that for the online rectilinear Steiner tree problem the best achievable approximation ratio is Θ(log n/ log log n), 54 5.5 A Simple Topology Generation Algorithm where n is the number of terminals. Hence, inserting the sinks in an order disregarding the locations, like in (P1), can lead to long Steiner trees, no matter how we decide where to insert the sinks. The next example shows that inserting the sinks in an order different from the one considered in (P1) but still choosing the edge ei as in (P1) results in a repeater tree topology whose worst slack can be much smaller than the largest achievable worst slack. Example 1. Let c = 1, d = 0 and a ∈ N. We consider the following sequences of −a’s and 0’s A(1) = (−a, 0), A(2) = (A(1), −a, 0), A(3) = (A(2), −a, 0, . . . . . . , 0), | {z } 1+(21 −1)(a+2) A(4) = (A(3), −a, 0, . . . . . . . . . , 0), . . . , | {z } 1+(22 −1)(a+2) i.e. for l ≥ 2, the sequence A(l) is the concatenation of A(l − 1), one −a, and a l−2 sequence of 0’s of length 1 + 2 − 1 (a + 2). If the entries of A(l) are considered as the required arrival times of an instance of the Restricted Repeater Tree Topology Problem, then Theorem 1 together with the choice of c and d imply that the largest achievable worst slack for this instance equals $ − log2 l2a + 1 + l X 1 + (2i−2 − 1)(a + 2) ! 20 !% . i=2 For l = a + 1 this is at least −2 − a − log2 (a + 2). If we insert the sinks in the order as specified by the sequences A(l), and always choose the edge into which we insert the next internal vertex such that the worst slack is maximized, then the following sequence of topologies can arise: T (1) is the topology with exactly two sinks at depth 2. The worst slack of T (1) is −(a + 1). For l ≥ 2, T (l) arises from T (l − 1) by (a) subdividing the edge of T (l − 1) incident with the root with a new vertex x, (b) appending an edge (x,y) to x, (c) attaching to y a complete binary tree B of depth l − 2, (d) attaching to one leaf of B two new leaves corresponding to sinks with required arrival times −a and 0, and (e) attaching to each of the remaining 2l−2 − 1 many leaves of B a binary tree ∆ which has a + 2 leaves, all corresponding to sinks of arrival times 0, whose depths in ∆ are 1, 2, 3, . . . , a − 1, a, a + 1, a + 1. Note that this uniquely determines T (l). Clearly, the worst slack in T (l) equals −a − l. Hence for l = a + 1, the worst slack equals −2a − 1, which differs approximately by a factor of 2 from the largest achievable worst slack as calculated above. 55 5 Topology Generation This example, however, does not show that there is no online algorithm for approximately maximizing the worst slack, say up to an additive constant of c. Recently, Held and Rotter (2013) presented an O(n log n) algorithm that given > 0 and an initial topology T0 solves the Restricted Repeater Tree Topology Problem for η = 1/2 such that if the worst slack of the instance is non-negative σ(T ) ≥ −dnode − max{as } s∈S l(T ) < 1 + 2 l(T0 ) + 2n · dnode with n := |S|. The length of a topology l(T ) is the sum of edge lengths in T . Here, T0 can be derived from any Steiner tree, for instance a minimum Steiner tree or an approximation of it. 5.6 Topology Generation Algorithm We now extend our topology generation from Algorithm 2. The basic structure remains the same. However, we now allow the distribution of node delays using η between 0 and 1/2. Algorithm 3 shows the version of the topology generation that we use to construct repeater trees. We use the delay model but only required arrival times are updated during the process. In terms of Algorithm 2 the choice of si , P l(xi ), and ei is as follows: • Similar to (P1), the sinks are ordered by non-increasing criticality. • P l(xi ) is chosen such that the netlength increase of the topology is minimized. In general, the new nodes do not lie on top of the root node as in (P1). • The edge ei is chosen such that the weighted sum of resulting topology slack and netlength increase is minimized. We hope to reduce netlength by connecting the sinks as short as possible to the chosen candidate edge. However, by doing so, the slack is no longer guaranteed to be optimal as shown in Theorem 1 for (P1). The algorithm first sorts the sinks according to criticality. Then, it connects the most critical sink directly to the root and initializes the topology and rat function accordingly (lines 1–4). The next step is to iterate over all sinks and to add them into the topology one after another. To determine ei , we compute for each existing edge (v, w) the required arrival time at the root that we would get if we choose the edge (lines 9–26). The Steiner point is always in the bounding box between v and w due to the `1 -norm used. The netlength increase is therefore ||z − P l(si )||. Given the resulting RAT rat at the root, we choose the edge that maximizes ξ(min{rat, 0}) − (1 − ξ)||z − P l(si )||. 56 5.6 Topology Generation Algorithm Algorithm 3 Topology Generation Algorithm Input: An instance of the Restricted Repeater Tree Topology Problem Output: A topology consisting of tree T = (V, E) and embedding P l 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: For each sink s ∈ S, compute the criticality σs Sort S such that σs1 ≤ σs2 ≤ · · · ≤ σs|S| V := {r, s1 } E := {(r, s1 )} ratT (si ) := asi for 1 ≤ i ≤ |S| for i := 2, . . . , |S| do ei := ∅ bval := −∞ bdist := ∞ for all (v, w) ∈ E do . Search for the best edge to connect si to Choose z ∈ R2 minimizing ||z − P l(v)|| + ||z − P l(w)|| + ||z − P l(si )|| α1 := ratsi − dwire ||z − P l(si )|| α2 := ratw − dwire ||z − P l(w)|| rat := maxηdnode ≤b≤(1−η)dnode min{α1 − b, α2 − (dnode − b)} for (x, y) ∈ E[δ+ (r),w] traversed bottom-up do Let u be the sibling of y α1 := ratu − dwire ||P l(x) − P l(u)|| α2 := rat − dwire ||P l(x) − P l(y)|| rat := maxηdnode ≤b≤(1−η)dnode min{α1 − b, α2 − (dnode − b)} end for val := ξ min{rat − dwire ||P l(r) − P l(δ + (r))||, 0} − (1 − ξ)||z − P l(si )|| if val > bval or val = bval ∧ ||z − P l(si )|| < bdist then bval := val bdist := ||z − P l(si )|| ei := (v, w) end if end for Create a new Steiner node xi V := V ∪ {si , xi } E := E \ {(v ∗ , w∗ )} ∪ {(v ∗ , x), (x, w∗ ), (x, si )} with ei = (v ∗ , w∗ ) P l(x) := z ∈ R2 minimizing ||z − P l(v ∗ )|| + ||z − P l(w∗ )|| + ||z − P l(si )|| for (v, w) ∈ E[δ+ (r),si ] traversed bottom-up do Let y be the sibling of w α1 := raty − dwire ||P l(v) − P l(y)|| α2 := ratw − dwire ||P l(v) − P l(w)|| ratT (x) := maxηdnode ≤b≤(1−η)dnode min{α1 − b, α2 − (dnode − b)} end for end for 57 5 Topology Generation If two solutions have the same value, we choose the shorter connection. The parameter ξ allows us to scale between topologies that optimize the worst slack and short topologies depending on the objective of the Repeater Tree Problem. After choosing ei , we add the new sink to the tree splitting ei and update rat on all affected edges (lines 27–37). Given an internal node v with children u1 and u2 , let αi := ratT (ui ) − d||P l(ui ) − P l(v)|| for i ∈ {1, 2}. By setting ratT (v) := max min{α1 − b, ηdnode ≤b≤(1−η)dnode α2 − (dnode − b)} we implicitly maintain the node delay for each edge. If without loss of generality α1 < α2 , then ratT (v) can be computed in constant time by ratT (v) := α1 − ηdnode − 21 max{(1 − 2η)dnode − (α2 − α1 ), 0}. The additional delays of the outgoing edges, dnode ((v, u1 )) = α1 − ratT (v) dnode ((v, u2 )) = dnode − dnode ((v, u2 )) ≤ α2 − ratT (v), satisfy our requirements on the node delay. Theorem 3. The worst case running time of the algorithm is O(n3 ) if n = |S| and the best case running time is Ω(n2 ). Proof. There are n − 1 iterations of the outer loop of the algorithm (lines 5–37). Each iteration removes one edge from the tree and adds three new. When ei is searched, there are 1 + 2(i − 2) = 2i − 3 edges in the tree. The loop searching P for ei (lines 9–26) is called ni=2 (2i − 3) ∈ Θ(n2 ) times. The loop computing the RAT at the root (lines 14–19) and the loop updating the RATs after sink insertion (lines 31-36) can be stopped if the RAT does not change at a node. In such a case, no updates will be done on the path to the root. In the best case, both loops only perform a constant number of updates resulting in an overall best case running time of Ω(n2 ). In the worst case, the inner loop (lines 14–19) iterates linearly in the size of the graph resulting in an overall worst case running time of O(n3 ). 5.6.1 Handling High Fanout Trees As we will show in our experimental results, the running time of our algorithm is extremely small for instances up to 1 000 sinks. Nevertheless, our topology generation, as described above, has a cubic running time. There are instances with several hundred thousand sinks on actual designs, for which this would lead to intolerable running times. One way to reduce the running time would be to consider only the nearest k edges while inserting a sink, where k is some positive integer. This would require to 58 5.7 Blockages store the edges as rectangles in a suitable geometric data structure (e.g. a k-d-tree). However, we chose a different approach. For instances with more than 1 000 sinks, we first apply a clustering algorithm to all sinks, except for the 100 most critical ones if ξ > 0. More precisely, we . . find a partition S 0 = S1 ∪ · · · ∪ Sk of the set S 0 of less critical sinks, and Steiner trees Ti for Si (i = 1, . . . , k), such that the total capacitance of Ti plus the input capacitances of Si is at most maxcap (see Section 4.1.6). Among such solutions we try to minimize the total wire capacitance plus k times the input capacitance of the repeater t∗ . For this facility location problem, we use an approximation algorithm by Maßberg and Vygen (2008), which generates very good solutions in O(|S| log |S|) time. We introduce an appropriate repeater for each component; its input pin constitutes a new sink. This typically reduces the number of sinks by at least a factor of 10. If the number of sinks is still greater than 1 000, we iterate the clustering step. Finally, we run our topology generation algorithm as described above. 5.7 Blockages The topology generation so far did not consider blockages or congestion. Nothing prevents Steiner points from being located on blockages or topology segments from having long intersections over them. For the vast majority of instances, blockages do not play a role because there are none in the bounding box of the involved points. However, some instances can only be constructed properly if one navigates around blockages or considers congestion. We handle blockages in the second step of our topology generation algorithm. First, we iterate over the topology bottom-up and move each Steiner point s to a free location. Our algorithm searches for the nearest free location in each direction that minimizes the sum of `1 -distances between s and its neighbours. Then, we search a buffer-aware shortest path (see Section 4.2.3) within the blockage grid for each topology edge that crosses a blockage. A similar approach is shown by Zhang et al. (2012) and improved in Zhang and Pan (2014). They take an input topology and calculate slew degradations over blockages. In case of violations, they formulate an ILP and solve it to find a good replacement topology. Huang and Young (2012) propose a similar solution. Held and Spirkl (2014) propose a fast 2-approximation algorithm to create rectilinear Steiner trees that can cross obstacles for a limited distance. After performing the shortest path search, some edges might still cross blockages that are small enough to pass over. These edges are splitted on blockage boundaries such that each topology segment is either completely blocked or free. Neighbouring blocked edges on a chain are merged into a single one. 59 5 Topology Generation 5.8 Plane Assignment So far, we only used one buffering mode and its choice is often appropriate for shorter connections. For bridging large connections, it might be better to use buffering modes on higher planes than the default ones. Such buffering modes are often faster and use less placement space due to larger repeater spacing. The third and final step of our topology generation is plane assignment. We use a simple greedy routine to assign buffering modes to a computed topology. They are then used by our repeater insertion routine. We restrict ourselves to the fastest buffering mode for each wire mode. For each buffering mode, we know the repeater spacing. The algorithm processes the nodes of the topology in a BFS traversal starting at the root. If the slack of the node is negative and if it is not too congested according to the congestion map, then we consider the edge between the node and its parent. The edge gets the fastest buffering mode assigned for which repeater spacing is smaller than half the length of the edge. After changing an assignment, the slack values are updated. The algorithm runs in time O(m log b) where m is the number of edges in the topology and b is the number of buffering modes we might assign. The buffering modes can be sorted by the repeater spacing. If there is a buffering mode with higher spacing and worse timing than another one, then it can be removed because it will never be inserted. The resulting list is sorted increasingly by spacing and decreasingly by delay per length. Thus, for an edge, the best buffering mode can be found by binary search in O(log b) time. It is not necessary to recompute the whole timing after each assigment. It is sufficient to propagate an arrival time delta to the children of the node such that the timing updates are constant for each node. The greedy routine is suitable at design stages where the placement of the circuits and the timing are premature. In later design stages, it is desirable to consider congestion better and to optimize plane assignment for best slack. We will present an extension to the dynamic program used for repeater insertion that maximizes slack and takes advantage of higher planes. 5.9 Global Wires as Topologies As already mentioned, routing congestion is one of the biggest problems for optimizing current chip designs. The approach described in Section 5.7 works locally without considering the big picture. A global router (see for example Müller (2009)), on the other hand, optimizes the distribution of nets over the whole routing space. A recent trend in the industry is to use the result of global routing for repeater tree topologies. The advantage of using global wires is a reduced expected congestion. However, global routing has to be adapted to yield results suitable for repeater insertion. One has to consider timing, placement space, blockages and instance sizes. 60 5.9 Global Wires as Topologies A global routing that will later be buffered should not use wires that cross blockages for too long because they cannot be processed in repeater insertion without electrical and timing violations. The available placement space has also to be considered. The number of necessary repeaters can be estimated using suitable buffering modes for the wires. The global router should also consider the timing criticality of nets and sinks when it creates routes. Otherwise, uncritical nets might get short routes at the cost of detours in critical nets. One approach might be to use the topology algorithm presented here as a subroutine within the global router when it has to calculate Steiner trees for nets. The input presented to the global router is often stripped from all buffers and unnecessary inverters. However, it is often not possible to remove some inverters if one wants to preserve logical correctness of the design. The global router will then use the placement of the inverter as a constraint to the routing. It will consider two nets for a single instance and connect the sink of the first net with the current inverter and the source of the second net. It is better to consider the inverter and both nets as a single net. This is also the case if existing repeaters are not stripped from the design2 . 2 see also Section 7.6.1. 61 6 Repeater Insertion Finding a topology for a repeater tree instance is the first step in our approach to solve the Repeater Tree Problem. The second step is to insert repeaters along the topology to create a feasible solution. For this we consider the Repeater Insertion Problem and describe how our algorithm solves it. Instance: An instance (I, T, Bl, M, F ) consists of • an instance I of the Repeater Tree Problem, • a topology T with embedding P l connecting the root and sinks of I, • a set Bl ∈ E(T ) of edges that are blocked, • a set of buffering modes M , and • buffering mode assignments F : E(T ) → M . Task: Find a feasible solution of the Repeater Tree Problem for I minimizing costs such that each repeater lies on a shortest path between the endpoints of a topology edge and all sinks reachable from the repeater in the final tree are also reachable from the edge in the topology. Figure 6.1: Repeater Insertion Problem The dominant approach to solve the Repeater Insertion Problem is dynamic programming. An extensive survey of the dynamic programming approach can be found in Alpert et al. (2008), Sections 26.4 – 26.6. We give a short summary of existing work in Section 6.3. Our main contribution is the repeater insertion of our Fast Buffering algorithm presented in Section 6.2 that, in practice, is considerably faster than the standard dynamic program. Our routine can be characterized as a version of the dynamic program that keeps only one solution at a time. Several heuristics are used to choose a solution that will lead to an overall good solution. A substantial difference to the dynamic programming approach is that our algorithm is able to change the topology in order to reduce the number of repeaters inserted for preserving parity constraints. Figure 6.2 shows an extreme example where different topologies for the same sink set result in a huge difference in the minimum number of repeaters necessary to realise each of them. Given topology a), our algorithm will often create a solution that lies between both extremes depending on the criticality of the instance. However, our solution will still fulfill the constraints from the Repeater Insertion Problem. Finally, the dynamic program depends on precomputed repeater positions. In 63 6 Repeater Insertion − a) − − − r + − b) − + − + − + − − r + + + + Figure 6.2: While topology b) requires only one inverter to realise the indicated sink parities, topology a) would require five. contrast, our algorithm is free to choose any position along an unblocked edge of the input topology. Our repeater insertion algorithm consists of two parts. In a first step, we assign delay efforts to the edges of the topology by solving a Deadline Problem. This allows us to buffer parts of the topology with less effort and leads to lower resource usage. In a second step, we replace the topology by a repeater tree in a bottom-up fashion. We present in Section 6.3 how we use the standard dynamic programming technique to improve the solution found by the Fast Buffering algorithm. The buffering algorithm we present here is an extension to joint work with Stephan Held, Dieter Rautenbach and Jens Vygen (Bartoschek et al., 2009, 2007b). 6.1 Computing Required Arrival Time Targets As shown in Section 4.1.1, it is possible to buffer a long line with different delay and power consumption characteristics. For topology generation, we assumed that each edge is buffered such that the fastest delay using the default wire modes can be achieved. After computing arrival times and required arrival times for all nodes of the topology using our delay model, there are sinks that have non-positive slack even if the fastest buffering mode is used on the path from the root. It is obvious that we want to buffer the paths as fast as possible to keep timing constraint violations small. However, other sinks and subtrees respectively might have positive slack using the fastest buffering mode. Each edge of such a subtree can potentially be slowed down to reduce the overall power consumption of the resulting repeater tree. As input to the Repeater Insertion Problem each edge has a buffering mode m assigned. We have the possibility to choose another buffering mode. At this point in time, we do not want to change the layer assignment of the edges. Therefore, we restrict ourselves to the alternative buffering modes in Mm (see Section 4.1.3). 64 6.1 Computing Required Arrival Time Targets Instance: An instance consists of • an instance I of the Repeater Tree Problem, • a topology T for I, • a set of buffering modes M , • buffering mode assignment for each edge F : E(T ) → M , and • a maximal subtree Tz of T rooted at z ∈ V (T ) such that using our delay model ratT (z) − atT (z) > 0 . Task: Let E 0 be the set of edges reachable from z (E(Tz )) including the edge leading to z if z 6= r. Find an assignment F 0 : E(T ) → M with F 0 (e) = F (e) if e ∈ / E 0 and F 0 (e) ∈ MF (e) and for all sinks reachable from z ratT (s) − atT (s) ≥ 0 such that the total cost X F 0 (e)p ||P l(v) − P l(w)|| e=(v,w)∈E 0 is minimized. Figure 6.3: Buffering Mode Assignment Problem We call the problem of assigning buffering modes to a subtree Buffering Mode Assignment Problem. It is shown in Figure 6.3. If the slack at the root of our topology is positive, then the whole tree is an instance to buffering mode assignment. Otherwise, each maximal subtree with positive slack is considered separately. The initial assignment for such a subtree is a feasible assignment but probably not the cheapest one. We find cheaper solutions, but we do not let the slack at the root become negative. Thus, the result of buffering mode assignment does not change RATs outside of the considered subtree. The problem can be solved independently for each subtree. The Buffering Mode Assignment Problem is very similar to the Discrete Deadline Problem (see for example Skutella (1998)) as a special case of the Time-Cost Tradeoff Problem (Kelley, 1961; Fulkerson, 1961). Figure 6.4 shows the Discrete Deadline Problem. The graph P in an instance of the problem is called project graph. Each edge e corresponds to a task that has to be executed and Xe is the set of possible alternatives to finish the task with different execution times and costs. An instance to the Buffering Mode Assignment Problem can be transformed to an instance of the Discrete Deadline Problem. We show how to transform a subtree rooted at z with parent parent(z). The transformation is done by (a) using the subtree induced by E 0 as a project graph, (b) adding a new node s and an edge (s, parent(z)) with the single execution time atT (parent(z)) and costs 0, (c) adding a new node t and edges (si , t) for all sinks si reachable from z with single execution time −ratsi and costs 0, (d) setting execution times according to valid buffering modes for all topology edges, and (e) setting the deadline to 0. Each topology edge 65 6 Repeater Insertion Instance: An instance consists of • a directed graph P = (V, E) with two nodes s, t ∈ V such that each node is reachable from s and t is reachable from each node, • for each edge e ∈ E a set Xe of execution times xe with costs ce (xe ), and • a deadline D. Task: For each edge e ∈ E find an execution time x∗e and an assignment of arrival times a : V → R such that a(s) ≥ 0 a(t) ≤ D a(v) + and P c(v,w) (x∗(v,w) ) ≤ a(w) ∀(v, w) ∈ E c(x∗e ) is minimized. e∈E Figure 6.4: Discrete Deadline Problem (v, w) has the initial buffering mode F ((v, w)) assigned. We set n X(v,w) := xm m ∈ MF ((v,w)) o with xm := md ||P l(v) − P l(w)|| c(v,w) (xm ) := mp ||P l(v) − P l(w)||. If z is the root of the whole topology, then we just connect s to z during project graph construction. As mentioned earlier, each of our instances to the Buffering Mode Assignment Problem has a feasible solution. It follows that using the arrival times of the delay model as arrival times a in the Discrete Deadline Problem is a feasible solution. On the other hand, any feasible solution to the Discrete Deadline Problem results in a feasible buffering mode assignment if we choose for each the buffering mode m if the edge has execution time xm . The Discrete Deadline Problem is NP-hard, but Halman et al. (2008) have shown that there exists a FPTAS on series-parallel networks like repeater tree topologies. 6.1.1 Linear Time-cost Tradeoff As our delay model is only a rough approximation of the reality after buffering, it makes no sense to spend too much effort into solving the Buffering Mode Assignment Problem exactly. Furthermore, it turns out that for our purpose, it is sufficient to solve a linear relaxation of the problem. 66 6.1 Computing Required Arrival Time Targets Power m1 c(xm1 ) m2 m3 c(xm2 ) c(xm3 ) c(xm4 ) m4 xm1 xm 2 xm3 xm4 Delay Figure 6.5: Piecewise-linear Relaxation of Buffering Modes. The buffering mode m3 is dominated by a linear combination of m2 and m3 and can be cancelled. If there is an edge e = (v, w) with execution time x∗e between execution times xa and xb for buffering modes a, b such that x∗e = αxa + (1 − α)xb , then e can be divided into two edges such that the first edge has length α||P l(v) − P l(w)|| and buffering mode a and the second edge has length (1−α)||P l(v)−P l(w)|| with buffering mode b. The linear relaxation of the problem uses piecewise linear time-cost functions for each edge. For example, Figure 6.5 shows how delays and costs are relaxed for an edge with four alternative buffering modes {m1 , m2 , m3 , m4 } that are sorted by increasing delay. After removing buffering modes (m3 in the example) that are dominated by linear combinations of others, the result is a convex piecewise-linear time-cost function. It approximates the non-dominated part of the delay-power tradeoff curve of a wire mode. One example is the red part of the curve shown in Figure 4.4. Now the Time-Cost Tradeoff Problem can be solved efficiently by solving a Minimum-Cost Flow Problem (MCF). The construction is described in Fulkerson (1961) or Lawler (2001). Algorithms to solve the MCF problems can be found in Korte and Vygen (2012) Chapter 9. The input to the MCF algorithm is the project graph where each edge is replaced by a chain of at most |M |, the number of buffering modes and the maximum number of sampling points for a time-cost function, edges. Then, each edge in the chain is doubled. After solving the MCF, we get a node potential that corresponds to an arrival time assignment a. Theorem 4. The effort assignment adds at most |S| new vertices and edges into the topology. Proof. The tree T in the spanning tree structure of any basic spanning tree solution of the MCF spans all vertices in our original topology. For any r-s-path with s ∈ S, 67 6 Repeater Insertion it omits at most one edge. By complementary slackness, all edges in T define integral buffering modes. Therefore there are at most |S| fractional edges which are divided. Note that the Network Simplex Algorithm always maintains basic solutions. Furthermore, one can transform non-basic optimum solutions into basic ones in at most |E| pivots. 6.1.2 Effort Assignment Algorithm The algorithm we use to solve the Buffering Mode Assignment Problem is outlined in Algorithm 4 and called AssignEffort. The input is a topology and allowed buffering modes for every edge. Algorithm 4 Buffering Mode Assignment Problem 1: procedure AssignEffort(T ) 2: Compute timing using the fastest buffering mode. 3: for all n ∈ |V (T )| with slack(n) > 0 and slack(parent(n)) ≤ 0 do 4: Create MCF instance I for subtree rooted at n 5: Solve I 6: Assign buffering modes according to node potentials in I 7: end for 8: end procedure First, we assign the fastest buffering mode to each edge of the topology and recompute the timing using Algorithm 1 (TimeTree). For all sinks with slack smaller or equal to 0 the fastest buffering mode is kept. Then, we identify instances to the Buffering Mode Assignment Problem using a DFS search and process each subtree separately1 as a Deadline Problem. We solve the min-cost-flow formulation and compute node potentials. After having filtered out the sinks with non-positive slacks, we know that the problem is feasible and that the potentials are feasible. Given potentials π : V (Tn ) → R, the delay we want to assign to an edge (v, w) is π(v) − π(w). For non-fractional edges the according buffering mode is used. For fractional edges we do not subdivide the edge as outlined in the previous section. Instead, we just round to the cheapest buffering mode faster than the fractional solution. After rounding, the delays on all edges correspond to a buffering mode. We update for each edge e ∈ E(Tn ) the buffering mode assignment F accordingly. 1 Note that it is possible to merge all deadline problems in a single one because they correspond to disjunct subtrees of the topology. 68 6.2 Repeater Insertion Algorithm 6.2 Repeater Insertion Algorithm We now describe the main part of our repeater insertion algorithm. The input is an instance of the Repeater Insertion Problem (I, T in , Bl, M, F ), for example, as it has been computed by our topology generation algorithm. The result will be a repeater tree R = (T, P l, Rt , RW )2 . We first update the buffering assignment F of the input topology T in using AssignEffort. Then, the algorithm traverses the topology in post-order fashion. We create a pair of so-called clusters at each node of the input topology. During topology traversal, leaf nodes are moved with their clusters towards their parents (Move operation). Eventually, the clusters are merged with the clusters at the parent of their node (Merge operation). The node and the clusters get removed from the topology. At the same time we insert repeaters (mostly inverters) and build up T . Thus, the topology is successively replaced by the final repeater tree. First, we will explain clusters. Then, we explain the timing model that we use in our algorithm. Finally, we describe the main parts of the algorithm. 6.2.1 Cluster A cluster C is a triple (S(C), M (C), P (C)) which is assigned to a node V (C) in the topology and consists of • a set of sinks S(C) containing pins corresponding to sinks of the original repeater tree instance as well as input pins of repeaters that have already been inserted earlier, • a buffering mode M (C) ∈ M or the empty set, and • a so-called merge point P (C) ∈ R2 . By an empty cluster we mean (∅, ∅, (0, 0)). The position of a cluster is always the same as the position of the node to which it is assigned P l(C) = P l(V (C)). Definition 7. We say that a pair of clusters (C + , C − ) at a node is in parallel mode if the sink sets S(C + ) and S(C − ) are both non-empty. For a cluster pair (C + , C − ) in parallel mode, the merge points P (C + ) and P (C − ) are both defined. They store the last location of a cluster where the sink set was changed. It is the location where the cluster pair entered parallel mode if the sink set did not change since then. Figure 6.6 shows a cluster pair in parallel mode and the merge points for both parities. We depict cluster pairs as two stacked rectangles, a green (above) for positive sinks and a red (below) for negative sinks in all pictures showing clusters. As both clusters are always assigned to a node at the same position, we do not show the node explicitly. 2 See Section 3.2 69 6 Repeater Insertion + c a y x z − b − d − Figure 6.6: Example for a cluster pair (C + ,C − ) in parallel mode. Their current position is P l(C + ) = P l(C − ) = z. We have S(C + ) = {c}, S(C − ) = {a,b,d}, P (C + ) = x and P (C − ) = y. Parallel mode was entered at point x; the last negative sink entered at point y. By moving a cluster and adding repeaters, we want to realize a repeater chain that corresponds to the buffering mode m := M (C) of the cluster. We say a cluster has target slews ms and a target repeater mt 3 . The resulting wires either use mh or mw as wiring mode depending on the direction. 6.2.2 Initialization We want to move the nodes of input topology T in but keep the instance sinks at their place. Therefore, we start initialization by replacing each sink s ∈ S in T in by a new node vs at the same place. Then, we assign a pair of empty clusters, one for each parity, to each node in the modified topology. Finally, each sink is added to the sink set of the cluster at node vs with the same parity resulting in cluster ({s}, F (e), (0, 0)) with e being the edge incident to vs . We initialize the resulting tree by T := (S, ∅). 6.2.3 Timing Model during Repeater Insertion Our algorithm depends on the timing model during repeater insertion to guide the decisions. In the process of the algorithm, there are three different structures we maintain: • At the top, there are the remaining parts of the initial topology T in with cluster pairs at all nodes. • The bottom T is a set of subtrees that will be part of the final repeater tree. At the beginning, the bottom consists of the instance sinks. Each inserted repeater extends T possibly merging subtrees and the result is a tree rooted at the new repeater which is then part of T . • Clusters connect the topology with the final tree. While each cluster is associated with a node in the topology, its sink set consists of roots in T . 3 See Section 4.1.3 70 6.2 Repeater Insertion Algorithm For the resulting repeater tree we also maintain the node placement P l, the repeater assignment Rt and the wiring mode assignment Rw . r − − a + − + n + + + Figure 6.7: Topology, clusters, and resulting tree during repeater insertion. So far, the final tree forest consists of all sinks in clusters and the net n with three sinks and a driving inverter. Figure 6.7 shows a possible intermediate state of our buffering algorithm. At the top, we have the remaining parts of the topology (blue edges) with cluster pairs at all nodes. The cluster pair at the root is not shown. Dashed black lines show for each cluster a Steiner tree between the cluster and its sinks. Solid lines show the final repeater tree. In this example, the set of final trees consists of net n with its driving repeater and all sinks. We maintain additional information during the course of the algorithm: • For each cluster sink s, we know a pair of slew limits Sl(s), a required arrival time function rat(s), and the pin capacitance capin (s). • For each cluster C, we maintain a pair of slew limits Sl(C), a pair of slew targets St(C), the load capacitance cap(C), and a required arrival time function rat(C). • For each node v in the topology, we have an arrival time atT in (v) and a required arrival time value ratT in (v) coming from our delay model. 71 6 Repeater Insertion Note that during buffering rat is a function that, given a cluster sink or a cluster, returns us a required arrival time function which has to be evaluated for a slew. We now explain the data structures in more detail: Cluster Sinks Cluster sinks are either instance sinks or input pins of inserted repeaters. For instance sinks, the required arrival time function and slew limit pair are given with the input. Each inserted repeater (see below) drives a set of cluster sinks. A Steiner tree is created between the repeater and the sinks forming a new net. The new net is then extracted as described in Section 3.2.1 using a minimum Steiner tree. Given the Elmore delay for each sink, the required arrival times and slew limits can be propagated backwards to the source of the net (see also Section 2.6f) where they are merged. The results can then be propagated to the repeater’s input pin. The input pin is then treated as a new cluster sink with slew limits and required arrival times. Clusters Each time a cluster is modified, for example, the cluster is moved or a sink is added, we recompute the timing of the cluster. The cluster and its sinks are treated as a net. A Steiner tree connecting all pins is computed and extracted using the wiring modes of the cluster’s buffering mode. Then, similar to the previous section, the rat functions and slew limits are propagated backwards to the root of the Steiner tree. In addition, the slew target of the cluster’s buffering mode is treated as a separate slew limit for each cluster sink and propagated backwards resulting in St(C). The capacitance of a cluster cap(C) is the sum of sink pin capacitances and the wire capacitances of the segments in the Steiner tree. Topology Delays in the unbuffered topology are estimated using the delay model introduced in Section 5.1 with parameters dnode and η. For each edge e, we have a wiring delay F (e)d as it is given by the edges buffering mode F (e). The timing of the topology is calculated by treating each non-empty cluster as a virtual node of the topology connected to its associated node via a zero-length edge. For a cluster C at node v the RAT in the topology ratT in is given by ratT in (C) := min{ratr (C)(F (e)rs ), ratf (C)(F (e)fs )} (6.1) with e being the edge pointing to v. We assume that the topology will try to reach the cluster with the edge’s target slew F (e)s . For a node with two non-empty clusters (e.g. node a in Figure 6.7), we assume that there is an additional virtual node with both virtual cluster nodes as children. 72 6.2 Repeater Insertion Algorithm (C + , C − ) − + P (C + ) ? + + − − − Figure 6.8: An inverter is searched for the positive cluster of pair (C + , C − ). Its input pin will become a sink in S(C − ). The position of the inverter is M (C + ). The virtual node is then connected via a zero-length edge to the original node. In the resulting virtual tree, each node has at most two outgoing edges. The required arrival time can then be computed for each node discarding values for virtual nodes. 6.2.4 Finding a new Repeater At certain stages of the buffering algorithm, we insert a repeater that drives the sinks of a cluster or test the effect of such an insertion. We describe this operation for a cluster C + , which is part of a cluster pair (C + , C − ). The operation is completely analogous (exchanging + and −) for cluster C − . It will be applied only to non-empty clusters. After inserting a new repeater for C + , its input pin is a new sink that is inserted into an existing cluster C 0 . This cluster can be 1. C + itself (if we insert a buffer along a path), 2. C − (if we insert an inverter along a path), or 3. a cluster from a different cluster pair (during a Merge operation). The new repeater is going to drive all sinks in S(C + ). The location of the new repeater depends on the mode of the cluster pair (C + , C − ). If the cluster pair is in parallel mode, we insert a repeater at position P (C + ). If the cluster pair is not in parallel mode, the location of the new repeater is the current position of the cluster P l(C + ). The operation is called InsertRepeater. It takes three parameters: the cluster for which a repeater is searched, the cluster to which the new sink should be added, and the type of repeater we want to insert (buffer or inverter). 73 6 Repeater Insertion Figure 6.8 shows the situation when an inverter is searched for positive cluster in parallel mode. The new inverter should be inserted into the negative cluster C 0 = C − of the cluster pair. The routine first computes a Steiner tree for cluster C 0 containing the new sink position. Then, for a repeater t of the requested type, the required arrival time function and slew limits are computed at the input pin using the load is has to drive C+ rat(t) := ratinvt (cap(C + ), rat(C)) Sl(t) := slewinvt (cap(C + ), Sl(C)). A Steiner tree is extracted using the wiring modes from buffering mode M (C 0 ) such that we have an Elmore delay rci for each sink i ∈ S(C 0 ) ∪ {t}. Finally, the RAT function at C 0 is computed rat := min ratinv(rci , rat(i)). 0 i∈S(C )∪{t} Using the resulting capacitance cap(C 0 ) of cluster C 0 , we can compute a new required arrival time for C 0 (see Equation 5.2) and propagate it towards the root (we have to rebalance dnode with side branches) where we can calculate a slack σt . The weighted slack using the power-time tradeoff ξ is σt∗ := ξ min{σt , 0} − (1 − ξ)pwr(t). Finally, we assume that the resulting cluster C 0 is driven by repeater M (C 0 )t with input slews M (C 0 )s . For each sink i ∈ S(C 0 ) ∪ {t}, e propagate the slews through the Steiner tree resulting in slew pair si . We then compute the sum of slew violations: svio max{si − Sl(i), 0}. t := min 0 i∈S(C )∪{t} We also add possible load violations at both repeaters: ltvio := max{cap0 − loadlim(M (C 0 )t ), 0} + max{cap(C + ) − loadlim(t), 0}. After processing all repeaters, we have for each of them an estimated weighted slack σt∗ , its power consumption pwr(t), the load violation cvio t , and the slew violation vio st . We choose the repeater that lexicographically minimizes vio ∗ cvio t , st , −σt , pwr(t) . After having chosen a repeater, we update the resulting tree. The Steiner tree behind the new repeater and the repeater are merged into T and P l, the function Rt is updated to reflect the chosen repeater, and RW is updated to use the wiring modes from M (C + ) for the new edges. 74 6.2 Repeater Insertion Algorithm Algorithm 5 Buffering Algorithm 1: procedure Buffering(T in ) 2: AssignEffort(T in ) 3: Initialize the topology for buffering T in . 4: Initialize result (T, P l, Rt , RW ). 5: while |V (T )| > 0 do 6: Choose leaf v ∈ V (T in ) 7: if P l(v) 6= P l(parent(v)) then 8: Move(v) 9: else 10: Merge(v) 11: end if 12: end while 13: ConnectRoot 14: end procedure . See Section 6.2.2 . Results in the removal of v 6.2.5 Buffering Algorithm The overall structure of the Buffering algorithm is described in Algorithm 5. Input to the algorithm is a topology. After having assigned new buffering modes to the topology, the data structures are initialized as described above. The topology is then successively modified to create a repeater tree. Leaves are moved towards their parent nodes and merged with them until only the root node is left. In a final step, the last remaining sinks are connected to the root. In the next section, we describe the Merge operation because it is also used in the Move operation which we explain later. 6.2.6 Merging operation When some node l and its cluster pair has been moved to the position of another cluster pair, they will be merged. Let (Cl+ ,Cl− ) be the cluster pair that arrives along arc e at the cluster pair (Cr+ ,Cr− ) which is at the tail of arc e. We compute the merged cluster pair (C + , C − ) using the Merge operation. If |S(Cl+ )| · |S(Cr+ )| = 0 and |S(Cl− )| · |S(Cr− )| = 0, the merging operation is straightforward. We set C + := Cl+ if |S(Cl+ )| = 0 and C + := Cr+ otherwise. The same is done for C − . If the resulting cluster is not in parallel mode, the merge point is not updated. Otherwise, the merge point is set to the current cluster position for both parities. In other cases, we give us five options: inserting an inverter driving one of the four clusters Cl+ , Cl− , Cr+ , Cr− , or merging clusters of the same parity without inserting any inverter. Note that this does not exclude the possibility of inserting an inverter later, as merge points are (re)defined if there are sinks of both parities after merging. Therefore, we do not evaluate possibilities that can be realised by resolving parallel 75 6 Repeater Insertion 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) Figure 6.9: The possible merge configurations that are tested during a Merge operation. In each case, the clusters (Cl+ , Cl− ) arrive from the left and the clusters (Cr+ , Cr− ) arrive from the right side of the resulting cluster pair. mode later. We evaluate the remaining fifteen possibilities of inserting inverters in front of one or more clusters as shown in Figure 6.9. For example, in Case 12 we would first put an inverter in front of Cl+ and put its input pin as sink into Cl− . Then, we would add another inverter in front of Cl− . The result would be C + . We would also put an inverter in front of Cr+ and add its input as sink into Cr− resulting in C − . Possibilities are only evaluated if we do not have to insert a repeater in front of an empty cluster. Table 6.1 shows which of the fifteen cases are evaluated if a set of clusters is empty. For example, if Cl+ and Cr+ are empty (first row in the table), then only cases 1, 3, 5, 9, 11, 13, 15 are evaluated. In all other cases, we would try to insert one or two repeaters either in front of Cl+ or Cr+ . Cl+ Cl− ∅ Cr+ Cr− ∅ ∅ ∅ ∅ ∅ ∅ ∅ Cases 1, 3, 1, 2, 1, 3, 1, 2, 1, 2, 1, 2, all 5, 4, 4, 4, 3, 3, 9, 8, 5, 5, 5, 4, 11, 13, 15 10, 12, 14 7, 9, 10, 11, 13, 15 6, 8, 10, 11, 12, 14 6, 8, 9, 11, 13, 15 7, 8, 9, 10, 12, 14 Table 6.1: Possible cases which of the four clusters are empty. The table shows which of the fifteen cases of Figure 6.9 are considered. 76 6.2 Repeater Insertion Algorithm To evaluate a case, repeaters are tentatively inserted using the InsertRepeater function. Inverters are used if they are available. In case that we would be adding two inverters in front of an input cluster without additional side sinks, we add a single buffer instead. For example, this can happen if in Case 12 as explained above cluster Cl− is empty at the beginning. Then only a single buffer is added to drive Cl+ resulting in C + . Similarly to the InsertRepeater function, the resulting cluster pair has required arrival time functions at the current position and load capacitances. We use them to compute a slack in our delay model. We evaluate power consumption, slew violations, and load violations for both clusters in the same way as in InsertRepeater and add the values up as well as the values for each InsertRepeater invocation. As a result, we get for each case a slack σ, a sum of slew violations svio , a sum of load violations svio , and a sum of power consumption. We compute the weighted slack using the power-time tradeoff ξ σ ∗ := ξ min{σ, 0} − (1 − ξ)pwr. Among all cases, we choose the solution that lexicographically minimizes cvio , svio , −σ ∗ , pwr . We realize the chosen case and update our data structures accordingly. At the end, we replace (Cr+ ,Cr− ) by (C + , C − ) and remove node l together with its clusters and incoming edge from the topology. Handling Different Buffering Modes During the Merge operation, it can happen that we have to merge clusters with different buffering modes. We treat both clusters as if they have the wiring mode with the better wire delay in such a case. This prevents us from arbitrarily worsening the delay to the sinks that are currently in the cluster with the better wire delay. For example, we would not set a cluster that drives a long distance on higher planes to the default planes. 6.2.7 Moving operation We now describe how to move cluster pairs within the topology towards the root. A leaf node of the current topology is moved together with its associated cluster pair (C + , C − ) along the incoming edge. Algorithm 6 shows the Move operation. First, we have to decide how far both clusters can be moved. Procedure RemainingDistance computes the maximum moving distance for a given cluster C. If the sink set S(C) is empty, we return infinite distance. Otherwise, we assume that the repeater M (C)t that is optimal for the cluster’s buffering mode M (C) drives a wire segment and the cluster’s sinks. We maximize the length of the wire segment such that slew targets and limits are not violated. 77 6 Repeater Insertion Algorithm 6 Move Procedure 1: Let p = parent(v) 2: l+ := RemainingDistance(C + (v)) 3: l− := RemainingDistance(C − (v)) 4: l := min{l+ , l− } 5: if l ≥ ||P l(v) − P l(p)|| then 6: P l(v) := P l(p) 7: else 8: if S(C + (v)) > 0 and S(C − (v)) > 0 then 9: ResolveParallel(v) 10: else 11: if Bl((p, v)) = 1 then 12: InsertRepeater(v) 13: P l(v) := P l(p) 14: else 15: Choose z with ||P l(v) − z|| = l minimizing ||z − P l(p)|| 16: P l(v) := z 17: InsertRepeater(v) 18: end if 19: end if 20: end if + a x C + + Figure 6.10: The maximum distance x is searched by which cluster C can be moved such that, given slew targets at the input a of an optimal repeater, slew limits and slew targets at C are not violated. 78 6.2 Repeater Insertion Algorithm Figure 6.10 shows the situation for positive cluster C. A wire segment with length x is added in front of the cluster. Let m := M (C) be the buffering mode stored at cluster C. Repeater mt drives the resulting wire and we assume slew pair ms at input pin a. Let rc be the Elmore delay of the wire segment. We can now compute the slew pair arriving at C: sout = wireslew(rc, slewmt (xmwirecap + cap(C), ms )). We then search the maximum x such that sout ≤ Sl(C) and sout ≤ St(C) using binary search. After computing the remaining distance for both clusters, we take the minimum for l. If a node and both clusters can be moved to the parent’s place, the move is performed and both clusters are updated. Otherwise, we have to insert a repeater. In the non-parallel mode one of the two clusters, say C − , is empty and we insert a repeater for C + using InsertRepeater. We use a buffer if mt is a buffer and we use an inverter otherwise. The position of the new repeater depends on whether the edge is blocked or not. For a blocked edge, the repeater is added at the head of the edge. The resulting cluster is moved to the parent node, and the cluster timing is updated. For an unblocked edge we search a point along the path to the parent node such that the distance to the current position is l and the distance to the parent is minimized. The cluster is then moved and the solution from InsertRepeater is realised. Resolving Parallel Mode Both clusters are non-empty in parallel mode, and merge points are defined for both. We resolve such a situation by treating the cluster pair (C + , C − ) as two cluster pairs, (C + , C 0 ) and (C 00 , C − ), with empty dummy clusters C 0 , C 00 and using the Merge operation. However, we restrict the procedure to choose from Case 2 and Case 5 in Figure 6.9, the two valid operations that directly resolve parallel mode. Running time We stop the binary search as soon as the difference between the upper and lower bound gets smaller then the width of the smallest repeater ltmin . The running time of a single invocation of RemainingDistance is in O(log( lltmax )) with lmax being min the length of the longest edge in the topology. 6.2.8 Arriving at the root When the last leaf node arrives at the root, we have to connect the remaining cluster pair to the root pin. As we get the load dependent arrival times from the root, we no longer depend on the topology delay model. Instead, we enumerate all possible solutions for connecting the root via a sequence of zero, one, or two repeaters. 79 6 Repeater Insertion In non-parallel mode all possible sequences that result in a correct parity are connected to the cluster with non-empty sink set. If the clusters are in parallel mode, we have to resolve it. In contrast to resolving a parallel mode in the Move operation, we search for an inverter for the positive cluster and an inverter for the negative cluster using InsertRepeater. If the clusters are not in parallel mode, only one cluster has a non-empty sink set. We then create a chain of zero, one or two repeaters at the root for all combinations of repeaters that have the correct parity. The chain connects the root to the cluster sinks. Arrival times and slews are propagated from the root to the cluster. For each combination that we evaluate, we get an estimated slack σ, the sum of slew violations svio , the sum of load violations cvio , and the total power consumption pwr. Similar to the InsertRepeater and Move operations, we compute the weighted slack σ ∗ = ξ min{σ, 0} − (1 − ξ)pwr and lexicographically minimize cvio , svio , −σ ∗ , pwr . If the clusters are in parallel mode, then we search for an inverter for both clusters respectively and then try all combinations in the same way as we do in the non-parallel case. The overall best solution is then chosen using the same criteria as above and all data structures are updated accordingly creating the final repeater tree. 6.2.9 Running Time It is possible that the buffering algorithm that we presented does not terminate if it gets stuck by making no progress in the Move operation. This could happen, for example, if the cluster is not allowed to move but inserting any repeater in front of the cluster creates the same cluster or one with worse constraints. To prevent this problem, we move at least the width of the smallest repeater before we insert a new one. This is no limitation because after legalization of the repeaters’ placement no two repeaters are allowed to overlap. We also limit the number of sinks allowed in a cluster by a constant as this is often done in practice by designers. This allows us to also bound the running time necessary to evaluate a possibility in InsertRepeater, Merge or for root connection by a constant as Steiner trees are computed over a bounded set of terminals. In practice, we run the Network Simplex Algorithm to solve the Buffering Mode Assignment Problem. However, it is not a polynomial-time algorithm. For running time considerations, we use Orlin’s algorithm (Orlin, 1993) which solves the Minimum Cost Flow Problem in O(m log m(m + n log n)) with m being the number of edges and n the number of nodes. As we work on a series-parallel graph with roughly twice as many edges as nodes, the running time becomes O(m2 (log m)2 ). 80 6.2 Repeater Insertion Algorithm Converting a solution into a basic tree solution takes at most m iterations with linear running time. Finding the AssignEffort solution therefore has a worst case running time of O(m2 (log m)2 ). Theorem 5. Given an instance of the Repeater Insertion Problem with input topology T in , set M of buffering modes, and repeater library L, the worst case running time of the algorithm is O(Cm2 |L| + (r + m)(log( lmax ) + Cm|L|) + C|L|2 + (|M |m)2 (log(|M |m))2 ) ltmin with m = E(T in ) and r being the number of inserted repeaters in the output. The computation of each Steiner tree during the algorithm is bounded by C. Proof. To solve the Minimum Cost Flow Problem an input graph is constructed with at most 2|M |m edges. The assignment itself runs in linear time. In total, AssignEffort runs in O((|M |m)2 (log(|M |m))2 ). A single invocation of InsertRepeater runs in O(C|L|) time in the best case and O(Cm|L|) in the worst case. The worst case arises, if, for calculating a slack for a solution, we have to traverse a significant number of edges of the topology. A call to Merge makes a bounded number of calls to InsertRepeater and is therefore also in O(Cm|L|). There are O(m) calls to Merge resulting in a total worst case running time of O(Cm2 |L|). Move executes the binary search in a worst case running time of O(log( lltmax )), min where lmax is the longest topology edge and ltmin is the smallest repeater width. The binary search is followed by at most one call to Merge or InsertRepeater. The number of calls to Move is bounded by r + m. The total worst case running time is in O((r + m)(log( lltmax ) + Cm|L|)). min Connecting to the root is in O(C|L|2 ) as at most two repeaters are tried and each combination can be evaluated in constant time because there is only one edge left to the root. Putting everything together, we get the claimed running time. 6.2.10 Repeater Insertion - Summary As we show in the experimental results, our repeater insertion algorithm produces good results very quickly. For repeater libraries as they appear in practice, it generally finds a solution without capacitance violations if such a solution exists. Slew violations are also avoided most of the time but they appear slightly more often than capacitance violation. This is mainly due to tighter slew limits and the fact that the Steiner trees used for net extraction can change their topology if a sink or the driver is slightly moved. Strictly Following the Topology The repeater insertion algorithm presented here does not create repeater trees that follow the input topology strictly. This is one of its main features. Instead, some 81 6 Repeater Insertion r r − − − − Figure 6.11: A topology (left) and the resulting repeater tree (right). Topology detours are removed by recomputing Steiner trees. parts of the topology are used twice while moving clusters in parallel. Other parts are discarded due to recomputed Steiner trees. Figure 6.11 shows an example of how a detour is discarded during repeater insertion. − − + − r − + − r − Figure 6.12: A topology (left) and the resulting repeater tree (right). By using a Steiner tree instead of following the topology, a detour for the sink with positive parity is avoided (orange dashed line). In practice, recomputing Steiner trees gives better results than keeping the input topology because delay calculations are more close to the final (pre-routing) timing. Figure 6.12 shows how detours are avoided that would be induced by following topologies in parallel. There are cases where it is desirable to strictly follow a topology, for example, if an existing routing should be buffered such that the result can use the same routes. This can be done by changing clusters to also store a subtree of the topology connecting the cluster to its sinks. Delay calculations are then performed on the stored tree. 6.3 Dynamic Programming In his groundbreaking paper van Ginneken (1990) proposed a dynamic programming algorithm for buffering of repeater tree topologies maximizing the slack at the root. The algorithm worked only for a single buffer type and a single wiring mode. In addition to the input of the Repeater Insertion Problem, the algorithm needs buffering positions along the topology. The canonical approach is to add repeater positions equidistantly along the topology. However, it makes sense to choose buffer positions based on library and input characteristics as shown by Alpert et al. (2004b). The running time of which was O(n2 ) where n is the number of buffer positions. Later, Lillis et al. (1996a) extended the approach to handle a library consisting of b buffers or inverters with a running time of O(n2 b2 ). They also proposed a way to 82 6.3 Dynamic Programming handle power consumption. However, the algorithm is not polynomial. The running time of the algorithm was later improved (Shi and Li, 2005; Li and Shi, 2006; Li et al., 2012) by using clever data structures and better pruning techniques than in the previous papers. For instances with only a single sink, they achieve a running time of O(b2 n). For nets with m sinks, the running time becomes O(b2 n + bmn). There are a lot of extensions to the basic version of Lillis et al. (1996a). There are works considering higher-order delay models (Alpert et al., 1999; Chen and Menezes, 1999), simultaneous buffer insertion and tree construction (Okamoto and Cong, 1996; Hrkić and Lillis, 2002, 2003; Hu et al., 2003), segmenting wires (Alpert and Devgan, 1997), or minimum buffer insertion under slew constraints (Hu et al., 2007). For an instance of the Repeater Insertion Problem with given repeater positions, the task of finding a repeater tree with maximum slack can be solved efficiently as we have just seen. If one wants to get the cheapest solution that satisfies the slack targets, then the problem becomes NP-complete even if one ignores load limits at the source and repeaters as shown by Shi et al. (2004). An FPTAS for the problem was presented by Hu et al. (2009). 6.3.1 Basic Dynamic Programming Approach We have implemented a version of the dynamic program as introduced by Lillis et al. and added some improvements. We did not use the running time improvements by Li, Zhou and Shi for several reasons that we give below. The dynamic program algorithm works with sets of candidates that are characterized by a required arrival time, a downstream capacitance, and a solution subtree. Candidates are propagated bottom up by adding wire segments and adding repeaters at buffering positions. A new candidate for each repeater type is created at each buffering position as long as capacitance limits are not violated. At inner nodes of the topology, the candidates of the left and right branch are merged together by adding all combinations to the candidate list. An explosion of candidates is prevented by only keeping candidates that are not dominated (i.e. there is no other candidate with better or equal RAT and lower or equal capacitance). At each node, we have two candidate lists for subtrees that need a positive or negative signal to preserve parity. Candidates are created at sinks. As the dynamic program algorithm does not work with RAT functions, we collapse the RAT to a single value and evaluate the functions for optslew. We compare the quality of the Fast Buffering algorithm with the dynamic program algorithm in Section 8.3. 83 6 Repeater Insertion 6.3.2 Buffering Positions The running time and quality of a van Ginneken style algorithm highly depends on the choice of repeater positions. The result of our repeater insertion algorithm (Algorithm 5) is a complete repeater tree. Repeater nodes and Steiner nodes are used as buffering positions. If there is a node v such that • v is a sink and has outdegree higher than 0, • v is a root and has outdegree higher than 1, or • v is an inner node and has outdegree higher than 2, then we add additional nodes at the same positions and reconnect children to them until no node satisfies any of above conditions. We create at each sink an additional buffering position. Optionally, we also split long edges in the resulting tree such that there is a potential buffer position at most after a given length. As the result of Fast Buffering already has quite optimal distances between consecutive repeaters on long lines, it does not make much sense to split the edges further. The repeater tree from Fast Buffering already navigates around big blockages. Thus, it is not necessary to worry much about them. However, the nodes of the created Steiner trees can lie above blockages. Thus, we add repeater positions at edges that cross a blockage boundary. Finally, all nodes in the interior of blockages are marked as blocked. They are not used as repeater positions. 6.3.3 Extensions to Dynamic Programming We describe in this section our changes to the basic dynamic program algorithm. Our changes are motivated by timing properties as observed in practice. By using some more accurate calculations the algorithm achieves better slacks than the basic version by Lillis et al. (1996a). This makes the algorithm suitable as postprocessing to Fast Buffering. Black Box Timing Rules As we have already discussed in Section 2.5.1, we do not exploit some properties of the Elmore delay and work with black box wiredelay and wireslew functions. This prevents us from using most of the techniques suggested by Shi, Li, and others to speed up the dynamic program. Each candidate knows the position of the last inserted repeater or the last merge of several topology branches. We call the position the sink of a candidate. For the sink, we always keep the required arrival time and the capacitance up-to-date. Instead of updating the sink when wire segments are added, we just accumulate the RC-delay and compute the required arrival time on demand. This makes the algorithm slower than assuming pure Elmore delays but improves the result quality, because we are nearer to the final timing over wire segments. Because of running time reasons, we only want to have one sink for each candidate. Thus, we update the sink on merge points. 84 6.3 Dynamic Programming Slew Effects While computing the required arrival time for a candidate after adding wire or a repeater, we do not know the slew that will arrive at the candidate. This makes the calculations inaccurate and can lead to pruning of otherwise optimal candidates. One solution to mitigate the problem is the use of buckets of discrete slew values as for example proposed by Hu et al. (2007). For minimizing power consumption, this is a viable solution. However, for optimizing slack, this leads to an explosion of candidates. The resulting running times make such a solution impractical if one wants to handle millions of instances. While we still assume a prototype slew at the inputs of our candidates, we see the resulting slew at the sink of each candidate. We use the difference between the slew that we assumed for the required arrival time calculation at the sink and the arriving slew to estimate the real required arrival time. Given a RAT at a candidate’s sink and slew s, we can compute the required arrival time that we use for further calculations: rat = RAT − slewdelay(s). See Section 4.1.4 for the slewdelay function. Wiring Mode Assignment An important extension to the dynamic program is the handling of different wiring modes. For this, each candidate has a buffering mode assigned of which only the horizontal and vertical wiring modes are used. All candidates with the same buffering mode and parity are kept in a candidate list. Only candidates with the same buffering mode are merged during the merging step. When we realise the net behind a candidate, all horizontal (vertical) wiring segments of the net will get the same horizontal (vertical) wiring mode. We use the wiring modes stored in the buffering mode of the candidate. Physically, multiple wiring modes per net would be possible, but most industrial routers tolerate only a single wiring mode per net and dimension. After a repeater has been inserted into a candidate, a buffering mode change can occur. Thus, the resulting candidate is copied into the lists of all modes. Such an extension was first proposed by Alpert et al. (2001a). They also showed that wire tapering, that is, the continuous assignment of widths to wire segments, only has marginal advantages compared to assigning a single wiring mode for each dimension to the whole net if there are enough modes available. Following this, we disabled changing of buffering modes at repeater positions if no repeater was inserted. With this, the number of buffering modes changes the overall running time only linearly. Our layer assignment routine will not use the same set of buffering modes uniformly. Instead, each edge of the input topology gets a set of possible buffering modes. Candidates that arrive at an edge with a not assigned buffering mode are just discarded. To decide which buffering modes are available at an edge the blockage 85 6 Repeater Insertion and congestion maps are used. We remove modes that would cause too high congestion or are blocked. It can happen that the incident edges of a buffering position that lies on top of a blockage have an empty set respectively. As we cannot add a repeater and are not allowed to switch the buffering mode within a net, this would lead to empty candidate sets on all layers. We ignore the congestion map in such cases and allow the lowest buffering mode on all edges incident to the repeater position. Distinguishing Rise/Fall A typical instance to the Repeater Insertion Problem has different required arrival times for rise and fall as well as different arrival times at the root. Most implementations of the dynamic program do not distinguish between both values. Instead, they settle to a single value like the average or the worst value of both. To compute the delay over a repeater, also a single value is used. We have seen that several instances can be build better if one considers both values separately. The candidates in our algorithm have a time pair as required arrival time. When a wire is added or a buffer is inserted, the values can be updated separately by calculating the according delays. This causes twice the effort of only propagating a single number. The pruning step uses only the worse value of both required arrival times when comparing candidates to prevent an explosion of candidates that we have to handle. Our experiments showed that the benefit of only pruning candidates that are dominated in both required arrival times was small but it was paid with high running times. Slew Limits In addition to the required arrival times, we also propagate a slew limit backwards. The slew limit of a candidate is the maximum slew that can arrive at the current node such that the slew limits are not violated in the whole subtree of the candidate. Candidates are pruned as soon as their slew limit is not reachable unless all candidates have to be pruned. Candidate Selection As soon as we arrive at the root, we have to choose the best candidate to return a solution. Instead of relying on the required arrival times of the candidates, we recompute the whole timing of each candidate propagating the slew accurately from root to sinks. The sinks are then also evaluated using their required arrival time functions instead of constant values. 86 6.3 Dynamic Programming Power-aware Dynamic Programming Lillis et al. (1996a) showed how to extend the dynamic program to find the cheapest solution satisfying the required arrival time constraints. For their FPTAS for the Repeater Insertion Problem with buffering positions, Hu et al. (2009) discretized the power consumption values of repeaters into cost buckets. We have also extended our implementation to work with cost bins. Basically, each candidate is now characterized by three values: cost, cap, and rat. Every time a buffer is added or candidates get merged, the result has to be inserted into the correct bucket. There is a limited number of buckets. All candidates using more power than the highest-valued bucket will be merged together. Unfortunately, the running time of this version is prohibitive high when using a number of buckets that is sufficient for good results. In comparison to the Fast Buffering solution, the timing results are very good and the power consumption is smaller than for the basic dynamic program version (see Table 8.4). However, the running time prevents that this version will be used extensively in production. We have taken random instances of different characteristics from a 22 nm design and compared the results of the power-aware version of the dynamic program to the basic one. The values are in Table 6.2. All runs of the power-aware version use 40 buckets between 0 and the power consumption of the Fast Buffering solution. We see that that the sum of negative slacks and the worst slack are similar for both runs. The basic version uses a lot of area for bigger instances while the running time of the power-aware version explodes. See Chapter 8 for details of the hardware setup and the instances. A comparison between Fast Buffering and the power-aware dynamic program on the same instance set can be found in Section 8.1. 87 6 Repeater Insertion Dynamic Program Sinks I01 I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 Dyn. Program + Buckets SNS Slack Area Time SNS Slack Area Time 1 −378 1 −91 2 −414 2 −161 3 −161 3 −177 4 −22 4 −137 5 −371 8 −112 10 −444 15 −694 24 −71 33 −3317 47 −2563 65 −1523 73 −3762 120 −10275 322 0 −378 −91 −207 −81 −61 −68 −11 −36 −111 −18 −69 −59 −15 −188 −105 −96 −75 −104 25 154 12 138 6 16 42 25 24 48 48 51 187 89 309 310 310 355 510 1427 11 2 13 4 5 9 8 9 10 16 20 55 57 70 96 181 144 345 777 −377 −91 −413 −161 −143 −162 −22 −133 −351 −117 −387 −698 −60 −3167 −2922 −3852 −3424 −10701 0 −377 −91 −206 −81 −61 −68 −11 −36 −113 −20 −68 −59 −8 −189 −109 −96 −78 −107 25 160 14 150 6 11 35 25 8 40 34 22 116 38 144 101 48 132 333 851 374 56 446 128 214 370 351 371 487 829 886 2820 3660 3711 5372 13125 10078 22624 46548 Table 6.2: Results of optimizing for slack and power on several instances from a 22 nm design using the basic version and the power-aware version of the dynamic program. All times are given in ps. SNS is the sum of negative slacks for all sinks. Slack is the worst slack of the instance. Area is the space consumed by the internal repeaters measured in placement grid steps. Time gives the running time of the dynamic program excluding other parts in milliseconds. 88 7 BonnRepeaterTree The repeater tree algorithm that we described in Chapter 5 and Chapter 6 has been implemented together with a small framework for repeater tree optimization as part of the BonnTools (Korte et al., 2007; Held et al., 2011) suite of physical design optimization tools developed at the Research Institute for Discrete Mathematics, University of Bonn in an industrial cooperation with IBM. The main algorithm together with some utility tools and its APIs is called BonnRepeaterTree. BonnTools are now part of the IBM electronic design automation tools. They have to work with the requirements of industrial physical design. Thus, a huge amount of development work is spent to cope with real-world designs. This chapter describes the aspects one has to consider if one wants to implement our algorithms for an existing physical optimization environment. We start with details that are valid for a whole range of repeater tree instances, like the repeater library or blockage map. We then show how a single instance is processed. We take a brief look at the BonnRepeaterTree software architecture and finish with a description of two tools that use our framework. 7.1 Repeater Library A typical standard gate library has a lot of different repeaters. There are repeaters for special purposes like repeaters for clock trees or repeaters that should just add delays to the signals. Then there are standard repeaters that are used within repeater trees. They are sorted into families of similar properties. First, there are repeaters from different Vt -levels. The voltage threshold (Vt ) is the voltage where the gate starts to switch. Gates with a lower Vt -level are faster because they switch earlier but their power consumption is much higher due to higher leakage power. The optimization flow or the designer set the currently active Vt -level. BonnRepeaterTree prefers repeaters from the active Vt -level. Second, repeaters are distinguished by their beta ratio (i.e. the difference between the fall and rise delays). Repeaters can be build such that for similar inputs the rise and fall delays are either balanced or asymmetrical. Chains consisting of balanced repeaters are usually slower than chains with unbalanced ones. We perform longdistance calculations for each repeater (see Section 4.1.1) and then choose the family with the fastest repeaters. There are repeaters of different sizes or BHCs (Block Hardware Codes) within each family. Smaller repeaters consume less leakage power. They have lower load limits and are very sensitive to the load. In general, they are also slower than larger 89 7 BonnRepeaterTree repeaters that can drive higher load capacitances. The largest repeater is often the gate type that can drive the highest capacitance over all gates in the whole library. For a given Vt -level and beta ratio, we often use the whole family of BHCs and only hide the smallest repeaters because they are too sensitive, so that small changes in routing can result in huge timing differences. As the running time of all repeater tree algorithms depends on the size of the library, one might choose to work with a subset of a family. For example, Alpert et al. (2000) proposed an algorithm to select a proper set of repeaters such that the results do not deteriorate too much. While such an approach would certainly improve the running time of our algorithms, the times we see in practice are fast enough to keep all repeater sizes (except the smallest ones as described above). The sizes of some example libraries are shown in Table 8.3. Typically, the library has between 15–25 inverters and buffers. There are libraries that consist only of buffers and libraries that consist only of inverters. In practice, we distinguish between the repeaters that can be removed from the design and the repeaters that can be inserted. While we limit the number of buffers and inverters that are used for construction, we want to be able to remove as much of the existing repeater trees as possible. 7.1.1 Repeater and Wire Analysis During repeater tree construction, the timing rules are called millions of times to evaluate intermediate solutions. Unfortunately, evaluating timing rules is quite slow. It is prohibitive for running time reasons to call the timing rules each time one wants to calculate delays or slews. To speed up the calculations, we approximate the timing rules of all repeaters. To this end, we sample the domain of both functions equidistantly and use bilinear approximation between the sampling points, which can be evaluated very quickly. Using 64 sampling points in both dimensions of the rules (input slew and output load), limits the error to about 2 picoseconds for the technologies in our testbed. Similarly, we approximate the delay rules over net segments. Given an input slew and an Elmore delay, the timing rules compute an output slew and a delay. For some technologies, the timing rule is just a linear scaling. Other technologies, however, use more complicated functions. As we want to work with all inputs, we sample the timing rules for nets with 256 sampling points in both directions and also use bilinear approximation. 7.1.2 RAT and Slew Backwards Propagation We use bilinear approximations for all flavours of the slewinv function. During approximation a binary search is performed to find the highest slew that achieves a given output slew and Elmore delay for nets or output slew and load capacitance for repeaters. 90 7.2 Blockages and Congestion Map We approximate required arrival times by linear functions. During repeater insertion we are usually interested in the required arrival time for a given target slew st . Thus, we approximate the tangent of the RAT function at st . To compute the RAT function at the source of a net for a given sink and signal edge with Elmore delay rc and sink RAT function rat, we compute the required arrival time for slews st and st + for an appropriate small : ratst := rat(wireslew(rc, st )) − wiredelay(rc, st ) ratst + := rat(wireslew(rc, st + )) − wiredelay(rc, st + ). The resulting required arrival time at the source is then the linear function going through ratst and ratst + . To merge required arrival time from several sinks, we do not compute the lower contour. Instead, we evaluate all required arrival times for slew st and only keep one RAT function that attains the minimum. Required arrival times are propagated backwards over repeaters in an analogue way. Additionally, we have to take care about signal edge inversions through inverters. 7.2 Blockages and Congestion Map The blockage map contains the regions of the design that are blocked for repeater insertion. The bounding box of the chip area is an enclosing rectangle of all free space in the design. Everything outside is considered blocked. In addition, we also consider as blocked • regions that belong to the bounding box of the chip area but not to the chip area itself, • regions that are not free for gate placement within the design, • regions that are blocked by the user, • gates that are fixed in their location, and • large gates that are usually difficult to legalize. Given all blockages, we first block all free regions that are too small for placing a repeater within. In a second step, an overlap-free set of rectangles covering the blocked areas is computed. The rectangles are then stored in a quadtree that supports fast nearest free location searches. Using the blockages, an equidistant blockage grid is created. The blockage map and blockage grid are then used to initialize free routing capacities between neighbouring tiles for a congestion map. Finally, all nets are added into the congestion map. 91 7 BonnRepeaterTree 7.3 Processing Repeater Tree Instances The basic steps of optimizing a single instance are extracting the input data from the netlist, building a new repeater tree, and replacing the original netlist with the new one. We describe the steps in the following sections. As our algorithms are either heuristics to the problem or simplify it, it is possible that the new solution we computed is worse than the original one. To prevent degradation of the design, we evaluate the original and new solutions and compute some metrics like slack, length and number of electrical violations. If the new solution is better, then it will be inserted into the design. Otherwise, it will be discarded. 7.3.1 Identifying Repeater Tree Instances The BonnRepeaterTree tool is designed to optimize all repeater tree instances of a design. However, the runtime environment does not give us all instances and the corresponding information directly. Instead, the tool has to find instances in the netlist. After identifying instances, we have to extract all data relevant to the instance. We can assume that basic data like the repeater library, the wiring modes, the blockage map, and potentially a congestion map are already given. It remains to extract • all nets and pins belonging to the instance and their placement, • the arrival time functions of the rules and the the timing rules necessary to compute them, • required arrival time functions at the sinks, and • if wiring already exists and it is requested, the existing wiring topology. The following sections describe the steps mentioned in the list. Identifying Roots and Inner Circuits Generally, all nets of a design are part of repeater trees. Nets incident to repeaters belong to the same repeater tree. However, some nets are not part of any repeater tree because they are protected from optimization by the designer. There are several sources of such hides: • Nets are hidden if they are already optimized and if the designer does not want a tool to mess up with the current result. • Nets are hidden by other tools because they depend on the current solution. 92 7.3 Processing Repeater Tree Instances • Nets that carry clock signals have to be buffered but there are special requirements, for example, the signal should arrive at the same time at all sinks, such that special clock tree tools buffer them. • Nets with analog signals should not be buffered. • Nets can have multiple inputs. In such cases, special care has to be taken that no short-circuit is created. It can easily be queried from the runtime environment whether a net is hidden. BonnRepeaterTree works on a set of nets from a design. To identify the instances corresponding to a set of nets, the following steps are performed: 1. Nets that are hidden are filtered out. The remaining nets, either a root net or a net at a deeper level, are part of a repeater tree. 2. For each net’s source we identify its root pin and collect it. The source pin of a net is a repeater tree root if it has no gate or its gate is not a repeater or if it is not possible to remove the repeater and its incident nets without violating a hide or similar restriction. If a source pin is not a root, then it is an output pin of a repeater that is part of the tree. We recursively continue the search with the net connected to its input pin. 3. For each collected root, we start a forward search to include all repeaters and sinks that belong to the instance. A sink pin is reached if it does not belong to a gate or if its gate is not a repeater that can be removed. After running the routine we have a set of instances. Each instance consists of a root pin, a set of inner repeaters, a set of sink pins, and a set of connections between them that are stored in a tree data structure. Arrival Time Functions After we have identified the root pin of an instance, we have to extract some characteristics of the root. We are interested in • the arrival time functions for all signals at the root and • the output load limit that the root can drive. It is possible that different signals are going through a repeater tree instance. The timing engine creates so-called phases for different signal sources. Phases are propagated separately such that there are several independent arrival times and slews as well as required arrival times at the timing nodes. Slew limits are also phase-specific. The output load limit is used in ConnectRoot to check for electrical feasibility of the solutions. For each phase of the instance, there is a pair of arrival time 93 7 BonnRepeaterTree functions, one for each transition respectively. Given the load at the root pin, we can compute the arrival time we are interested in. There are several types of roots. For each type, we extract the arrival time functions in a different manner. Within the BonnRepeater framework, we might see root pins that are • output pins of circuits, • primary inputs of the netlist, • pins on hierarchy boundaries, or • output pins of circuits that are fed by transparent segments. We describe each type of roots in the following sections. a p1 z p2 b Figure 7.1: A repeater tree root z at the output pin of a standard AND-gate with two input pins. The arrival times and slews at the root are computed using arrival times and slews at the input pins a, b and the propagation segments p1 , p2 . Outputs of Circuits The most prominent type of root is an output pin of a standard circuit or a macro. There are propagation segments heading towards the output pin within the circuit coming from its inputs or internal timing points. To estimate the timing at the root, we have to extract the timing rules of the propagation segments together with the arrival times and slews at their tails. The load limit of the root is given by the timing rule of the circuit’s pin. r Figure 7.2: A repeater tree root at a primary input of the design. Arrival times and slews are constant. Primary Inputs Primary input pins are starting points for signals coming from outside of the design. We distinguish two different types designs and on each type primary inputs are treated differently. Some designs are macros that are later included into larger designs forming a hierarchy of designs. The primary inputs and outputs of macro designs communicate with gates at higher hierarchy levels. Incident nets can be optimized by the repeater tree routine. Top-level designs are the second type. They communicate with components outside of the chip image. Here, it is often not possible to optimize nets incident to primary inputs because they need special treatment due to electrical constraints. In 94 7.3 Processing Repeater Tree Instances both cases, information about timing coming from the outside is given in form of arrival time and slew assertions. Normally, the values are constant, independent from the load at the pins. In addition, there is a load limit asserted to the pin. Lower Hierarchy Level Upper Hierarchy Level a z y r b i Figure 7.3: A repeater tree root (r) at a hierarchy boundary. It is in the middle of a net crossing the boundary. To compute arrival times and slews one has to fetch them at the driving gate’s inputs and propagate over the gate and net. Changing the load at pin r has effects on the timing at side pins i and z. Pins on Hierarchy Boundaries It is possible to load hierarchical designs such that the contents of all hierarchy levels are visible to optimization tools. Nets crossing hierarchy boundaries could be optimized in a single step. However, tools are often only allowed to work at a single level at a time. Thus, instances stop at hierarchy boundaries and it is possible to get roots at virtual pins that mark the hierarchy boundaries. For roots at hierarchy boundaries, it is possible to take information about the previous level into account. If one changes the load capacitance at such a root this has an impact on the driver in the preceding net and the net itself. To calculate the timing at the root one has to extract the driver’s timing as described in the previous case and also the timing behaviour of the net between the driver and the root. In our tool, we first extract the driver’s timing rules and then calculate a Steiner tree for the preceding net. When we want to get the timing at the root, we first calculate the final load at the driver and then compute the timing at the drivers output pin. In a second step, we recompute the Elmore delay for the root and propagate the timing over the net. The load capacitance limit of the boundary pin is the highest capacitance we can connect to the pin without violating the capacitance limit of the preceding driver. Changing a repeater tree at a boundary hierarchy also changes the timing on all pins that are siblings to the root in the preceding net. Figure 7.3 shows the situation with sibling sinks i and z. The slack can turn negative or electrical violations can appear. However, we ignore the timing at sibling pins because we did not see any problems in the past. If it turns out to be a problem in the future, the timing of sibling pins can be considered by limiting the capacitance load limit of the root such 95 7 BonnRepeaterTree that the sibling’s timing remains feasible even in the worst case. A different problem with siblings in the preceding net appears if they are hierarchy boundary pins (pin z in Figure 7.3). Working on one of the pins can significantly change the timing behaviour of the others. This can be a problem when we optimize both instances in parallel where we first extract the timing constraints and then optimize. Here, we do not see the changes we do during optimization of the first root when we start to work on a subsequent root. A better solution would recognize that all siblings belong together and optimize them in a single repeater tree instance. However, the problem appears rarely in practice and did not impose a problem. Thus, we ignore the situation so far. z p1 tz a r p2 tr b Figure 7.4: A root at a macro boundary with transparent segments. The load capacitance at r is visible over transparent segment tr to propagation segments p1 and p2 . Arrival times and slews are calculated by propagating them first over p1 and p2 and then over tr . Timing at sibling pin z depends on the load capacitance at r and vice-versa. Transparent Segments There is a fourth type of roots similar to roots at hierarchy boundaries that one can find at the output pins of macros. Macros that were processed as a separate hierarchy level are later finalized and treated as a single block. The timing of the whole block is catched in macro-specific timing rules. The behaviour at hierarchy boundaries is modelled using transparent segments. Transparent segments mimic preceding nets in this situation. Normally, propagation segments within gates shield the load capacitance at their head such that it is not visible at preceding segments. In contrast, the capacitance at the head of transparent segments is visible at the tail and therefore influences previous propagation segments. This is analogue to a net where the load at the sink is visible at the source. Transparent segments are preceded by propagation segments that correspond to the driving circuit in the hierarchy boundary case. To fully catch the timing behaviour, one has to extract the timing at the tails of the segments heading to a transparent segment and the timing rules of all involved segments. The timing rules of the macro give us the load capacitance limit of the root. Transparent segments have similar problems as pins on hierarchy boundaries. It is possible that there are sibling segments heading to other outgoing pins of the macro. Here, both instances should be considered at once, too. However, we have 96 7.3 Processing Repeater Tree Instances not seen such a problem in practice yet. RAT Functions at Sinks For each phase that arrives at a sink, we approximate the rise and fall RAT functions by linear functions. This is a rough estimate compared to the effort we spend on accurately estimating arrival times at the root. In practice, adding more effort into getting better RAT functions improves the results only by a small margin. The influence of the root is much higher. We know the current slews and required arrival times at the sinks from the timing engine. We then set the RAT functions for sinks at primary outputs, where required arrival times are asserted, to constant functions. For pins that are gate inputs, we set the RAT functions to the tangent around the current slews by evaluating the outgoing propagation segments. Phase shifts Ideally, different phases are handled separately during repeater tree construction because different sinks can be critical for different phases. Due to different slews as well as slew limits, propagating only a single phase can be too pessimistic. Because of running time reasons, we only handle a single phase during repeater tree construction. The different phases have to be merged into a single one. This is done by normalizing arrival times and RATs. First, we assume that the root has a load capacitance of 0 and compute arrival times. Using our delay model, we determine the criticality for each phase and signal edge separately. The most critical phase and signal edge is used as reference. Then, all other signals are shifted by a constant such that their arrival times match the reference arrival time. The shift is also performed for the RAT functions at all sinks. The arrival times and RATs used during repeater tree construction are the worst ones after shifting. We also use the tightest limits over all phases. When it comes to evaluate a solution, we propagate each phase independently again to avoid pessimism. Unplaced Pins Especially in early design stages, it is possible that sink pins or the root have no proper placement. A reason might be that the corresponding gate is not yet placed or the gate’s design is not yet finished such that the pin’s position within the gate is unclear. If there is at least one placed pin in an instance, then the unplaced pins are positioned at the center of gravity of the placed ones. If all pins are unplaced, we treat them as if they all lie at the origin of the coordinate system. 97 7 BonnRepeaterTree Identifying Existing Wiring As mentioned in Section 5.9, existing wiring can be used as a topology for the buffering algorithm. Because wires are stored as a list of segments between two coordinates in our optimization environment, we first have to reconstruct a graph out of the segments. It can happen that the existing wiring is not connected or it does not cover all pins. In such a case, we just discard the whole graph and use our topology algorithm. Slew Limits In addition to the slew limits at input pins imposed by the timing rules, there might be additional design specific slew limits. First, there is a global slew limit that should not be violated at any pin. Second, there are phase-specific slew limits that are only valid for arrival times belonging to the according phase. Given an instance, we know which slew limits apply. We then modify the instance such that the slew limits of the sinks and all repeaters in the library are not higher than the instance-specific limits. In case of phase-specific slew limits, this means that we choose the minimum over all slew limits for construction. This can be too pessimistic. For example, we sometimes compare slews from uncritical phases having higher slew limits with smaller slew limits from critical phases. Lowering the slew limits at insertable repeaters can make some buffering modes invalid if their slew target is above the limit. We just remove invalid buffering modes before an instance is processed. Capacitance Limits Similar to phase-specific slew limits, we also see capacitance limits at output pins that depend on the phases propagating to gates. They are imposed to lessen the effects of electromigration. Electromigration is the movement of material in a conductor caused by current. It decreases the reliability of integrated circuits over time. One strategy used to cope with this is reducing the capacitance that gates are allowed to drive. As the strength of the effect depends on several factors including the frequency of the signals, the countermeasures also depend on the signals. For a given instance, the load limits of all repeaters that can possibly be inserted have to be lowered according to the signals going through the instance. This can make some buffering modes invalid that depend on higher loads. They are just removed for the instance. 7.3.2 Constructing Repeater Trees For repeater tree construction, we always first construct a topology and then add repeaters using our algorithm. The result can optionally be post-processed by the dynamic programming repeater insertion that treats the result of Fast Buffering as 98 7.3 Processing Repeater Tree Instances an input topology. The initial topology can be constructed using existing wiring or by our topology algorithm. Parameter ξ The algorithms are controlled by the preprocessing that we presented in Chapter 4 and the ξ parameter. For our implementation, we have split the parameter into three different ones: ξm for buffering mode creation, ξt for topology generation, and ξr for the repeater insertion step. The parameters can be controlled by the user independently. In Section 8.8, we give hints which values work best for the parameters. Currently, we only work with two buffering modes for each wire mode. The first buffering mode is extracted using ξ = 0.0 and for the second one we use ξr . The faster buffering mode is then used for topology generation and AssignEffort chooses between both. As described earlier, lowering slew or capacitance limits can render buffering modes invalid. This can only happen for the slower buffering mode as it has higher limits. In such a case, we increase ξ until all limits are met. Thus, we always have a choice between a slower and a faster buffering mode. The parameter ξt is used within topology generation when we have to decide at which edge we want to connect a sink. Finally, ξr is used during repeater insertion when we have to decide which solution to choose in InsertRepeater, Merge, and ConnectRoot. 7.3.3 Replacing Repeater Tree Instances After a repeater tree has been constructed, we have to evaluate it. Our algorithm is only a heuristic that cannot guarantee a good solution. We therefore compare our solution to the existing repeater tree that was identified during instance collection. For this, we have the choice to • evaluate the result using our approximations of the timing rules or • to use the timing engine for evaluation. Evaluation using the approximations has the disadvantage that it is slightly inaccurate and can miss effects influencing timing. However, it is much faster than the timing engine and it can run on different instances at the same time (see Section 7.4.3). If the new solution is not good enough to be kept, it is possible to discard it without modifying the netlist. If we want to evaluate using the timing engine, then we have to insert the result into the netlist first. Currently, the quick evaluation mode is used most of the time. Only if we want to be 100 % accurate, the timing engine is used. The criteria used to evaluate a solution are ordered from the most important ones to the least important: electrical violations, slack, power consumption, and length. 99 7 BonnRepeaterTree 7.4 Implementation Overview BonnRepeaterTree is a module in the BonnTools suite. It has been implemented using the C++ programming language. The module consists of • a repeater tree API, • a framework that can be used to implement repeater tree construction algorithms, • implementations of our repeater tree construction algorithms, and • a layer translating between the framework and the IBM physical design tools. All our algorithms have been implemented using the framework. To migrate them to a different physical design tool suite one only has to reimplement the translation layer. It will not be necessary to touch the algorithms. 7.4.1 Repeater Tree Construction Framework Our framework provides an algorithm with all information that belongs to an instance as described in Chapter 3. In addition, the existing implementation of an instance is available. For example, this is used to do post optimization of instances or to compare new solutions with older ones. To build a new tree, an algorithm only has to create a tree data structure that consists of nodes for roots, sinks, repeaters, and the connections between them. Evaluation and implementation into the design is done by the framework. 7.4.2 Repeater Tree API The BonnRepeaterTrees are not only used as a standalone utility but also as a subroutine for other tools, for example, BonnLogic (Werber, 2007; Werber et al., 2007), a tool to restructure logic on the critical path. There are also programs that need information about repeater tree instances. For example, determining whether a pin is part of a repeater tree at all is a functionality that is not provided by the timing engine. We provide a small API to work on repeater trees using our algorithms consisting of • utility functions to determine whether a pin is in a repeater tree, whether a pin is a root, and for a pin in a repeater tree the corresponding root, • a function returning a whole repeater tree instance, and • a function to construct a repeater tree using one of the algorithms. 100 7.5 BonnRepeaterTree in Global Timing Optimization Instances expose the original repeater tree. The user has direct access to root and sinks and can traverse the tree to get inner repeaters and nets. For example, the RerouteChains tool uses the interface to fetch all repeater trees and traverses the original tree to identify chains. Most tools, however, have a set of pins and want to optimize the pins’ repeater trees. Such tools just iterate over all pins, fetch an instance, and construct it. 7.4.3 Parallelization About ten years ago the increase of CPU speed with each new generation slowed down significantly. Instead, multi-core CPUs appeared with the number of cores increasing with each generation. To fully utilize the power of modern CPUs, it is necessary to distribute work on the cores. It rarely happens that repeater tree optimization is started on a single instance. Typically, hundreds or thousands of instances should be calculated at the same time. Because of this and the small running time of optimizing a single instance, it makes little sense to parallelize parts of our algorithm. Instead, we choose the simpler approach of parallelizing the computation of different instances. The optimization environment does not allow to modify or even query the netlist or timing engine at the same time from different threads because this would lead to race-conditions. Therefore, we choose to protect all calls to the environment by a single mutex. To reduce congestion on the mutex the framework first fetches all information necessary to compute a repeater tree while the mutex is held. Then, during the whole computation, the mutex is never acquired. Only after the decision to insert a repeater tree has been made, the mutex is locked again to modify the netlist. Testing Multithreading Running Times We have tested the parallel execution on our testbed of chip designs using an Intel Xeon machine with 4 processors having 8 cores each. Table 7.1 shows the running times with different numbers of parallel threads. We achieve a speedup around 4 using 8 cores which is our default when we optimize all instances of a design. The speedup is limited by the serial parts, instance identification and instance insertion. 7.5 BonnRepeaterTree in Global Timing Optimization Our algorithm is able to process millions of instances in reasonable time. It is therefore suitable to be used in global timing optimization where all instances are processed. BonnRepeaterTree offers two parameter sets by default that have proven useful for global optimization (see Held (2008)). 101 7 BonnRepeaterTree Design Baldassare Beate Gerben Wolfram Luciano Benedikt Renaud Julius Franziska Meinolf Iris Gautier 1 2 4 8 Factor 12 16 24 32 20 s 35 s 40 s 45 s 102 s 250 s 259 s 309 s 358 s 449 s 501 s 1420 s 11 s 18 s 21 s 23 s 55 s 130 s 140 s 158 s 188 s 235 s 269 s 765 s 7s 11 s 12 s 14 s 35 s 78 s 89 s 92 s 113 s 138 s 170 s 473 s 6s 8s 9s 11 s 29 s 63 s 74 s 72 s 89 s 108 s 142 s 383 s 3.34× 4.09× 4.46× 4.08× 3.48× 3.95× 3.52× 4.38× 4.02× 4.16× 3.53× 3.70× 6s 8s 9s 10 s 28 s 59 s 69 s 69 s 84 s 103 s 135 s 372 s 6s 8s 8s 10 s 27 s 58 s 69 s 66 s 82 s 101 s 135 s 363 s 6s 8s 8s 10 s 28 s 59 s 69 s 69 s 84 s 101 s 136 s 378 s 6s 8s 10 s 11 s 28 s 59 s 70 s 68 s 85 s 102 s 138 s 390 s Table 7.1: Running times with different numbers of used threads. As the running time gains from going further than 8 threads are rather small we use 8 threads by default. The speedup factor is then around 4. Power Trees The first parameter set aims to build all instances with minimal power consumption while electrical violations are avoided. We achieve this by relaxing all required arrival time constraints to infinity. We assume that the root gate is replaced by the strongest version from its BHC family. We mainly build short topologies, but to prevent too long daisy-chains, the parameter ξt is slightly above 0.0. Slack Trees The second parameter sets tries to improve the slack above the slack target for each instance. The ξ parameters are set around 0.8 to prevent a too high resource usage. Uncritical instance parts are built with parameters similar to the power tree ones due to the AssignEffort subroutine. 7.6 BonnRepeaterTree Utilities Besides optimization, there are other tasks that are related to repeater trees. The most common tasks are implemented as separate tools that can be called directly by the user. The two existing tools are a rip out routine and repeater chain optimization. 7.6.1 Removing Existing Repeaters It is desirable to remove all existing repeater trees to get into a clean initial state for optimization. Our RipOut routine removes all inverters and buffers from a design or specified instances such that at most one parity preserving inverter is left. 102 7.6 BonnRepeaterTree Utilities − + − r + Figure 7.5: The netlength of this instance can nearly be halved by adding a second inverter and placing one of both at each negative sink. The inverter is placed within the bounding box of all negative sinks such that it is nearest to the root. Figure 7.5 shows how removing as many repeaters as possible increases the netlength of the design. This has to be considered if one uses a physical design flow where all repeater trees are built along topologies coming from a global router. Typical global routers are not capable of inserting inverters on their own. The topologies that they can generate are limited by the input, and the example above shows that removing as many repeaters as possible is no satisfactory solution because it can lead to unnecessary high netlength. We offer a mode in our RipOut routine that tries to circumvent the problem. In this mode the routine will not insert the inverting repeater but modify logic by connecting all sinks directly to the root. The original parity is stored for each sink pin. As soon as the instance is touched again by one of our tools, the sink parities are restored. The routine is dangerous because the whole design might get broken if another tool changes the logic between rip out and restore. For global routers, however, each repeater tree instance is presented as a single net. 7.6.2 Postprocessing Repeater Chains Larger distances are often covered by chains of repeaters. A chain is a set of subsequent nets with fanout one. For larger distances, it is more probable that it is better to use higher layers with wider wires for the nets than the default layers. It is often not possible to reach the slack targets without using wider wires and a lot of short nets on lower layers increase placement congestion due to a lot of additional repeaters. On the other hand, it makes no sense to assign high fanout nets to higher layers. High fanout nets often connect a lot of sinks locally due to the pin capacitance. The benefits of wider wires are not high in such a situation. Thus, we restrict ourselves to repeater chains. We offer two routines for improving the placement and layer assignment of repeater chains. Postprocessing The first routine tries to improve the layer assignment of repeater tree chains by ripping them out and rebuilding them on higher layers. The routine iterates over all repeater chains and then builds a new solution using each configured layer assignment with the Fast Buffering routine. The best solution according to our 103 7 BonnRepeaterTree Figure 7.6: All repeater chains longer than 4 mm on a large design. Distributing the chains more evenly can give us less congestion while preserving the timing. criteria, electrical violations, slack, power consumption, and length is kept. The routine either uses a shortest path computation in the blockage grid or uses the path search of the congestion map if available. In the later case, assignments are only performed if they do not violate congestion targets on the probed layers. Congestion-aware Rerouting The second routine, RerouteChains, does not try to improve the layer assignment of repeater tree chains. Instead, it tries to improve congestion by moving the repeaters into less congested areas. Figure 7.6 shows all repeater chains longer than 4 mm on a design that has a lot of congestion. By distributing the chains evenly and by avoiding congested areas, it is possible to reduce overall congestion. We have a simple ripup-and-reroute heuristic that reroutes the chains. This routine only works in presence of a congestion map. In several iterations all chains are collected that use an edge in the map above a certain congestion level. Then, for each chain, the following steps are performed: 1. Collect the nets of the chain and the costs of their routing in the congestion map. 2. Search for a new path from the start point to the end point of the chain in the congestion map. 3. Distribute the internal repeaters of the chain along the new path such that relative distances are preserved. The repeaters are then placed legally at the free position next to their target position. 4. New routes for all incident nets are computed. 104 7.6 BonnRepeaterTree Utilities 5. If the costs of the new nets are smaller than the old costs, then the new solution is kept. Otherwise, it is reverted. The search for the new path does not take existing layer assignments into account because it is not supported by the congestion map. The assignments are only used when new routes for the nets are recomputed. Similarly, the path search ignores timing. We restrict the congestion map to the bounding box between the start and end of the path plus some tiles for detour. Thus, we expect that the routes do not get too long and timing is not degraded too much. Instead, the number of nets with long detours generated by the global router should become smaller. The routine also does not consider placement congestion directly apart from searching for free positions. In practice, placing gates densely such that no gaps occur results in unroutable designs. Thus, if we have a design that is considered routable by the global router, we expect that placing the circuits legally in the corresponding global routing tiles should be feasible without much placement distortion. This cannot be guaranteed but the experiments did not show problems with placement legalization. If we skip placing the repeaters legally after movement, then the results are similar to runs where the circuits are legalized. A potentially better solution to the problem has been proposed by Janzen (2012). He integrates the placement of repeater chains into the global router. Whole chains are considered as a single net by the global router, and, during path search, timing as well as placement area are considered. Compared to the method presented here, the running time of the global router increased significantly due to the additional work during path search. This approach is not suitable for application in practice so far. Experimental Results We test the RerouteChains routine on our testbed of chip designs (see Section 8). The test cases are output of a standard BonnTools optimization flow (see Section 7.5). We compare the effects of our routine on routability and timing. Table 7.2 and Table 7.3 summarize the experiments. BonnRouteGlobal1 is used to measure routability. We have one run of the global router before and one run after RerouteChains. The global router and the congestion map have nearly the same number of global routing tiles in both dimensions. Table 7.2 first shows the number of repeater chains considered on the test instances. Our tool runs in five iterations over all chains and tries to reroute a chain if it uses an edge in the congestion map with more than 85 % utilization. If a cheaper solution is found, then it is kept. The second column shows the number of improvements found over all five iterations. The following columns show the running times of both global router runs (1st GR, 2nd GR), congestion map creation (CM), and RerouteChains (RC). The congestion map proves to be very fast. The RerouteChains running 1 see (Gester et al., 2011; Müller et al., 2011; Müller, 2009) 105 7 BonnRepeaterTree Running Times Design Chains Reroutes 1st GR CM RC 2nd GR Baldassare Beate Wolfram Gerben Luciano Benedikt Renaud Julius Franziska Meinolf Iris Gautier 7613 5043 3842 3493 27922 37893 14908 41677 56697 128023 76153 165491 381 251 1473 10 4200 3209 12379 7315 3108 68221 141865 214135 9s 14 s 18 s 14 s 46 s 106 s 295 s 178 s 151 s 165 s 750 s 6454 s 2s 2s 2s 1s 5s 14 s 13 s 24 s 22 s 23 s 30 s 205 s 3s 5s 6s 5s 20 s 38 s 33 s 185 s 65 s 103 s 193 s 1344 s 10 s 14 s 19 s 15 s 50 s 111 s 289 s 150 s 160 s 170 s 797 s 5247 s Table 7.2: Testbed for the RerouteChains tool. The running times are given for the global routing (GR), congestion map creation (CM), and RerouteChains (RC). times also contain the legalization of moved repeaters. Overall, the running times are acceptable. Table 7.3 first shows the total overflow over all global routing edges as reported by BonnRouteGlobal. Then, the number of nets is reported with length longer than two times the length of a Steiner minimum tree and the highest relative detour. Finally, timing quality is measured with the sum of negative slacks (SNS). Most instances are uncritical and the tool has little effect. On Benedikt and Franziska the routability decreases slightly due to differences in the free edge capacities seen between the global router and the congestion map. On designs Gautier and Renaud the congestion was reduced significantly resulting in less detours. Due to the reduction of detours, also the sum of negative slacks was improved on these instances. In summary, there are instances where running the tool improves routability, while on uncritical instances no harm is done. Figure 7.7 and Figure 7.8 show congestion maps as reported by BonnRouteGlobal for designs Julius and Gautier. The colors show the maximum relative edge utilization over all layers. Especially on Gautier, we see the huge reduction of overflow by a factor more than 100×. 106 7.6 BonnRepeaterTree Utilities Design Step Overflow Nets > 100 % Max detour SNS Baldassare before after 0 0 0 0 8% 8% −140049 ps −140039 ps Beate before after 0 0 0 0 78 % 78 % −151997 ps −151990 ps Wolfram before after 0 0 0 0 25 % 25 % −75187 ps −75210 ps Gerben before after 0 0 0 0 30 % 30 % −109589 ps −109626 ps Luciano before after 0 0 52 24 524 % 342 % −464277 ps −481492 ps Benedikt before after 1446 2906 0 0 72 % 103 % −207397 ps −210871 ps Renaud before after 4140245 3444979 913 825 747 % 513 % −6557941 ps −5822497 ps Julius before after 0 0 0 0 99 % 89 % −18787024 ps −18766033 ps Franziska before after 1307 10982 0 0 78 % 78 % −1694730 ps −1694434 ps Meinolf before after 0 0 0 0 98 % 89 % −4622849 ps −4715832 ps Iris before after 2843214 1891486 4057 3032 1662 % 1662 % −19704061 ps −15602736 ps Gautier before after 12682825 123511 4609 111 306 % 131 % −9647137 ps −7058513 ps Table 7.3: Difference in routability and timing before and after RerouteChains. 107 7 BonnRepeaterTree 120 % 110 % 10the0 % 95 % 91 % 87 % 82 % 76 % 70 % 60 % 50 % 30 % Figure 7.7: Congestion on design Julius before (left) and after (right) RerouteChains. 120 % 110 % 100 % 95 % 91 % 87 % 82the % 76 % 70 % 60 % 50 % 30 % Figure 7.8: Congestion on design Gautier before (left) and after (right) RerouteChains. 108 8 Experimental Results As part of the IBM optimization tool suite, the BonnRepeaterTree tool is in daily use to optimize the physical design of current chip designs. It proved to be useful for current ASIC (application-specific integrated circuit) designs as well as for bleedingedge processor units. Designers choose it as part of the BonnTools optimization tools to achieve timing-closure. In this chapter, we want to show performance metrics of our tool. We compare it to another tool used by IBM for repeater tree optimization. We also compare against lower bounds for slack, netlength and repeater counts. We then show how parallel walk and effort assignment affect the results and give hints on proper selection of parameters ξ, η, and dnode . Design Baldassare Beate Wolfram Gerben Luciano Benedikt Renaud Julius Franziska Meinolf Iris Gautier Technology Instances Max. Sinks 22 nm 22 nm 32 nm 32 nm 22 nm 22 nm 45 nm 22 nm 22 nm 22 nm 22 nm 45 nm 20552 34382 44413 44677 101625 241465 268775 284934 328600 364969 393457 1275731 152 166 274 194 263 12061 389 1589 740 1539 3264 6859 Table 8.1: Designs used for experimental results. For each design, all repeater tree instances were build. We have chosen twelve current chip designs from our industrial partner IBM for our experiments. For each design, we build all repeater tree instances regardless of difficulty. In total, we have more than 3.3 million instances of varying sizes with up to 12061 sinks. Table 8.1 shows the designs we used, their codename, technology, the number of instances, and the number of sinks in the biggest instance. The distribution of instance sizes is very uneven as shown by Table 8.2, which shows it subdivided into the different technologies. The designs are dominated by single sink instances. Instances with up to four sinks already make up for more than 90 % of all instances. It is important to produce good repeater trees for 109 8 Experimental Results Sinks 45 nm 32 nm 22 nm Total 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 251–500 501–1000 > 1000 958307 191709 144054 115818 27687 25420 16523 11518 27348 14842 1667 775 233 75 110 58920 11422 9077 2219 2384 968 652 521 2336 759 239 46 1 0 0 1307808 179836 96330 42237 24882 20435 19334 18883 44070 12434 5192 2311 269 30 28 2325035 382967 249461 160274 54953 46823 36509 30922 73754 28035 7098 3132 503 105 138 68.39 % 11.26 % 7.34 % 4.71 % 1.62 % 1.38 % 1.07 % 0.91 % 2.17 % 0.82 % 0.21 % 0.09 % 0.01 % 0.00 % 0.00 % Total 1536086 89544 1774079 3399709 100.00 % Table 8.2: Test instances grouped by number of sinks and technologies. small instances to get overall acceptable results. However, it is also important to optimize instances with more sinks well because, in practice, they are often among the timing-critical instances. Table 8.3 shows the size of the repeater families used as libraries. The designs from 32 nm and 22 nm technologies do not have any buffers. Instead, two consecutive inverters are used if necessary. We have performed all of our experiments on a machine with two Intel Xeon X5690 processors with 6 cores each. The machine runs with a base clock speed of 3.46 GHz. The Intel® Turbo Boost Technology is enabled with a maximum clock speed of 3.73 GHz. Hyperthreading was disabled. All experiments were run single-threaded, but up to 12 experiments were run in parallel if not noted otherwise. The running times reported might be higher than necessary because the machine was fully loaded with experiments. However, in practice, designers often have to share computing resources with others, too. The machine has 192 GiB of main memory. The code was compiled with GCC version 4.1.2 under Red Hat Enterprise Linux Server 5.6 at optimization level O2. 8.1 Comparison to an Industrial Tool As a first experiment, we compare our algorithm to a repeater tree construction tool that is used by IBM for most repeater trees. It uses a van Ginneken-style approach with the running time improvements by Li et al. (2012). It is the default tool in 110 8.2 Comparison to Bounds Inverter Buffer 45 nm 32 nm 22 nm 18 18 20 0 22 0 Table 8.3: Repeater Library Sizes the IBM optimization suite due to its good results and tight integration with the placement tool of the suite. The integration makes it hard to compare both tools on all instances, because it is not possible to run the industrial tool on a single repeater tree instance without huge overhead. We have chosen a random sample of 19 instances from the Franziska design with different characteristics. Instances have different numbers of sinks and diameters. While it took seconds to run both tools on all instances, testing took one hour due to the overhead. Table 8.4 shows the results of running the industrial tool, our BonnRepeaterTree Fast Buffering routine, and our BonnRepeaterTree routine with dynamic programming using 40 power buckets. All tools are configured to maximize slack. We have, however, reduced ξ to 0.8 for our repeater insertion. The reason is that higher values would not improve the slack anymore but cause higher area consumption. In practice, we also seldomly use higher ξ values. Both tools are configured to obey blockages and they are only allowed to use the default wiring modes. The instances are already optimized to give all tools reasonable arrival times at the root and required arrival times at the sinks. In practice, the IBM tool would prune the size of the repeater library to improve running time. We configured the tool to use the whole library because this leads to significantly better slacks. Our tool does not prune the library so far. The overall result is that our dynamic program produces the best slack followed by Fast Buffering even though we reduced the ξ parameter. Better slack, on the other hand, costs more area. In general, we see that the industrial tool uses less area. While the Fast Buffering algorithm finds solutions without any electrical violation, both other tools create violations. The experiment was performed on an otherwise empty machine to make the running times comparable. The running time of our Fast Buffering version is 1.26 seconds. The IBM tool uses 1.78 seconds. The dynamic program needs 113.8 seconds with 40 buckets and 3.2 seconds without buckets. The results of the version without buckets using the same setup are shown in Table 6.2. All running times include identifying and replacing instances and not only the core algorithm. 8.2 Comparison to Bounds It is hard to show the quality of our algorithm for the Repeater Tree Problem because optimal solutions are not known. Despite that, for some aspects of the 111 Industrial Tool I01 I02 I03 I04 I05 I06 I07 I08 I09 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 BonnRepeaterTrees + DP SNS Slack Area Vio SNS Slack Area Vio SNS Slack Area Vio 1 −400 1 −91 2 −475 2 −156 3 −157 3 −167 4 −25 4 −141 5 −368 8 −190 10 −368 15 −761 24 −74 33 −3799 47 −2998 65 −3354 73 −3208 120 −10988 322 −273 −400 −91 −243 −82 −66 −71 −16 −40 −118 −30 −74 −72 −16 −238 −140 −99 −102 −147 −29 106 12 112 6 3 25 13 6 23 14 25 68 31 123 69 40 86 162 145 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/25 −379 −91 −417 −153 −140 −183 −26 −136 −349 −127 −428 −690 −79 −3327 −2833 −2784 −4358 −9431 −6 −379 −91 −213 −85 −63 −71 −14 −40 −116 −20 −68 −62 −16 −197 −108 −96 −86 −107 −3 160 10 144 8 11 43 20 8 38 28 20 107 23 228 113 127 165 256 342 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 −377 −91 −413 −161 −143 −162 −22 −133 −351 −117 −387 −698 −60 −3167 −2922 −3852 −3424 −10701 0 −377 −91 −206 −81 −61 −68 −11 −36 −113 −20 −68 −59 −8 −189 −109 −96 −78 −107 25 160 14 150 6 11 35 25 8 40 34 22 116 38 144 101 48 132 333 851 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/2 0/0 0/0 0/0 0/0 0/0 0/0 0/0 Table 8.4: Results of optimizing for slack on several instances from a 22 nm design for the industrial tool, our Fast Buffering routine and our dynamic program implementation that considers power consumption. All times are given in ps. SNS is the sum of negative slacks for all sinks. Slack is the worst slack of the instance. Area is the space consumed by the internal repeaters measured in placement grid steps. Vio gives us the number of load and slew violations in the result. 112 8 Experimental Results Sinks BonnRepeaterTrees 8.2 Comparison to Bounds solutions like netlength, number of inserted repeaters, and slack, we can compare the results to their respective bounds. Length ξ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Slack Dev. Power Inversions Avg. Max. Avg. Max. Wall Time 739.174 797.945 882.063 967.900 1069.397 1177.334 1274.543 1479.679 1689.111 2027.596 2970.202 1504726 1587022 1746005 1920838 2076479 2265218 2547997 2914892 3412856 4163780 6254108 2.0 % 1.8 % 1.8 % 1.8 % 1.8 % 1.8 % 1.8 % 1.9 % 2.3 % 3.6 % 14.1 % 837.7 % 833.1 % 832.0 % 841.3 % 852.5 % 874.0 % 896.3 % 877.7 % 904.3 % 926.3 % 3321.0 % 12.07 ps 9.23 ps 7.48 ps 6.15 ps 5.16 ps 4.36 ps 3.52 ps 2.79 ps 2.16 ps 1.60 ps 1.14 ps 1419.46 ps 1037.46 ps 781.36 ps 666.23 ps 534.88 ps 376.63 ps 352.00 ps 250.13 ps 212.43 ps 174.89 ps 178.36 ps 2301 s 2347 s 2302 s 2291 s 2280 s 2257 s 2213 s 2194 s 2156 s 2116 s 2119 s Table 8.5: Results of our repeater tree algorithm for different ξ values. We ran our algorithm on all instances for ξ values ranging from 0.0 to 1.0. The results are shown in Table 8.5. In general we see that power consumption and netlength increase with higher ξ values. The slack deviation (see below) gets better and even the running time reduces. Table 8.5 is a summary of all runs over all technologies and all numbers of sinks. We have added detailed tables in Appendix A where the data is separated by technology and number of sinks. 8.2.1 Running Time We did not focus on running time during testing. The results show that the algorithm runs very fast. The average running time for a single instance is about 0.6 milliseconds, which means that we can solve about 5.7 million instances per hour. The wall time reported in Table 8.5 only contains the time used to build topologies and to insert repeaters. Overhead like identifying instances and adding the result into the design is not reported. Depending on the design, the overhead is between 100 % and 150 % of the running time used to solve the Repeater Tree Problem. The running time decreases slightly with higher ξ values. The more we try to optimize slack the earlier repeaters are inserted to shield uncritical side paths from the critical ones. In addition more repeaters are added along paths due to timing reasons. The result is that the algorithm works with nets that have a smaller number of sinks. Due to the smaller instances the, subroutine that computes Steiner trees runs significantly faster. 113 8 Experimental Results 8.2.2 Wirelength A lower bound on the total wire length of a repeater tree instance is the length of a Steiner minimum tree spanning the root and all sinks. We computed one for all instances with less than 36 sinks. For bigger instances we used the minimum of a Steiner tree heuristic guaranteeing a 3/2-approximation and the result of our routine over all test runs. Table 8.5 shows how many percent we are away from the optimal repeater tree length. The optimal length can be 0 if the root is a primary input pin and a single sink is directly below it. Let I be the set of instances with non-zero optimal length. Given for each i ∈ I the length of our tree length(i) and the optimal length opt(i), Table 8.5 shows the result of the following equation: 1 X length(i) − opt(i) . |I| i∈I opt(i) The average length increase compared to an optimal Steiner tree is quite low. However, there are some instances with huge wirelength increases. The detailed tables in Appendix A show that this only happens on instances that use the clustering preprocessing or with high ξ values where we accept detours to keep bifurcations from the critical path. Due to parallel walk and instances with sinks of different parities, a deviation of almost 100 % can be optimal. Consider an instance with one negative sink close to the root and two other sinks, one negative and one positive, at some distance. One inverter is necessary to negate the inversions, but we have the choice between bridging the distance twice or adding an additional inverter at the negative sink away from the root. 8.2.3 Number of Inserted Inverters 1 To obtain a lower bound on the number of inverters needed to legally buffer a repeater tree instance let Capextra arise from the sum of the wire capacitance of a minimum Steiner tree and the input capacitances of all sinks by subtracting the maximum capacitance that can be driven by the root with the given input slew, such that the output slew is at most optslew. Every inserted repeater of type t can drive a certain amount loadlim(t) of this capacitance but also contributes its own input capacitance capin (t). Let value M axCap(t) be the biggest load the repeater can drive with an input slew of optslew such that the output slew is smaller or equal to optslew. We may assume M axCap(t) > capin (t). Therefore, if there is a legal inverter tree using xt inverters of type t, then Capextra + X t∈L 1 capin (t)xt ≤ X t∈L Part of this section is from Bartoschek et al. (2007b). 114 M axCap(t)xt (8.1) 8.2 Comparison to Bounds has to be satisfied. Depending on whether we are interested in the number of inserted inverters, in their total area, or in power consumption, we can assign a cost ct ≥ 0 to each inverter type t ∈ L. We ask how well our algorithm minimizes this cost. To obtain a lower bound on the cost that any inverter tree must have, we consider the problem of minimizing the total cost X ct xt t∈L subject to (8.1) and xt ≥ 0 for all t ∈ L. This is a very simple linear program (LP). The dual LP is maximize Capextra y subject to (M axCap(t) − capin (t))y ≤ ct for all t ∈ L y≥0 If Capextra ≤ 0, then y ∗ = 0 is the optimum solution of this LP. If Capextra > 0, then ct ∗ t ∈ L y = min M axCap(t) − capin (t) is optimum. By the LP duality theorem, the optimum value of the original (primal) LP is X ct Capextra ∗ ct xt = min t∈L . M axCap(t) − capin (t) t Of course, if we consider the number of inverters (i.e. ct = 1 for all t), we can round up this lower bound to the next integer. Three further modifications are possible to improve this bound in some cases: First, if the lower bound is 0 but there is a sink of negative parity, we clearly need at least one inverter. Moreover, if our lower bound is 1 but all sinks have positive parity, we need at least two inverters. Finally, if there is only one sink, we can round up the lower bound to the next even or odd integer, depending on the sink’s parity, + or −. The resulting minimum number of inverters that have to be inserted over all of our instances is 1150712. Table 8.5 shows the number of inversions we have added for different ξ values. Each buffer added on the 45 nm instances is counted as two inversions. With ξ = 0, we are only 31 % above the bound that does not take slew propagation over wire segments into account. Getting better timing results increases the number of used inversions significantly such that for ξ = 1.0 we are 540 % over the bound. 8.2.4 Timing It is hard to obtain an upper bound on the slack that one can achieve for a single repeater tree instance. We have chosen the following approach. 115 8 Experimental Results r − Figure 8.1: Test setup used for computing slack upper bounds. For instances with a single sink, we build repeater trees with the highest effort and different parameter sets. We use our algorithm and also the dynamic program algorithm for post optimization. We choose the maximum slack we get over all different runs on the same instance as an upper bound. Obviously, it is not proven that the bound is indeed an upper bound. However, we are confident that the real upper bound is not far away. Any provable upper bound we can conceive is too far away from the actual achievable results because it has to be based on too optimistic assumptions. For instances with two or more sinks, we construct a test instance for each sink. Each test instance consists of the original sink and an additional sink that corresponds to the input pin of the repeater with the smallest input capacitance. The additional sink is located at the root pin and has infinite required arrival time. The setup models the smallest impact that shielding off all other sinks can have on a critical sink. Figure 8.1 shows the setup. Similar to instances with a single sink, we then build the best possible repeater tree we can achieve for each test instance. The maximum slack achievable over all runs is used as the upper bound for the according sink. The minimum slack bound over all sinks is then used as the upper bound for the achievable slack of the repeater tree instance. As in the single sink case, it is not guaranteed that we get a real upper bound. However, it is not possible to achieve the best possible slack at all sinks simultaneously. The more sinks with similar criticality an instance has the worse the achievable slack is compared to the upper bound. Table 8.5 shows the effect of ξ on the resulting slack deviation from the upper bound. While going to the extreme might be undesirable due to the high increase in netlength and power consumption, using a ξ of up to 0.9 can be justified by the slack improvements. 8.3 Fast Buffering vs. Dynamic Programming We use our version of the dynamic program algorithm as a post optimization step for the most critical instances. Table 8.6 shows how the Fast Buffering algorithm compares to the dynamic program. The Fast Buffering algorithm is run with ξ = 1.0. The table summarizes the results on our testbed for all technologies but separated by number of sinks. Slack deviation and length deviation are computed in the same way as in Section 8.2. The power columns sum up the power consumption and number of 116 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 251–500 501–1000 > 1000 Total Total >2 Power No DP DP Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 0.32 34.23 1.55 31.78 2.50 37.42 2.98 35.41 3.71 41.38 4.14 57.69 4.98 92.07 7.42 101.25 7.74 73.60 13.92 92.16 20.55 109.36 21.22 86.98 25.57 148.72 51.36 178.36 56.81 163.90 0.16 43.72 1.08 31.27 1.69 36.92 2.14 32.18 2.55 36.78 2.73 57.49 3.34 84.38 4.82 83.78 5.25 61.00 9.32 71.64 13.62 82.43 14.07 80.86 18.49 109.02 35.06 132.57 30.47 99.67 Avg. Max. Avg. Max. 1.25 178.36 4.87 178.36 0.81 132.57 3.30 132.57 Length No DP DP Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. 500.412 1318262 136.984 507048 128.043 462522 100.239 303944 53.277 190835 71.950 189147 95.207 194208 152.175 268542 118.038 739550 143.395 560739 36.407 363539 25.143 326046 10.403 78538 14.425 28869 18.945 50864 545.606 1472938 148.558 625075 145.060 580626 124.410 410043 62.546 266861 82.125 268407 106.697 284789 167.032 369478 131.693 1055801 156.515 776370 36.224 518275 24.841 474345 11.826 116911 15.619 44360 98.136 273687 Pwr. Rpt. Pwr. Rpt. 1605.045 5582653 967.648 3757343 1856.887 7537966 1162.723 5439953 117 8.3 Fast Buffering vs. Dynamic Programming Slack Deviation Running Time No DP DP Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 0.0 0.0 0.0 0.0 14.2 99.3 22.8 183.5 23.4 176.7 12.6 192.0 32.8 314.4 31.8 255.8 30.7 585.8 51.1 1352.1 75.4 1755.5 75.7 2110.0 82.9 926.1 221.6 1643.9 387.0 1066.1 0.0 0.0 0.0 0.0 14.2 99.3 22.8 183.5 23.3 176.7 12.6 192.0 32.7 296.6 31.8 255.8 30.6 585.8 50.8 1337.8 74.6 1747.7 74.8 2042.0 81.5 915.7 219.6 1637.6 376.9 1065.5 Avg. Max. Avg. Max. 16.4 2110.0 33.4 2110.0 16.3 2042.0 33.2 2042.0 No DP DP Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. 48.28 420.46 11.12 115.85 8.99 102.82 6.96 83.89 4.04 51.98 3.56 48.01 3.56 51.15 4.17 59.68 16.74 225.67 18.61 192.43 18.43 133.07 27.78 117.25 20.37 36.41 18.83 13.99 5.22 7.55 52.36 9870.21 11.91 2805.00 9.66 2393.74 7.36 1864.14 4.26 1031.52 3.79 1328.70 3.75 1623.05 4.34 2496.50 17.24 3419.13 19.04 3187.63 19.00 1542.92 28.51 1256.83 20.79 394.31 19.35 285.59 5.41 1315.06 Top. Buf. Top. Buf. 216.65 1660.21 157.26 1123.89 226.75 34814.33 162.49 22139.11 Table 8.6: Comparison between Fast Buffering and our dynamic program. 8 Experimental Results repeaters inserted. Buffers are not counted twice here. The running times are given for topology generation and buffering separately. As expected, the average slack deviation of the dynamic program is smaller. It should be used to get the last tenth of a picosecond from most critical instances. In general, the running time is more than 10× compared to Fast Buffering. In addition, the power consumption increases significantly. However, the additional power might be necessary to get better slacks. Netlengths are very similar for both algorithms. 8.4 Varying η Length η 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Slack Dev. Power Repeaters Avg. Max. Avg. Max. 1835.132 1611.987 1598.349 1596.091 1595.128 1592.330 1593.975 1592.487 1591.764 1590.603 1589.791 7415542 5700476 5644917 5596761 5555896 5519986 5505832 5499882 5497906 5492262 5487923 16.0 % 15.0 % 15.4 % 15.7 % 16.0 % 16.1 % 16.4 % 16.4 % 16.4 % 16.4 % 16.3 % 2042.0 % 1301.6 % 1491.2 % 1688.2 % 1759.2 % 2110.0 % 2367.6 % 2481.0 % 2534.0 % 2246.3 % 2117.9 % 0.81 ps 1.24 ps 1.23 ps 1.24 ps 1.24 ps 1.25 ps 1.26 ps 1.26 ps 1.26 ps 1.26 ps 1.26 ps 132.57 ps 148.93 ps 224.05 ps 201.43 ps 216.41 ps 178.36 ps 221.72 ps 195.71 ps 204.04 ps 203.28 ps 196.78 ps Table 8.7: Results of our repeater tree algorithm for different η values. The next set of experiments compares the results of our algorithm optimizing for slack (ξ = 1.0) but with different η parameters. Table 8.7 shows the resulting power consumption, number of repeaters, length deviation and slack deviation. In general, higher values of η consume less power but have higher netlength and higher slack deviation. Our preferred value of 0.25 seems to be a reasonable choice. Smaller values like 0.15 and 0.20 are also good candidates. A choice of η = 0.0 is a special case. It allows to assign the whole dnode to a branch. Thus, all sinks whose criticality is better than dnode compared to the most critical sink can be connected to the critical sink without degrading the required arrival time. This leads to the degenerated case that it is favourable to add bifurcations to the most critical sink. During buffering a lot of repeaters are added to reduce the impact of the additional bifurcations. This explains the high power consumption of this run. Somewhat surprisingly, the average slack deviation is best with η = 0.0. 118 8.5 Varying dnode 8.5 Varying dnode Length dnode -factor 0.0 0.5 1.0 1.5 2.0 2.5 Slack Dev. Power Repeaters Avg. Max. Avg. Max. 147.096 136.963 135.092 138.993 144.373 149.157 2412342 2296316 2264650 2321192 2406360 2489224 11.7 % 25.5 % 32.9 % 40.8 % 48.9 % 56.0 % 225.7 % 1181.6 % 2110.0 % 2154.6 % 2486.2 % 2560.3 % 6.41 ps 4.71 ps 4.52 ps 4.52 ps 4.63 ps 4.73 ps 97.72 ps 78.19 ps 84.31 ps 115.26 ps 112.52 ps 127.15 ps Table 8.8: Results of our repeater tree algorithm for different dnode scaling factor values. We only consider instances with more than two sinks. In Chapter 5 we claimed that it is necessary to add a bifurcation delay in our delay model to take additional capacitance on side paths into account. Table 8.8 shows the results of scaling the precomputed dnode value by certain factors optimizing our instances with ξ = 1.0. We only consider instances with more than two sinks, because dnode has little effect for two-sink instances and no effect on single-sink instances. Disabling dnode altogether has good effects on the repeater tree length because no detours are added to avoid additional delay on paths to critical sinks. On the other hand, the average slack deviation goes up significantly, and a lot of repeaters are added to shield off critical repeaters. If dnode gets too big, the algorithm tends to build more balanced topologies. Detours are accepted to decrease the number of bifurcations on root-sink paths. The result is high netlength and a high repeater count. Altogether, our current choice of dnode with factor 1.0 has the lowest repeater count and slack deviation. The additional netlength is acceptable because it leads to less repeaters. 8.6 Disabling Effort Assignment All experiments so far used the assign effort step (see Section 6.1) to reduce the power consumption in tree parts that are above the slack target. To evaluate the effects of this step, we compared it to runs where the step was skipped using ξ = 1.0. Table 8.9 shows how power consumption can be saved due to AssignEffort. We see that for some instances the power savings are significant. The potential depends on how critical the timing of the design is. The table shows that for some instances the power gets worse. This is due to the inexact repeater insertion. Most instances, however, are completely below the slack threshold and AssignEffort does not apply. 119 8 Experimental Results Repeaters pwr(A) - pwr(N) Design Assign No Assign Baldassare Beate Wolfram Gerben Luciano Benedikt Renaud Julius Franziska Meinolf Iris Gautier 28151 53198 40469 44680 165743 183350 155931 604506 631607 588236 963296 641085 30687 55264 45411 45852 179219 289006 152752 666291 667934 614779 1026381 1179108 pwr(A)/pwr(N ) 88.87 % 95.16 % 89.68 % 98.47 % 89.38 % 60.38 % 90.73 % 89.18 % 92.86 % 93.55 % 92.85 % 49.22 % <0 1849 1588 2694 1192 8619 32021 7492 25790 14515 18623 18016 186116 =0 18108 31938 40816 42300 89374 201440 256103 250624 306362 335051 365939 1079702 >0 668 1104 1247 1295 3802 6887 5080 9164 9182 12974 9861 1593 Table 8.9: Repeaters used and power consumption in runs with and without AssignEffort. pwr(A) (pwr(N )) is the power consumption of the run with (without) AssignEffort. The last three columns show on how many instances the power has improved (< 0), stayed the same (= 0), or degraded (> 0) due to AssignEffort. Slack Degradation Design Baldassare Beate Wolfram Gerben Luciano Benedikt Renaud Julius Franziska Meinolf Iris Gautier slk(A) − slk(N ) Assign No Assign <0 =0 >0 0.54 1.27 0.43 1.79 0.56 0.08 0.20 0.77 0.98 0.60 0.51 0.02 0.53 1.28 0.46 1.79 0.56 0.08 0.18 0.76 0.97 0.56 0.49 0.00 594 818 717 832 2539 1041 1533 7049 6219 8225 5435 989 19678 33189 43166 42300 97663 238633 266437 273785 320887 355124 385838 1266301 353 623 874 676 1593 674 705 4744 3553 3299 2543 121 Table 8.10: Slack degradation in runs with and without AssignEffort. The last three columns show for how many instances the slack below the slack target got worse, stayed the same, or got better due to AssignEffort. 120 8.7 Disabling Parallel Mode Table 8.10 shows how the slack is affected by AssignEffort. The slack degradation is measured in the same way as in Section 8.2. The overall average degradation is shown. The effect of AssignEffort is that for a lot of instances the slack gets closer to the slack target. Due to the heuristic nature of our buffer insertion, more slacks can get worse than the slack target. This is also the reason for improvements on some nets. A different repeater insertion on an uncritical sidepath can accidentally lead to better buffering on the critical path. Overall, the vast majority of instances does not get worse. 8.7 Disabling Parallel Mode A large part of the complexity of our repeater insertion routine lies in handling clusters in parallel mode. Buffering algorithms that use a variant of van Ginneken’s algorithm, on the other hand, do not change the topology during processing and produce good results. We have to ask, whether using parallel clusters is worth the effort. We processed the test instances with a variant of our algorithm that is not allowed to enter parallel mode. Merge solutions that would enter parallel mode are disabled. Table 8.11 shows the result of the comparison summed up for all technologies. The table compares the slack deviation from our upper bound, the power consumption and number of inserted inverter stages, and the deviation from the Minimum Steiner Tree for both runs. In general, the slack gets better if we are allowed to walk in parallel. The static power consumption is reduced by about 20 %. The average increase in netlength that corresponds to increase in dynamic power consumption is quite small. It is expected that netlength increases if we are allowed to go in parallel due to segments that are used twice. As we have disabled merge cases that result in parallel walk, the work done during merging decreases. Accordingly, the running time is smaller. Given the big improvements in slack and static power consumption, we accept the small degradations in netlength and running time. 8.8 Choosing Tradeoff Parameters So far, we used our power-slack parameter ξ uniformly for all parts of the algorithm. However, as explained in Section 7.3.2 it is possible to use a different parameter for topology generation (ξt ), repeater insertion (ξr ), or buffering mode selection (ξm ). We have varied all three parameters independently from 0.0 to 1.0 in 0.1 steps and optimized each design for each combination. The input netlists were placed, repeater trees were optimized for power, and gate sizing was performed. For each of the resulting 11 × 11 × 11 runs we measured the resulting sum of negative endpoint slacks (SNS), area consumption (which correlates with power consumption), netlength, and worst slack. The measurements are not restricted to repeater trees. Instead, SNS is read at timing points where tests are performed. All nets are counted for netlength. Area consumption counts all gates in the design. 121 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 8 Experimental Results 251–500 501–1000 > 1000 Total Total >2 Power Parallel No Parallel Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 0.28 36.20 1.53 41.61 1.83 37.42 2.99 60.79 3.40 72.62 4.00 62.55 6.27 92.07 7.32 101.25 7.24 73.60 12.24 92.16 18.17 109.36 19.50 105.18 26.48 154.40 73.19 232.30 59.89 163.90 0.28 36.20 2.43 43.85 3.20 43.55 5.72 60.79 5.80 81.64 7.03 62.55 9.49 93.68 9.62 88.82 10.13 76.63 15.50 112.70 24.55 110.56 28.24 114.12 38.84 148.21 89.59 198.38 74.60 177.99 Avg. Max. Avg. Max. 1.21 232.30 4.14 232.30 1.79 198.38 6.50 198.38 Length Parallel No Parallel Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. 738.292 1545171 476.554 814674 310.435 717013 271.303 528539 115.726 250945 293.898 350513 220.964 294115 199.535 306462 179.345 810857 187.628 612392 54.287 392700 40.271 338506 20.481 84870 32.716 39868 21.881 53972 738.291 1545171 513.288 852053 339.320 764715 362.414 574810 137.936 277530 337.360 381772 259.513 329636 224.720 329179 225.163 912467 219.563 682592 71.303 469790 53.588 420841 25.198 112493 38.808 49175 27.224 60018 Pwr. Rpt. Pwr. Rpt. 3163.316 7140597 1948.470 4780752 3573.690 7762242 2322.111 5365018 Running Time Parallel No Parallel Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 0.0 0.0 0.0 0.0 14.7 99.8 15.9 183.5 30.1 196.6 39.2 198.4 31.8 314.4 27.4 255.8 25.6 585.8 44.5 1352.1 65.5 1755.5 70.2 2110.0 110.3 3321.0 229.1 1643.9 380.9 1096.2 0.0 0.0 0.0 0.0 14.7 99.8 15.8 183.5 29.8 196.6 39.0 198.4 31.4 314.4 27.0 307.3 24.0 585.8 41.7 1347.0 60.4 1755.6 63.8 2041.7 99.5 3196.4 220.9 1636.7 372.2 1086.4 Avg. Max. Avg. Max. 16.2 3321.0 31.5 3321.0 15.7 3196.4 30.6 3196.4 Table 8.11: Testing the effects of parallel mode. Parallel No Parallel Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. 58.77 513.68 16.21 176.20 14.50 166.36 12.06 150.73 5.14 66.42 5.21 71.96 4.81 68.37 4.75 67.55 19.69 263.99 23.55 246.82 20.62 150.97 34.20 146.51 26.52 51.69 40.53 25.18 5.34 8.04 60.57 521.59 16.21 167.21 14.34 154.34 11.78 133.12 5.06 57.58 5.16 62.54 4.78 56.56 4.71 58.32 19.78 218.87 23.39 204.03 20.93 118.91 34.46 114.15 26.31 38.03 39.83 18.70 5.37 6.05 Top. Buf. Top. Buf. 291.89 2174.45 216.92 1484.57 292.70 1929.99 215.92 1241.19 122 Slack Deviation # Sinks 8.8 Choosing Tradeoff Parameters SNS Worst Slack Area Netlength ξt ξr ξm -2460784 ps -2503732 ps -2526592 ps -2529250 ps -2581487 ps -2587899 ps -2598980 ps -2661338 ps -2692252 ps -2721807 ps -2774560 ps -2812004 ps -2909451 ps -2912460 ps -2914087 ps -2933527 ps -2963781 ps -2975351 ps -2985396 ps -3006647 ps -3022317 ps -3052016 ps -3055216 ps -3096810 ps -3136457 ps -3154245 ps -3157275 ps -3200980 ps -3273230 ps -3282542 ps -3419364 ps -360.960 ps -371.797 ps -368.434 ps -361.152 ps -400.693 ps -351.772 ps -345.037 ps -357.609 ps -381.317 ps -380.967 ps -381.654 ps -378.829 ps -402.322 ps -402.322 ps -408.144 ps -422.991 ps -397.433 ps -415.343 ps -403.745 ps -387.869 ps -381.193 ps -420.933 ps -414.426 ps -387.917 ps -434.217 ps -407.107 ps -455.027 ps -411.824 ps -433.453 ps -427.694 ps -431.093 ps 4264514 4161722 4139336 4039524 4003001 3977094 3940266 3909961 3882043 3856202 3841977 3816806 3815549 3814486 3802287 3797866 3795544 3792524 3785308 3784393 3780095 3779169 3774567 3769622 3765553 3765336 3763362 3759277 3759111 3756085 3754110 11421392775 nm 11396585038 nm 11384182618 nm 11357719162 nm 10823042062 nm 11315124985 nm 11287051361 nm 11255902718 nm 11247554519 nm 11222038997 nm 11215744785 nm 11186602609 nm 10683834075 nm 10667397255 nm 10777654190 nm 11155820658 nm 11146478260 nm 11131954147 nm 10769355282 nm 10702389423 nm 10764354020 nm 11128458742 nm 11104510839 nm 10762057030 nm 10756910041 nm 10696614904 nm 11074617288 nm 10753998314 nm 10697382852 nm 10748223421 nm 10734650521 nm 1.0 1.0 1.0 1.0 0.9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.7 0.6 0.9 1.0 1.0 1.0 0.9 0.8 0.9 1.0 1.0 0.9 0.9 0.8 1.0 0.9 0.8 0.9 0.9 0.7 0.6 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.4 0.4 0.5 0.5 0.4 0.4 0.3 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 1.0 1.0 0.9 0.9 0.9 0.8 0.7 0.6 0.7 0.6 0.7 0.6 0.5 0.5 0.6 0.5 0.6 0.4 0.5 0.5 0.4 0.5 0.4 0.5 0.4 0.4 0.0 0.3 0.3 0.0 0.0 Table 8.12: Non-dominated parameter sets on the Franziska design. SNS Worst Slack Area Netlength ξt ξr ξm -5196277 ps -5257051 ps -5269220 ps -5365769 ps -5383743 ps -5449405 ps -5503767 ps -5546630 ps -5621983 ps -5653641 ps -5666127 ps -5725306 ps -5780351 ps -5807757 ps -5860053 ps -5919674 ps -6007460 ps -6071154 ps -6130646 ps -6189021 ps -6306196 ps -6434807 ps -6512938 ps -6635688 ps -6639647 ps -6656564 ps -6674811 ps -6694420 ps -6745947 ps -6836574 ps -6864691 ps -7008116 ps -7052258 ps -7104017 ps -7165642 ps -7306835 ps -7430083 ps -7819054 ps -8071842 ps -631.933 ps -635.508 ps -635.053 ps -635.359 ps -638.916 ps -636.583 ps -653.201 ps -635.822 ps -637.161 ps -666.391 ps -649.055 ps -649.055 ps -642.621 ps -658.837 ps -663.175 ps -657.425 ps -661.243 ps -666.057 ps -677.938 ps -666.374 ps -660.793 ps -682.030 ps -695.812 ps -662.423 ps -759.335 ps -673.105 ps -725.778 ps -766.073 ps -718.396 ps -728.312 ps -737.688 ps -738.604 ps -732.529 ps -762.535 ps -741.704 ps -778.668 ps -778.668 ps -844.052 ps -797.117 ps 4527479 4509205 4472708 4348123 4336342 4247059 4197911 4157311 4143570 4136134 4090594 4082070 4079167 4039493 4035272 4031406 4004925 3964178 3953669 3948620 3919318 3896112 3874760 3872070 3859165 3851146 3847456 3838560 3833113 3827277 3819098 3807331 3803516 3798011 3781045 3769847 3769836 3765793 3759992 8925535585 nm 8819283865 nm 8785736183 nm 8906582047 nm 8801360797 nm 8816022521 nm 8928168330 nm 8907015090 nm 8775993529 nm 8934564389 nm 8907969542 nm 8817992649 nm 8786841694 nm 8825447381 nm 8794310975 nm 8777209504 nm 8906715403 nm 8904311050 nm 9410850288 nm 8776106642 nm 9393366028 nm 8898249360 nm 8893895626 nm 9372604172 nm 9345686625 nm 9357824853 nm 8897120664 nm 8897072449 nm 8893540101 nm 8829467006 nm 8895643021 nm 8893223837 nm 8831553876 nm 8887622641 nm 9280930291 nm 8875640716 nm 8825179592 nm 8852407160 nm 8848223450 nm 0.9 0.8 0.8 0.9 0.8 0.8 0.9 0.9 0.7 0.9 0.9 0.8 0.7 0.8 0.7 0.6 0.9 0.9 1.0 0.6 1.0 0.9 0.9 1.0 1.0 1.0 0.9 0.9 0.9 0.8 0.9 0.9 0.8 0.9 1.0 0.9 0.8 0.9 0.9 0.8 0.8 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.4 0.5 0.4 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.9 0.9 1.0 0.9 0.9 0.8 0.7 0.8 0.8 0.5 0.7 0.7 0.7 0.6 0.6 0.6 0.7 0.6 0.5 0.6 0.4 0.4 0.3 0.4 0.0 0.3 0.1 0.0 0.3 0.3 0.2 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.0 Table 8.13: Non-dominated parameter sets on the Iris design. 123 8 Experimental Results Table 8.12 and Table 8.13 show non-dominated parameter sets (ξt , ξr , ξm ) for two example designs. A set is not dominated if there is no other set with better or equal SNS and better or equal area. Area consumption is measured in units internal to the placement engine. It can only be compared relatively. The tables indicate that choosing ξt above 0.7 is preferred, even if one does not care for timing. On the other hand, ξm and ξr scale nicely between power-aware and slack optimizing repeater trees. 124 A Detailed Comparison Tables 125 1 2 3 4 5 6 7 8 9–20 21–50 51–100 A Detailed Comparison Tables 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 10.64 247.71 25.26 256.22 36.01 220.36 28.06 235.76 45.04 320.20 71.94 269.21 94.60 374.31 109.82 584.53 88.28 557.50 127.30 480.46 203.58 693.86 275.94 569.89 312.17 701.41 670.98 1329.86 367.51 1082.28 9.44 213.35 21.34 193.40 29.37 171.63 23.00 165.66 34.24 292.29 56.64 184.60 63.26 262.16 72.70 584.53 58.02 557.50 93.22 266.21 132.33 350.25 155.55 363.89 171.85 414.25 357.14 696.60 271.57 849.02 7.48 190.29 16.90 155.58 22.61 141.15 18.09 131.52 25.04 253.06 44.09 168.54 47.82 214.23 55.14 273.53 41.49 227.28 74.47 237.06 97.18 339.29 111.38 259.88 120.24 266.72 271.76 567.61 219.91 736.17 5.87 167.15 13.42 137.14 18.34 131.27 14.76 112.36 19.89 160.18 34.07 149.68 38.31 230.03 45.16 273.53 33.78 227.28 62.32 203.11 79.59 284.41 92.02 222.04 93.99 273.57 216.80 511.95 204.83 666.23 4.62 148.34 11.13 124.63 15.12 125.30 12.14 103.72 16.29 124.47 27.78 134.47 31.67 156.98 37.91 185.21 28.45 188.57 52.33 167.58 64.40 266.51 72.73 201.31 71.59 229.18 181.17 418.86 166.20 364.46 3.70 132.12 9.27 106.57 12.47 93.83 10.16 97.22 13.27 92.84 23.49 121.58 26.16 146.81 31.37 180.72 23.99 159.22 43.39 147.85 52.39 208.37 60.87 188.48 51.82 186.29 147.32 355.99 146.71 300.75 2.87 126.82 7.87 90.57 10.35 78.46 8.70 67.57 11.10 77.20 20.16 103.94 22.10 138.78 26.51 149.05 19.91 107.55 35.49 136.01 40.45 182.17 46.41 135.90 36.83 131.03 120.71 352.00 133.56 294.24 2.00 88.91 6.39 68.01 8.31 63.28 7.36 48.54 8.99 70.00 15.98 83.45 17.23 112.25 20.84 120.49 15.05 82.63 27.87 105.62 29.60 153.52 33.80 122.70 27.54 116.98 97.80 236.72 111.10 232.69 1.39 66.38 5.03 53.13 6.28 46.90 5.83 52.09 6.68 55.70 11.39 64.64 12.88 104.10 15.41 108.67 11.12 71.42 21.64 77.62 22.61 135.76 25.04 87.87 21.10 106.49 83.03 207.64 100.55 212.43 0.80 40.60 3.44 30.26 4.17 31.32 3.96 35.72 4.40 46.37 6.96 49.41 8.81 104.44 10.23 93.22 8.10 50.14 16.89 71.61 17.49 120.09 19.19 87.92 16.18 113.01 69.00 174.89 81.25 141.56 0.38 23.81 1.59 31.78 2.31 32.90 2.45 35.41 3.01 41.38 4.36 57.69 6.25 92.07 7.12 101.25 6.60 73.60 15.01 92.16 15.56 109.36 17.13 81.82 16.39 148.72 65.18 178.36 66.35 117.80 0.07 43.72 0.94 31.27 1.41 36.92 1.72 32.18 2.03 27.92 2.27 57.49 3.70 84.38 4.13 83.78 4.79 61.00 9.95 71.64 11.18 77.88 12.27 69.03 12.46 95.53 46.11 132.57 36.37 83.70 Avg. Max. Avg. Max. 21.01 1329.86 53.64 1329.86 16.91 849.02 39.88 849.02 13.16 736.17 30.45 736.17 10.51 666.23 24.70 666.23 8.54 418.86 20.46 418.86 7.00 355.99 17.01 355.99 5.71 352.00 14.24 352.00 4.37 236.72 11.37 236.72 3.23 212.43 8.61 212.43 2.11 174.89 5.93 174.89 1.28 178.36 4.12 178.36 0.70 132.57 2.69 132.57 Table A.1: Worst Slack Deviation on 45 nm Instances 126 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 0.0 0.0 0.0 0.0 0.1 89.5 0.4 94.9 2.5 96.2 0.9 97.6 3.7 97.6 4.1 99.4 7.7 97.9 8.9 86.2 20.3 84.9 32.5 92.1 18.7 49.5 24.0 58.5 373.5 837.7 0.0 0.0 0.0 0.0 0.1 92.1 0.4 96.1 2.6 97.1 0.9 97.6 2.2 98.5 3.1 96.7 6.1 97.9 8.5 86.2 19.6 83.0 31.9 92.1 21.7 54.8 26.8 60.5 381.5 833.1 0.0 0.0 0.0 0.0 0.1 92.1 0.5 96.1 2.5 97.1 0.9 97.6 2.2 98.5 2.9 96.7 5.9 96.1 8.7 86.2 19.6 83.0 31.6 85.2 22.8 51.7 27.8 57.6 380.5 832.0 0.0 0.0 0.0 0.0 0.1 92.1 0.5 96.1 2.4 96.8 0.9 97.6 2.0 98.5 2.7 94.9 5.9 95.8 8.9 86.2 19.4 83.0 30.9 85.2 24.7 51.5 28.6 57.5 384.1 841.3 0.0 0.0 0.0 0.0 0.1 92.1 0.6 96.1 2.5 96.8 0.9 97.6 1.9 96.0 2.6 94.5 6.0 95.8 9.1 86.2 19.4 86.5 31.8 85.2 24.0 49.0 28.4 57.4 385.6 852.5 0.0 0.0 0.0 0.0 0.2 92.1 0.7 96.1 2.5 96.8 0.9 97.6 1.8 98.9 2.5 94.3 6.0 94.9 9.6 89.4 19.5 78.5 32.4 86.6 25.6 49.0 27.6 52.9 388.1 874.0 0.0 0.0 0.0 0.0 0.3 92.1 0.8 94.1 2.4 96.8 0.9 93.9 1.7 98.9 2.4 93.9 5.9 93.2 9.6 89.4 18.1 84.3 29.4 76.2 26.2 54.9 26.5 42.9 387.0 896.3 0.0 0.0 0.0 0.0 0.5 92.1 1.2 87.3 2.8 96.8 1.2 106.6 1.8 128.1 2.4 112.9 5.9 107.1 10.0 81.4 15.7 62.1 25.2 64.5 23.5 36.7 27.0 37.1 391.9 877.7 0.0 0.0 0.0 0.0 1.5 92.1 2.9 94.1 4.5 96.8 2.4 125.3 2.3 134.7 2.8 137.0 8.0 112.3 12.4 78.3 15.8 58.8 24.5 56.7 24.6 34.4 28.7 40.6 396.0 904.3 0.0 0.0 0.0 0.0 4.8 96.2 8.7 109.3 8.0 118.4 4.3 128.2 3.8 160.4 4.3 112.9 12.0 123.2 17.5 102.0 19.0 61.7 26.9 64.5 27.4 38.2 34.8 43.3 399.8 926.3 0.0 0.0 0.0 0.0 16.9 99.3 27.2 167.2 26.9 125.8 11.3 151.0 34.2 221.4 34.4 247.7 34.6 319.4 51.9 1352.1 52.1 313.1 58.1 638.6 58.2 609.6 286.7 1026.0 440.4 947.4 0.0 0.0 0.0 0.0 16.9 99.3 27.2 167.2 26.9 125.8 11.3 151.0 34.2 221.4 34.4 247.7 34.6 319.4 51.7 1337.8 51.9 310.7 57.8 638.0 57.9 609.6 284.3 1019.6 432.2 950.2 Avg. Max. Avg. Max. 2.0 837.7 4.3 837.7 1.8 833.1 3.7 833.1 1.7 832.0 3.6 832.0 1.7 841.3 3.6 841.3 1.8 852.5 3.7 852.5 1.8 874.0 3.7 874.0 1.8 896.3 3.6 896.3 1.8 877.7 3.8 877.7 2.4 904.3 5.0 904.3 4.0 926.3 8.2 926.3 14.8 1352.1 30.8 1352.1 14.8 1337.8 30.7 1337.8 Table A.2: Length Deviation on 45 nm Instances 127 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 A Detailed Comparison Tables 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. 141.378 80207 35.435 26199 25.558 28425 12.538 18647 8.873 11293 21.505 14298 25.252 15677 41.883 23675 15.400 19572 13.661 15082 1.605 1987 0.362 663 0.314 332 0.454 414 1.042 1515 147.234 90849 37.535 28785 26.851 30466 13.908 20776 9.619 12938 22.933 15964 27.492 17828 44.377 26320 17.629 22702 16.881 17837 2.035 2289 0.507 776 0.408 379 0.616 486 1.154 1625 162.069 111514 41.529 35294 30.084 36236 16.832 26546 11.013 15787 24.317 17974 29.111 20178 45.992 29053 19.565 26283 19.065 20688 2.269 2661 0.601 946 0.476 461 0.714 551 1.235 1727 179.284 138754 46.462 42839 33.203 42246 19.660 31550 12.198 17741 25.338 19963 29.750 21990 46.531 31618 20.870 29546 21.014 23561 2.498 3078 0.718 1100 0.540 550 0.797 632 1.315 1839 200.144 165707 51.158 48828 37.621 47224 23.012 35162 13.658 19586 26.962 21838 31.153 23623 49.262 34004 22.722 33796 23.858 27241 2.819 3738 0.854 1444 0.647 713 0.952 770 1.450 1997 223.675 194912 56.820 54944 42.734 53043 26.202 39926 15.302 21759 29.007 24019 32.811 25673 51.807 37235 24.571 37720 26.962 31799 3.221 4451 1.037 1786 0.765 893 1.103 951 1.585 2182 238.193 231389 59.704 61542 47.177 59630 29.267 44840 16.729 24364 29.875 26542 33.499 28088 52.388 40672 26.477 43845 30.415 38133 3.669 5835 1.262 2514 0.897 1106 1.334 1264 1.706 2405 287.045 281488 70.059 69676 54.582 67186 33.483 50020 18.940 27716 34.028 29856 38.156 31656 59.482 45849 30.710 52812 35.743 45762 4.381 7906 1.530 3341 1.021 1341 1.516 1384 1.984 2676 321.524 322208 78.370 79428 63.747 78419 40.067 60109 21.934 33213 38.036 35317 42.116 36564 65.767 52848 35.626 65158 43.316 57477 5.188 10107 1.862 4359 1.268 1840 1.914 1783 2.287 3013 371.676 371683 90.067 92518 77.255 95953 52.097 78893 26.998 41799 45.235 43514 48.643 43244 75.756 62585 44.132 83375 57.159 77829 6.605 13217 2.450 5773 1.644 2674 2.532 2565 2.836 3680 450.346 437870 115.505 131093 108.964 138495 89.763 131243 45.021 66649 63.878 64404 78.558 65219 121.124 94480 74.457 145130 112.488 157571 12.945 26476 5.080 11548 3.796 7010 7.938 6913 4.362 4633 494.524 472559 126.756 151748 126.653 170192 113.599 172242 53.831 85141 73.540 85815 88.351 87882 134.596 126580 88.898 209131 128.392 234372 15.368 40344 6.056 18294 4.628 10606 8.589 10743 23.114 40911 Pwr. Rpt. Pwr. Rpt. 345.260 257986 168.447 151580 369.176 290020 184.407 170386 404.873 345899 201.275 199091 440.179 407007 214.433 225414 486.271 465671 234.969 251136 537.601 531293 257.106 281437 572.593 612169 274.695 319238 672.660 718669 315.556 367505 763.023 841843 363.129 440207 905.085 1019302 443.342 555101 1294.228 1488734 728.377 919771 1486.895 1916560 865.615 1292253 Table A.3: Power Consumption on 45 nm Instances 128 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. 12.72 107.79 2.63 27.56 2.48 29.40 3.35 40.78 1.31 18.13 1.01 14.98 0.74 11.69 0.94 14.92 3.39 59.87 4.65 91.65 0.78 20.29 0.42 9.61 0.45 6.22 0.75 6.09 0.03 0.40 12.91 109.17 2.69 27.77 2.55 29.63 3.47 40.92 1.37 18.21 1.07 14.95 0.80 11.70 1.02 14.94 3.96 59.32 7.01 89.61 1.61 19.91 1.32 9.89 3.02 6.76 9.19 6.92 0.05 0.44 12.46 104.63 2.62 26.86 2.49 28.52 3.41 39.58 1.35 17.60 1.06 14.44 0.79 11.37 1.01 14.37 3.91 57.80 6.92 85.31 1.59 18.93 1.28 9.35 2.87 6.45 8.49 6.56 0.05 0.45 12.84 109.14 2.67 27.60 2.53 29.15 3.45 40.16 1.36 17.88 1.06 14.61 0.79 11.53 1.01 14.55 3.92 58.00 6.89 84.44 1.58 18.42 1.27 8.91 2.85 6.38 8.07 6.16 0.06 0.48 13.01 110.92 2.69 28.30 2.56 29.82 3.49 41.15 1.37 18.19 1.07 14.98 0.80 11.70 1.02 14.80 3.95 58.17 6.97 84.77 1.58 17.76 1.27 8.63 2.81 6.12 7.95 5.98 0.07 0.52 12.81 108.70 2.66 27.98 2.53 29.60 3.44 40.66 1.36 18.02 1.06 14.86 0.80 11.64 1.01 14.72 3.96 57.92 6.88 82.62 1.58 17.11 1.26 8.31 2.75 5.84 7.71 5.73 0.07 0.54 12.21 103.09 2.56 26.81 2.45 28.20 3.35 38.76 1.33 16.93 1.04 14.08 0.78 11.03 0.99 14.08 3.85 55.40 6.77 76.58 1.56 15.53 1.24 7.32 2.68 5.41 7.38 5.17 0.08 0.52 12.43 108.09 2.61 28.01 2.48 28.64 3.41 39.70 1.34 16.80 1.05 14.03 0.79 10.71 1.00 13.72 3.89 50.04 6.89 73.25 1.57 13.25 1.25 6.43 2.73 4.90 7.45 5.11 0.09 0.58 12.53 109.71 2.59 28.14 2.47 28.25 3.38 38.83 1.33 16.36 1.04 13.60 0.78 10.35 0.99 13.33 3.85 47.85 6.82 69.14 1.57 12.14 1.24 5.75 2.68 4.42 7.15 4.65 0.11 0.60 12.37 109.07 2.58 28.12 2.47 28.23 3.38 38.91 1.34 16.23 1.05 13.54 0.78 10.35 1.00 13.33 3.87 46.59 6.93 67.16 1.59 11.40 1.26 5.38 2.70 4.16 7.09 4.40 0.12 0.64 12.07 107.74 2.55 28.41 2.44 28.76 3.37 39.49 1.33 16.42 1.04 13.73 0.78 10.68 0.99 13.92 3.82 45.98 6.96 66.16 1.62 10.91 1.32 5.08 2.75 3.85 6.88 4.09 0.15 0.69 13.36 6697.77 2.79 1755.45 2.63 1502.51 3.54 1353.03 1.40 640.97 1.10 957.22 0.82 1116.63 1.03 1726.36 3.87 1473.50 7.08 1782.41 1.64 260.35 1.34 108.55 2.80 71.73 6.97 123.00 0.15 320.03 Top. Buf. Top. Buf. 35.65 459.36 20.29 324.02 52.06 460.13 36.46 323.19 50.32 442.22 35.24 310.73 50.37 447.42 34.86 310.67 50.61 451.82 34.91 312.59 49.87 444.24 34.40 307.56 48.28 418.92 33.51 289.02 48.98 413.26 33.94 277.16 48.54 403.11 33.41 265.26 48.53 397.52 33.59 260.33 48.07 395.91 33.45 259.77 50.54 19889.51 34.39 11436.29 Table A.4: Runtime on 45 nm Instances 129 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 A Detailed Comparison Tables 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 0.61 39.71 4.44 48.59 8.40 56.10 11.41 56.79 14.06 57.25 15.70 61.04 17.08 77.36 23.26 68.80 35.80 189.57 60.57 178.05 94.56 206.96 220.78 281.01 – – – – – – 0.18 23.20 3.35 27.91 7.01 51.36 9.57 49.70 11.40 46.03 13.48 49.21 14.62 51.35 19.91 59.97 31.12 137.19 50.96 154.09 77.28 174.81 159.80 181.01 – – – – – – 0.18 23.20 3.34 27.91 6.83 44.24 9.42 44.27 11.11 39.79 13.21 43.81 14.39 55.33 19.24 59.97 29.26 154.90 44.91 134.17 66.69 122.20 109.71 151.52 – – – – – – 0.17 23.20 3.27 27.86 6.49 34.05 8.93 37.90 10.59 34.71 12.02 42.60 12.92 52.03 16.71 59.97 26.42 120.08 39.92 102.68 60.80 108.30 104.54 146.40 – – – – – – 0.15 18.58 3.13 27.86 6.23 34.05 8.38 37.35 10.05 33.58 10.92 38.29 11.89 43.64 14.77 47.92 23.18 81.90 33.74 88.17 52.95 99.33 89.92 124.77 – – – – – – 0.14 17.88 2.98 27.86 5.82 29.56 7.92 35.36 9.60 32.77 10.16 36.41 10.92 39.74 13.39 47.92 20.41 69.51 29.95 79.84 46.68 85.80 68.87 90.89 – – – – – – 0.12 15.05 2.83 27.42 5.52 30.43 7.53 34.83 9.26 32.19 9.36 30.95 10.04 35.92 12.39 33.65 18.26 62.30 25.94 65.38 41.26 85.84 60.84 87.55 – – – – – – 0.10 15.05 2.73 25.50 5.28 28.03 7.14 29.98 8.95 31.28 8.82 28.05 9.31 35.31 11.43 36.05 16.04 52.06 22.79 62.68 35.77 64.86 58.81 77.68 – – – – – – 0.07 12.54 2.65 22.24 5.11 23.87 6.69 29.73 8.71 27.07 8.03 27.34 8.07 29.52 10.10 30.46 13.90 48.98 19.47 53.27 33.00 65.40 45.66 67.84 – – – – – – 0.06 13.63 2.53 20.65 4.85 24.41 5.91 23.55 8.44 29.04 6.86 26.51 6.68 28.52 9.30 31.80 11.89 40.42 17.13 45.74 30.30 52.01 39.90 57.93 – – – – – – 0.13 13.08 2.25 20.74 4.55 24.95 5.61 25.17 9.53 25.12 6.98 27.99 6.85 28.21 9.73 27.91 12.05 39.42 17.53 47.44 30.53 55.24 42.28 54.29 – – – – – – 0.05 8.30 1.77 12.24 3.43 16.16 4.02 16.62 6.40 17.77 4.63 20.68 4.31 18.41 6.43 19.22 7.77 29.89 11.02 36.56 20.56 43.72 28.54 41.42 – – – – – – Avg. Max. Avg. Max. 4.13 281.01 15.31 281.01 3.21 181.01 12.92 181.01 3.09 154.90 12.33 154.90 2.88 146.40 11.39 146.40 2.65 124.77 10.43 124.77 2.45 90.89 9.57 90.89 2.27 87.55 8.87 87.55 2.12 77.68 8.24 77.68 1.96 67.84 7.61 67.84 1.79 57.93 6.93 57.93 1.80 55.24 6.94 55.24 1.25 43.72 4.81 43.72 Table A.5: Worst Slack Deviation on 32 nm Instances 130 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 0.0 0.0 0.0 0.0 0.8 82.8 4.2 77.8 3.1 90.8 6.8 80.4 7.4 63.8 11.6 89.9 19.4 90.0 23.3 87.3 37.4 71.0 71.7 97.4 – – – – – – 0.0 0.0 0.0 0.0 0.8 82.8 3.8 77.8 2.7 90.8 6.4 80.4 6.9 63.8 10.4 89.9 18.9 90.0 21.6 86.0 33.3 67.4 71.8 101.2 – – – – – – 0.0 0.0 0.0 0.0 0.8 82.8 3.4 77.8 2.7 90.8 6.2 80.4 6.7 63.8 9.3 89.9 18.3 90.0 21.0 85.7 31.5 69.1 65.5 90.0 – – – – – – 0.0 0.0 0.0 0.0 0.9 82.8 2.9 74.6 2.6 90.8 5.6 80.4 6.7 63.8 8.8 89.9 17.7 90.0 20.3 87.4 31.9 71.4 68.2 91.5 – – – – – – 0.0 0.0 0.0 0.0 0.9 82.8 2.7 74.6 2.7 90.8 5.5 80.4 6.4 71.6 9.1 89.9 18.1 89.0 20.3 115.6 31.5 64.5 74.7 94.3 – – – – – – 0.0 0.0 0.0 0.0 1.0 69.5 2.6 74.6 2.6 90.8 5.6 80.4 6.3 71.6 9.1 89.9 17.8 89.0 20.4 115.1 30.7 70.3 72.7 94.3 – – – – – – 0.0 0.0 0.0 0.0 1.1 72.6 2.8 74.1 2.8 90.8 5.7 80.4 6.9 71.6 9.3 89.9 18.3 89.0 21.2 100.5 32.3 78.4 72.4 92.5 – – – – – – 0.0 0.0 0.0 0.0 1.4 76.9 2.9 74.1 3.0 90.8 6.2 80.4 8.2 84.1 10.1 89.9 18.8 115.9 22.3 102.8 33.9 81.9 70.8 85.4 – – – – – – 0.0 0.0 0.0 0.0 1.8 83.2 4.0 72.8 4.0 90.8 7.5 86.6 9.5 100.9 11.7 89.9 19.2 143.0 22.1 106.6 36.3 94.9 73.6 86.2 – – – – – – 0.0 0.0 0.0 0.0 2.7 87.4 6.4 74.3 6.1 129.0 10.0 93.4 12.9 97.0 14.5 147.3 20.7 163.1 25.0 129.9 38.2 121.6 68.9 93.1 – – – – – – 0.0 0.0 0.0 0.0 4.9 87.4 11.7 129.0 11.1 145.1 15.3 162.8 20.7 131.5 22.3 127.8 36.4 278.1 58.4 476.2 112.4 566.9 214.8 538.3 – – – – – – 0.0 0.0 0.0 0.0 4.8 87.4 11.6 129.0 10.5 145.1 14.9 162.8 20.2 129.2 21.4 127.8 35.5 277.9 57.3 415.9 109.6 564.0 207.8 520.8 – – – – – – Avg. Max. Avg. Max. 5.5 97.4 11.1 97.4 5.1 101.2 10.5 101.2 5.0 90.8 10.1 90.8 4.8 91.5 9.8 91.5 4.9 115.6 9.9 115.6 4.8 115.1 9.8 115.1 5.0 100.5 10.2 100.5 5.3 115.9 10.7 115.9 5.6 143.0 11.4 143.0 6.5 163.1 13.3 163.1 12.9 566.9 26.4 566.9 12.6 564.0 25.7 564.0 Table A.6: Length Deviation on 32 nm Instances 131 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 A Detailed Comparison Tables 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. 0.397 6557 0.251 4471 0.355 6272 0.122 2081 0.139 2127 0.065 1107 0.050 877 0.049 820 0.324 5004 0.154 2159 0.047 703 0.006 97 – – – – – – 0.397 6557 0.249 4420 0.348 6083 0.120 2050 0.127 1857 0.064 1087 0.049 856 0.049 796 0.327 4792 0.158 2046 0.048 670 0.007 97 – – – – – – 0.397 6557 0.250 4420 0.343 5928 0.122 2059 0.129 1851 0.065 1089 0.050 848 0.051 792 0.340 4691 0.166 2031 0.051 654 0.007 92 – – – – – – 0.397 6557 0.252 4441 0.351 5888 0.126 2093 0.133 1881 0.070 1121 0.053 868 0.055 803 0.361 4752 0.176 2066 0.053 639 0.007 80 – – – – – – 0.401 6557 0.259 4507 0.371 6146 0.131 2130 0.139 1937 0.075 1159 0.057 891 0.059 819 0.394 4942 0.190 2118 0.058 651 0.008 91 – – – – – – 0.404 6557 0.271 4655 0.391 6337 0.138 2203 0.148 2000 0.081 1203 0.062 924 0.063 841 0.433 5153 0.204 2195 0.063 668 0.009 90 – – – – – – 0.411 6557 0.290 4900 0.424 6637 0.148 2286 0.158 2072 0.090 1283 0.067 963 0.071 906 0.493 5528 0.234 2362 0.077 734 0.011 101 – – – – – – 0.423 6557 0.310 5110 0.468 7002 0.162 2423 0.173 2166 0.100 1357 0.076 1024 0.082 967 0.596 6133 0.284 2581 0.093 790 0.014 107 – – – – – – 0.470 7055 0.345 5447 0.559 7629 0.191 2687 0.215 2489 0.122 1563 0.099 1251 0.106 1174 0.799 7809 0.364 3209 0.128 959 0.018 126 – – – – – – 0.637 9091 0.439 6425 0.735 9076 0.273 3513 0.315 3330 0.186 2178 0.150 1754 0.153 1600 1.172 11391 0.517 4609 0.181 1358 0.026 199 – – – – – – 1.005 13057 0.973 12185 1.438 15642 0.533 5945 0.680 6327 0.356 3692 0.290 2983 0.279 2728 2.181 20121 1.037 9111 0.376 2793 0.065 486 – – – – – – 0.939 12995 0.822 13818 1.175 19003 0.438 7366 0.459 7323 0.293 4972 0.247 4279 0.209 3495 1.730 28034 0.776 11943 0.275 3834 0.045 626 – – – – – – Pwr. Rpt. Pwr. Rpt. 1.958 32275 1.310 21247 1.943 31311 1.297 20334 1.970 31012 1.324 20035 2.034 31189 1.385 20191 2.142 31948 1.482 20884 2.267 32826 1.592 21614 2.474 34329 1.772 22872 2.780 36217 2.048 24550 3.417 41398 2.602 28896 4.784 54524 3.708 39008 9.214 95070 7.237 69828 7.407 117688 5.647 90875 Table A.7: Power Consumption on 32 nm Instances 132 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. 1.61 11.55 0.50 4.65 0.54 6.08 0.16 2.10 0.22 2.98 0.10 1.60 0.08 1.34 0.07 1.27 0.51 10.21 0.21 4.62 0.10 1.94 0.02 0.35 – – – – – – 1.57 11.42 0.49 4.57 0.54 5.96 0.16 2.06 0.22 2.91 0.11 1.58 0.08 1.30 0.08 1.24 0.58 9.98 0.29 4.52 0.20 1.84 0.05 0.34 – – – – – – 1.52 11.35 0.48 4.53 0.53 5.90 0.16 2.04 0.22 2.88 0.10 1.55 0.08 1.29 0.07 1.21 0.57 9.92 0.29 4.46 0.20 1.82 0.05 0.35 – – – – – – 1.57 11.43 0.49 4.58 0.54 5.95 0.16 2.06 0.22 2.91 0.11 1.57 0.08 1.30 0.08 1.22 0.58 9.94 0.29 4.44 0.20 1.80 0.05 0.35 – – – – – – 1.54 11.35 0.48 4.54 0.53 5.90 0.16 2.04 0.22 2.88 0.10 1.56 0.08 1.30 0.08 1.22 0.57 9.85 0.29 4.41 0.20 1.77 0.05 0.35 – – – – – – 1.56 11.39 0.49 4.54 0.54 5.89 0.16 2.04 0.22 2.87 0.11 1.57 0.08 1.31 0.08 1.22 0.57 9.82 0.29 4.38 0.20 1.73 0.05 0.34 – – – – – – 1.53 11.38 0.48 4.54 0.53 5.90 0.16 2.04 0.22 2.88 0.10 1.57 0.08 1.31 0.08 1.22 0.56 9.80 0.29 4.37 0.19 1.72 0.05 0.34 – – – – – – 1.55 11.37 0.49 4.54 0.53 5.90 0.16 2.05 0.22 2.88 0.10 1.57 0.08 1.31 0.07 1.21 0.56 9.76 0.29 4.30 0.19 1.72 0.05 0.34 – – – – – – 1.51 11.37 0.48 4.53 0.52 5.83 0.16 2.00 0.21 2.82 0.10 1.51 0.08 1.26 0.07 1.15 0.56 9.18 0.29 4.08 0.19 1.68 0.05 0.33 – – – – – – 1.55 11.40 0.49 4.56 0.54 5.70 0.16 1.90 0.22 2.73 0.10 1.41 0.08 1.17 0.08 1.05 0.57 8.30 0.29 3.67 0.20 1.57 0.05 0.29 – – – – – – 1.53 11.35 0.48 4.59 0.53 5.70 0.16 1.87 0.22 2.66 0.10 1.37 0.08 1.12 0.08 1.01 0.57 7.69 0.30 3.33 0.20 1.45 0.05 0.22 – – – – – – 1.59 88.99 0.50 43.07 0.55 57.93 0.17 18.94 0.22 30.43 0.11 13.17 0.08 10.21 0.08 9.93 0.56 70.72 0.29 31.75 0.20 14.21 0.05 2.28 – – – – – – Top. Buf. Top. Buf. 4.13 48.69 2.02 32.49 4.38 47.73 2.32 31.74 4.27 47.31 2.27 31.43 4.38 47.54 2.32 31.52 4.30 47.17 2.28 31.28 4.34 47.09 2.29 31.16 4.28 47.06 2.26 31.15 4.30 46.95 2.27 31.04 4.23 45.73 2.24 29.83 4.32 43.76 2.29 27.80 4.28 42.36 2.27 26.42 4.39 391.63 2.31 259.57 Table A.8: Runtime on 32 nm Instances 133 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 A Detailed Comparison Tables 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 2.48 211.39 6.66 164.17 10.98 150.56 12.46 163.12 14.06 242.04 18.55 160.01 19.10 139.31 27.90 458.97 36.80 439.75 63.71 520.35 103.29 489.63 159.81 902.37 225.13 642.05 58.25 174.76 396.87 658.76 0.91 166.31 4.10 140.04 6.96 104.37 8.79 143.07 10.74 216.13 14.61 137.93 15.51 114.57 20.73 397.18 27.60 415.04 49.90 536.68 80.71 456.71 122.84 1037.46 173.18 543.15 70.75 212.25 291.70 523.64 0.87 150.18 3.99 140.04 6.71 102.73 8.46 153.33 10.38 212.15 14.00 110.00 15.02 112.40 19.75 142.55 25.65 258.53 45.36 289.64 70.05 301.81 97.75 346.02 124.88 416.40 36.96 110.89 226.88 406.77 0.82 126.60 3.77 121.44 6.17 89.30 7.87 131.16 9.63 173.94 12.87 96.55 14.06 104.51 18.26 131.07 22.62 205.46 38.42 268.05 57.46 238.01 74.09 216.88 90.81 239.67 16.57 49.72 172.13 314.23 0.76 107.42 3.47 111.86 5.57 85.36 7.25 111.72 8.76 131.61 11.32 88.34 12.80 84.36 16.31 101.33 19.74 199.67 31.76 255.08 46.73 213.71 57.90 188.42 72.34 222.11 19.44 58.33 149.11 219.34 0.65 78.56 3.11 80.66 4.96 65.23 6.62 77.49 7.87 73.93 9.77 69.72 11.33 72.94 14.42 97.74 16.93 113.66 26.84 135.69 38.75 144.11 46.89 142.03 64.59 204.07 10.15 30.46 118.42 191.18 0.50 60.72 2.73 60.00 4.40 46.51 5.94 69.20 6.96 64.67 8.53 55.09 9.77 55.28 12.71 72.13 14.47 91.00 22.50 92.44 31.97 117.23 36.64 127.45 54.30 181.67 8.96 26.89 101.17 151.56 0.34 42.44 2.36 43.06 3.89 39.75 5.22 40.53 5.86 36.06 7.28 39.48 8.31 41.17 10.92 46.03 11.91 60.73 18.59 87.36 26.04 89.07 28.72 102.30 42.48 159.31 5.70 17.11 82.19 134.10 0.23 26.54 2.05 33.47 3.43 29.92 4.51 28.76 4.84 31.04 5.77 45.83 6.75 43.51 9.06 40.07 9.45 58.98 14.85 72.50 20.97 62.21 21.85 79.88 27.99 94.92 5.04 15.13 55.10 117.96 0.21 25.79 1.85 28.82 3.01 26.58 3.81 26.50 4.03 23.75 4.31 38.46 4.93 39.65 7.50 33.98 7.81 54.70 12.75 45.16 18.19 54.04 18.21 74.20 18.98 55.61 5.26 15.78 40.24 70.47 0.31 34.23 1.49 30.01 2.39 37.42 3.42 27.58 3.56 34.98 3.87 51.63 4.43 30.13 7.35 35.73 7.85 61.07 13.08 49.16 19.16 92.22 18.52 86.98 17.78 52.01 6.31 18.93 34.87 69.79 0.20 13.58 1.08 21.82 1.66 33.51 2.52 18.57 2.50 36.78 2.86 43.68 3.18 35.33 5.02 23.72 5.27 33.85 8.82 40.44 12.61 82.43 11.65 80.86 11.09 32.02 3.80 11.41 17.11 40.18 Avg. Max. Avg. Max. 5.74 902.37 20.69 902.37 3.46 1037.46 15.23 1037.46 3.28 416.40 14.30 416.40 3.00 314.23 12.83 314.23 2.68 255.08 11.31 255.08 2.34 204.07 9.88 204.07 1.98 181.67 8.58 181.67 1.63 159.31 7.29 159.31 1.32 117.96 6.01 117.96 1.12 74.20 4.99 74.20 1.10 92.22 4.62 92.22 0.76 82.43 3.20 82.43 Table A.9: Worst Slack Deviation on 22 nm Instances 134 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. Avg. Max. 0.0 0.0 0.0 0.0 2.2 96.8 2.8 96.1 5.0 97.3 7.7 91.9 12.9 98.4 5.4 95.0 10.5 97.1 19.2 108.7 31.9 105.3 35.2 122.6 35.9 96.8 37.4 52.6 285.4 500.0 0.0 0.0 0.0 0.0 2.1 96.8 2.7 96.1 4.8 97.3 7.5 91.9 12.7 98.4 5.0 95.0 9.8 97.1 17.5 108.7 29.1 103.2 33.1 108.9 35.6 93.1 37.2 50.2 283.4 509.4 0.0 0.0 0.0 0.0 2.0 96.8 2.6 96.1 4.6 97.3 7.3 91.9 12.4 95.6 4.9 101.6 9.4 97.1 16.6 108.7 28.1 100.1 32.4 106.1 35.5 92.6 36.7 45.5 283.6 501.8 0.0 0.0 0.0 0.0 1.7 96.8 2.3 96.1 4.3 97.3 6.7 91.9 11.8 95.6 4.5 95.0 8.9 97.1 15.6 108.7 27.4 98.1 32.1 102.7 35.5 93.1 37.1 49.7 283.8 512.8 0.0 0.0 0.0 0.0 1.7 96.8 2.2 96.1 4.2 97.3 6.5 91.9 11.5 94.5 4.5 95.0 9.0 97.1 15.8 111.4 28.0 104.5 32.7 108.7 35.6 92.9 36.9 46.9 284.9 497.6 0.0 0.0 0.0 0.0 1.7 96.8 2.2 96.1 4.1 97.3 6.1 91.9 11.3 100.0 4.3 95.0 8.6 97.1 15.5 111.4 27.3 112.2 32.1 104.1 34.8 91.7 36.7 45.4 285.0 509.1 0.0 0.0 0.0 0.0 1.6 96.8 2.2 96.1 4.2 97.3 5.9 93.3 11.4 103.8 4.3 95.0 8.5 104.4 15.3 111.4 27.2 117.4 32.3 127.9 35.1 94.1 36.6 43.8 285.0 510.3 0.0 0.0 0.0 0.0 1.7 96.8 2.4 96.1 4.7 100.8 6.2 101.7 11.7 122.8 4.5 98.8 8.7 128.5 15.7 119.7 28.1 125.3 33.0 161.8 34.6 94.1 36.5 43.7 285.8 507.9 0.0 0.0 0.0 0.0 2.0 96.8 2.9 107.9 5.8 123.2 7.2 139.3 12.6 122.8 5.6 117.8 9.5 166.8 16.2 159.7 28.9 161.2 32.9 226.1 35.7 91.2 36.6 44.5 284.9 516.4 0.0 0.0 0.0 0.0 3.2 95.9 4.6 173.7 8.4 156.2 10.7 138.0 15.1 140.6 8.8 150.9 11.9 200.7 19.2 170.2 31.8 207.4 33.2 377.3 35.5 87.8 35.5 39.5 276.5 488.0 0.0 0.0 0.0 0.0 9.1 96.9 9.7 183.5 16.2 176.7 17.3 192.0 22.8 314.4 18.6 255.8 27.5 491.0 49.4 946.9 79.4 893.4 70.1 2110.0 40.6 138.2 35.2 39.5 283.2 523.3 0.0 0.0 0.0 0.0 9.1 96.9 9.6 183.5 16.1 176.7 17.2 192.0 22.5 296.6 18.5 255.8 27.3 469.6 48.9 920.3 78.4 882.4 69.4 2042.0 39.7 133.8 35.2 39.5 268.5 499.6 Avg. Max. Avg. Max. 4.2 500.0 10.0 500.0 3.9 509.4 9.3 509.4 3.7 501.8 9.0 501.8 3.6 512.8 8.5 512.8 3.6 497.6 8.5 497.6 3.5 509.1 8.3 509.1 3.4 510.3 8.2 510.3 3.5 507.9 8.5 507.9 3.8 516.4 9.2 516.4 4.8 488.0 11.5 488.0 10.5 2110.0 25.3 2110.0 10.4 2042.0 25.1 2042.0 Table A.10: Length Deviation on 22 nm Instances 135 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 A Detailed Comparison Tables 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. Pwr. Rpt. 15.622 317791 6.226 132993 4.014 98442 2.264 51317 1.414 32732 1.215 28368 1.090 26470 1.905 36573 5.073 108515 1.978 44819 0.776 19848 0.300 8232 0.024 733 0.010 275 0.095 3156 15.987 322463 6.298 133800 4.070 98936 2.286 51579 1.439 32984 1.234 28574 1.110 26629 1.955 36686 5.304 109091 2.098 44270 0.848 19477 0.318 8146 0.025 723 0.010 275 0.104 3364 16.408 328275 6.422 135656 4.177 100650 2.339 52433 1.467 33463 1.277 29257 1.140 27097 2.030 37547 5.593 113001 2.274 45914 0.936 20297 0.359 8591 0.027 731 0.011 280 0.112 3607 17.187 337855 6.679 139332 4.402 104543 2.456 54125 1.549 34701 1.376 30795 1.214 28246 2.176 39411 6.131 121988 2.542 49839 1.059 22264 0.414 9426 0.030 771 0.011 285 0.124 3947 18.243 351949 7.044 145101 4.731 110353 2.623 56579 1.666 36503 1.514 33019 1.327 30026 2.405 42543 6.826 131892 2.924 55030 1.245 24538 0.491 10350 0.035 825 0.011 289 0.137 4244 19.218 373721 7.358 152932 5.172 117789 2.800 59789 1.794 38912 1.650 35800 1.457 32460 2.602 45937 7.615 144057 3.286 60388 1.431 27257 0.569 11537 0.037 857 0.011 297 0.153 4669 21.515 413117 8.081 165521 5.652 127204 3.123 65062 2.027 42621 1.868 39391 1.667 36117 2.979 50818 8.719 160540 3.864 68617 1.717 31329 0.692 13187 0.045 933 0.012 305 0.175 5211 24.661 479471 8.980 183836 6.363 139980 3.543 72539 2.322 48331 2.141 44305 1.942 40996 3.413 57934 10.128 184148 4.627 79827 2.109 36767 0.853 15429 0.053 1049 0.013 322 0.201 5837 29.614 573663 10.458 209308 7.574 160179 4.338 86223 2.834 57699 2.682 54614 2.458 50332 4.206 70653 12.695 227408 6.007 101952 2.778 46927 1.145 20089 0.072 1255 0.014 347 0.245 6801 36.243 694817 12.541 243218 9.443 191201 5.557 107224 3.684 72600 3.606 72070 3.392 68619 5.565 90564 16.867 293898 8.283 136184 3.975 64787 1.676 28618 0.103 1694 0.017 405 0.324 8564 46.240 860511 19.214 360371 15.889 302582 8.928 161459 6.146 113683 6.191 116926 6.055 115520 9.164 142253 29.752 505200 16.388 266374 8.491 138493 3.637 63291 0.202 3696 0.029 717 0.532 12547 47.151 980717 19.646 455857 15.333 384809 9.200 224340 6.624 169283 6.549 172306 6.657 179128 8.788 204131 28.784 728076 14.304 364601 7.711 205950 3.739 104768 0.316 8785 0.096 2133 1.179 36717 Pwr. Rpt. Pwr. Rpt. 42.006 910264 20.159 459480 43.085 916997 20.800 460734 44.571 936799 21.741 472868 47.350 977528 23.483 500341 51.225 1033241 25.938 536191 55.155 1106402 28.579 579749 62.136 1219973 32.540 641335 71.350 1390771 37.708 727464 87.118 1667450 47.046 884479 111.274 2074463 62.489 1136428 176.859 3163623 111.404 1942741 176.076 4221601 109.278 2785027 Table A.11: Power Consumption on 22 nm Instances 136 # Sinks 1 2 3 4 5 6 7 8 9–20 21–50 51–100 101–250 251–500 501–1000 > 1000 Total Total >2 ξ = 0.0 ξ = 0.1 ξ = 0.2 ξ = 0.3 ξ = 0.4 ξ = 0.5 ξ = 0.6 ξ = 0.7 ξ = 0.8 ξ = 0.9 ξ = 1.0 DP Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. Top. Buf. 35.99 303.59 8.20 80.87 6.02 70.55 3.34 45.26 2.38 35.47 2.33 37.12 2.50 44.34 2.70 45.05 10.07 188.46 5.74 122.82 3.70 81.48 2.16 46.44 0.23 3.98 0.10 0.97 0.08 1.22 36.02 303.84 8.25 80.78 6.12 70.32 3.43 45.04 2.48 35.27 2.44 36.84 2.65 43.97 2.91 44.57 11.54 185.45 8.23 119.84 7.67 79.23 6.79 45.71 1.17 3.97 0.33 0.96 0.40 1.23 35.84 303.41 8.22 80.76 6.10 70.26 3.42 44.97 2.47 35.21 2.44 36.80 2.66 43.91 2.90 44.33 11.50 184.04 8.18 117.10 7.59 77.52 6.66 44.83 1.10 3.92 0.33 0.95 0.47 1.30 35.87 303.18 8.20 80.66 6.09 70.15 3.42 44.77 2.47 34.99 2.44 36.65 2.66 43.50 2.90 44.01 11.51 181.94 8.18 114.28 7.54 75.72 6.57 43.68 1.08 3.86 0.33 0.95 0.53 1.39 35.26 299.56 8.17 80.30 6.06 69.64 3.41 44.56 2.46 34.86 2.42 36.17 2.64 43.22 2.89 43.76 11.38 179.29 8.06 110.98 7.34 72.56 6.39 41.95 1.04 3.71 0.33 0.93 0.58 1.44 35.29 299.58 8.14 80.20 6.04 69.29 3.40 44.43 2.46 34.84 2.39 35.87 2.63 43.00 2.89 43.72 11.34 177.56 8.00 107.92 7.22 69.99 6.26 40.38 1.00 3.56 0.32 0.89 0.68 1.47 35.13 300.78 8.12 80.54 6.02 69.29 3.39 44.51 2.45 34.95 2.39 36.00 2.63 43.11 2.88 43.75 11.33 176.85 8.00 106.13 7.21 68.84 6.23 39.48 0.98 3.50 0.32 0.89 0.83 1.58 34.94 297.72 8.10 80.16 6.01 68.91 3.39 44.29 2.46 34.85 2.39 35.82 2.63 42.90 2.89 43.37 11.33 173.80 7.96 103.14 7.13 67.14 6.12 38.25 0.96 3.41 0.32 0.88 0.96 1.70 34.82 299.42 8.06 80.42 5.97 67.41 3.37 42.90 2.43 33.56 2.37 34.54 2.60 41.64 2.87 42.21 11.23 166.23 7.94 97.09 7.09 63.91 6.08 36.07 0.95 3.34 0.32 0.87 1.13 1.83 34.73 298.40 8.06 80.65 5.97 66.47 3.37 41.77 2.44 32.53 2.37 32.73 2.61 39.68 2.87 41.18 11.19 159.22 7.93 91.37 7.06 60.00 6.06 33.49 0.93 3.16 0.31 0.85 1.43 1.84 34.61 300.83 8.06 82.51 5.97 67.69 3.37 41.84 2.45 32.32 2.37 32.22 2.62 38.10 2.87 41.30 11.26 157.32 7.98 88.10 7.14 56.21 6.18 30.77 0.95 2.84 0.32 0.80 1.84 1.94 37.34 3035.93 8.58 985.77 6.42 803.75 3.59 473.42 2.58 336.41 2.53 331.57 2.76 351.92 3.00 445.96 11.69 1602.61 8.26 935.38 7.47 556.79 6.50 280.95 1.00 22.89 0.32 6.21 1.93 72.20 Top. Buf. Top. Buf. 85.54 1107.61 41.35 723.15 100.41 1097.04 56.14 712.41 99.89 1089.31 55.83 705.14 99.77 1079.73 55.70 695.89 98.42 1062.92 55.00 683.06 98.05 1052.70 54.62 672.93 97.91 1050.21 54.66 668.89 97.57 1036.34 54.54 658.46 97.25 1011.44 54.36 631.59 97.32 983.35 54.54 604.30 97.99 974.79 55.32 591.46 103.96 10241.77 58.04 6220.07 Table A.12: Runtime on 22 nm Instances 137 # Sinks Summary Repeaters, inverters and buffers, are the logical gates that dominate modern chip designs. We see designs where up to 50 % of all gates are repeaters. Repeaters are used during physical design of chips to improve the electrical and timing properties of interconnections. They are added along Steiner trees that connect root gates to sinks, creating repeater trees. Their construction became a crucial part of chip design. It has great impact on all other parts, for example, placement and routing. We first present an extensive version of the Repeater Tree Problem. Our problem formulation encapsulates most of the constraints that have been studied so far. We also consider several aspects for the first time, for example, slew dependent required arrival times at repeater tree sinks. These make our formulation more adequate to the challenges of real-world repeater tree construction. For creating good repeater trees, one has to take the overall design environment into account. The employed technology, the properties of available repeaters and metal wires, the shape of the chip, the temperature, the voltages, and many other factors highly influence the results of repeater tree construction. To take all this into account, we extensively preprocess the environment to extract parameters for our algorithms. These parameters allow us to quickly and yet quite accurately estimate the timing of a tree before it has even been buffered. We present an algorithm for Steiner tree creation and prove that our algorithm is able to create timing-efficient as well as cost-efficient trees. Our algorithm is based on a delay model that accurately describes the timing that one can achieve after repeater insertion. This makes our algorithm suitable for creating good Steiner trees, the input for subsequent repeater insertion algorithms. Next, we deal with the problem of adding repeaters to a given Steiner tree. The predominantly used algorithms to solve this problem use dynamic programming. However, they have several drawbacks. Firstly, potential repeater positions along the Steiner tree have to be chosen upfront. Secondly, the algorithms strictly follow the given Steiner tree and miss optimization opportunities. Finally, dynamic programming causes high running times. We present our new buffer insertion algorithm, Fast Buffering, that overcomes these limitations. It is able to produce results with similar quality to a dynamic programming approach but a much better running time. In addition, we also present improvements to the dynamic programming approach that allows us to push the quality at the expense of a high running time. We have implemented our algorithms as part of the BonnTools physical design optimization suite developed at the Research Institute for Discrete Mathematics in 139 A Detailed Comparison Tables cooperation with IBM. Our algorithms are used and help engineers dealing with some of the most complex chips in the world. When we released the first version of our global optimization tools, for the first time it became possible to optimize chips with several million gates within reasonable running times. At the same time, designers were able to achieve compelling results. Our tools are not only used for global optimization in early design stages. For later stages of physical design, a more accurate version of our algorithm can be enabled that is able to squeeze out the last tenth of a picosecond. Our implementation deals with all tedious details of a grown real-world chip optimization environment. At the same time, we offer a clean framework abstracting away the details such that new repeater tree construction algorithms can easily be implemented. As a side project, we implemented a blockage map that helps managing the free/blocked information not only for our algorithms but also other optimization tools within BonnTools. Recently, the congestion map that we implemented has been added as a fast mode to BonnRouteGlobal, the global routing engine used throughout the whole IBM physical design. We have created extensive experimental results on challenging real-world test cases provided by our cooperation partner. The testbed consists of more than 3.3 million different repeater tree instances. The average running time for a single instance is about 0.6 milliseconds, which means that we can solve about 5.7 million instances per hour. We also compare our implementation to an state-of-the-art industrial tool and show that our algorithm produces better results with less electrical violations. 140 B Bibliography Noga Alon and Yossi Azar. On-Line Steiner Trees in the Euclidean Plane. Discrete & Computational Geometry, 10:113–121, 1993. doi: 10.1007/BF02573969. Charles J. Alpert and Anirudh Devgan. Wire Segmenting for Improved Buffer Insertion. In Proceedings of the 34th Annual Design Automation Conference, DAC ’97, pages 588–593, New York, NY, USA, 1997. ACM. doi: 10.1145/266021.266291. Charles J. Alpert, Anirudh Devgan, and Stephen T. Quay. Buffer insertion with accurate gate and interconnect delay computation. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC ’99, pages 479–484, 1999. doi: 10.1145/309847.309983. Charles J. Alpert, R. Gopal Gandham, Jose L. Neves, and Stephen T. Quay. Buffer Library Selection. In Proceedings of the International Conference on Computer Design, pages 221–226, Los Alamitos, CA, USA, 2000. IEEE Computer Society. doi: 10.1109/ICCD.2000.878289. Charles J. Alpert, Anirudh Devgan, John P. Fishburn, and Stephen T. Quay. Interconnect Synthesis Without Wire Tapering. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 20(1):90–104, 2001a. doi: 10.1109/43.905678. Charles J. Alpert, Gopal Gandham, Jiang Hu, Jose L. Neves, Stephen T. Quay, and Sachin S. Sapatnekar. Steiner Tree Optimization for Buffers, Blockages, and Bays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(4):556–562, 2001b. doi: 10.1109/43.918213. Charles J. Alpert, Gopal Gandham, Miloš Hrkić, Jiang Hu, Andrew B. Kahng, John Lillis, Bao Liu, Stephen T. Quay, Sachin S. Sapatnekar, and A. J. Sullivan. Buffered Steiner Trees for Difficult Instances. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21(1):3–14, 2002. doi: 10.1109/TCAD.2005.858348. Charles J. Alpert, Gopal Gandham, Miloš Hrkić, Jiang Hu, Stephen T. Quay, and C. N. Sze. Porosity-Aware Buffered Steiner Tree Construction. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(4):517–526, 2004a. doi: 10.1109/TCAD.2004.825864. Charles J. Alpert, Miloš Hrkić, and Stephen T. Quay. A Fast Algorithm for Identifying Good Buffer Insertion Candidate Locations. In Proceedings of the 141 B Bibliography 2004 International Symposium on Physical design, ISPD ’04, pages 47–52, New York, NY, USA, 2004b. ACM. doi: 10.1145/981066.981076. Charles J. Alpert, Dinesh P. Mehta, and Sachin S. Sapatnekar, editors. Handbook of Algorithms for Physical Design Automation. Auerbach Publications, Boston, MA, USA, 1st edition, 2008. ISBN 9780849372421. Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen. Efficient Generation of Short and Fast Repeater Tree Topologies. In Proceedings of the 2006 International Symposium on Physical Design, ISPD ’06, pages 120–127, New York, NY, USA, 2006. ACM. doi: 10.1145/1123008.1123032. Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen. Efficient algorithms for short and fast repeater trees. I. Topology generation. Technical Report No. 07977, Research Institute for Discrete Mathematics, University of Bonn, 2007a. Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen. Efficient algorithms for short and fast repeater trees. II. Buffering. Technical Report No. 07978, Research Institute for Discrete Mathematics, University of Bonn, 2007b. Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen. Fast Buffering for Optimizing Worst Slack and Resource Consumption in Repeater Trees. In Proceedings of the 2009 International Symposium on Physical Design, ISPD ’09, pages 43–50, New York, NY, USA, 2009. ACM. doi: 10.1145/1514932.1514942. Christoph Bartoschek, Stephan Held, Jens Maßberg, Dieter Rautenbach, and Jens Vygen. The repeater tree construction problem. Information Processing Letters, 110(24):1079–1083, 2010. doi: 10.1016/j.ipl.2010.08.016. Chung-Ping Chen and Noel Menezes. Noise-aware Repeater Isertion and Wire Sizing for On-chip Interconnect Using Hierarchical Moment-Matching. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC ’99, pages 502–506, 1999. doi: 10.1145/309847.309987. Jason Cong and Xin Yuan. Routing Tee Construction Under Fixed Buffer Locations. In Proceedings of the 37th Annual Design Automation Conference, DAC ’00, pages 379–384, New York, NY, USA, 2000. ACM. doi: 10.1145/337292.337502. Jason Cong, Lei He, Cheng-Kok Koh, and Patrick H. Madden. Performance Optimization of Vlsi Interconnect Layout. Integration, the VLSI Journal, 21:1–94, 1996. doi: 10.1016/S0167-9260(96)00008-9. Sampath Dechu, Cien Shen, and Chris Chu. An Efficient Routing Tree Construction Algorithm With Buffer Insertion, Wire Sizing, and Obstacle Considerations. IEEE 142 B Bibliography Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24 (4):600–608, 2005. doi: 10.1109/TCAD.2005.844107. William C. Elmore. The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers. Journal of Applied Physics, 19(1): 55–63, 1948. doi: 10.1063/1.1697872. Delbert R. Fulkerson. A Network Flow Computation for Project Cost Curves. Management Science, 7(2):167–178, 1961. Michael R. Garey and David S. Johnson. The Rectilinear Steiner Tree Problem is NP-Complete. SIAM Journal on Applied Mathematics, 32(4):826–834, 1977. doi: 10.1137/0132071. Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1979. ISBN 0716710447. Michael Gester, Dirk Müller, Tim Nieberg, Christian Panten, Christian Schulte, and Jens Vygen. BonnRoute: Algorithms and data structures for fast and good VLSI routing. Technical Report No. 111039, Research Institute for Discrete Mathematics, University of Bonn, 2011. Nir Halman, Chung-Lun Li, and David Simchi-Levi. Fully polynomial time approximation schemes for time-cost tradeoff problems in series-parallel project networks. In Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques, APPROX ’08 / RANDOM ’08, pages 91–103, Berlin, Heidelberg, 2008. Springer-Verlag. doi: 10.1007/978-3-540-85363-3_8. Stephan Held. Timing Closure in Chip Design. PhD thesis, University of Bonn, 2008. Stephan Held and Daniel Rotter. Shallow-Light Steiner Arborescences with Vertex Delays. In IPCO, pages 229–241, 2013. doi: 10.1007/978-3-642-36694-9_20. Stephan Held and Sophie Theresa Spirkl. A Fast Algorithm for Rectilinear Steiner Trees with Length Restrictions on Obstacles. In Proceedings of the 2014 International Symposium on Physical Design, ISPD ’14, pages 37–44, New York, NY, USA, 2014. ACM. doi: 10.1145/2560519.2560529. Stephan Held, Bernhard Korte, Dieter Rautenbach, and Jens Vygen. Combinatorial Optimization in VLSI Design. In Vasek Chvátal, editor, Combinatorial Optimization: Methods and Applications, volume 31 of NATO Science for Peace and Security Series - D: Information and Communication Security, pages 33–96. IOS Press, 2011. doi: 10.3233/978-1-60750-718-5-33. 143 B Bibliography Renato F. Hentschke, Jagannathan Narasimham, Marcelo O. Johann, and Ricardo L. Reis. Maze Routing Steiner Trees with Effective Critical Sink Optimization. In Proceedings of the 2007 International Symposium on Physical Design, ISPD ’07, pages 135–142, New York, NY, USA, 2007. ACM. doi: 10.1145/1231996.1232024. Robert B. Hitchcock, Gordon L. Smith, and David D. Cheng. Timing Analysis of Computer Hardware. IBM Journal of Research and Development, 26(1):100–105, 1982. doi: 10.1147/rd.261.0100. Miloš Hrkić and John Lillis. S-Tree: A Technique for Buffered Routing Tree Synthesis. In Proceedings of the 36th ACM/IEEE Annual Design Automation Conference, pages 578–583, 2002. doi: 10.1145/513918.514066. Miloš Hrkić and John Lillis. Buffer Tree Synthesis With Consideration of Temporal Locality, Sink Polarity Requirements, Solution Cost, Congestion, and Blockages. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 22(4):481–491, 2003. doi: 10.1109/TCAD.2003.809648. Jiang Hu, Charles J. Alpert, Stephen T. Quay, and Gopal Gandham. Buffer Insertion With Adaptive Blockage Avoidance. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 22(4):492–498, 2003. doi: 10.1109/TCAD.2003.809647. Shiyan Hu, Charles J. Alpert, Jiang Hu, S.K. Karandikar, Zhuo Li, Weiping Shi, and C.N. Sze. Fast Algorithms for Slew-Constrained Minimum Cost Buffering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 26(11):2009–2022, 2007. doi: 10.1109/TCAD.2007.906477. Shiyan Hu, Zhuo Li, and Charles J. Alpert. A fully polynomial time approximation scheme for timing driven minimum cost buffer insertion. In Proceedings of the 46th Annual Design Automation Conference, DAC ’09, pages 424–429, New York, NY, USA, 2009. ACM. doi: 10.1145/1629911.1630026. Tao Huang and Evangeline F. Y. Young. Construction of rectilinear steiner minimum trees with slew constraints over obstacles. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design 2012, ICCAD ’12, pages 144–151, New York, NY, USA, 2012. ACM. doi: 10.1145/2429384.2429411. Frank Hwang. On steiner minimal trees with rectilinear distance. SIAM Journal on Applied Mathematics, 30(1):104–114, 1976. doi: 10.1137/0130013. Maxim Janzen. Buffer Aware Global Routing im Chip Design. Diplomarbeit, University of Bonn, 2012. James E. Kelley, Jr. Critical Path Planning and Scheduling: Mathematical Basis. Operations Research, 9:296–320, 1961. 144 B Bibliography James E. Kelley, Jr and Morgan R. Walker. Critical-path planning and scheduling. In Papers presented at the December 1-3, 1959, eastern joint IRE-AIEE-ACM computer conference, IRE-AIEE-ACM ’59 (Eastern), pages 160–173, New York, NY, USA, 1959. ACM. doi: 10.1145/1460299.1460318. Bernhard Korte and Jens Vygen. Combinatorial Optimization: Theory and Algorithms. Springer Publishing Company, Incorporated, 5th edition, 2012. ISBN 9783642244872. Bernhard Korte, Dieter Rautenbach, and Jens Vygen. Bonntools: Mathematical Innovation for Layout and Timing Closure of Systems on a Chip. Proceedings of the IEEE, 95(3):555–572, 2007. doi: 10.1109/JPROC.2006.889373. Leon G. Kraft, Jr. A Device for Quantizing, Grouping, and Coding AmplitudeModulated Pulses. Master’s thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, Massachusets, 1949. Eugene L. Lawler. Combinatorial Optimization: Networks and Matroids. Dover Books on Mathematics Series. Dover Publications, 2001. ISBN 9780486414539. Zhuo Li and Weiping Shi. An O(bn2 ) Time Algorithm for Optimal Buffer Insertion With b Buffer Types. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25(3):484–489, 2006. doi: 10.1109/TCAD.2005.854631. Zhuo Li, Ying Zhou, and Weiping Shi. o(mn) Time Algorithm for Optimal Buffer Insertion of Nets With m Sinks. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 31(3):437–441, 2012. doi: 10.1109/TCAD.2011.2174639. John Lillis, Chung-Kuan Cheng, and Ting-Ting Y. Lin. Optimal Wire Sizing and Buffer Insertion for Low Power and a Generalized Delay Model. IEEE Journal of Solid-State Circuits, 31(3):437–447, 1996a. doi: 10.1109/4.494206. John Lillis, Chung-Kuan Cheng, Ting-Ting Y. Lin, and Ching-Yen Ho. New Performance Driven Routing Techniques With Explicit Area/Delay Tradeoff and Simultaneous Wire Sizing. In Proceedings of the 33rd Annual Design Automation Conference, DAC ’96, pages 395–400, New York, NY, USA, 1996b. ACM. doi: 10.1145/240518.240594. Jens Maßberg and Jens Vygen. Approximation algorithms for a facility location problem with service capacities. ACM Transactions on Algorithms, 4(4):50:1–50:15, 2008. doi: 10.1145/1383369.1383381. Dirk Müller. Fast Resource Sharing in VLSI Routing. PhD thesis, University of Bonn, 2009. Dirk Müller, Klaus Radke, and Jens Vygen. Faster min-max resource sharing in theory and practice. Mathematical Programming Computation, 3:1–35, 2011. doi: 10.1007/s12532-011-0023-y. 145 B Bibliography Matthias Müller-Hannemann and Ute Zimmermann. Slack Optimization of TimingCritical Nets. In Algorithms - ESA 2003, volume 2832 of Lecture Notes in Computer Science, pages 727–739. Springer Berlin Heidelberg, 2003. doi: 10.1007/9783-540-39658-1_65. Takumi Okamoto and Jason Cong. Buffered Steiner Tree Construction with Wire Sizing for Interconnect Layout Optimization. In Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’96, pages 44– 49, Washington, DC, USA, 1996. IEEE Computer Society. doi: 10.1109/ICCAD.1996.568938. Carlos A. S. Oliveira and Panos M. Pardalos. A Survey of Combinatorial Optimization Problems in Multicast Routing. Computers & Operations Research, 32(8): 1953–1981, 2005. doi: 10.1016/j.cor.2003.12.007. James B. Orlin. A Faster Strongly Polynomial Minimum Cost Flow Algorithm. Operations Research, 41(2):338–350, 1993. doi: 10.1287/opre.41.2.338. Min Pan, Chris Chu, and Priyadarshan Patra. A novel performance-driven topology design algorithm. In Proceedings of the 2007 Asia and South Pacific Design Automation Conference, ASP-DAC ’07, pages 244–249, Washington, DC, USA, 2007. IEEE Computer Society. doi: 10.1109/ASPDAC.2007.357993. Lawrence Pileggi. Coping with RC(L) Interconnect Design Headaches. In Proceedings of the 1995 IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’95, pages 246–253, Washington, DC, USA, 1995. IEEE Computer Society. doi: 10.1109/ICCAD.1995.480019. Jorge Rubinstein, Jr. Paul Penfield, and Mark A. Horowitz. Signal Delay in rc Tree Networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2(3):202–211, 1983. doi: 10.1109/TCAD.1983.1270037. Sachin S. Sapatnekar. Timing. Kluwer, 2004. ISBN 9781402076718. Prashant Saxena, Noel Menezes, Pasquale Cocchini, and Desmond A. Kirkpatrick. The Scaling Challenge: Can Correct-by-Construction Design Help? In Proceedings of the 2003 International Symposium on Physical Design, ISPD ’03, pages 51–58, New York, NY, USA, 2003. ACM. doi: 10.1145/640000.640014. Weiping Shi and Zhuo Li. A Fast Algorithm for Optimal Buffer Insertion. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24 (6):879–891, 2005. doi: 10.1109/TCAD.2005.847942. Weiping Shi, Zhuo Li, and Charles J. Alpert. Complexity Analysis and Speedup Techniques for Optimal Buffer Insertion with Minimum Cost. In Proceedings of the 2004 Asia and South Pacific Design Automation Conference, ASP-DAC ’04, pages 609–614, 2004. doi: 10.1109/ASPDAC.2004.1337664. 146 B Bibliography Martin Skutella. Approximation Algorithms for the Discrete Time-Cost Tradeoff Problem. Mathematics of Operations Research, 23:909—-929, 1998. Lukas P.P.P. van Ginneken. Buffer Placement in Distributed RC-Tree Networks for Minimal Elmore Delay. In Proceedings of the 1990 IEEE International Symposium on Circuits and Systems, volume 2, pages 865–868, 1990. doi: 10.1109/ISCAS.1990.112223. Jens Vygen. Slack in Static Timing Analysis. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 25(9):1876–1885, 2006. doi: 10.1109/TCAD.2005.858348. Jürgen Werber. Logic Restructuring for Timing Optimization in VLSI Design. PhD thesis, University of Bonn, 2007. Jürgen Werber, Dieter Rautenbach, and Christian Szegedy. Timing Optimization by Restructuring Long Combinatorial Paths. In Proceedings of the 2007 IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’07, pages 536–543, Piscataway, NJ, USA, 2007. IEEE Press. doi: 10.1109/ICCAD.2007.4397320. Yilin Zhang and David Z. Pan. Timing-driven, over-the-block rectilinear steiner tree construction with pre-buffering and slew constraints. In Proceedings of the 2014 International Symposium on Physical Design, ISPD ’14, pages 29–36, New York, NY, USA, 2014. ACM. doi: 10.1145/2560519.2560533. Yilin Zhang, Ashutosh Chakraborty, Salim Chowdhury, and David Z. Pan. Reclaiming Over-the-IP-Block Routing Resources With Buffering-Aware Rectilinear Steiner Minimum Tree Construction. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design 2012, ICCAD ’12, pages 137–143, New York, NY, USA, 2012. ACM. doi: 10.1145/2429384.2429410. 147

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising