D EPARTMENT OF E LECTRICAL AND I NFORMATION T ECHNOLOGY, FACULTY OF E NGINEERING , LTH M ASTER OF S CIENCE T HESIS Hardware Implementation of the Logarithm Function using Improved Parabolic Synthesis Supervisor: Author: Jingou Lai Peter Nilsson Erik Hertz Rakesh Gangarajaiah Lund September 2, 2013 The Department of Electrical and Information Technology Lund University Box 118, S-221 00 LUND SWEDEN This thesis is set in Computer Modern 11pt, with the LATEX Documentation System © Jingou Lai 2013 Abstract This thesis presents a design that approximates the fractional part of the based two logarithm function by using Improved Parabolic Synthesis including its CMOS VLSI implementations. Improved Parabolic Synthesis is a novel methodology in favor of implementing unary functions e.g. trigonometric, logarithm, square root etc. in hardware. It is an evolved approach from Parabolic Synthesis by combining it with Second-Degree Interpolation. In the thesis, the design explores a simple and parallel architecture for fast timing and optimizes wordlengths in computing stages for a small design. The error behavior of the design is described and characterized to meet the desired error metrics. This implementation is compared to other approaches e.g. Parabolic Synthesis and CORDIC using 65nm standard cell libraries and it is proved to have better performance in terms of smaller chip area, lower dynamic power, and shorter critical path. iv Acknowledgments This thesis would not have been possible without the support from many people. I would like to express my deepest gratitude to the supervisors, Peter Nilsson, Erik Hertz, and Rakesh Gangarajaiah who patiently, inspiringly guided me through this thesis work. My special thanks would go to my families and dearest friends who always have faith in me and encourage me. Finally, my appreciation would extend to my SoC classmates for all the bright ideas and generous helps. Jingou Lai Lund, September 2013 vi Contents Abstract iii List of Tables xi List of Figures xiii List of Acronyms xv 1 Introduction 1.1 Thesis Outlines . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 2 Parabolic Synthesis 2.1 First Subfunction . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Second Subfunction . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Subfunction, sn (x), for n > 2 . . . . . . . . . . . . . . . . . . . 5 6 8 9 3 Improved Parabolic Synthesis 3.1 First Subfunction . . . . . . . . . . . . 3.2 Second Subfunction . . . . . . . . . . . 3.2.1 Second Degree Interpolation . . 3.3 Develop two subfunctions concurrently 4 . . . . 13 13 14 14 16 Hardware Architecture 4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Architecture of Parabolic Synthesis . . . . . . . . . . . . 19 20 20 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii CONTENTS 4.3 5 6 7 8 4.2.2 Architecture of Improved Parabolic Synthesis 4.2.3 Floating-Point Operations . . . . . . . . . . 4.2.4 Algorithm for squarer . . . . . . . . . . . . 4.2.5 Truncation and Optimization . . . . . . . . . Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 23 24 24 Error Analysis 5.1 Error Behavior Metrics . . . . . . 5.1.1 Maximum/Minimum Error 5.1.2 Median Error . . . . . . . 5.1.3 Mean Error . . . . . . . . 5.1.4 Standard Deviation . . . . 5.1.5 RMS(Root Mean Square) 5.2 Error Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 26 27 27 27 27 28 Implementation of Logarithm 6.1 Development of Subfunctions 6.1.1 Development of c1 . . 6.1.2 First Subfunction . . . 6.1.3 Second Subfunction . 6.2 Hardware Architecture . . . . 6.2.1 Preprocessing . . . . . 6.2.2 Processing . . . . . . 6.2.3 Postprocessing . . . . 6.3 Error Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 30 30 31 32 34 34 34 35 35 . . . . . 39 39 40 41 41 42 Conclusion 8.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 45 46 Implementation Results 7.1 Area Information . . . . . . . 7.2 Timing Information . . . . . . 7.3 Power and Energy Estimation . 7.3.1 Power analysis . . . . 7.4 Physical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS ix A Logarithm Impementation Results A.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Power Estimation for LPHVT . . . . . . . . . . . . . . . . . . . 49 49 50 50 List of Tables 6.1 6.2 6.3 6.4 7.1 7.2 7.3 7.4 A.1 A.2 A.3 A.4 A.5 A.6 A.7 the optimized 8 coefficients l2,i in the second subfunction, s2 (x). . the optimized 8 coefficients j2,i in the second subfunction, s2 (x). . the optimized 8 coefficients c2,i in the second subfunction, s2 (x). . The error metrics of the truncated and optimized implementation for logarithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ASIC synthesis result for the Improved Parabolic Synthesis when least area constraint is applied. . . . . . . . . . . . . . . . . . . . ASIC synthesis area for the 3 methods . . . . . . . . . . . . . . . ASIC synthesis timing result for fastest constraint . . . . . . . . . ASIC synthesis timing results for two methods . . . . . . . . . . ASIC synthesis area results using LPLVT . . . . . . . . . . . . . ASIC synthesis area results using LPHVT . . . . . . . . . . . . . ASIC synthesis area results using GPSVT . . . . . . . . . . . . . ASIC synthesis timing results in LPLVT . . . . . . . . . . . . . . ASIC synthesis timing results in LPHVT . . . . . . . . . . . . . . ASIC synthesis timing results in GPSVT . . . . . . . . . . . . . . Primetime Power Estimation for CORDIC method using LPHVT library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.8 Primetime Power Estimation for Parabolic Synthesis method using LPHVT library . . . . . . . . . . . . . . . . . . . . . . . . . . . A.9 Primetime Power Estimation for Improved Parabolic method using LPHVT library . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 33 37 39 40 40 40 49 49 49 50 50 50 51 52 53 List of Figures 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 4.1 4.2 4.3 4.4 An example of two original functions, forg1 (x) and forg2 (x), compared with a strait line y = x. . . . . . . . . . . . . . . . . . . . 6 Two help functions results from the quotient between the original functions, forg1 (x) and forg2 (x) in Fig. 2.1, and a strait line, y = x. 7 An example of the first help function, f1 (x), shown to be a strict convex curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 An example of the second subfunction, s2 (x), compared with first help function, f1 (x). . . . . . . . . . . . . . . . . . . . . . . . . 9 An example of the second help function, f2 (x), a pair of opposite concave and convex functions. . . . . . . . . . . . . . . . . . . . 10 The first help function, f1 (x), divided into 4 intervals. . . . . . . . The bit precsion as a function of c1 with 1 interval in the second subfunction, s2 (x). . . . . . . . . . . . . . . . . . . . . . . . . . The bit precision depending on the value of c1 with 1, 2, 4, 8, 16, 32, 64 interval in the second subfunction, s2 (x). . . . . . . . . . . Three stages hardware architecture for both Parabolic Synthesis and Improved Parabolic Synthesis, shown in hierarchy view. . . . The parallel hardware architecture for Parabolic Synthesis. . . . . Processing stage architecture for the Parabolic Synthesis with 4 subfunctions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . The processing stage architecture for Improved Parabolic Synthesis. 14 17 17 19 20 21 22 xiv LIST OF FIGURES 4.5 The algorithm of the optimized architecture of the squarer unit, in which the reduced logical operations are used to calculate the partial products. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.1 5.2 5.3 The error functions before and after truncation. . . . . . . . . . . The error functions before and after truncation and optimization. . The error histogram of the error function in Fig. 5.2. . . . . . . . 25 26 28 6.1 Normalization of binary logarithm x range from 1 to 2, in which the dashed line is the function before normalization and the solid line is the original function, forg (x). . . . . . . . . . . . . . . . . With the interval of 1, 2, 4, 8, 16, 32, and 64, in the second subfunction, s2 (x), the output precision as a function of the coefficient c1 in the first subfunction, s1 (x). . . . . . . . . . . . . . . . . . . Hardware architecture of logarithm in hierarchy . . . . . . . . . . Hardware architecture of logarithm in the processing stage . . . . The error function before and after the truncation and optimization. Absolute error function expressed in dB unit of the Fig. 6.5. . . . The error histogram before the optimization on the coefficients of the second subfunction, s2 (x), and the wordlength between operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 6.3 6.4 6.5 6.6 6.7 7.1 7.2 Power estimation for 3 designs at different frequencies . . . . . . The GDSII result for the binary logarithm realized by the Improved Parabolic Synthesis. . . . . . . . . . . . . . . . . . . . . . . . . 30 31 34 35 36 36 37 42 43 List of Acronyms ASIC Application Specific Integrated Circuit CORDIC Coordinated ROtation DIgital Computer DSP Digital Signal Processor or Digital Signal Processing EDA Electronic Design Automation FPGA Field Programmable Gate Array GDS Graphic Database System LUT Look Up Table MOS Metal Oxide Semiconductor RMS Root Mean Square SDF Standard Delay Format SPEF Standard Parasitic Exchange format VCD Value Change Dump VHDL Very high speed integrated circuit Hardware Discription Language VLSI Very Large Scale Integrated circuit VT Threshold Voltage Chapter 1 Introduction Binary logarithm is widely applied in the field of graphical processing, communication systems, etc. It can substitute some high complex operation e.g. multiplication and division [1]. For high speed processing, it is not efficient to only rely on software solutions. Instead, due to the rapidly decreasing scale of Metal Oxide Semiconductor Transistor (MOS) transistors, realizing the function in Very Large Scale Integrated Circuit (VLSI) becomes applicable nowadays. Methodologies to approach a binary logarithm in hardware, e.g. a rudimentary implementation, simply employs a direct Look-Up Table (LUT) [2] [3]. However, the large table size will be problematic both for large number of input patterns and high precision. An alternative method, polynomial approximation, can reduce design to some extent but far from enough due to its high computational strength architecture [4]. To address those problems, the Coordinate Rotation Digital Computer (CORDIC), an algorithm offers a time-multiplexed architecture, is often used [5] [6]. However it has a drawback since it has a long processing delay due to the iterative characteristic. Moreover, in consideration of error behavior, since CORDIC calculates output by configuring bit by bit, it leads errors statics to be unbalanced referring to zero [7]. An innovative methodology, parabolic synthesis, recently proposed by Erik Hertz and Peter Nilsson, suggests new hardware architectures to approximate unary functions e.g. sine, logarithm, exponential, and square root etc. [8] [9] [10]. This methodology instructs to develop and recombine several parabolic functions, which are called subfunctions, to approximate each normalized function, called original 2 Introduction function. The idea explores parallelism to achieve fast speed and uses simple hardware for each subfunction to reduce the overall area. Its evolved methodology, Improved Parabolic Synthesis, proposed by Erik Hertz and Peter Nilsson, combining Parabolic Synthesis and second-degree interpolation, employs only two subfunctions to approximate the original function. Accuracy is dependent on both the first subfunction and the number of intervals in the second subfunction. Thereby, the first subfunction can be chosen for less hardware complexity while the second subfunction in each subinterval is developed. The error behavior is also an important factor to character the design. The desired error behavior is symmetric and concentrating around 0. The coefficients of the second subfunction are optimized to characterize error behavior. A fractional part of binary logarithm function had been approximated using Parabolic Synthesis by Peyman P. The results were compared to CORDIC on a Field Programmable Gate Arrary (FPGA) and an Application Specific Integrated Circuit (ASIC). It is proved to be faster and consume less power than the CORDIC [11]. In the thesis, a binary logarithm that calculates 15 bits output precision from inputs with 14 bits mantissa range from integer 1 to 2 is implemented using the Improved Parabolic Synthesis. It results in a simpler hardware due to the advantage of the Improved Parabolic Synthesis and optimized word length in each stage. In addition, refined coefficients result in a characterized error behavior with symmetry concentrated around zero. For implementation, it includes three hierarchical stages in the Parabolic Synthesis and the Improved Parabolic Synthesis. For logarithm implementation, in the processing stage, a squarer unit is implemented from a hardware-effective algorithm. [12]. The design is written in Very High Speed Integrated Circuit Hardware Description Language (VHDL) using the 65nm CMOS technology with low power low VT , low power high VT and low power standard VT cell libraries with a supply voltage 1.2V library. As result, area and timing information from synthesis and place and route (P&R) are reported. Power and energy are estimated under different operating frequencies. Those results are compared to the Parabolic Synthesis Method and the CORDIC. 1.1 Thesis Outlines 1.1 3 Thesis Outlines Remaining chapters are outlined below: Chapter 2 introduces Parabolic Synthesis theory, subfunctions developing methods. Chapter 3 explains the Improved Parabolic Synthesis theory, subfunctions developing methods. Chapter 4 introduces the overall and different hardware architectures of Parabolic Synthesis and Improved Parabolic Synthesis. Chapter 5 explains the error behavior analysis and its metrics with examples. Chapter 6 describes the implementation of binary logarithm using Improved Parabolic logarithm. Chapter 7 lists and analyzes results of implementation including area, timing, and power. It also lists the result from physical design. Chapter 8 concludes the advantage using Improved Parabolic Synthesis, in comparison with early method, and future improvements. Appendix A lists the area and timing results using 3 different cell libraries, which are 1.2V LPHVT, 1.2V LPLVT, and 1.2V GPSVT. The power estimation result using 1.2V LPHVT cell library is shown as well. 4 Introduction Chapter 2 Parabolic Synthesis The Parabolic Synthesis methodology is based on the calculation of several second order functions, called subfunctions, expressed as s1 (x), s2 (x), ..., sn (x), and recombine them to approximate the original function, forg (x), as defined in (2.1). Notice that when the number of subfunctions reaches infinity, the product of all the subfunctions will result in the original function, forg (x). The original function, forg (x), is the target function to be approximated. forg (x) = s1 (x) · s2 (x) · s3 (x)... · s∞ (x) (2.1) For the aid of the development of the subfuctions, help functions are used. As shown in (2.2), the first help function, f1 (x), is defined as the division between the original function, forg (x), and the first subfunction, s1 (x), since the first subfunction, s1 (x), is designed to approximate the original function, forg (x). f1 (x) = forg (x) = s2 (x) · s3 (x)... · s∞ (x) s1 (x) (2.2) The second subfunction, s2 (x), is designed to approach the first help function, f1 (x), which is to make the overall error smaller. This will also result in a second help function, defined recursively, as shown in (2.3). f2 (x) = f1 (x) = s3 (x) · s4 (x)... · s∞ (x) s2 (x) (2.3) In general, when n ≥ 2 the nth subfunction, sn (x), is designed to approach (n-1)th help function, fn−1 (x), as defined in (2.4). 6 Parabolic Synthesis fn (x) = fn−1 (x) = sn+1 (x) · sn+2 (x)... · s∞ (x) sn (x) (2.4) When n increases, the amplitude of the help function, fn (x), will decrease in size. This indicates that a larger number of subfunctions will result in a higher accuracy on the output. 2.1 First Subfunction The original function, forg (x), must cross two points, (0,0) and (1,1), as shown in Fig. 2.1, where a convex curve, forg1 (x), and a concave curve, forg2 (x) are shown. 1 forg1(x) forg2(x) y=x 0.8 forg(x) 0.6 0.4 0.2 0 0 0.2 0.4 x 0.6 0.8 1 Fig. 2.1: An example of two original functions, forg1 (x) and forg2 (x), compared with a strait line y = x. The first subfunction, s1 (x), which is a second order function, is defined in (2.5). s1 (x) = l1 + k1 x + c1 (x − x2 ) (2.5) Since the first subfunction, s1 (x), crosses (0,0), the constant part l1 in (2.5) is calculated to be 0. The linear part k1 in (2.5) is calculated to be 1 since the starting point is (0,0) and the end point is (1,1). The first subfunction, s1 (x), is thereby reduced to (2.6). s1 (x) = x + c1 (x − x2 ) (2.6) 2.1 First Subfunction 7 In order to develop the first subfunction, s1 (x), the original function, forg (x), is first divided by a strait line, f (x) = x. The help function, fhelp (x), is therefore defined as: fhelp (x) = forg (x) x (2.7) Applying (2.7) to the two original functions, forg1 (x) and forg2 (x), in Fig. 2.1, the help functions in (2.7) are calculated as the two curves in Fig. 2.2. 1.8 1 < forg(x) / x < 2 0 < forg(x) / x < 1 1.6 forg(x)/x 1.4 1.2 1 0.8 0.6 0 0.2 0.4 x 0.6 0.8 1 Fig. 2.2: Two help functions results from the quotient between the original functions, forg1 (x) and forg2 (x) in Fig. 2.1, and a strait line, y = x. Additional criteria on the original function, forg (x), when developing c1 are: 1. The original function, forg (x), must be strictly convex or concave, which means it cannot be both convex and concave. 2. The fucntion, forg (x) , x must have a limit value when x goes to 0. 3. The limited value in criteria 2 cannot be larger than 1 or smaller than -1 after subtraction by 1. The help function, fhelp (x), in (2.7) is calculated as 1 + c1 (1 − x). This function cuts two points, (0,0) and (1,1). This interprets that the function starts from 1+c1 and ends with 1. The coefficient c1 in the first subfunction, s1 (x) is therefore 8 Parabolic Synthesis defined as in (2.8). c1 = lim x→0 2.2 forg (x) −1 x (2.8) Second Subfunction The second subfunction, s2 (x), is a second order function, developed to approximate the first help function, f1 (x). The first help function, f1 (x), is strictly concave or convex. As the example, shown in Fig. 2.3, the first help function, f1 (x), derived from the upper original function in Fig. 2.1 using (2.2). 1.1 f1(x) 1.08 1.06 1.04 1.02 1 0 0.2 0.4 x 0.6 0.8 1 Fig. 2.3: An example of the first help function, f1 (x), shown to be a strict convex curve. The second subfunction, s2 (x), is defined in (2.9). s2 (x) = l2 + k2 x + c2 (x − x2 ) (2.9) Since the second subfunction, s2 (x), starts from the point (0,1) and ends at the point (1,1), the constant part in (2.9), l2 , is calculated as 1. The linear part in (2.9), k2 , is the gradient from (0,1) to (1,1) and therefore equals to 0. In (2.10), the reduced second subfunction, s2 (x), is shown. s2 (x) = 1 + c2 (x − x2 ) (2.10) 2.3 Subfunction, sn (x), for n > 2 1.1 9 s2(x) fhelp(x) 1.08 1.06 1.04 1.02 10 0.2 0.4 x 0.6 0.8 1 Fig. 2.4: An example of the second subfunction, s2 (x), compared with first help function, f1 (x). The desired second subfunctions, s2 (x), needs to cross the same points, including the start point, middle point, and end point, as the help function, shown in Fig. 2.4. To calculate c2 in (2.9), the middle point of the first help function, f1 (0.5), is used according to (2.9) c2 = 2.3 f1 (0.5) − 0.5 · k2 − l2 = 4 · (f1 (0.5) − 1) 0.25 (2.11) Subfunction, sn (x), for n > 2 When further developing subfunctions, sn (x), for n > 2, the same methodology is applied, which is designed to approach the help function, fn−1 (x), as stated in (2.2) and (2.4). However, all the help functions, from f2 (x) to fn (x) for n > 1, are no more strictly convex or concave from 0 to 1 on the x axis. As an example shown in Fig. 2.5, the second help function, f2 (x), results in a pair of convex and concave functions. The first function is in the interval 0 ≤ x < 0.5 and the second function is in the interval 0.5 ≤ x ≤ 1. The second help function, f2 (x), is expressed in (2.12). ( f2,0 (x) 0 ≤ x < 21 (2.12) f2 (x) = f2,1 (x) 12 ≤ x ≤ 1 To approximate the second help function, f2 (x), composed of two parabolic curves, every parabolic curve in the function is normalized into 0 to 1 in the x axis. Notice 10 Parabolic Synthesis 1.004 f1(x) 1.002 1 0.998 0.996 0 0.2 0.4 x 0.6 0.8 1 Fig. 2.5: An example of the second help function, f2 (x), a pair of opposite concave and convex functions. that x is substituted with x0 in order to map the input x to the normalized parabolic curve. From x to x0 , the mapping is according to (2.13). x0 = f ract(2 · x) (2.13) Each of the parabolic curve is approximated using the method in Section 2.2 in Fig. 2.4. For the third subfunction, s3 (x), when 0 ≤ x < 21 , s3,0 (x0 ) is calculated. When 12 ≤ x ≤ 1, s3,1 (x0 ) is calculated. It is expressed in (2.14). ( s3,0 (x0 ) 0 ≤ x < 21 s3 (x) = (2.14) s3,1 (x0 ) 12 ≤ x ≤ 1 In general, the nth subfunction fn (x), when n > 1, consists of pairs of concave and convex functions, as defined in (2.15). A larger n results in a higher numbers of pairs. 1 fn,0 (x) 0 ≤ x < 2n−2 f (x) 1 2 ≤ x < 2n−2 n,1 2n−2 (2.15) fn (x) = ... f n−1 (x) 2n−2 −1 ≤ x < 1 n,2 −1 2n−2 As a consequence, in general, when developing the nth subfunction, sn (x), when n > 2, the subfunction is defined, as shown in (2.16). 2.3 Subfunction, sn (x), for n > 2 11 sn,0 (xn ) s (x ) n,1 n sn (x) = ... s n−2 (x ) n,2 −1 n 1 0 ≤ x < 2n−2 1 2 ≤ x < 2n−2 2n−2 2n−2 −1 2n−2 (2.16) ≤x<1 Similarly, in order to map the input to the normalized parabolic curve, the input x is substituted by xn . In (2.17), xn is the fractional part of the product of x and 2n−2 . xn = f ract(2n−2 x) (2.17) Each parabolic curve in the nth subfunction, sn (x), is sn,m (xn ). sn,m (xn ) is defined in (2.18), sn,m (xn ) = 1 + (cn,m · (xn − x2n )) (2.18) where the cn,m for each sn,m , is calculated similar as in Section 2.2, as defined in (2.19): cn,m = 4 · (fn−1,m ( 2 · (m + 1) − 1 ) − 1) 2n−1 (2.19) 12 Parabolic Synthesis Chapter 3 Improved Parabolic Synthesis The extended method, called Improved Parabolic Synthesis, uses only two subfunctions to approximate the original function, forg (x), as shown in (3.1). forg (x) = s1 (x) · s2 (x) (3.1) The first subfunction, s1 (x), is developed to be a parabolic function. The second subfunction, s2 (x), combines parabolic functions with second-degree interpolation. In this methodology, the first subfunction, s1 (x), and the second subfunction, s2 (x), are developed with conformity. In order to apply the methodology, as shown in Fig. 2.1, the original function, forg (x), needs to cut (0,0) and (1,1). Additional constraints applied to the original function, forg (x), when using the methodology are: 1. The original function, forg (x), must be strictly convex or concave. It cannot be both convex and concave. f (x) 2. The function, orgx , must have a limit value when x goes to 0. 3.1 First Subfunction The first subfunction, s1 (x), is developed to approximate the original function, forg (x), similar to section 2.1, as defined in (2.6). Depending on the convexity or concavity of the original function, forg (x), a proper c1 in (2.6) is chosen by 14 Improved Parabolic Synthesis sweeping it from range 0 to 1.2 or -1.2 to 0 respectively. With every value of c1 , the first subfunction, s1 (x), is developed. 3.2 Second Subfunction The second subfunction, s2 (x) , is developed to approximate the first help function, f1 (x), as shown in Section 2.2. 3.2.1 Second Degree Interpolation In the Improved Parabolic Synthesis, as shown in Fig. 3.1, the first help function, f1 (x), is divided into intervals of power of 2 numbers. To develop the second subfunction, s2 (x), in each subinterval, the curve is normalized. 1 0.95 fhelp(x) 0.9 0.85 0.8 0.75 0.7 0 0.2 0.4 x 0.6 0.8 1 Fig. 3.1: The first help function, f1 (x), divided into 4 intervals. The general equation of the second subfunction that denotes the ith interval is shown in (3.2): s2,i (x) = l2,i + k2,i · xw + c2,i · (xw − x2w ) (3.2) where i is the integer from 0 to number of intervals-1. The second subfunction in each interval, s2,i (x), is developed to approximate the help function in that interval, f1,i (x), by cutting 3 points, which are the start point, middle point, and end point of the first help function in that interval, f1,i (x). In 3.2 Second Subfunction 15 general, the second subfunction, s2,i (x), can be expanded with all the intervals as: 0 ≤ x < 21ω s2,0 (xω ) s (x ) 1 2 2,1 ω 2ω ≤ x < 2ω s2 (x) = ... s I−1 2,I−1 (xω ) 2ω ≤ x < 1 (3.3) Note that x is substituted by xw . The input x is multiplied by 2w to transfer the inputs in each interval to the normalized domain, in which way the second subfunction, s2 (x), approximates the first subfunction, f1 (x): xω = f ract(2ω x) (3.4) As shown in (3.5), I is the number of intervals, which is expressed in a power of 2 number. The number of intervals can be chosen as radix-2 numbers. It will further benefit when developing the hardware architecture. I = 2ω (3.5) Similar to the Parabolic Synthesis, the offset of the second subfunction in the ith interval, l2,i , is simply the starting point of that interval: l2 , i = f1 (start, i) (3.6) The ingredient of the second subfunction in the ith interval, k2,i , is the difference between the start point and the end point of that interval: k2 , i = f1 (start, i) − f1 (end, i) (3.7) The second degree component, c2,i , is developed by the help of middle point of each interval, shown in (3.8). c2 , i = 4 · (f1,i (0.5) − l2,i − k2,i · 0.5) (3.8) For the purpose of saving an adder in hardware, ji is preset as in (3.9). j2,i = k2,i + c2,i (3.9) 16 3.3 Improved Parabolic Synthesis Develop two subfunctions concurrently In order to find the coefficient c1 of the first subfunction, s1 (x), the following steps are performed: 1. Choose a value of the coefficient c1 . With every value of the coefficient c1 , the first subfunction, s1 (x), is developed according to (2.6). 2. Since the first subfunction, s1 (x), is set, the second subfunction, s2 (x), is developed according to (3.1). 3. Since the first subfunction, s1 (x), and the second subfunction, s2 (x), have been developed, the output precision is calculated. By Changing the value of the coefficient c1 and repeat Step 1 to 3, the output precision is therefore a function of the coefficient c1 . For the explanation of the Step 3, the output precision is calculated by the error function. The error function, e(x), is define in (3.10): e(x) = |s1 (x) · s2 (x) − forg (x)| (3.10) It results in the differences between recombined result, s1 (x) · s2 (x), and original function, forg (x), expressed in decibel(dB) unit. Since the 1 bit precision in the floating point, namely 0.5, is approximately -6 dB, as shown in (3.11), 20 · log10 (0.5) ≈ 20(−0.301) ≈ −6dB (3.11) the bit precision is the quotient between the maximum of the error function, max{e(x)}, in dB and -6, as shown in (3.12). bit precision = 20 · log10 (max{e(x)}) −6 (3.12) In contrast to the Parabolic Synthesis method, the coefficient c1 is developed by considering overall precision on the output. The preferable c1 is the value that fulfills the precision requirement and results in a simple architecture. 3.3 Develop two subfunctions concurrently 17 The bit precision is therefore a function of c1 , as an example of sine approximation using this approach shown in Fig. 3.2. Two peak values are detected, one is around 0.3 and the other is around 1.1. 11 Bit Precision 10 9 8 7 0 0.2 0.4 0.6 0.8 C1 1 1.2 1.4 Fig. 3.2: The bit precsion as a function of c1 with 1 interval in the second subfunction, s2 (x). After combining with the Second-Degree Interpolation, the bit precision depends not only on the coefficient c1 but also the number of intervals. As shown in Fig. 3.3, for the higher number of intervals, the output precision curve result in higher precision. 1 int. 2 int. 4 int. 8 int. 16 int. 32 int. 64 int. 35 Bit Precision 30 25 20 15 10 0 0.2 0.4 0.6 C1 0.8 1 1.2 1.4 Fig. 3.3: The bit precision depending on the value of c1 with 1, 2, 4, 8, 16, 32, 64 interval in the second subfunction, s2 (x). Typical values for c1 are 1, 0.5, and 0, with which numbers the hardware will be 18 Improved Parabolic Synthesis simplified. With more intervals, freedom of choosing c1 is increased. By choosing the typical values above, the output precision can be increased. Chapter 4 Hardware Architecture The hardware architecture is shown in Fig. 4.1. The architecture consists of three stages: preprocessing, processing, and postprocessing. The Preprocessing and Postprocessing stages are transformation stages, while the Processing stage approximates the original function, forg (x). v Preprocessing x Processing y Postprocessing z Fig. 4.1: Three stages hardware architecture for both Parabolic Synthesis and Improved Parabolic Synthesis, shown in hierarchy view. 20 Hardware Architecture This approach is applicable for both Parabolic Synthesis and Improved Parabolic Synthesis. 4.1 Preprocessing In the Preprocessing stage, the input v maps the input domain to the output x into the interval from 0 to 1. As an example, the preprocessing stage of sin(x) implementation process the input v from the input domain, which is from 0 to π2 , to the output x into the interval from 0 to 1 by multiplying the π2 . 4.2 Processing The processing stage processes the input x that results in the output y, which is the approximated quantity from the processing stage function or the original function, forg (x). The parabolic Synthesis or Improved Parabolic Synthesis is applied in this stage to approximate the processing stage function, forg (x). As shown in Fig. 4.2, the architecture can fulfill the calculation of (2.1) and (3.1). s1 (x) x x s2 (x) x y s3 (x) x s4 (x) ... Fig. 4.2: The parallel hardware architecture for Parabolic Synthesis. 4.2 Processing 4.2.1 21 Architecture of Parabolic Synthesis The unrolled architecture of (2.1) is shown in Fig. 4.3 with the subfunctions when n = 4. x c - x2 x3 x4 2 + s1(x) 1 x c + x + 2 2 i C3,i x3 + - x4 x - x 42 x + 1 s4(x) c4,i + x s3(x) x 32 h 1 s2(x) x yps x + 1 Fig. 4.3: Processing stage architecture for the Parabolic Synthesis with 4 subfunctions. To process the first subfunction, s1 (x), and the second subfunction, s2 (x), x is the direct input. For the third subfunction, s3 (x), x is divided into 2 parts, the interval and the input. The interval part are the most significant bits that select the step function, c3,i (x), and the rest of the bits is the input x3 . It is similar when computing the fourth subfunction, s4 (x), except the use of 1 more significant bit as the interval part and 1 less bit for the input. The squarer unit is used to produce the the square module x2 , the partial products x23 and x24 for the subfunctions. Note that the architecture shows that in the first subfunction, s1 (x), k1 is equal to 1 and l1 is equal to 0. For the nth subfunctions, sn (x), that is when n > 1, the kn equals to 0 and ln equals to 1. 22 4.2.2 Hardware Architecture Architecture of Improved Parabolic Synthesis The architecture of (3.1), as shown in Fig. 4.4, only computes and combines two subfunctions, s1 (x) and s2 (x). x s1(x) c x2 - 1 x + + x2 2 l2,i x j2,i c2,i i x yips x 2 x - + x2 2 + s2(x) Fig. 4.4: The processing stage architecture for Improved Parabolic Synthesis. In the second subfunction, s2 (x), the most significant bits of the input x are used to select the set of coefficients, c2,i , k2,i , l2,i . The number of significant bits, which is the ω in (3.5), are determined by the number of intervals I. The squarer unit produce the products of x2 and partial products x2ω , which are used in the first subfunction, s1 (x), and the second subfunction, s2 (x), respectively. 4.2.3 Floating-Point Operations In hardware, a floating point number is represented by a fixed point number and noted by a fractional length. For the addition(or subtraction), the operation should be performed with the alignment of the fractional length. For the multiplication, the numbers are simply multiplied and the fractional length is accumulated as the sum of fractional parts of the multiplier and the multiplicand. 4.2 Processing 23 With this representation system, the wordlengths of the coefficients, c2,i and k2,i can be reduced. 4.2.4 Algorithm for squarer The algorithm of the square unit, x2 and x2ω , in Fig. 4.4, can be implemented from an algorithm shown in Fig. 4.5. x x4 x3 x2 x1 x0 x4 x3 x2 x1 x0 x0x0 p0 p q1 q0 q r2 r1 r0 r s3 s2 s1 s0 s t3 t2 t1 t0 t x1x0 x1x1 x0x1 q3 q2 x2x0 x2x1 x2x2 x1x2 x0x2 r5 r4 r3 x3x0 x3x1 x3x2 x3x3 x2x3 x1x3 x0x3 s7 s6 s5 s4 x4x0 x4x1 x4x2 x4x3 x4x4 x3x4 x2x4 x1x4 x0x4 t9 t8 t7 t6 t5 t4 Fig. 4.5: The algorithm of the optimized architecture of the squarer unit, in which the reduced logical operations are used to calculate the partial products. It calculates and adds the partial products p, q, r, s, and t etc. to produce the final 24 Hardware Architecture result. Following this algorithm, the number of partial products is implemented and controlled by the number of bits from the parameterized input x. The advantage of implementing the squarer using this algorithm is that both the chip area and the latency of the squarer are about half compared to the corresponding multiplier. 4.2.5 Truncation and Optimization To represent the coefficients, c2,i , j2,i , and l2,i in hardware, they are truncated to feasible binary numbers and fractional lengths as described in Section.4.2.3. The optimization to the coefficients c2,i , j2,i , and l2,i can characterize error behavior, which will be described in Chapter 5. The wordlengths can be reduced to some extent while the system still maintain the required output precision. Under the condition to meet the output precision, different wordlengths between the operations have been simulated and one combination that results in a minimized architecture and a best error behavior is chosen. 4.3 Postprocessing The Postprocessing stage processes the output y, which is the approximated result, in the range from 0 to 1, to the z with the range of the target function, to fulfill the approximation. As the example of sin(x) approximation, the range of y is from 0 to 1, which is the same range to the target function, sin(x) with x from 0 to π2 . Therefore, the postprocessing stage function is z = y. Chapter 5 Error Analysis After the development of the first subfunction, s1 (x), the second subfunction, s2 (x), in Chapter 3, and the architecture in Chapter 4, the following step is to determine the wordlengths to be used in the architecture. This will effect the error behavior, As shown in Fig. 5.1. The black curve is the error function before truncation, and the grey is the error function after the truncation. before truncation and optimization after truncation 0.000015 0.00001 error 5x10-6 0 -5x10-6 -0.00001 -0.000015 0 0.2 0.4 x 0.6 0.8 1 Fig. 5.1: The error functions before and after truncation. As shown in the Fig. 5.1, after truncation, the error function has a negative offset compared to the error function before the truncation. To neutralize this effect, the coefficients in the second subfunction, s2 (x), are adjusted. It is fulfilled by manipulating those coefficients to result in a normal distributed error. This will help to reduce the wordlengths of the architecture in Fig. 26 Error Analysis 4.4 since the margin between the maximum error and the required precision becomes larger. The error function after truncation is shown in Fig. 5.2, where the black curve is the error function before the truncation and optimization and the grey is the error function after the truncation and optimization. before truncation and optimization after truncation and optimizaion 0.000015 0.00001 error 5x10-6 0 -5x10-6 -0.00001 0 0.2 0.4 x 0.6 0.8 1 Fig. 5.2: The error functions before and after truncation and optimization. As shown in Fig. 5.2 and 5.1, the error function after the truncation and normalization is more evenly distributed around 0 than the the error function only after truncation. 5.1 Error Behavior Metrics When developing the architecture to result in a normal distributed error behavior, for the supplements of the error function, some metrics are important, namely the max error, the min error, the median error, the mean error, the standard deviation, and the Root Mean Square (RMS) error. 5.1.1 Maximum/Minimum Error The maximum or minimum error is the maximum or minimum value of the error function respectively, as shown in (5.1) and (5.2). emax = max{e(x)} (5.1) emin = min{e(x)} (5.2) 5.1 Error Behavior Metrics 27 The max or min error gives the precision bottleneck of the design. 5.1.2 Median Error The median error gives an error value that is in the middle value of all the error samples. For an odd number of samples, it is the value makes an equal number of samples that are larger or smaller than that value. For an even number of samples, it is the mean value of central values. The median error shows the skewness of the error distribution. 5.1.3 Mean Error The mean error is the average value of error error function, as shown in (5.3) n emean = 1X e(x) n (5.3) x=0 Where n is the number of samples. 5.1.4 Standard Deviation The standard deviation is the square root of the average from the sum of the square of difference between the error and the mean error, as shown in (5.4). σ= q 1 n Pn x=0 [e(x)−emean ] 2 (5.4) The standard deviation indicates the dispersion around mean. 5.1.5 RMS(Root Mean Square) The root mean square is the square root of the average from the sum of the square of errors, as shown in (5.5). q σ = 1 Pn e(x)2 (5.5) n x=0 The RMS error gives the equivalent quantity of a varying value. In order to result in a normal distributed error, the optimization is expected to make the Standard deviation and RMS equal. 28 5.2 Error Analysis Error Distribution Another tool to analyze the error is to use the histogram of the error function. An histogram can indicate the distribution of the error, it gives the number of the errors for all the specific values, as shown in Fig. 5.3. 1400 1200 number of errors 1000 800 600 400 200 0 -1.5e-05 -1.0e-05 -5e-06 0.0e+00 5e-06 error 1.0e-05 1.5e-05 Fig. 5.3: The error histogram of the error function in Fig. 5.2. Fig. 5.3 shows that the error histogram of Fig. 5.2 is evenly distributed around 0. Chapter 6 Implementation of Logarithm The based-2 logarithm function that calculates the number from 1 to 2 with 14 bits mantissa and produces the output with 15 bits precision is implemented in hardware using the improved Parabolic Synthesis methodology described in Chapter 3. Subfunctions are developed adopting the approach from Section 3.3, within which the coefficient c1 is chosen for the simplest hardware. For the hardware implementation, the architecture and optimization in Chapter 4 is used. The error metrics in Chapter 5 is listed. 30 6.1 Implementation of Logarithm Development of Subfunctions To derive the original function, forg (x), as shown in Fig. 6.1, the binary logarithm function with the input x ranging from 1 to 2 is simply shifted 1 to the left. 1 y=log2(x+1) y=log2(x) 0.8 0.6 0.4 0.2 0 0 0.5 1 x 1.5 2 Fig. 6.1: Normalization of binary logarithm x range from 1 to 2, in which the dashed line is the function before normalization and the solid line is the original function, forg (x). Therefore, the original function, forg (x), is: forg (x) = log2 (x + 1) 6.1.1 (6.1) Development of c1 When different coefficient c1 in the first subfunction, s1 (x), combining seconddegree interpolation with different number of interval in the second subfunction, s2 (x), as the method described in section 3.3.2, the result is shown in Fig. 6.2. It plots the precision as a function of c1 from 0 to 1.4 under the number of intervals from 1 to 64 for the second subfunction. 6.1 Development of Subfunctions 31 1 int. 2 int. 4 int. 8 int. 16 int. 32 int. 64 int. 30 Bit Precision 25 20 15 10 5 0 0.2 0.4 0.6 C1 0.8 1 1.2 1.4 Fig. 6.2: With the interval of 1, 2, 4, 8, 16, 32, and 64, in the second subfunction, s2 (x), the output precision as a function of the coefficient c1 in the first subfunction, s1 (x). To achieve a simple hardware, the coefficient c1 is chosen to be 0. To compensate the output precision, the number of intervals is chosen to be 8. 6.1.2 First Subfunction Since the coefficient c1 is set to 0, the first subfunction, s1 (x) is defined in (6.2): s1 (x) = x (6.2) As the first subfunction, s1 (x), is developed, the first help function, f1 (x), is defined in (6.3): forg (x) log2 (x + 1) f1 (x) = = (6.3) s1 (x) x The first help function, f1 (x), helps to develop the second subfunction, s2 (x), as shown in Section 3.2. 32 Implementation of Logarithm 6.1.3 Second Subfunction To develop the second subfunction, s2 (x), the methodology from section 3.2 is used. In the outcome from (3.3) to (3.9), the 8 sets of coefficients, l2,i , j2,i , and c2,i of the (3.2), after truncation and optimization, are listed in Tab. 6.1 to 6.3, respectively. Table 6.1: the optimized 8 coefficients l2,i in the second subfunction, s2 (x). coefficient l2,0 l2,1 l2,2 l2,3 l2,4 l2,5 l2,6 l2,7 Value 1.44268798828125000 1.35939788818359375 1.28771209716796875 1.22514343261718750 1.16991424560546875 1.12069702148437500 1.07646942138671875 1.03644561767578125 Table 6.2: the optimized 8 coefficients j2,i in the second subfunction, s2 (x). coefficient j2,0 j2,1 j2,2 j2,3 j2,4 j2,5 j2,6 j2,7 Value -0.089294433593750 -0.076629638671875 -0.066589355468750 -0.058471679687500 -0.051849365234375 -0.046447753906250 -0.041900634765625 -0.038085937500000 6.1 Development of Subfunctions 33 Table 6.3: the optimized 8 coefficients c2,i in the second subfunction, s2 (x). coefficient c2,0 c2,1 c2,2 c2,3 c2,4 c2,5 c2,6 c2,7 Value -0.0060424804687500 -0.0049438476562500 -0.0040435791015625 -0.0032501220703125 -0.0026397705078125 -0.0022277832031250 -0.0018920898437500 -0.0016479492187500 The effective wordlengths representing the coefficients l2,i , j2,i , and c2,i in the second subfunction, s2 (x), are selected as 18, 12, and 8 bits respectively. The optimized wordlengths between the operations will be described in Section.6.2.2. 34 6.2 Implementation of Logarithm Hardware Architecture The architecture for the implementation of binary logarithm is shown in Fig. 6.3. v x=v-1 x y=s1(x)*s2(x) y z=y z Fig. 6.3: Hardware architecture of logarithm in hierarchy It is divided into 3 stages: preprocessing, processing, and postprocessing, as shown in Fig. 4.1. In the Preprocessing stage, it is simply subtracted by 1 from its operand. In the Processing stage, it uses the Improved Parabolic Synthesis method to approximate the original function, log2 (x + 1). In the Postprocessing stage, the output is directly equal to the input. 6.2.1 Preprocessing Since the Improved Parabolic Synthesis approximates the original function, forg (x), in (6.1), for the binary logarithm, log2 (v), with interval from 1 to 2, The input v is therefore subtracted by 1 to get x normalized from 0 to 1. The Preprocessing stage function is shown as: x=v−1 (6.4) In hardware, therefore the input in the next stage represents only the mantissa part of the number between 1 and 2. 6.2.2 Processing In the Processing stage, the fractional part of logarithm is approximated using improved Parabolic Synthesis, which contains two subfunctions, where the first sub- 6.3 Error Behavior 35 function is a parabolic function and the second function is a second-degree interpolation. The architecture with optimized wordlengths is shown in Fig. 6.4. x 14 14 l2,i interval 3 j2,i C2,i 12 8 X 17 + 18 18 17 X 18 X + - 18 9 11 X frac�onalinput Fig. 6.4: Hardware architecture of logarithm in the processing stage The optimization of the architecture has an impact on the error behavior, which is to be described in Section 6.3. 6.2.3 Postprocessing Since the normalization is only the left shift on the coordinate, therefore, in the Postprocessing stage, the output is simply equal to the input, as shown in (6.5). z=y 6.3 (6.5) Error Behavior The error behavior of the implementation is shown in Fig. 6.5, where the black curve is the error function before truncation and optimization and the grey is the error function after the truncation and optimization. 36 Implementation of Logarithm before truncation and optimization after truncation and optimization 0.000015 0.00001 error 5x10-6 0 -5x10-6 -0.00001 -0.000015 0 0.2 0.4 x 0.6 0.8 1 Fig. 6.5: The error function before and after the truncation and optimization. In Fig. 6.5, it indicates that after the error function after the truncation and optimization is evenly distributed around 0. The error function is expressed in dB unit and shown in Fig. 6.6. -100 -110 dB -120 -130 -140 -150 -160 0 0.2 0.4 x 0.6 0.8 1 Fig. 6.6: Absolute error function expressed in dB unit of the Fig. 6.5. As described in (3.11), since all the errors are below -90dB, the precision requirement of 15 bits is satisfied. For the histogram of the error in Fig. 6.5, which is two-sided, symmetric and correlated to the normal distribution, shown in Fig. 6.7. As shown in Fig. 6.7, the errors of the optimized design distributed around −1.5 · 10−5 to 1.5 · 10−5 , which has most of the errors around 0, and the number is grad- 6.3 Error Behavior 37 1200 number of errors 1000 800 600 400 200 0 -1.5e-05 -1.0e-05 -5e-06 0.0e+00 error 5e-06 1.0e-05 1.5e-05 Fig. 6.7: The error histogram before the optimization on the coefficients of the second subfunction, s2 (x), and the wordlength between operations. ually decreasing to the sides. The error metrics in Section 5.1 for the implementation are listed in Tab. 6.4. Table 6.4: The error metrics of the truncated and optimized implementation for logarithm. Min error Max error Mean Median Standard Deviation RMS Value -0.000015692179454 0.000015897615491 0.000000019483566 -0.000000025005258 0.000004737796437 0.000004737691911 Bits 15.96 bits 15.94 bits 25.61 bits The min and max error show a good symmetry of the error behavior. The similarity between standard deviation and RMS indicate that the error histogram is highly correlated to the normal distribution. 38 Implementation of Logarithm Chapter 7 Implementation Results In this Chapter, the implementation results including the area, timing, and power estimation using 65nm Low Power High VT (LPHVT) CMOS technology are listed and compared. A full list of results using Low Power Low VT (LPLVT) and General Purpose Standard VT (GPSVT) cell libraries is shown in Appendix A. The binary logarithm is also implemented using Parabolic Synthesis and CORDIC. For Parabolic Synthesis implementation, 4 subfunctions are used and no optimization to the wordlengths of design has been performed. For CORDIC implementation, 20 iterations are used, where 15+1 of 20 is used for accuracy and iteration 4, 7, 10, and 13 are used to ensure convergence [13]. Notice that no pipeline are used in any of those implementations. The results are compared to the Improved Parabolic Synthesis approach and will be described in Chapter 8. 7.1 Area Information The synthesis tool estimates the ASIC area for logic gates. Least area can be extracted by applying least-area constraint. As shown in Tab. 7.1, the binary logarithm possesses less than 4800µm2 . Table 7.1: ASIC synthesis result for the Improved Parabolic Synthesis when least area constraint is applied. least area Area(µm2 ) 4785 40 Implementation Results Under the normal conditions, where no constraints are applied to the synthesis, the results for the 3 approaches are listed in Tab. 7.2. Table 7.2: ASIC synthesis area for the 3 methods Desgin CORDIC Parabolic Synthesis Improved Parabolic Synthesis Area(µm2 ) 12893 16258 4865 The Improved Parabolic approach possesses much less area compared to the other two methods. 7.2 Timing Information The timing information in a design gives the bottleneck for the highest clock frequency. As shown in the Tab. 7.3, it lists the timing information when the fastest design is required. Table 7.3: ASIC synthesis timing result for fastest constraint constraint fastest design timing path(ns) 1.71 frequency(Hz) 584MHz Under the normal constraint, Tab. 7.4 compares the timing results of the 3 implementations. Table 7.4: ASIC synthesis timing results for two methods Desgin CORDIC Parabolic Synthesis Improved Parabolic Synthesis timing path(ns) 86.96 21.03 6.96 frequency(Hz) 11.5MHz 47.5MHz 140MHz As the results derive, the binary logarithm function designed in Improved Parabolic Synthesis can be implemented in a system that has a local clock frequency of 1/6.96ns = 140MHz. When using Parabolic Synthesis, it can be implemented with a clock frequency of 1/21.03ns = 47.5MHz. For the CORDIC implementation, however, due to the iterative character, the equivalent frequency is 1/86.96ns = 11.5MHz [11]. 7.3 Power and Energy Estimation 7.3 7.3.1 41 Power and Energy Estimation Power analysis The CMOS transistors consist of two power sources: dynamic power, and static power, where Ptotal = Pdynamic + Pstatic (7.1) The static power source is the leakage current when the power is on. The dynamic power consists of switching power and internal power: Pdynamic = Pswitching + Pinternal (7.2) where the switching power is the charge and discharge from transistor capacitors and the internal power source is the transition spike current when transistors are short-circuited. The dynamic power is positive proportional to the switching activity, α, clock frequency, f , the equivalent capacitors of the transistor, C, and the supply voltage, VDD . 2 Pdynamic = αf CVDD (7.3) The switching activity, α, is estimated from Value Change Dump (VCD) file generated from simulation tools. The power estimation of the binary logarithm using the Improved Parabolic Synthesis, Parabolic Synthesis, and CORDIC are plotted in Fig. 7.1 under different frequencies. 42 Implementation Results 1x107 1x106 Improved P.S. logarithm P.S. logarithm CORDIC logarithm P(nW) 100000 10000 1000 100 10 0.001 0.01 0.1 1 10 100 1000 10000 100000 1x106 1x107 Frequency(Hz) Fig. 7.1: Power estimation for 3 designs at different frequencies For the same cell library, the static power depends on the number of transistors, which will be the dominating dissipation source during low frequencies. When frequency increases, dynamic power start to possess more and more power, which thereby become the main source of dissipation. The binary logarithm function implemented with Improved Parabolic Synthesis, due to less area and lower switching activity, consumes much less static power and dynamic power compared to the other 2 approaches. 7.4 Physical Design In The physical design, The Electronic Design Automation (EDA) tools combine the netlist and library files, which results in the Graphic Database System (GDSII) for fabrication. The layout of the binary logarithm is shown in Fig. 7.2. 7.4 Physical Design 43 Fig. 7.2: The GDSII result for the binary logarithm realized by the Improved Parabolic Synthesis. The floor plan is specified as 80x80µm2 . Two metals are placed for power supply and ground. The Standard Delay Format (SDF) file, which contains the timing delay information from physical design, is added with netlist to perform post layout simulation. The SPEF file is also extracted after physical placement. It presents parasitic data information, which is used when doing post layout power estimation. The post-layout simulation is performed to ensure the design computation correction before fabrication. 44 Implementation Results Chapter 8 Conclusion For the Improved Parabolic Synthesis, the first subfunction and the second subfunction, s1 (x) and s2 (x), are developed with conformity where the desired coefficient c1 is chosen for a value that results in both high accuracy and low complexity. An increased number of intervals in the Second-Degree Interpolation for the second subfunction, s2 (x) can compensate the output precision when a simple hardware is chosen. The truncation gives an offset to the error behavior and the optimization on the coefficients, c2,i , j2,i , and l2,i , will balance it. Note that it is beneficial when developing the second subfunction, s2 (x), if the difference between first help function, f1 (x), and the second subfunction, s2 , is gradually decreased on the x-axis. 8.1 Comparison Compared to the Parabolic Synthesis and CORDIC, the Improved Parabolic Synthesis is much smaller, faster, and consumes much less power. The implementation using Improved Parabolic Synthesis and Parabolic Synthesis have an advantage of error behavior comparing to CORDIC implementation, where the error of Improved Parabolic Synthesis can be characterized to be near an normal distribution after the optimization. The Improved Parabolic Synthesis approach is suitable for a high frequency low power solution. 46 8.2 Conclusion Future Work The three different approaches can be implemented and prototyped on an FPGA to compare the resource utilization. The design is possible to be implemented to be faster if it is pipelined for a smaller critical path. Alternatively, increase the number of interval in the second subfunction, s2 (x), will make design faster. The tactic of optimizing the design to characterize the error behavior should be studied and standardized. The Improved Parabolic Synthesis can realizes other unary functions, e.g. trigonometric, exponential, square root functions, etc. Bibliography [1] J. N. Mitchell, “Computer multiplication and division using binary logarithms,” IRE Transactions on Electronic Computers, vol. EC-11, no. 4, pp. 512–517, 1962. [2] P. Tang, “Table-lookup algorithms for elementary functions and their error analysis,” in 10th IEEE Symposium on Computer Arithmetic, 1991. Proceedings, 1991, pp. 232–236. [3] P.-T. P. Tang, “Table-driven implementation of the logarithm function in ieee floating-point arithmetic,” ACM Trans. Math. Softw., vol. 16, no. 4, pp. 378–400, Dec. 1990. [Online]. Available: http://doi.acm.org/10.1145/98267.98294 [4] J. Hormigo, J. Villalba, and M. Schulte, “A hardware algorithm for variableprecision logarithm,” in Application-Specific Systems, Architectures, and Processors, 2000. Proceedings. IEEE International Conference on, 2000, pp. 215–224. [5] J. E. Volder, “The cordic trigonometric computing technique,” Electronic Computers, IRE Transactions on, vol. EC-8, no. 3, pp. 330–334, 1959. [6] A. Boudabous, F. Ghozzi, M. Kharrat, and N. Masmoudi, “Implementation of hyperbolic functions using cordic algorithm,” in The 16th International Conference on Microelectronics, 2004. ICM 2004 Proceedings, 2004, pp. 738– 741. 48 BIBLIOGRAPHY [7] R. Andraka, “A survey of cordic algorithms for fpga based computers,” in Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays. ACM, 1998, pp. 191–200. [8] E. Hertz and P. Nilsson, “A methodology for parabolic synthesis,” a book chapter in vlsi, in-tech,” ISBN 978-3-902613-50-9, Tech. Rep. [9] ——, “A methodology for parabolic synthesis of unary functions for hardware implementation,” in 2nd International Conference on Signals, Circuits and Systems, SCS 2008, 2008, pp. 1–6. [10] ——, “Parabolic synthesis methodology implemented on the sine function,” in IEEE International Symposium on Circuits and Systems. ISCAS 2009, 2009, pp. 253–256. [11] P. Pouyan, E. Hertz, and P. Nilsson, “A vlsi implementation of logarithmic and exponential functions using a novel parabolic synthesis methodology compared to the cordic algorithm,” in Circuit Theory and Design (ECCTD), 2011 20th European Conference on, 2011, pp. 709–712. [12] Y. Voronenko and M. P. Üschel, “Multiplierless multiple constant multiplication,” ACM Transactions on Algorithms. [13] P. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, “50 years of cordic: Algorithms, architectures, and applications,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56, no. 9, pp. 1893–1907, 2009. Appendix A Logarithm Impementation Results A.1 Area Table A.1: ASIC synthesis area results using LPLVT Desgin CORDIC Parabolic Synthesis Improved Parabolic Synthesis Area(µm2 ) 12991 16368 4855 Table A.2: ASIC synthesis area results using LPHVT Desgin CORDIC Parabolic Synthesis Improved Parabolic Synthesis Area(µm2 ) 12893 16258 4865 Table A.3: ASIC synthesis area results using GPSVT Desgin CORDIC Parabolic Synthesis Improved Parabolic Synthesis Area(µm2 ) 13061 16378 4853 50 A.2 Logarithm Impementation Results Timing Table A.4: ASIC synthesis timing results in LPLVT Desgin CORDIC Parabolic Synthesis Improved Parabolic Synthesis timing path(ns) 86.96 21.03 6.96 Table A.5: ASIC synthesis timing results in LPHVT Desgin CORDIC Parabolic Synthesis Improved Parabolic Synthesis timing path(ns) not simulated 38.93 11.90 Table A.6: ASIC synthesis timing results in GPSVT Desgin CORDIC Parabolic Synthesis Improved Parabolic Synthesis A.3 timing path(ns) not simulated 8.09 4.89 Power Estimation for LPHVT A.3 Power Estimation for LPHVT 51 Table A.7: Primetime Power Estimation for CORDIC method using LPHVT library frequency(Hz) 0.001 1 1.77827941004 3.16227766017 5.6234132519 10 17.7827941004 31.6227766017 56.234132519 100 177.827941004 316.227766017 562.34132519 1000 1778.27941004 3162.27766017 5623.4132519 10000 17782.7941004 31622.7766017 56234.132519 100000 177827.941004 316227.766017 562341.32519 1000000 1778279.41004 3162277.66017 5623413.2519 10000000 17782794.1004 31622776.6017 Dynamic Power(nW) 0.0001585 0.1585 0.281857286 0.501221009 0.891311 1.585 2.818572865 5.012210091 8.913110004 15.85 28.185728649 50.122100914 89.131100043 158.5 281.857286491 501.221009137 891.311000427 1585 2818.57286491 5012.21009137 8913.11000427 15850 28185.7286491 50122.1009137 89131.1000427 158500 281857.286491 501221.009137 891311.000427 1585000 2818572.86491 5012210.09137 Static Power(nW) 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 43.8 Total Power(nW) 44 44 44 44 45 45 47 49 53 60 72 94 133 202 326 545 935 1629 2862 5056 8957 15894 28230 50166 89175 158544 281901 501265 891355 1585044 2818617 5012254 52 Logarithm Impementation Results Table A.8: Primetime Power Estimation for Parabolic Synthesis method using LPHVT library frequency(Hz) 0.001 1 1.77827941004 3.16227766017 5.6234132519 10 17.7827941004 31.6227766017 56.234132519 100 177.827941004 316.227766017 562.34132519 1000 1778.27941004 3162.27766017 5623.4132519 10000 17782.7941004 31622.7766017 56234.132519 100000 177827.941004 316227.766017 562341.32519 1000000 1778279.41004 3162277.66017 5623413.2519 10000000 17782794.1004 31622776.6017 Dynamic Power(nW) 6.54e-05 0.0654 0.116299473 0.206812959 0.367771227 0.654 1.162994734 2.06812959 3.677712267 6.54 11.629947342 20.681295898 36.777122667 65.4 116.299473417 206.812958975 367.771226674 654 1162.99473416 2068.12958975 3677.71226674 6540 11629.9473417 20681.2958975 36777.1226674 65400 116299.473417 206812.958975 367771.226674 654000 1162994.73417 2068129.58975 Static Power(nW) 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 Total Power(nW) 57 57 57 57 57 57 58 59 60 63 68 77 93 122 173 263 424 711 1219 2125 3734 6597 11686 20738 36834 65457 116356 206869 367828 654057 1163051 2068186 A.3 Power Estimation for LPHVT 53 Table A.9: Primetime Power Estimation for Improved Parabolic method using LPHVT library frequency(Hz) 0.001 1 1.77827941004 3.16227766017 5.6234132519 10 17.7827941004 31.6227766017 56.234132519 100 177.827941004 316.227766017 562.34132519 1000 1778.27941004 3162.27766017 5623.4132519 10000 17782.7941004 31622.7766017 56234.132519 100000 177827.941004 316227.766017 562341.32519 1000000 1778279.41004 3162277.66017 5623413.2519 10000000 17782794.1004 31622776.6017 Dynamic Power(nW) 6.12e-06 0.00612 0.01088307 0.0193531393 0.0344152891 0.0612 0.1088306999 0.1935313928 0.344152891 0.612 1.0883069989 1.935313928 3.4415289102 6.12 10.8830699894 19.3531392802 34.4152891016 61.2 108.830699894 193.531392802 344.152891017 612 1088.30699894 1935.31392802 3441.52891016 6120 10883.0699894 19353.1392802 34415.2891016 61200 108830.699894 193531.392802 Static Power(nW) 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 Total Power(nW) 17 17 17 17 17 17 17 17 17 17 18 19 20 23 28 36 51 78 126 210 361 629 1105 1952 3458 6137 10900 19370 34432 61217 108847 193548

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement