Hardware Implementation of Number Inversion Function Jiajun Chen Jianan Shi

Hardware Implementation of Number Inversion Function Jiajun Chen Jianan Shi
Masters’s Thesis
Hardware Implementation of
Number Inversion Function
Jiajun Chen
Jianan Shi
Department of Electrical and Information Technology,
Faculty of Engineering, LTH, Lund University, 2016.
Master’s Thesis
Hardware Implementation of Number
Inversion Function
By
Jiajun Chen and Jianan Shi
Department of Electrical and Information Technology
Faculty of Engineering, LTH, Lund University
SE-221 00 Lund, Sweden
The Department of Electrical and Information Technology
Lund University
Box 118, S-221 00 LUND
SWEDEN
This thesis is set in Times New Roman 11pt,
with the WORD Documentation System
© Jiajun Chen and Jianan Shi, 2015
2
Abstract
Nowadays, number inversion is of great significance and a complex arithmetical
operator, especially the implementation in hardware. It provides a higher speed
performance and lower power consumption.
Initially, the conversion of the inputs to floating-point numbers contains sign,
exponent, and mantissa parts. The scope of the project is to perform an
approximation of the number inversion function. In this paper, the number
inversion function carried out by two methods: one based on Harmonized
Parabolic Synthesis and the other one on an unrolled Newton-Raphson algorithm.
It is worth mentioning that the implementation of these two methods performed as
an Application Specific Integrated Circuit using a 65nm Complementary Metal
Oxide Semiconductor technology with Low Power High Threshold Voltage
transistors. Furthermore, the investigation and comparison for various aspects such
as accuracy, error behavior, chip area, power consumption, and performance for
both methods are realizable.
Keywords: harmonized parabolic synthesis, unrolled Newton-Raphson, power
consumption.
3
4
Acknowledgments
This Master’s thesis would not exist without the support and guidance of our
primary supervisors, Erik Hertz and Peter Nilsson, who gave us considerable,
useful suggestions with this thesis work and comments on the draft. We are deeply
grateful of his help in the completion of this thesis.
We also owe our sincere gratitude to our supervisor, Rakesh Gangarajaiah who
gave us numerous, unlimited, effective help and walked us through tough
difficulties while we were stuck and confused by confronting challenging
problems.
Finally, our thanks would go to our beloved family correspondingly, for their
continuous support and encouragement all these years.
Peter Nilsson
in memoriam
Peter Nilsson is our supervisor during the whole project. He gave us tremendous
help when we confronted challenging problems and left us comments to the draft
before he passed away. We feel great sorrow that he is absent and unable to share
the achievement of the completion of this thesis.
Jiajun Chen
Jianan Shi
5
6
Contents
Abstract .................................................................................................................. 3
Acknowledgments .................................................................................................. 5
List of Acronyms.................................................................................................... 9
List of Figures ...................................................................................................... 11
List of Tables ........................................................................................................ 13
1
Introduction................................................................................................. 15
1.1
Thesis Outlines ..................................................................................... 17
2
Algorithm for number inversion ............................................................... 19
2.1
Number inversion function .................................................................. 20
3
Parabolic Synthesis ..................................................................................... 23
3.1
First sub-function ................................................................................. 23
3.2
Second sub-function ............................................................................. 25
3.3
Nth sub-function when n>2 ................................................................... 27
4
Harmonized Parabolic Synthesis ............................................................... 31
4.1
First sub-function ................................................................................. 31
4.2
Second Sub-function ............................................................................ 31
4.3
Selection of a proper c1 ........................................................................ 33
4.4
Bit Precision ......................................................................................... 34
4.4.1 Definition ......................................................................................... 34
4.4.2 Combination of Decibel and Binary Number .................................. 34
4.5
An example of c1 selection ................................................................... 35
5
Newton-Raphson ......................................................................................... 37
5.1
An example of NR iterations................................................................ 37
5.2
Reciprocal, square root, square root reciprocal .................................... 39
6
Hardware Architecture Using HPS ........................................................... 41
6.1
Pre-processing ...................................................................................... 41
6.2
Processing ............................................................................................ 42
6.2.1 Architecture of HPS ......................................................................... 42
6.2.2 Squarer ............................................................................................. 42
6.2.3 Truncation and Optimization ........................................................... 45
6.3
Post-processing .................................................................................... 45
7
Hardware Architecture Using NR ............................................................. 47
8
Error Analysis ............................................................................................. 49
8.1
Error Behavior Metrics ........................................................................ 49
8.1.1 Maximum positive and negative error ............................................. 49
8.1.2 Median error .................................................................................... 49
7
8.1.3 Mean error ....................................................................................... 49
8.1.4 Standard deviation ........................................................................... 49
8.1.5 RMS ................................................................................................. 50
8.2
Error Distribution ................................................................................. 50
9
Implementation of Number Inversion applying HPS and NR iteration 51
9.1
HPS ...................................................................................................... 51
9.1.1 Development of Sub-functions ........................................................ 51
9.1.1.1
Pre-processing ......................................................................... 51
9.1.1.2
Processing ............................................................................... 52
9.1.1.3
Post-processing ....................................................................... 52
9.1.1.4
First Sub-function ................................................................... 53
9.1.1.5
Selection of a proper c1 ........................................................... 53
9.1.1.6
Second sub-function ................................................................ 54
9.1.2 Hardware architecture ...................................................................... 54
9.1.2.1
The optimized hardware architectures .................................... 55
9.2
NR Iteration.......................................................................................... 57
9.2.1 Initial Guess ..................................................................................... 57
9.2.2 Hardware Architecture..................................................................... 57
10
Implementation Results .............................................................................. 59
10.1
Error Behavior...................................................................................... 59
10.2
Error metrics ........................................................................................ 65
10.3
Synthesis Constraints ........................................................................... 66
10.4
Chip Area ............................................................................................. 66
10.5
Timing .................................................................................................. 67
10.6
Power estimation .................................................................................. 68
10.7
Physical Layout .................................................................................... 73
11 Conclusions ..................................................................................................... 75
11.1 Comparison ............................................................................................... 75
11.2 Future Work .............................................................................................. 75
Appendix A .......................................................................................................... 77
A.1 The coefficients in the second sub-function using HPS method. ............... 77
A.2 The coefficients in LUTs of NR Iteration .................................................. 83
A.3 The power consumptions at various frequencies........................................ 85
References ............................................................................................................ 91
8
List of Acronyms
CORDIC
COordinated Rotation DIgital Computer
CMOS
Complementary Metal Oxide Semiconductor
dB
Decibel
GPSVT
General Purpose Standard Threshold Voltage
HPS
Harmonized Parabolic Synthesis
LPHVT
Low Power High Threshold Voltage
LPLVT
Low Power Low Threshold Voltage
LUT
Look-UpTable
NR
Newton-Raphson algorithm
P&R
Place and Route
RMS
Root Mean Square
VHDL
Very high speed integrated circuit Hardware Description Language
9
10
List of Figures
Fig 2.1 The block diagram of the approximation. ............................................. 19
Fig 2.2 The curve of number inversion function. .............................................. 20
Fig 2.3 The block diagram of number inversion. .............................................. 21
Fig 3.1 An example of an original function, forg(x)............................................ 24
Fig 3.2 An example of a first help function, f1(x). ............................................. 25
Fig 3.3 A second sub-function, s2(x), compared to a first help function, f1(x). . 26
Fig 3.4 An example of a second help function, f2(x). ........................................ 27
Fig 3.5 The segmented and renormalized second help function, f2(x). .............. 28
Fig 4.1 The first help function, divided evenly. ................................................ 32
Fig 4.2 C1 selection with 1 interval in the second sub-function, s2(x). .............. 35
Fig 5.1 The curve of the equation, f(x). ............................................................. 38
Fig 6.1 The procedure of HPS. .......................................................................... 41
Fig 6.2 The architecture of the processing step using HPS with 16 intervals. .. 42
Fig 6.3 The squarer unit algorithm. ................................................................... 43
Fig 6.4 The optimized squarer unit. ................................................................... 44
Fig 7.1 The general architecture of NR iterations. ............................................ 47
Fig 8.1 An example of error distribution. .......................................................... 50
Fig 9.1 The graphs of the original function, forg(x), and the target function, f(x).
............................................................................................................................... 52
Fig 9.2 The precision function for c1, combined with 2, 4, 8, 16, 32, 64 intervals
in the second sub-function. .................................................................. 53
Fig 9.3 The hardware architecture of the number inversion function. .............. 54
Fig 9.4 The block diagram of the procedure of number inversion function. ..... 55
Fig 9.5 The hardware architecture for 16-interval structure. ............................. 56
Fig 9.6 The hardware architecture for 32-interval structure. ............................. 56
Fig 9.7 The hardware architecture for 64-interval structure. ............................. 57
Fig 9.8 The hardware architecture for one-stage NR iteration architecture. ..... 58
Fig 9.9 The hardware architecture for two-stage NR iteration architecture. ..... 58
Fig 10.1 The error behavior before and after truncation and optimization applying
16-interval structure. ............................................................................ 59
Fig 10.2 The error of Fig 10.1 in bit unit. ............................................................ 60
Fig 10.3 The error distribution based on Fig 10.1. .............................................. 60
Fig 10.4 The error distribution for 32-interval structure. .................................... 61
Fig 10.5 The error distribution for 64-interval structure. .................................... 61
Fig 10.6 The error behavior after truncation and optimization applying one-stage
NR iteration. ........................................................................................ 62
Fig 10.7 The error behavior before truncation and optimization applying onestage NR iteration. ............................................................................... 63
Fig 10.8 The error in Fig 10.6 in bit unit. ............................................................ 63
Fig 10.9 The error distribution based on Fig 10.6. .............................................. 64
11
Fig 10.10 The error distribution of two-stage NR iteration .................................. 64
Fig 10.11 The power consumption for five architectures at multiple clock
frequencies. .......................................................................................... 69
Fig 10.12 The zoomed-in figure of Fig 10.1. ....................................................... 69
Fig 10.13 The switching power, internal power, static power and total power for
64 interval structure applying HPS after post synthesis simulation at
multiple clock frequencies. .................................................................. 70
Fig 10.14 The switching power, internal power, static power and total power
applying one-stage NR iteration after post synthesis simulation at
multiple clock frequencies. .................................................................. 71
Fig 10.15 The switching power, internal power, static power and total power for
64 interval structure applying HPS after post layout simulation at
multiple clock frequencies. .................................................................. 72
Fig 10.16 The switching power, internal power, static power and total power
applying one-stage NR iteration after post layout simulation at multiple
clock frequencies. ................................................................................ 72
Fig 10.17 The physical layout of 64 interval structure applying HPS method. .... 73
Fig 10.18 The physical layout of one-stage NR iteration architecture. ................ 73
12
List of Tables
Table 2.1
The transformation of a fixed-point number to a floating-point number
in number base 2. ............................................................................... 19
Table 6.1 The remaining factors in the corresponding vectors. ......................... 45
Table 10.1 Error metrics for five architectures after truncation and optimization.
............................................................................................................................... 65
Table 10.2 Chip area when minimum chip area constraint is set. ........................ 66
Table 10.3 Chip area when maximum speed constraint is set. ............................. 66
Table 10.4 Timing report when maximum speed constraint is set. ...................... 67
Table 10.5 Timing report when minimum chip area constraint is set. ................. 67
Table A.1 The coefficients l2,i of 16-interval structure. ....................................... 77
Table A.2 The coefficients j2,i of 16-interval structure. ....................................... 77
Table A.3 The coefficients c2,i of 16-interval structure........................................ 77
Table A.4 The coefficients l2,i of 32-interval structure. ....................................... 78
Table A.5 The coefficients j2,i of 32-interval structure. ....................................... 78
Table A.6 The coefficients c2,i of 32-interval structure........................................ 79
Table A.7 The coefficients l2,i of 64-interval structure. ....................................... 80
Table A.8 The coefficients j2,i of 64-interval structure. ....................................... 81
Table A.9 The coefficients c2,i of 64-interval structure........................................ 82
Table A.10 The initial guess values of one-stage NR iteration. ............................ 83
Table A.11 The initial guess values of two-stage NR iteration. ............................ 85
Table A.12 The total power consumptions at various frequencies. ....................... 85
Table A.13 The switching power, internal power, static power and total power for
64-interval architecture after post synthesis simulation .................... 86
Table A.14 The switching power, internal power, static power and total power for
1NR architecture after post synthesis simulation .............................. 87
Table A.15 The switching power, internal power, static power and total power for
64-interval architecture after post layout simulation ......................... 88
Table A.16 The switching power, internal power, static power, total power for
1NR architecture after post layout simulation ................................... 89
13
14
CHAPTER
1
1 Introduction
Unary function, especially non-linear operations such as sine, logarithmic,
exponential and number inversion functions have been widely applied in computer
graphics, wireless systems and digital signal processing [1]. These functions are
simple in software for low-precision cases. However, concerning high precision
and high-speed applications, software implementations are insufficient. Therefore,
hardware implementations for these functions are worth considering.
To implement these non-linear operations in hardware, there are several
different methods to consider, i.e. Look-Up Table (LUT), COordinated Rotation
DIgital Computer (CORDIC) algorithm, Harmonized Parabolic Synthesis (HPS)
and unrolled Newton-Raphson (NR) algorithm. In this thesis, LUT and CORDIC
are briefly introduced. On the contrary, we implement HPS and NR method both
in software and in hardware. Furthermore, the investigation and comparison,
which are of great interest, for various factors such as accuracy, error behavior,
chip area, power consumption, and performance.
For accuracies up to about 12 bit, a LUT is simple, fast and strait forward
method to implement approximations in hardware. Reading a value from the table
or matrix is much faster than calculating the number with an algorithm or using a
trigonometric function. However, the primary drawback of a LUT is its memory
usage. In many cases, when higher precision is required, the size of the LUT
increases exponentially, which is not always feasible.
A CORDIC algorithm is a simple and efficient method to calculate hyperbolic
and trigonometric functions. The basic idea is vector rotation with a number of
stages to improve accuracy. Adders/subtractors are required to determine the
direction of each rotation. The primary advantage is that hardware multiplications
are avoidable, which decreases the complexity of the design. Based on the
CORDIC algorithm, only simple shift-and-add operations are required for tasks
such as real and complex multiplications, divisions, square root calculations, etc.
However, the latency of the implementation is an inherent drawback of the
conventional CORDIC algorithm [2]. Furthermore, it requires n+1 iterations to
obtain n-bit precision, which leads to low speed carry-propagate additions, low
throughput rate and chip area consuming shift operations.
Erik Hertz and Peter Nilsson [3] have recently proposed a parabolic synthesis
method. It is an approximation method for the implementations of unary functions.
It is a new method with high-speed performance. It contains pre-processing,
15
processing, and post-processing part. By combining with a synthesis of parabolic
functions, the approximation of the target function, such as sine, exponential,
logarithm and number inversion functions, is realizable. The improved method
based on parabolic synthesis, HPS, requires only two sub-functions, combined
with second-order interpolation, which results in better speed performance.
The NR iteration algorithm is also applicable. It is a method named after Isaac
Newton and Joseph Raphson to find approximations to the roots of a real-valued
function. This method is hardware-friendly. In this algorithm, A LUT is often
required to select a start value, also called as an initial guess. It is important for the
initial guess to be as close as possible to the true result. To meet the bit precision
requirement, several iterations are necessary. It is worth mentioning that more
iterations yield smaller sizes of the LUTs. Therefore, there is a trade-off between
the number of iterations and the size of LUTs.
In this thesis, we focus on the HPS and NR method. There are totally five
architectures implemented in hardware based on HPS and NR method. The
comparison of the hardware design concerning various aspects such as accuracy,
error behavior, chip area, power consumption and speed performance achieved.
The power consumption, both static and dynamic, estimated for different clock
frequencies. When applying HPS methods, different numbers of intervals in the
second sub-function have a large influence on the error behavior, critical path, and
chip area. Hence, three different architectures built using 16, 32, and 64 intervals
in the second sub-function. Two different Matlab models designed based on both
HPS and NR methods. All implementations designed in Very High Speed
Integrated Circuit Hardware Description Language (VHDL) and synthesized using
a 65nm Complementary Metal Oxide Semiconductor (CMOS) technology.
Various synthesis libraries such as Low Power High Threshold Voltage (LPHVT),
Low Power Low Threshold Voltage (LPLVT), and General Purpose Standard
Threshold Voltage (GPSVT) are considered.
As a result, chip area, timing and power consumption using different clock
frequencies reported. Furthermore, the physical layouts for both HPS and NR
method after Place and Route (P&R), demonstrated as well.
16
1.1 Thesis Outlines
Remaining chapters, listed as follows:
Chapter 2 introduces general algorithm for number inversion function.
Chapter 3 describes the detailed Parabolic Synthesis theory.
Chapter 4 demonstrates HPS theory.
Chapter 5 illustrates NR theory.
Chapter 6 describes general hardware architecture using HPS method.
Chapter 7 explains general hardware architecture using NR method.
Chapter 8 expounds error behavior analysis in Matlab.
Chapter 9 introduces three different implementations regarding to three different
intervals by using HPS and two different implementations with respect to two
different number of NR iterations. All five detailed architectures are included.
Chapter 10 lists and analyzes the result of Chapter 9, including error behavior,
error metrics, chip area, timing and power consumption.
Chapter 11 contains conclusion and future work of this thesis work.
Appendix A lists the coefficients in the second sub-function using HPS method,
the coefficients in LUTs of NR iteration method, and the power consumptions at
various frequencies for all architectures.
17
18
CHAPTER
2
2 Algorithm for number inversion
In this thesis, the algorithms for number inversion function are applicable by
using floating-point numbers [4], containing a mantissa and an exponent. It is
worth mentioning that the exponent scales the mantissa [5]. The separation for the
computation for both the mantissa and the exponent is achievable. This separation
results in the reduction of the computation complexity due to the approximation is
on a limited mantissa range. In addition, the computation of the exponent is simple
in hardware. Table 2.1 describes the transformation of a fixed-point number to a
floating-point number in number base 2.
Table 2.1 The transformation of a fixed-point number to a floating-point number
in number base 2.
Base 10
158
Fixed-point base 2
0000000010011110
Exponent
0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0
Index
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Floating-point base 2
1.001111000000000·27
The exponent is depending on the most significant bit of the fixed-point number.
The exponent when using floating-point representation thus scales the mantissa.
The implementation is computing the mantissa of binary function as an
approximation. Figure 2.1 demonstrates the block diagram of the approximation.
Fig 2.1 The block diagram of the approximation.
19
As shown in Figure 2.1, the input v is the mantissa part of the floating-point
number. The output z is the output after the approximation.
2.1 Number inversion function
As mentioned in Table 2.1, the transformation of a fixed-point number to a
floating-point number is realizable. When computing the number inversion
function, the output z, shown in (2.2).
z 
1
V 0v 1v  2  v 15  2index
(2.2)
where V0 is the integer bit of v, v-1, v-2, …, v-15 are the fractional bits of v.
In (2.3) is (2.2) modified.
z 
1
V 0v 1v  2  v 15
 2 index
(2.3)
which explains why the index is negative.
Figure 2.2 illustrates the inverse function in the range from 1 to 2.
Fig 2.2 The curve of number inversion function.
The algorithm of the number inversion function, illustrated in Figure 2.3.
20
Fig 2.3 The block diagram of number inversion.
The input is a floating-point number with an exponent part and a mantissa. As
shown in Figure 2.3, the computation for the mantissa and the exponent is separate.
For the exponent, the sign bit changes depending on the result. Concerning the
mantissa, if the mantissa of the result is less than 1, the exponent has to be
subtracted with 1. In addition, if the mantissa is less than 1, multiply the mantissa
with 2. Note that, in this thesis, we focus on the approximation part, performed by
HPS and NR method, which are in further chapters.
21
22
CHAPTER
3
3 Parabolic Synthesis
This chapter contains the description of the Parabolic Synthesis algorithm. It is
an approximation method for implementing unary functions. Parabolic Synthesis
is the product of a series of second order sub-functions [6], defined as s1(x),
s2(x)… sn(x) to approximate the original function, defined as forg(x), shown in (3.1).
forg(x )  s 1(x )  s 2(x )  s 3(x )    s n(x )
(3.1)
The number of sub-functions affects the accuracy. When the number of subfunctions, n, goes to infinite, the original function, forg(x), is fully satisfied. In order
to develop these sub-functions, the help function is used. The first help function,
f1(x), is defined as the quotient of the original function, forg(x), and the first subfunction, s1(x), as shown in (3.2).
f1(x ) 
forg(x )
 s 2(x )  s 3(x )    s (x )
s1(x )
(3.2)
In the same manner, the following help functions, fn(x), is defined as shown in
(3.3). Note that, the nth help function is to develop the nth+1 sub-function.
fn(x ) 
fn 1(x )
 s n  1(x )  s n  2(x )    s (x )
s n(x )
(3.3)
3.1 First sub-function
In order to apply Parabolic Synthesis method, the normalization of the target
function is essential, to the range 0  x  1 on the x-axis and 0  y  1 on
the y-axis. This normalization results in the original function, forg(x).
Figure 3.1 demonstrates an example of an original function, forg(x).
23
Fig 3.1 An example of an original function, forg(x).
In addition, the original function, forg(x), should satisfy the requirements as
follows:
1. The original function, forg(x), should be strict concave or convex. This is due to
that the sub-functions are second order polynomials which are strict concave
or convex.
2. The first help function, f1(x), has a limited value when x goes to 0. This
indicates that if the original function, forg(x), does not have a limited value
when x goes to 0, the first help function, f1(x), is not defined at all.
3. The limited value in requirement 2 should be smaller than 1 or larger than -1
after it is subtracted with 1. If the limited value in requirement 2 is not inside
the interval, the first sub-function, s1(x), is not deployable since the gradient is
not positive in the interval.
The first sub-function, s1(x), is a second order polynomial, as shown in (3.4).
s 1(x )  l1  k1x  c1(x  x 2 )
(3.4)
As shown in Figure 3.1, the first sub-function, s1(x), should cut the same starting
point and ending point as the original function, forg(x), since it is to approach forg(x).
The starting point is (0, 0) which gives l1 to be 0. The coefficient k1, calculated to
be 1, based on both the starting point and the ending point. The first sub-function,
s1(x), is thus simplified according to (3.5).
s 1(x )  x  c1(x  x 2 )
(3.5)
24
The coefficient c1, calculated according to (3.6).
1  lim
x 0
forg(x )
forg(x )
 lim
x  0 x  c (x  x 2 )
s1(x )
1
(3.6)
Note that, since x2 goes faster towards 0 than x, the limit in (3.6), modified as
shown in (3.7).
forg(x )
x  0 (1  c )x
1
1  lim
(3.7)
The coefficient c1, calculated according to (3.8).
c1  lim
x 0
forg(x )
1
x
(3.8)
3.2 Second sub-function
The second sub-function, s2(x), is designed to approach the first help function,
f1(x). Figure 3.2 demonstrates an example of the first help function, f1(x).
Fig 3.2 An example of a first help function, f1(x).
The second sub-function, s2(x), is a second order polynomial as well, defined in
(3.9).
s 2(x )  l2  k 2x  c 2(x  x 2 )
(3.9)
25
Figure 3.3 demonstrates an example of a comparison between a first help
function, f1(x), and a second sub-function, s2(x).
Fig 3.3 A second sub-function, s2(x), compared to a first help function, f1(x).
Since the second sub-function, s2(x), is to approach the first help function, f1(x),
both functions should cut the same starting point (0, 1), and the same ending point
(1, 1). The coefficient l2, calculated to be 1, based on the starting point, and the
coefficient k2, calculated to be 0, based on both the starting point and ending point.
The second sub-function, s2(x), is simplified as shown in (3.10).
s 2(x )  1  c2(x  x 2 )
(3.10)
The coefficient c2, is to ensure the quotient of f1(x) and s2(x) to be 1 when x
equals 0.5. In (3.11), is the coefficient, c2, calculated.
1
2
c 2  4(f1( )  1)
(3.11)
Similarly, by dividing the first help function, f1(x), with the second sub-function,
s2(x), an example of a second help function, f2(x), is illustrated in Figure 3.4.
26
Fig 3.4 An example of a second help function, f2(x).
As shown in Figure 3.4, the second help function, f2(x), contains both convex
and concave curves. To apply this methodology, the second help function, f2(x),
should be segmented into two intervals and both intervals should be renormalized.
th
3.3 N sub-function when n>2
When developing the third help function, s3(x), the second help function, f2(x),
should be renormalized into two intervals due to requirement 1, mentioned in
Section 3.1. As shown in Figure 3.4, the second help function appears both convex
and concave curves. By dividing the interval into two intervals, 0  x  0.5
and 0.5  x  1 , the curve is strict convex or concave again.
Hence, the second help function, f2(x), is shown in (3.12).
f (x ) 0  x  0.5
f2(x )   2,0
f2,1(x ) 0.5  x  1
(3.12)
The second help function, f2(x), consists of a pair of concave and convex curves.
The renormalization for both these curves is essential, to the interval 0  x  1
on the x-axis.
Figure 3.5 demonstrates partially the segmented and renormalized second help
function, f2(x).
27
Fig 3.5 The segmented and renormalized second help function, f2(x).
It is worth mentioning that x3 replaces x based on (3.14). The purpose of this
replacement is to map the input x to the normalized parabolic curve.
x 3  fract(2x )
(3.13)
The third sub-function, s3(x), for each interval, is shown in (3.14).
s (x ) 0  x  0.5
s 3(x )   3,0
s 3,1(x ) 0.5  x  1
(3.14)
In the same manner, the nth help function, fn(x), can be developed according to
(3.15). Along with the growth of n, the number of pairs of concave and convex
curves is higher.

1
0  x  n 2
 fn ,0(x )
2

1
2
 f (x )
 x  n 2
n 2
 n ,1
2
2
fn(x )  







2n  2  1
 x  1
fn ,2n  1 1(x )
2n  2

28
(3.15)
Hence, the nth sub-function, sn(x), can be developed as shown in (3.16).

1
0  x  n 2
 s n ,0(x n )
2

1
2
 s (x )
 x  n 2
n ,1
n
n

2

2
2
s n(x )  






n 2

2
1
 x  1
s n ,2n  2  1(x n )
n 2
2

(3.16)
It is worth mentioning that xn replaces x, as shown in (3.17). Similar to the
replacement in the third sub-function, s3(x), this replacement is to map the input x
to the normalized parabolic curves.
x n  fract(2n  2 x )
(3.17)
th
The n sub-function, sn(x), thus partially described as sn,m(xn), shown in (3.18).
s n ,m(x n )  1  c n ,m(x n  x n2 )
(3.18)
The coefficient, cn,m , can be calculated in the same manner as in (3.11), shown
in (3.19).
c n ,m  4(fn 1,m(
2(m  1)  1
)  1)
2n  1
(3.19)
29
30
CHAPTER
4
4 Harmonized Parabolic Synthesis
Based on the Parabolic Synthesis, a superior methodology, called Harmonized
Parabolic Synthesis (HPS), only requires two sub-functions, as described in (4.1).
Both the sub-functions are second order polynomials. Note that, the second subfunction, s2(x), is combined with second-degree interpolation. The output accuracy
thus depends on the number of intervals in the interpolation.
forg(x )  s 1(x )  s 2(x )
(4.1)
Similar to the Parabolic Synthesis, the normalization of the original function,
forg(x), is essential, to the range 0  x  1 on the x-axis and 0  y  1 on the
y-axis. Additional requirements for the original function, forg(x), listed as follows:
1. The original function, forg(x), should be strict concave or convex. This is due to
that the sub-functions are second order polynomials which are strict concave
or convex.
2. The first help function, f1(x), has a limited value when x goes to 0. This
indicates that if the original function, forg(x), does not have a limited value
when x goes to 0, the first help function, f1(x), is not defined at all.
4.1 First sub-function
The first sub-function, s1(x), is determined to approach the original function,
forg(x), as defined in (4.2).
s 1(x )  l1  k1x  c1(x  x 2 )
(4.2)
Since the first sub-function, s1(x), has the same starting point and the ending
point as the original function, forg(x), the coefficient l1, calculated based on the
starting point, and the coefficient k1, calculated based on both the starting point
and ending point. The coefficient, c1, is selected by scanning from the interval (1.5, 0) and (0, 1.5) respectively based on the concavity and convexity of the
original function, forg(x).
4.2 Second Sub-function
The second sub-function, s2(x), is determined to approach the first help function,
f1(x), as explained in Section 3.2. The former first help function, f1(x), is divided
into 4 intervals as shown in Figure 4.1.
31
Fig 4.1 The first help function, divided evenly.
n
The number of intervals is 2 . Along with the growth of n, the improvement of
the precision is realizable. Concerning the second sub-function, s2(x), with respect
to the ith interval, the general formula can be expounded according to (4.3).
s 2,i(x )  l2,i  k 2,i x   c2,i(x   x 2 )
where i is from 0 to
(4.3)
.
The first help function in each interval, defined as f1,i(x), is approached by the
second sub-function, s2,i(x), in each corresponding interval. To calculate all the
coefficients, three specific coordinates in each interval, which are the starting point,
the middle point, and the ending point, are important. The general formula of the
second sub-function, s2,i(x), in each interval can be expounded according to (4.4).

1
0  x  
 s 2,0(x  )
2

1
2
 s (x )

x

2,1


2
2
s 2(x )  

 
 


i 1
 x  1
s 2,i  1(x  )
2

32
(4.4)
It is worth mentioning that x  replaces x. The purpose of this replacement is to

maintain the normality in each interval by multiplying by 2 , according to (4.5).
x   fract(2 x )
(4.5)
In order to compute the coefficient, l2,i, in each interval, the starting point of
each interval is essential, as shown in (4.6).
l2,i  f1(start ,i )
(4.6)
When computing the coefficient, k2,i, which is the slope in each interval, is
expressed according to (4.7).
k 2,i  f1(start ,i )  f1(end ,i )
(4.7)
Computing the coefficient, c2,i, requires the help of the middle point in each
interval, as shown in (4.8).
1
2
c 2,i  4(f1,i( )  l2,i 
1
k 2,i )
2
(4.8)
The second sub-function in each interval, s2,i(x), can be modified based on (4.3),
as shown in (4.9).
s 2,i(x )  l2,i  (k 2,i  c2,i )x   c2,i x 2
(4.9)
This modification saves an adder in hardware by adding k2,i and c2,i manually,
defined as j2,i , according to (4.10).
j 2,i  k 2,i  c 2,i
(4.10)
According to (4.9) and (4.10), the second sub-function in each interval is
according to (4.11).
s 2,i(x )  l2,i  j 2,i x   c2,i x 2
(4.11)
4.3 Selection of a proper c1
The selection of the coefficient, c1, is extremely essential since c1 is not only to
meet the precision requirement of the output, but also to realize a hardware
friendly architecture. The steps of developing c1 combined with the number of
intervals in the second sub-function, s2(x), listed as follows:
First, select every value from the interval (-1.5, 0) and (0, 1.5) respectively
based on the convexity and concavity of the original function, forg(x). Depending
on the different values of c1, the first sub-function, s1(x), is settled. Second, the
33
second sub-function, s2(x), is developed based on (4.1). Finally, the computation
of the precision of the output is realizable with these two sub-functions.
The error e(x) described according to (4.12).
e(x )  abs(forg(x )  s 1(x )  s 2(x ))
(4.12)
When presenting the accuracy of a function, the method of using logarithms
provides a great resolution. We propose the concept of Decibel (dB) [8].
4.4 Bit Precision
4.4.1 Definition
There are two equal ways of defining the result in decibels. With respect to
power measurements, described according to (4.13).
PdB  10 log 10(
p
)
p0
(4.13)
where p is the measured power and p0 the reference power. With respect to the
measurements of amplitude, described according to (4.14).
AdB  20 log 10(
a
)
a0
(4.14)
where a is the measured amplitude and a0 the reference amplitude.
4.4.2 Combination of Decibel and Binary Number
This combination provides a great result. In Binary representation, the
transformation of the numbers in number base 2 to decibel is shown in (4.15).
20 log10 2  6dB
(4.15)
Since 6dB represents 1 bit in resolution, this transformation leads to a simplified
comprehension of the result. For instance, if the error equals 0.001, the
transformation, described according to (4.16).
20 log10 0.001  60dB
(4.16)
which represents 10 bits in resolution.
34
4.5 An example of c1 selection
As a result, the bit precision is a function of c1 combined with the number of
intervals in the second sub-function, s2(x). Figure 4.2 [9] demonstrates an example
of c1 selection.
Fig 4.2 C1 selection with 1 interval in the second sub-function, s2(x).
There are two peak values approximately 0.3 and 1.1 respectively. The
precision can be greater than 11 bits. However, to realize a hardware friendly
architecture, these peak values are not always feasible. A simple hardware
architecture is realized by selecting c1as 0, 0.5, or 1. Note that if the graph of the
original function is convex, the sweeping range of c1 is from -1.5 to 0. Another
method to increase the freedom of selecting c1 is to increase the number of
intervals in the second sub-function, s2(x).
35
36
CHAPTER
5
5 Newton-Raphson
In general, the NR methodology [10] is an iterative process to approximate the
root of a function. The NR iteration, as described in (5.1).
x n 1  x n 
f(x n )
f (x n )
(5.1)
In (4.1), xn is from the previous iteration, f (xn) stands for the value at xn. The
derivative of f (xn), f (x n ) stands for the slope of f(xn) at xn. The number n is the
number of iteration subtracted with 1.
Hence, the first iteration, as expounded in (5.2).
x1  x 0 
f(x 0 )
f (x 0 )
(5.2)
where x0 is the initial guess for a single root σ of the equation.
5.1 An example of NR iterations
For instance, assume that the equation, f(x), illustrated in (5.3), is to be
approximated. The precision requirement is that the error is less than 0.000000001.
f(x )  3x 2  6x  2
(5.3)
Figure 5.1 demonstrates the graph of the equation, f(x).
37
Fig 5.1 The curve of the equation, f(x).
The derivative of f(x), f (x ) is shown in (5.4).
f (x )  6x  6
(5.4)
There are two roots in the interval (0, 1) and (1, 2) respectively. To calculate the
root in the interval (0, 1), assume that the initial guess x0 equals 0.5. Assume that
15 decimal places are reserved. The first iteration, is as shown in (5.5).
x1  x 0 
f(x 0 )
 0.4166666666 66667
f (x 0 )
(5.5)
The second iteration is as described in (5.6).
x 2  x1 
f(x 1 )
 0.4226190476 19048
f (x 1 )
(5.6)
Furthermore, the third iteration is as shown in (5.7).
x3  x2 
f(x 2 )
 0.4226497299 95091
f (x 2 )
(5.7)
The true value of the root σ in the interval (0, 1) is approximately
0.422649730810374. Therefore, the error e is as shown in (5.8).
e  abs(x 3   )  0.0000000008 15283  0.000000001
38
(5.8)
After applying only three iterations, the precision requirement is fully satisfied.
Hence, it is critical for x0 to be as close as possible to the real root. Note that if the
equation, f(x), is a continuous derivative, the iterations will strictly converge to σ
by developing an initial guess x0 that is close enough to σ.
In the example above, to meet the precision requirement, three iterations are
required. However, this may not be applicable for most of the cases nowadays. In
1991, Paolo Montuschi and Marco Mezzalama discussed the possibility of using
absolute error instead of relative error in order to apply fixed number of iterations,
specifically, only one or two iterations [11]. That leads to an optimization to
minimize the absolute error with a good initial guess. In 1994, Michael J. Schulte,
Earl E. Swartzlander and J. Omar discussed how to optimize the initial guess for
reciprocal functions [12]. They focused on the optimization of the initial guess
interval by interval to minimize the absolute error. The general formula is as
illustrated in (5.9).
n 
p 2n  q 2n
p 2  nq  q 2  n p
(5.9)
where n stands for the initial guess, n is the number of iteration, p and q are the
circumscription of the interval.
This formula can be realizable by applying LUTs to select a proper initial guess
n that is close enough to the true value σ, which leads to an increased speed of
the NR convergence.
5.2 Reciprocal, square root, square root reciprocal
The NR methodology provides tremendous unary function approximations such
as reciprocal (number inversion), square root, and square root reciprocal by using
different setups of the initial guess, x0, described as follows:
If f(x) is set to (5.10),
f (x ) 
1
x
a
(5.10)
The modification of the general iteration in (5.1) is according to (5.11).
x n 1  x n(2  ax n )
(5.11)
These iterations go to 1/a so that the reciprocals can be calculated.
If f(x) is set to (5.12),
f(x )  x 2  a
(5.12)
39
The modification of the general iteration in (5.1) is according to (5.13).
x n 1 
1
a
(x n 
)
2
xn
(5.13)
These iterations go to √ so that the square root can be realizable.
If f(x) is set to (5.14),
f(x ) 
1
x2
a
(5.14)
The modification of the general iteration in (5.1) is according to (5.15).
x n 1 
xn
2
(3  ax n2 )
(5.15)
These iterations go to 1/√ so that the square root reciprocals can be calculated.
It is worth mentioning that this setup can also be used to calculate √ by simply
multiplying it by a.
40
CHAPTER
6
6 Hardware Architecture Using HPS
The HPS procedure consists of three steps: pre-processing, processing, and
post-processing, as shown in Figure. 6.1.
Fig 6.1 The procedure of HPS.
Both the pre-processing part and the post-processing part are conversion steps.
The processing part is the approximation step.
6.1 Pre-processing
In order to ensure that the output x is in the interval (0, 1), the conversion of the
input v is essential. An example of sine function, described in (6.1).
f(v )  sin(v )
(6.1)
The input domain of v is from 0 to π/2. In order to normalize the output x to be
in the interval (0, 1), the input v should be multiplied by 2/π. Hence, the original
function is as shown in (6.2).
41
forg (x )  sin(

2
x)
(6.2)
6.2 Processing
6.2.1 Architecture of HPS
According to (4.1), (4.2), and (4.11), the architecture of HPS can be
demonstrated according to Figure 6.2.
Fig 6.2 The architecture of the processing step using HPS with 16 intervals.
With respect to the first sub-function, s1(x), and the second sub-function, s2(x), a
squarer unit should be applied. It is worth mentioning that, if the coefficient, c1, is
selected to 0, the squarer unit in the first sub-function, s1(x), is no longer required.
6.2.2 Squarer
As mentioned in Section 6.2.1, a squarer unit is required. Figure 6.3 illustrates
the squarer unit.
42
Fig 6.3 The squarer unit algorithm.
As shown in Figure 6.3, the computation of the output, p, is realizable by
squaring the least significant bit of x. The computation of the output, q, is also
realizable by squaring the two least significant bits of x. The computation of the
remaining r and s can be realizable in the same manner. By applying this
algorithm, the implementation of the partial output of
is realizable, which leads
to a reduction of chip area compared to the multiplier counterpart.
However, there are plenty of adders in the algorithm shown in Figure 6.3. To
save adders, Erik Hertz [13] put an optimized algorithm forward. The reduction of
the ellipses containing x0x0, x1x1, x2x2, and x3x3 is realizable, to x0, x1, x2, and x3
respectively. Moreover, it is realizable to move all the squares to the next column.
The modification of the algorithm in Figure 6.3 is as shown in Figure 6.4.
43
Fig 6.4 The optimized squarer unit.
The factor, p0, in the partial product, p, can be optimized according to (6.4).
p 0  x 0x 0  x 0
(6.4)
Since p0 cannot have an influence to p1, p1 equals 0. The value of q1 is as shown
in (6.5).
q 1  p 1  21  x 1x 0  21  x 0 x 1  21  0  21  x 1x 0  22  0  21 (6.5)
The value of q2 is as illustrated in (6.6).
q 2  x 1x 0  22  x 1x 1  22  x 1  22  x 1x 0  22
(6.6)
The value of r2 is as shown in (6.7).
r2  q 2  22  x 2x 0  22  x 0 x 2  22  q 2  22  x 2x 0  23  q 2  22
(6.7)
The computations of the remaining factors in the corresponding vectors are in
the same manner, as shown in Table 6.1.
44
Table 6.1 The remaining factors in the corresponding vectors.
factor
value
r3
q 3  23  x 2 x 0  23
r4
x 2  24  x 2x 1  24
s3
r3  23
s4
r4  24  x 3x 0  24
s5
r5  25  x 3x 1  25
s6
x 3  26  x 3 x 2  26
By applying this simplified squarer unit, plenty of adders saved which leads to
the optimization of the hardware architecture.
6.2.3 Truncation and Optimization
Due to floating-point representation, the coefficients, l2,i, j2,i and c2,i, result in
truncation, to limit the word length. By simulating different word lengths, each set
of the coefficients, l2,i, j2,i, and c2,i,can be optimized to meet the required output
precision.
6.3 Post-processing
The input of the post-processing step y is ranging from 0 to 1. The output of
post-processing step z should be in the same range as the target function to meet
the requirement. Take sin(x) as a target function for instance, the input x is ranging
from 0 to π/2, which is the same range for y from 0 to 1. Hence, the equation in the
post-processing step should be z = y.
45
46
CHAPTER
7
7 Hardware Architecture Using NR
According to Chapter 5, a LUT is essential to select a proper initial guess n
that is close enough to the true root value σ. This is to speed up the NR
convergence, which results in less number of iterations. A general architecture of
NR algorithm is as shown in Figure 7.1.
Fig 7.1 The general architecture of NR iterations.
However, in order to meet a specific precision requirement, the deduction of the
number of iterations results in a larger LUT size. Hence, there is a trade-off
between the size of the LUT and the number of iterations. It is worth mentioning
that the function inside every iteration block differs when the target function is
altered, which means if the target function changes, the modification for the
function inside each iteration block is essential. This is the main drawback of NR
iterations.
47
48
CHAPTER
8
8 Error Analysis
8.1 Error Behavior Metrics
In order to measure error behavior, there are 5 major factors to be considered,
namely, maximum positive and negative absolute error, median error, mean error,
standard deviation [14] and root mean square (RMS) error.
8.1.1 Maximum positive and negative error
The maximum positive and negative errors are the most absolute difference and
the least absolute difference between the actual value and the approximated value
respectively.
8.1.2 Median error
Median error is the value of the sample that lies in the middle position of one set
of samples. For an odd set of samples, the median error is the value of the sample
that is in the middle position. That is, the number of larger samples equals the
number of smaller samples. For an even set of samples, the median error is the
average value of the two central samples.
8.1.3 Mean error
Mean error is the average of all the error samples. Each error sample is the
difference between the actual value and the approximated value. Assume that the
approximated values are expressed as y1, y2, …, yn, the actual values are expressed
as x1, x2, …, xn. Mean error can thus be described according to (8.1).
e mean 
1
n
n
(y i

i
 xi )
(8.1)
1
where n is the number of samples.
8.1.4 Standard deviation
Standard deviation is a measure of how are the samples spreading out. That is to
say, it is the square root of the variance. The variance defined as the average of the
squared differences from the mean value of errors. To sum up, standard deviation
is as shown in (8.2).
49

2

1
n
n
 ((y i
2
 x i )  e mean )
(8.2)
i 1
8.1.5 RMS
The RMS error is the square root of the mean square error. Mean square error is
the average of the squares of the difference between approximated values and
actual values, as described in (8.3).
e rms 
1
n
 (y i
n i
 x i )2
(8.3)
1
8.2 Error Distribution
Error distribution is an alternative factor to consider. It illustrates the
possibilities of each value of error that occurs. Figure 8.1 demonstrates an example
of error distribution.
Fig 8.1 An example of error distribution.
As shown in Figure 8.1, the x-axis represents the value of the error and the yaxis stands for the number of errors that appear. Figure 8.1 is close to a Gaussian
distribution, which means the possibilities of errors are high in the center and
decrease on the sides.
50
CHAPTER
9
9 Implementation of Number Inversion
applying HPS and NR iteration
In this chapter, the implementation of the number inversion architecture is
applicable for two different methods. For the HPS method, the implementations
for three different architectures with different intervals are realizable in hardware.
For the NR method, the implementation is realizable for one and two iterations
respectively. The common input that is to be converted is in the interval (1, 2) with
a 15-bit mantissa and an output precision of 16 bits.
9.1 HPS
Based on the HPS method mentioned in Chapter 4, the implementation of the
number inversion function is realizable in hardware. The coefficients of the subfunctions, s1(x) and s2(x), are selected according to the method mentioned in
Section 4.1 and Section 4.2. Furthermore, the hardware implementation consists of
three different architectures with different number of intervals, namely 16, 32, and
64 intervals, when developing the second sub-function, s2(x).
9.1.1 Development of Sub-functions
As mentioned in Section 4.1, the normalization of the original function, forg(x),
is essential, to the range 0  x  1 on the x-axis and 0  y  1 on the y-axis.
The number inversion function, shown in (9.1).
f (x ) 
1
(9.1)
x
9.1.1.1 Pre-processing
As mentioned in Chapter 2, the mantissa, v, is ranging from 1 to 2. In the preprocessing part, the range of the input to the processing part x is subtracted with 1
as shown in (9.2).
x v 1
(9.2)
51
9.1.1.2 Processing
Initially, the normalization of the original function, forg(x), is essential, to the
range 0  x  1 on the x-axis and 0  y  1 on the y-axis. The original
function, forg(x), shown in (9.3).
forg (x ) 
2
1x
1 
1x
1x
(9.3)
Figure 9.1 demonstrates the graphs of the original function, forg(x), and the target
function, f(x).
Fig 9.1 The graphs of the original function, forg(x), and the target function, f(x).
9.1.1.3 Post-processing
In the post-processing part, divide the output y by 2. Since the output y from the
processing part is from 0  z  1 , the output z from the post-processing part has
to be in the range from 0.5  z  1 . The post-processing is as illustrated in
(9.4).
z 
y 1
(9.4)
2
The pre-processing, processing, and post-processing parts result in the
target function, f(x).
52
9.1.1.4 First Sub-function
The first sub-function, s1(x), is a second order polynomial, defined in (9.5).
s 1(x )  l1  k1x  c1(x  x 2 )
(9.5)
As shown in Figure 9.1, the first sub-function, s1(x), should cut the same starting
point and ending point as the original function, forg(x), since it is to approach forg(x).
The starting point is (0, 1) which gives l1 to be 1. The coefficient k1, calculated to
be -1, based on both the starting point and the ending point. The first sub-function,
s1(x), is thus simplified according to (9.6).
s1(x )  1  x  c1(x  x 2 )
(9.6)
9.1.1.5 Selection of a proper c1
Regarding Section 4.3 and 4.5, the combination of c1 and the number of
intervals divided by the second sub-function s2(x) is important to select a proper c1.
Figure 9.2 demonstrates the precision function of c1, sweeping from -0.8 to 0,
combined with different number of intervals in the second sub-function, s2(x).
Fig 9.2 The precision function for c1, combined with 2, 4, 8, 16, 32, 64 intervals in
the second sub-function.
In order to gain a hardware-friendly architecture, c1 is selected to 0.
Furthermore, to meet the output precision requirements of 16 bits, the numbers of
intervals of the second sub-function, s2(x), are selected to 16, 32, and 64.
Since c1 is selected to 0, the first sub-function s1(x), is simplified according to
(9.7).
53
s1(x )  1  x
(9.7)
The first help function, f1(x), is the quotient of the original function, forg(x), and
the first sub-function, s1(x), described in (9.8).
f1(x ) 
forg(x )
1

s 1(x )
1x
(9.8)
9.1.1.6 Second sub-function
According to Section 4.2, based on the number of intervals, the different sets of
coefficients, l2,i, j2,i, and c2,i, of the second sub-function, s2(x), are developed. They
are as listed in Table A.1 to A.9 in Appendix A.1.
9.1.2 Hardware architecture
Figure 9.3 demonstrates the hardware architecture of the HPS methodology for
the number inversion function.
Fig 9.3 The hardware architecture of the number inversion function.
It consists of three steps, pre-processing, processing, and post-processing. The
pre-processing step is the normalization part in the algorithm. In the processing
step, two sub-functions are required to approximate the target function. In the
post-processing step, the values (with normalized input) obtained from the
previous step are translated into true values. These three steps are in Section
9.1.1.1, 9.1.1.2, and 9.1.1.3 in detail.
54
Based on Figure 9.3, a detailed block diagram of the different parts of the
implementation of the number inversion function is as shown in Figure 9.4. As
described in Chapter 2, the input is a floating-point number with exponent part and
mantissa part. The computation for the mantissa and the exponent is separate. For
the exponent, the sign bit changes depending on the result. Concerning the
mantissa, if the mantissa of the result is less than 1, the exponent has to be
subtracted with 1. In addition, if the mantissa is less than 1, multiply the mantissa
with 2. In the approximation part, the mentioned three steps are used.
Fig 9.4 The block diagram of the procedure of number inversion function.
9.1.2.1 The optimized hardware architectures
The optimized hardware architectures of three different number of intervals in
the second sub-function, s2(x), are demonstrated in Figure 9.5, Figure 9.6, and
Figure 9.7. They are architectures using HPS method with 16, 32 and 64 intervals.
In the corresponding s2 parts in the figures, the size of the LUT (l2,i, j2,i, and c2,i, in
the figures) increases when the number of intervals increases in each architecture.
The needed word lengths to meet the required precision become smaller, thus,
result in smaller multipliers, adders and squarer. Among these three architectures,
the 64-interval structure claims the largest size of the LUT and the smallest
multipliers, adders and squarer.
55
Fig 9.5 The hardware architecture for 16-interval structure.
Fig 9.6 The hardware architecture for 32-interval structure.
56
Fig 9.7 The hardware architecture for 64-interval structure.
9.2 NR Iteration
This section contains the introduction of one-stage NR iteration and two-stage
NR iteration.
9.2.1 Initial Guess
As mentioned in Chapter 7, in order to achieve a quick convergence, a good
initial guess is crucial. If the number of stages is only one, a large LUT containing
different initial guesses for the different intervals is required. Furthermore, along
with the growth of the number of stages, the reduction of the size of LUT is
significant. Based on equation (5.9), two LUTs after truncation and optimization
regarding one-stage iteration and two-stage iteration, listed according to Table
A.10 and A. 11 in Appendix A.2.
9.2.2 Hardware Architecture
The diagrams in Figure 9.8 and Figure 9.9 demonstrate the architectures using
NR method for a single iteration and two iterations (unrolled) with needed word
lengths to meet the required output precision. Compare to the architecture for two
iterations, the one-stage iteration architecture claims a larger LUT size, but a
smaller number of components.
57
Fig 9.8 The hardware architecture for one-stage NR iteration architecture.
Fig 9.9 The hardware architecture for two-stage NR iteration architecture.
58
CHAPTER
10
10 Implementation Results
The implementation results consist of error behavior, error metrics, chip area,
timing, and power estimation. By applying a 65nm LPHVT COMS technology,
the implementation results include these aspects for all hardware architectures. All
of the conditions are at normal case and at room temperature.
10.1 Error Behavior
Figure 10.1 represents the error behavior for the 16-interval structure in the
second sub-function, s2(x).
Fig 10.1 The error behavior before and after truncation and optimization applying
16-interval structure.
As shown in Figure 10.1, the black and grey curves represent the error before
and after the truncation and optimization respectively. The errors are lying equally
around 0.
Figure 10.2 demonstrates the error in Figure 10.1 in bit unit.
59
Fig 10.2 The error of Fig 10.1 in bit unit.
According to Figure 10.2, the 16-bit target precision requirement is satisfied.
Figure 10.3 describes the error distribution in Figure 10.1.
Fig 10.3 The error distribution based on Fig 10.1.
60
As shown in Figure 10.3, the error is in the interval of (-1.5·
, 1.5·
). In
general, it realizes a Gaussian distribution, which represents a probability
distribution.
The error distribution of 32, and 64 intervals in the second sub-function, s2(x),
can be realized in the same manner, as shown in Figure 10.4 and Figure 10.5.
Fig 10.4 The error distribution for 32-interval structure.
Fig 10.5 The error distribution for 64-interval structure.
61
According to Figure 10.3, Figure 10.4, and Figure 10.5, along with the growth
of the number of intervals in the second sub-function, s2(x), the numbers of errors
slightly decrease around zero.
Figure 10.6 and demonstrates error behavior after truncation and optimization
applying one-stage NR iteration. The errors are lying equally around 0. Figure
10.7 illustrates error behavior before truncation and optimization. The errors are
not lying around 0. It is not readable when placing these two figures together.
Fig 10.6 The error behavior after truncation and optimization applying one-stage
NR iteration.
62
Fig 10.7 The error behavior before truncation and optimization applying one-stage
NR iteration.
Figure 10.8 demonstrates the error in Figure 10.6 in bit unit.
Fig 10.8 The error in Fig 10.6 in bit unit.
According to Figure 10.8, the 16-bit target precision requirement is realizable.
The error distribution is as described in Figure 10.9 based on Figure 10.6.
63
Fig 10.9 The error distribution based on Fig 10.6.
As shown in Figure 10.9, the error is in the interval (-1.5·
, 1.5·
). In
general, it realizes a Gaussian distribution, which is a probability distribution.
The error distribution of the two-stage NR iteration is realizable in the same
manner, as shown in Figure 10.10.
Fig 10.10 The error distribution of two-stage NR iteration
64
According to Figure 10.9 and Figure 10.10, along with the growth of the
number of iterations, the reduction of the size of the LUT is significant. This
results in that the error distribution becomes wider. As mentioned in Chapter 7,
there is a trade-off between the size of LUT and the number of NR iterations.
To close this section, the HPS architectures claim a better error distribution than
the NR iteration architectures. The error distributions for both NR iteration
architectures are much wider. The most of the errors for 16-interval structure
applying HPS are lying in the interval (-0.5·
, 0.5·
) according to Figure
10.1. However, with respect to Figure 10.6, the most of the errors for one-stage
NR iteration architecture are lying in the interval (-1·
, 1·
). This indicates
that the 16-interval structure claims a better error behavior than one-stage NR
iteration architecture.
10.2 Error metrics
Table 10.1 illustrates the error metrics for five architectures after truncation and
optimization.
Table 10.1 Error metrics for five architectures after truncation and optimization.
Error type
16HPS
-6
(10 /bits)
32HPS
-6
(10 /bits)
64HPS
-6
(10 /bits)
1NR
-6
(10 /bits)
2NR
(10-6/bits)
Maximum
positive error
13.5/16.24 14.0/16.17 13.5/16.24 15.7/16.06 11.8/16.43
Maximum
negative error
13.4/16.24 13.0/16.29 13.3/16.26 15.2/16.01 11.7/16.43
Mean error
2.40/18.73 2.52/18.66 2.53/18.66 4.65/17.77 3.96/18.01
Median error
2.11/18.92 2.18/18.87 2.19/18.87 4.12/17.95 3.77/18.08
Standard
deviation
2.99
3.15
3.15
5.67
4.66
RMS
3.00
3.16
3.16
5.69
4.66
According to Table 10.1, there are slight differences for each error factor. Due
to Gaussian distribution, the differences between the Standard deviation and the
65
RMS in each metrics are small as well. In general, the HPS architectures claim the
smaller errors concerning all the error types above.
10.3 Synthesis Constraints
There are two synthesis constraints, high-speed constraint and low-area
constraint. For a high-speed circuit, a specific high clock frequency and 0-area
constraint are set. In order to realize a high-speed constraint, a small clock period
is set in the synthesis script to maintain a high clock frequency. In order to realize
an area optimized circuit, the max chip area in the synthesis script is set to zero
and a low clock frequency (high clock period) is in specific.
10.4 Chip Area
Table 10.2 demonstrates the minimum chip area for five hardware architectures.
To be clear, 16HPS stands for 16 intervals in the second sub-function, s2(x), for
HPS method, 1NR stands for one stage iteration for the NR method. The
representations of the remaining architectures are in the same manner.
Table 10.2 Chip area when minimum chip area constraint is set.
Architecture
16HPS
32HPS
64HPS
1NR
2NR
Chip Area
(combinational,
μm2)
5599
4956
4759
4106
5710
(%)
Chip Area (total,
μm2)
(%)
98
87
83
72
100
5857
5213
5016
4364
5968
98
87
84
73
100
To synthesis the design, we add registers. This results in the difference between
the combinational chip area and total chip area.
By setting no chip area constraints, the maximum chip area for five
architectures are as listed according to Table 10.3.
Table 10.3 Chip area when maximum speed constraint is set.
Architecture
16HPS
32HPS
64HPS
1NR
2NR
Chip Area
(combinational,
μm2)
9328
7996
7828
7079
11455
(%)
Chip Area (total,
μm2)
(%)
81
70
68
62
100
9590
8257
8086
7337
11729
82
70
69
63
100
66
Based on Table 10.2 and Table 10.3, the two-stage NR iterations architecture
claims the largest chip area. On the contrary, the one-stage NR iteration
architecture claims the least chip area. Along with the growth of the number of
intervals in the second sub-function, s2(x), applying HPS, the chip area shows a
tendency to decline.
10.5 Timing
Table 10.4 illustrates timing report for high-speed circuits. The general idea is
to locate the minimum clock period when maintaining a minimum slack.
Therefore, the minimum clock period represents the highest possible clock
frequency.
Table 10.4 Timing report when maximum speed constraint is set.
Architecture
16HPS
32HPS
64HPS
1NR
2NR
Critical
Path/Frequency
(combinational,
ns/MHz)
4.53/221
3.92/255
3.46/289
3.55/282
6.25/160
Frequency
(%)
Critical
Path/Frequency
(total, ns/MHz)
Frequency
(%)
138
159
180
176
100
4.85/206
4.27/234
3.77/265
3.88/258
6.56/152
136
154
174
170
100
Table 10.5 demonstrates timing report for low-area circuits by setting the max
chip area in the synthesis script to 0.
Table 10.5 Timing report when minimum chip area constraint is set.
Architecture
16HPS
32HPS
64HPS
1NR
2NR
Critical
Path/Frequency
(combinational,
ns/MHz)
11.15/87
10.00/100
8.91/112
9.60/104
15.77/63
Frequency
(%)
Critical
Path/Frequency
(total, ns/MHz)
Frequency
(%)
138
159
178
165
100
11.61/86
10.39/96
9.30/108
10.02/100
16.21/62
139
155
174
161
100
According to Table 10.4 and Table 10.5, the two-stage NR iterations
architecture claims the largest critical path. However, 64 intervals HPS
architecture claims the least critical path. Along with the growth of the number of
intervals in the second sub-function, s2(x), applying HPS, the critical path shows a
tendency of decline as well.
67
10.6 Power estimation
The power in CMOS transistors consists of dynamic power and static power, as
shown in (10.1).
Ptotal  Pdynamic  Pstatic
(10.1)
The static power is the power consumed when there is no circuit activity. It is
due to sub threshold currents, reverse bias leakage gate leakage, and another few
currents [15]. The static power consumption is the product of the device leakage
current and the supply voltage. On the contrary, the dynamic power is the power
consumed when the inputs are active. It is due to dynamic switching events. It
consists of switching power and internal power, as shown in (10.2).
Pdynamic  Pswitching  Pint ernal
(10.2)
According to the tool vendor, “The switching power is determined by the
capacitive load and the frequency of the logic transitions on a cell output” –
quoted from [16]. Moreover, “The internal power is caused by the charging of
internal loads as well as by the short-circuit current between the N and P
transistors of a gate when both are on” – quoted from [16]. The dynamic power is
thus as expressed according to (10.3).
Pdynamic  aCV 2f
(10.3)
where a is the activity factor, C is the switched capacitance, V is the supply
voltage, and f is the clock frequency.
Figure 10.11 demonstrates the power consumption for 16, 32, and 64 intervals
in the second sub-function, s2(x), applying HPS as well as one-stage, and two-stage
NR iteration architectures at multiple clock frequencies. All the power
consumption values are as listed in Table A.12 to Table A.16 in Appendix A.3.
68
Fig 10.11 The power consumption for five architectures at multiple clock
frequencies.
Figure 10.12 demonstrates the zoomed-in figure of Figure 10.1.
Fig 10.12 The zoomed-in figure of Fig 10.1.
In Figure 10.11 and Figure 10.12, the two-stage NR iteration architecture claims
the highest power consumption. On the contrary, the 64-interval structure applying
HPS method claims the lowest power consumption. All five curves are close in
Figure 10.11. The last point is at the highest possible frequency. However, when
zooming in Figure 10.11 partially, there are slight differences. When applying
69
HPS, the power consumption is less along with the growth of the number of
intervals in the second sub-function, s2(x). When applying the NR iterations, the
power consumption is increasing along with the growth of the number of iterations
due to a large reduction of the LUT size. In general, one-stage NR iteration
consumes less power, two-stage NR iterations consumes more power.
Figure 10.13 illustrates the switching power, internal power, static power and
total power for 64 intervals in the second sub-function, s2(x), applying HPS after
post synthesis simulation at multiple clock frequencies.
Fig 10.13 The switching power, internal power, static power and total power for
64 interval structure applying HPS after post synthesis simulation at multiple clock
frequencies.
Figure 10.14 illustrates the switching power, internal power, static power and
total power for one-stage NR iteration architecture after post synthesis simulation
at multiple clock frequencies.
70
Fig 10.14 The switching power, internal power, static power and total power
applying one-stage NR iteration after post synthesis simulation at multiple clock
frequencies.
The green curve is switching power, the red curve is internal power, the blue
curve is static power, and the grey curve is total power. The dynamic power is the
difference between total power and static power, which is absent in the figure due
to the closeness between total power and dynamic power. Comparing Figure 10.13
to Figure 10.14, all curves of power are close. However, the switching power and
the internal power in 64 intervals applying HPS are slightly less than one-stage NR
iteration.
Figure 10.15 and Figure 10.16 demonstrates the switching power, internal
power, static power and total power for 64 intervals in the second sub-function,
s2(x), applying HPS and one-stage NR iteration after post layout simulation at
multiple clock frequencies respectively.
71
Fig 10.15 The switching power, internal power, static power and total power for
64 interval structure applying HPS after post layout simulation at multiple clock
frequencies.
Fig 10.16 The switching power, internal power, static power and total power
applying one-stage NR iteration after post layout simulation at multiple clock
frequencies.
72
The power consumption after post layout simulation is larger than that after post
synthesis simulation at corresponding frequencies for both architectures.
10.7 Physical Layout
Figure 10.17 and Figure 10.18 demonstrates the physical layout of two different
architectures.
Fig 10.17 The physical layout of 64 interval structure applying HPS method.
After P&R, the core size is 86.3 (width) multiplied by 83.2 (height), which is
7180 um2 and the die size is 126.3 (width) multiplied by 123.2 (height), which is
15560 um2. The die size is, however, not relevant since there are no pads.
According to Table 10.1, the minimum chip area of 64HPS architecture is 4759
um2. The core utilization is 99.86%.
Fig 10.18 The physical layout of one-stage NR iteration architecture.
73
The core size is 83.3(width) multiplied by 75.4 (height), which is 6281 um2 and
the die size is 123.3 (width) multiplied by 115.4 (height), which is 14229 um2. The
theoretical minimum chip area of this architecture in Table 10.1 is smaller than
that of 64HPS architecture. It is also true for the actual core size. The core
utilization is 99.90%.
74
CHAPTER
11
11 Conclusions
11.1 Comparison
In this thesis, the implementations for three different architectures applying HPS
and two different architectures applying NR iteration are realizable. For the HPS
method, the growth of the number of intervals in the second sub-function, s2(x),
results in less chip area, smaller critical path, and less power consumption. In
order to realize a simple hardware with maintained accuracy, the fixed coefficients
are essential. For NR iteration, a lager LUT gives fewer iterations, but larger chip
area, larger critical path, and significant larger power consumption. This is due to
that a significantly larger LUT size achieves significantly quicker convergence.
The major merits of HPS method compared to NR iterations, listed as follows:
Multiple target function: Plenty of unary functions such as sine, logarithmic,
exponential, reciprocal and even square reciprocal is realizable applying HPS
method. On the contrary, the amount of the implementations of different functions
by NR iterations is far more less.
High flexibility: For the HPS method, different target functions have the
similar hardware architecture. The differences are the sets of the coefficients. This
is not applicable for NR iterations. For NR iterations, when changing the target
function, the modification of the hardware architecture is essential.
11.2 Future Work
For the HPS method, it is inevitable that along with the growth of the number of
intervals in the second sub-function, s2(x), the output precision will be better. What
is more, the coefficients in the second sub-function can be deeper optimized.
Finally, the modification for the architectures is applicable to meet a better
performance.
For the NR iteration method, it is possible to scale down the size of LUT by half
in one-stage NR iteration.
75
76
Appendix A
A.1 The coefficients in the second sub-function using
HPS method.
Table A.1 The coefficients l2,i of 16-interval structure.
coefficient
l2,0
l2,1
l2,2
l2,3
l2,4
l2,5
l2,6
l2,7
value
coefficient
value
1.000000000000000
l2,8
0.666679382324219
0.941184997558594
l2,9
0.640014648437500
0.888893127441406
l2,10
0.615409851074219
0.842117309570313
l2,11
0.592613220214844
0.800010681152344
l2,12
0.571456909179688
0.761917114257813
l2,13
0.551773071289063
0.727287292480469
l2,14
0.533416748046875
0.695663452148438
l2,15
0.516357421875000
Table A.2 The coefficients j2,i of 16-interval structure.
coefficient
j2,0
j2,1
j2,2
j2,3
j2,4
j2,5
j2,6
j2,7
value
coefficient
value
0.062347412109375
j2,8
0.027740478515625
0.055267333984375
j2,9
0.025573730468750
0.049285888671875
j2,10
0.023651123046875
0.044250488281250
j2,11
0.021911621093750
0.039947509765625
j2,12
0.020385742187500
0.036224365234375
j2,13
0.019012451171875
0.033020019531250
j2,14
0.017761230468750
0.030212402343750
j2,15
0.016632080078125
Table A.3 The coefficients c2,i of 16-interval structure.
coefficient
c2,0
c2,1
c2,2
c2,3
c2,4
c2,5
c2,6
c2,7
value
coefficient
value
0.003555297851563
c2,8
0.001083374023438
0.002975463867188
c2,9
0.000961303710938
0.002517700195313
c2,10
0.000854492187500
0.002151489257813
c2,11
0.000762939453125
0.001846313476563
c2,12
0.000686645507813
0.001602172851563
c2,13
0.000610351562500
0.001388549804688
c2,14
0.000549316406250
0.001235961914063
c2,15
0.000503540039063
77
Table A.4 The coefficients l2,i of 32-interval structure.
coefficient
l2,0
l2,1
l2,2
l2,3
l2,4
l2,5
l2,6
l2,7
l2,8
l2,9
l2,10
l2,11
l2,12
l2,13
l2,14
l2,15
value
coefficient
value
1.000000000000000
l2,16
0.666679382324219
0.969696044921875
l2,17
0.653076171875000
0.941192626953125
l2,18
0.639999389648438
0.914283752441406
l2,19
0.627471923828125
0.888885498046875
l2,20
0.615394592285156
0.864868164062500
l2,21
0.603790283203125
0.842109680175781
l2,22
0.592605590820313
0.820518493652344
l2,23
0.581840515136719
0.800018310546875
l2,24
0.571479797363281
0.780479431152344
l2,25
0.561447143554688
0.761901855468750
l2,26
0.551757812500000
0.744194030761719
l2,27
0.542419433593750
0.727279663085938
l2,28
0.533393859863281
0.711112976074219
l2,29
0.524642944335938
0.695648193359375
l2,30
0.516288757324219
0.516357421875000
l2,31
0.508323669433594
Table A.5 The coefficients j2,i of 32-interval structure.
coefficient
j2,0
j2,1
j2,2
j2,3
j2,4
j2,5
j2,6
j2,7
j2,8
j2,9
j2,10
j2,11
j2,12
j2,13
j2,14
j2,15
value
coefficient
value
0.031188964843750
j2,16
0.013854980468750
0.029357910156250
j2,17
0.013305664062500
0.027648925781250
j2,18
0.012756347656250
0.026062011718750
j2,19
0.012268066406250
0.024658203125000
j2,20
0.011779785156250
0.023315429687500
j2,21
0.011352539062500
0.022094726562500
j2,22
0.010925292968750
0.020996093750000
j2,23
0.010559082031250
0.019958496093750
j2,24
0.010192871093750
0.018981933593750
j2,25
0.009826660156250
0.018066406250000
j2,26
0.009460449218750
0.017272949218750
j2,27
0.009155273437500
0.016479492187500
j2,28
0.008850097656250
0.015747070312500
j2,29
0.008483886718750
0.015075683593750
j2,30
0.008300781250000
0.016632080078125
j2,31
0.008056640625000
78
Table A.6 The coefficients c2,i of 32-interval structure.
Coefficient
c2,0
c2,1
c2,2
c2,3
c2,4
c2,5
c2,6
c2,7
c2,8
c2,9
c2,10
c2,11
c2,12
c2,13
c2,14
c2,15
Value
Coefficient
Value
0.000915527343750
c2,16
0.000244140625000
0.000854492187500
c2,17
0.000244140625000
0.000732421875000
c2,18
0.000244140625000
0.000671386718750
c2,19
0.000183105468750
0.000671386718750
c2,20
0.000183105468750
0.000549316406250
c2,21
0.000183105468750
0.000488281250000
c2,22
0.000183105468750
0.000488281250000
c2,23
0.000183105468750
0.000427246093750
c2,24
0.000122070312500
0.000427246093750
c2,25
0.000122070312500
0.000366210937500
c2,26
0.000122070312500
0.000366210937500
c2,27
0.000122070312500
0.000305175781250
c2,28
0.000122070312500
0.000305175781250
c2,29
0.000122070312500
0.000305175781250
c2,30
0.000122070312500
0.000305175781250
c2,31
0.000122070312500
79
Table A.7 The coefficients l2,i of 64-interval structure.
Coefficient
l2,0
l2,1
l2,2
l2,3
l2,4
l2,5
l2,6
l2,7
l2,8
l2,9
l2,10
l2,11
l2,12
l2,13
l2,14
l2,15
l2,16
l2,17
l2,18
l2,19
l2,20
l2,21
l2,22
l2,23
l2,24
l2,25
l2,26
l2,27
l2,28
l2,29
l2,30
l2,31
Value
Coefficient
Value
1.000000000000000
l2,32
0.666664123535156
0.984611511230469
l2,33
0.659797668457031
0.969703674316406
l2,34
0.653076171875000
0.955223083496094
l2,35
0.646453857421875
0.941169738769531
l2,36
0.639999389648438
0.927543640136719
l2,37
0.633659362792969
0.914283752441406
l2,38
0.627464294433594
0.901405334472656
l2,39
0.621368408203125
0.888908386230469
l2,40
0.615394592285156
0.876708984375000
l2,41
0.609558105468750
0.864868164062500
l2,42
0.603797912597656
0.853340148925781
l2,43
0.598159790039063
0.842102050781250
l2,44
0.592597961425781
0.831153869628906
l2,45
0.587188720703125
0.820510864257813
l2,46
0.581832885742188
0.810142517089844
l2,47
0.576614379882813
0.800003051757813
l2,48
0.571456909179688
0.790130615234375
l2,49
0.566413879394531
0.780487060546875
l2,50
0.561431884765625
0.771080017089844
l2,51
0.556564331054688
0.761909484863281
l2,52
0.551750183105469
0.752967834472656
l2,53
0.547042846679688
0.744194030761719
l2,54
0.542427062988281
0.735641479492188
l2,55
0.537887573242188
0.727287292480469
l2,56
0.533386230468750
0.719116210937500
l2,57
0.528991699218750
0.711120605468750
l2,58
0.524673461914063
0.703300476074219
l2,59
0.520439147949219
0.695648193359375
l2,60
0.516273498535156
0.688186645507813
l2,61
0.512199401855469
0.680847167968750
l2,62
0.508331298828125
0.673698425292969
l2,63
0.504714965820313
80
Table A.8 The coefficients j2,i of 64-interval structure.
Coefficient
j2,0
j2,1
j2,2
j2,3
j2,4
j2,5
j2,6
j2,7
j2,8
j2,9
j2,10
j2,11
j2,12
j2,13
j2,14
j2,15
j2,16
j2,17
j2,18
j2,19
j2,20
j2,21
j2,22
j2,23
j2,24
j2,25
j2,26
j2,27
j2,28
j2,29
j2,30
j2,31
Value
Coefficient
Value
0.015563964843750
j2,32
0.006896972656250
0.015075683593750
j2,33
0.006774902343750
0.014648437500000
j2,34
0.006652832031250
0.014221191406250
j2,35
0.006469726562500
0.013793945312500
j2,36
0.006347656250000
0.013427734375000
j2,37
0.006225585937500
0.013000488281250
j2,38
0.006103515625000
0.012634277343750
j2,39
0.005981445312500
0.012329101562500
j2,40
0.005859375000000
0.011962890625000
j2,41
0.005798339843750
0.011657714843750
j2,42
0.005676269531250
0.011352539062500
j2,43
0.005554199218750
0.011047363281250
j2,44
0.005432128906250
0.010742187500000
j2,45
0.005371093750000
0.010498046875000
j2,46
0.005249023437500
0.010253906250000
j2,47
0.005187988281250
0.010009765625000
j2,48
0.005065917968750
0.009704589843750
j2,49
0.005004882812500
0.009460449218750
j2,50
0.004882812500000
0.009216308593750
j2,51
0.004821777343750
0.009033203125000
j2,52
0.004699707031250
0.008850097656250
j2,53
0.004638671875000
0.008605957031250
j2,54
0.004577636718750
0.008422851562500
j2,55
0.004516601562500
0.008239746093750
j2,56
0.004394531250000
0.008056640625000
j2,57
0.004333496093750
0.007873535156250
j2,58
0.004272460937500
0.007690429687500
j2,59
0.004211425781250
0.007507324218750
j2,60
0.004150390625000
0.007385253906250
j2,61
0.004089355468750
0.007202148437500
j2,62
0.004028320312500
0.007080078125000
j2,63
0.003967285156250
81
Table A.9 The coefficients c2,i of 64-interval structure.
Coefficient
c2,0
c2,1
c2,2
c2,3
c2,4
c2,5
c2,6
c2,7
c2,8
c2,9
c2,10
c2,11
c2,12
c2,13
c2,14
c2,15
c2,16
c2,17
c2,18
c2,19
c2,20
c2,21
c2,22
c2,23
c2,24
c2,25
c2,26
c2,27
c2,28
c2,29
c2,30
c2,31
Value
Coefficient
Value
0.000183105468750
c2,32
0.000061035156250
0.000183105468750
c2,33
0.000061035156250
0.000183105468750
c2,34
0.000061035156250
0.000183105468750
c2,35
0.000061035156250
0.000183105468750
c2,36
0.000061035156250
0.000183105468750
c2,37
0.000061035156250
0.000122070312500
c2,38
0.000000000000000
0.000122070312500
c2,39
0.000000000000000
0.000122070312500
c2,40
0.000000000000000
0.000122070312500
c2,41
0.000000000000000
0.000122070312500
c2,42
0.000000000000000
0.000122070312500
c2,43
0.000000000000000
0.000122070312500
c2,44
0.000000000000000
0.000122070312500
c2,45
0.000000000000000
0.000122070312500
c2,46
0.000000000000000
0.000122070312500
c2,47
0.000000000000000
0.000122070312500
c2,48
0.000000000000000
0.000061035156250
c2,49
0.000000000000000
0.000061035156250
c2,50
0.000000000000000
0.000061035156250
c2,51
0.000000000000000
0.000061035156250
c2,52
0.000000000000000
0.000061035156250
c2,53
0.000000000000000
0.000061035156250
c2,54
0.000000000000000
0.000061035156250
c2,55
0.000000000000000
0.000061035156250
c2,56
0.000000000000000
0.000061035156250
c2,57
0.000000000000000
0.000061035156250
c2,58
0.000000000000000
0.000061035156250
c2,59
0.000000000000000
0.000061035156250
c2,60
0.000000000000000
0.000061035156250
c2,61
0.000000000000000
0.000061035156250
c2,62
0.000000000000000
0.000061035156250
c2,63
0.000000000000000
82
A.2 The coefficients in LUTs of NR Iteration
Table A.10 The initial guess values of one-stage NR iteration.
Initial
guess
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Value
0.998046875
0.994140625
0.990234375
0.986328125
0.982421875
0.978515625
0.974609375
0.970703125
0.966796875
0.962890625
0.958984375
0.958984375
0.953125
0.94921875
0.947265625
0.943359375
0.939453125
0.9375
0.93359375
0.927734375
0.927734375
0.923828125
0.91796875
0.916015625
0.912109375
0.908203125
0.904296875
0.904296875
0.900390625
0.896484375
0.89453125
0.888671875
0.888671875
0.884765625
0.880859375
0.876953125
0.876953125
Initial
guess
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Value
0.869140625
0.865234375
0.865234375
0.859375
0.857421875
0.853515625
0.853515625
0.84765625
0.84765625
0.841796875
0.841796875
0.837890625
0.833984375
0.833984375
0.830078125
0.826171875
0.826171875
0.8203125
0.8203125
0.814453125
0.814453125
0.8125
0.80859375
0.806640625
0.802734375
0.802734375
0.798828125
0.794921875
0.79296875
0.79296875
0.787109375
0.787109375
0.78515625
0.78125
0.779296875
0.775390625
0.775390625
83
Initial
guess
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
Value
0.771484375
0.767578125
0.765625
0.76171875
0.76171875
0.7578125
0.755859375
0.755859375
0.751953125
0.748046875
0.74609375
0.74609375
0.7421875
0.7421875
0.73828125
0.736328125
0.734375
0.732421875
0.732421875
0.728515625
0.724609375
0.72265625
0.72265625
0.71875
0.71875
0.71484375
0.71484375
0.7109375
0.7109375
0.70703125
0.70703125
0.703125
0.703125
0.69921875
0.69921875
0.6953125
0.6953125
38
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
0.87109375
0.69140625
0.689453125
0.689453125
0.685546875
0.681640625
0.681640625
0.6796875
0.6796875
0.67578125
0.67578125
0.671875
0.671875
0.66796875
0.66796875
0.666015625
0.6640625
0.6640625
0.66015625
0.66015625
0.65625
0.65625
0.65234375
0.65234375
0.650390625
0.6484375
0.6484375
0.64453125
0.64453125
0.642578125
0.640625
0.640625
0.63671875
0.63671875
0.6328125
0.6328125
0.6328125
0.62890625
0.62890625
0.626953125
0.625
0.625
0.625
76
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
0.771484375
0.62109375
0.619140625
0.6171875
0.6171875
0.61328125
0.61328125
0.61328125
0.609375
0.609375
0.607421875
0.60546875
0.60546875
0.6015625
0.6015625
0.6015625
0.59765625
0.59765625
0.595703125
0.59375
0.59375
0.591796875
0.58984375
0.58984375
0.587890625
0.5859375
0.5859375
0.583984375
0.58203125
0.58203125
0.578125
0.578125
0.578125
0.57421875
0.57421875
0.57421875
0.572265625
0.572265625
0.568359375
0.568359375
0.56640625
0.564453125
0.564453125
84
114
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
0.69140625
0.560546875
0.55859375
0.556640625
0.556640625
0.5546875
0.5546875
0.5546875
0.55078125
0.55078125
0.55078125
0.548828125
0.546875
0.546875
0.546875
0.54296875
0.54296875
0.541015625
0.541015625
0.5390625
0.5390625
0.537109375
0.537109375
0.533203125
0.533203125
0.53125
0.53125
0.53125
0.529296875
0.529296875
0.529296875
0.52734375
0.525390625
0.5234375
0.521484375
0.521484375
0.51953125
0.51953125
0.51953125
0.517578125
0.515625
0.515625
0.515625
157
0.62109375
0.51171875
0.509765625
0.509765625
0.509765625
0.5078125
244
245
246
247
248
200
249
250
251
252
253
0.560546875
0.5078125
0.5078125
0.50390625
0.50390625
0.501953125
243
254
255
256
0.513671875
0.501953125
0.501953125
0.501953125
Table A.11 The initial guess values of two-stage NR iteration.
Initial guess
Value
Initial guess
1
0.96875
9
2
0.91796875
10
3
0.87890625
11
4
0.83984375
12
5
0.779296875
13
6
0.763671875
14
7
0.69921875
15
8
0.69921875
16
Value
0.626953125
0.626953125
0.580078125
0.580078125
0.580078125
0.5234375
0.5234375
0.5234375
A.3 The power consumptions at various frequencies.
Table A.12 The total power consumptions at various frequencies.
Frequency(Hz)
16HPS
32HPS
64HPS
1NR
2NR
/Power(W)
1
1.840e-08 1.684e-08 1.681e-08 1.463e-08 1.906e-08
10
1.853e-08 1.697e-08 1.691e-08 1.475e-08 1.932e-08
100
1.980e-08 1.821e-08 1.789e-08 1.595e-08 2.193e-08
1000
3.247e-08 3.068e-08 2.770e-08 2.792e-08 4.803e-08
10000
1.592e-07 1.553e-07 1.258e-07 1.484e-07 3.099e-07
100000
1.427e-06 1.402e-06 1.106e-06 1.366e-06 2.918e-06
1000000
1.410e-05 1.387e-05 1.091e-05 1.346e-05 2.902e-05
10000000
1.409e-04 1.385e-04 1.090e-04 1.346e-04 2.901e-04
100000000
1.661e-03 1.436e-03 1.090e-03 1.363e-03 4.977e-03
85
Table A.13 The switching power, internal power, static power and total power for
64-interval architecture after post synthesis simulation
Frequency(Hz) Switching Power Internal Power Static Power Total Power
/Power(W)
1
5.735e-12
5.169e-12
1.680e-08
1.681e-08
10
5.735e-11
5.169e-11
1.680e-08
1.691e-08
100
5.735e-10
5.169e-10
1.680e-08
1.789e-08
1000
5.733e-09
5.164e-09
1.680e-08
2.770e-08
10000
5.733e-08
5.164e-08
1.680e-08
1.258e-07
100000
5.733e-07
5.164e-07
1.680e-08
1.106e-06
1000000
5.733e-06
5.164e-06
1.680e-08
1.091e-05
10000000
5.733e-05
5.164e-05
1.680e-08
1.090e-04
100000000
5.733e-04
5.166e-04
1.680e-08
1.090e-03
86
Table A.14 The switching power, internal power, static power and total power for
1NR architecture after post synthesis simulation
Frequency(Hz) Switching Power Internal Power Static Power Total Power
/Power(W)
1
7.188e-12
6.140e-12
1.462e-08
1.463e-08
10
7.188e-11
6.140e-11
1.462e-08
1.475e-08
100
7.188e-10
6.140e-10
1.462e-08
1.595e-08
1000
7.166e-09
6.121e-09
1.463e-08
2.792e-08
10000
7.205e-08
6.168e-08
1.463e-08
1.484e-07
100000
7.275e-07
6.237e-07
1.462e-08
1.366e-06
1000000
7.233e-06
6.209e-06
1.464e-08
1.346e-05
10000000
7.241e-05
6.216e-05
1.465e-08
1.346e-04
100000000
7.355e-04
6.273e-04
1.462e-08
1.363e-03
87
Table A.15 The switching power, internal power, static power and total power for
64-interval architecture after post layout simulation
Frequency(Hz) Switching Power Internal Power Static Power Total Power
/Power(W)
1
1.182e-11
1.106e-11
1.714e-08
1.717e-08
10
1.182e-10
1.106e-10
1.714e-08
1.737e-08
100
1.171e-09
1.096e-09
1.705e-08
1.932e-08
1000
1.178e-08
1.103e-08
1.712e-08
3.992e-08
10000
1.185e-07
1.108e-07
1.711e-08
2.464e-07
100000
1.174e-06
1.098e-06
1.710e-08
2.290e-06
1000000
1.175e-05
1.100e-05
1.699e-08
2.276e-05
10000000
1.180e-04
1.104e-04
1.700e-08
2.284e-04
100000000
1.225e-03
1.150e-03
1.765e-08
2.376e-03
88
Table A.16 The switching power, internal power, static power, total power for
1NR architecture after post layout simulation
Frequency(Hz) Switching Power Internal Power Static Power Total Power
/Power(W)
1
1.293e-11
1.104e-11
1.476e-08
1.479e-08
10
1.293e-10
1.104e-10
1.476e-08
1.500e-08
100
1.293e-09
1.104e-09
1.476e-08
1.716e-08
1000
1.296e-08
1.104e-08
1.475e-08
3.875e-08
10000
1.322e-07
1.120e-07
1.473e-08
2.589e-07
100000
1.335e-06
1.131e-06
1.470e-08
2.480e-06
1000000
1.318e-05
1.114e-05
1.476e-08
2.434e-05
10000000
1.350e-04
1.145e-04
1.472e-08
2.496e-04
100000000
1.359e-03
1.180e-03
1.506e-08
2.539e-03
89
90
References
[1]
E. Hertz, “Parabolic Synthesis,” Series of Licentiate and Doctoral Thesis at
the Department of Electrical and Information Technology, Faculty of
Engineering, Lund University, Sweden, January 2011, ISBN
9789174730692,
http://www.eit.lth.se/fileadmin/eit/news/747/lic_14.5_mac.pdf
[2]
P.K. Meher, J. Valls, T.B. Juang, K. Sridharan, and K. Maharatna, “50 years
of CORDIC: Algorithms, architectures, and applications,” IEEE
Transactions on Circuits & Systems. Part I: Regular Papers, Vol. 56, Issue
9, pp. 1893-1907, ISSN 15498328, September 2009.
[3]
E. Hertz and P. Nilsson, “A Methodology for Parabolic Synthesis,” a book
chapter in VLSI, In-Tech, ISBN 9783902613509, pp. 199-220, Vienna,
Austria, February 2010.
[4]
D. Zuras, M. Cowlishaw, and others, “IEEE standard for floating-point
arithmetic,”
IEEE
Std
754-2008,
pp.
1–70,
doi:
10.1109/IEEESTD.2008.4610935, IEEE, 2008.
[5]
E. Hertz, “Floating-point representation,” Private communication, January
2015.
[6]
P. Pouyan, E. Hertz, and P. Nilsson, “A VLSI implementation of logarithmic
and exponential functions using a novel parabolic synthesis methodology
compared to the CORDIC algorithm,” in the proceedings of 20th European
Conference on Circuit Theory & Design (ECCTD), ISBN 9781457706172,
pp. 709-712, Konsert & Kongress, Linkoping, Sweden, August 2011.
[7]
E. Hertz and P. Nilsson, “Parabolic synthesis methodology implemented on
the sine function,” in the proceedings of IEEE International Symposium on
Circuits and Systems, ISBN 9781424438273, pp. 253-256, Taipei, Taiwan,
May 2009.
[8]
K.S. Johnson, “Transmission Circuits for Telephonic Communication:
Methods of Analysis and Design,” D. Van Nostrand company, New York,
USA, 1947.
[9]
J. Lai, “Hardware Implementation of the Logarithm Function using
Improved Parabolic Synthesis,” Master of Science Thesis at the
Department of Electrical and Information Technology, Faculty of
Engineering,
Lund
University,
Sweden,
September
2013,
http://www.eit.lth.se/sprapport.php?uid=751.
91
[10]
P. Kornerup and J.M. Muller, “Choosing starting values for NewtonRaphson computation of reciprocals, square roots and square root
reciprocals,” in the proceedings of 5th conference on Real Numbers and
Computers, Leon, France, September 2003.
[11]
P. Montuschi and M. Mezzalama, “Optimal absolute error starting values
for Newton–Raphson calculation of square root,” Computing, Vol. 46,
Issue 1, pp. 67-86, ISSN 0010485X, Springer-Verlag, 1991.
[12]
M.J. Schulte, J. Omar, and E.E. Swartzlander, “Optimal initial
approximation for the Newton–Raphson division algorithm,” Computing,
Vol. 53, Issue. 3/4, pp. 233-242, ISSN 0010485X, Springer-Verlag, 1994.
[13]
E. Hertz and P. Nilsson, “A methodology for parabolic synthesis of unary
functions for hardware implementation,” in the proceedings of the 2008
International conference on Signals, Circuits & Systems (SCS08), ISBN
9781424426270, pp. 30-35, Hammamet, Tunisia, November 2008.
[14]
A.A. Giunta and L.T. Watson, “A Comparison of Approximation
Modeling Techniques: Polynomial Versus Interpolating Models ,”
in the proceedings of 7th AIAA/USAF/NASA/ISSMO Symposium on
Multidisciplinary Analysis and Optimization, Multidisciplinary Analysis
Optimization Conferences, Patent No. AIAA-98-4758, pp. 392-404,
American Institute of Aeronautics and Astronautics, Blacsburg, USA,
September 1998.
[15]
A. Agarwal, S. Mukhopadhyay, A. Raychowdhury, K. Roy, and C.H. Kim,
“Leakage power analysis and reduction for nanoscale circuits,” IEEE
Micro, Vol. 26, Issue 2, pp. 68-80, ISSN 02721732, October 2006.
[16]
G. Yip, “Expanding the Synopsys Prime Time Solution with Power
Analysis,” Synopsys, Inc., http://www.synopsys.com/, June 2006.
92
Series of Master’s theses
Department of Electrical and Information Technology
LU/LTH-EIT 2016-487
http://www.eit.lth.se
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement