Parabolic Synthesis (länk till annan webbplats)

Parabolic Synthesis (länk till annan webbplats)
Thesis for the degree of Licentiate in Engineering
Parabolic Synthesis
by
Erik Hertz
Department of Electrical and Information Technology
Faculty of Engineering, LTH, Lund University
SE-221 00 Lund, Sweden
The Department of Electrical and Information Technology
Lund University
P.O. Box 118, S-221 00 LUND
ISBN 978-91-7473-069-2
ISSN 1654-790X
No. 28
© Erik Hertz 2011
Printed in Sweden by Tryckeriet E-huset, Lund.
January 2011
2
Abstract
Many consumer products, such as within the computer areas, computer
graphics, digital signal processing, communication systems, robotics,
navigation, astrophysics, fluid physics, etc. are searching for high
computational performance as a consequence of increasingly more
advanced algorithms in these applications. Until recently the down scaling
of the hardware technology has been able to fulfill these higher demands
from the more advanced algorithms with higher clock rates on the chips.
This that the development of hardware technology performance has
stagnated has moved the interest more over to implementation of
algorithms in hardware. Especially within wireless communication the
desire for higher transmission rates has increased the interest for algorithm
implementation methodologies.
The scope of this thesis is mainly on the developed methodology of
parabolic synthesis. The parabolic synthesis methodology is a methodology
for implementing approximations of unary functions in hardware. The
methodology is described with the criteria’s that have to be fulfilled to
perform an approximation on a unary function. The hardware architecture
of the methodology is described and to this a special hardware that
performs the squaring operation.
The outcome of the presented research is a novel methodology for
implementing approximations of unary functions such as trigonometric
functions, logarithmic functions, as well as square root and division
functions etc. The architecture of the processing part automatically gives a
high degree of parallelism. The methodology is founded on operations that
are simple to implement in hardware such as addition, shifts, multiplication,
contributes to that the implementation in hardware is simple to perform.
The hardware architecture is characterized by a high degree of parallelism
that gives a short critical path and fast computation. The structure of the
methodology will also assure an area efficient hardware implementation.
3
4
Acknowledgments
Very few things in life you do by yourself and this thesis is just another
example of how dependent one is of the support from a wide range of
people, both near relatives, friends and colleges. Therefore, in
commemoration, my deepest thanks go to my mother and father for their
high beliefs in me. In memory I also want to thank my uncle Karl-Martin
Linderoth for his unreserved interest in my work and his thoughtful
guidance.
I also feel the greatest gratitude to my supervisor professor Peter Nilsson,
who made it possible for me to restart my Ph.D. studies. In my research he
always has had a devoted presence with continuously evolving and
encouraging dialogs. I am also grateful for his determined search for new
angles and applications for my research. To that I am over overwhelmed by
his understanding and consideration knowing the difficulties with doing
Ph.D. studies in your spare time.
To associate professor Clas Agnvall I also want to show my greatest
appreciation for the all-out effort that he put in to reestablish me as a Ph.D.
student. Finding loopholes in the rules that made it possible for me to startup my Ph.D. studies again after many years of letting my studies lay in
fallow.
My greatest appreciation goes to the administrative and technical staff at
the department of Electrical and Information Technology, who has gone far
beyond expectations in assisting me with numerous things in my research,
travels and many other things. Especially I am very grateful for the
excellent and foresighted help that I got in various ways from the
administrative coordinator Pia Bruhn.
To present and past members of the Digital ASIC research group, Radio
research group, many others at the department and also others at other
departments I am very grateful for their help and support in all that
surrounds the life of a student.
To all my friends I thank them for their support and for never giving up on
our friendship even though I have for long periods been invisible.
Lund, January 28, 2011
Erik Hertz
5
6
Tallenes lyrik
En halv
er,
tænk nu hvor aparte,
to trediedele af tre kvarte.
Piet Hein
7
8
Table of Contents
Abstract........................................................................................................ 3
Acknowledgments ....................................................................................... 5
Table of Contents ........................................................................................ 9
Preface........................................................................................................ 11
1
Introduction....................................................................................... 15
2
Hardware Approximation Methods ................................................ 17
2.1
Lookup Table ...................................................................................... 17
2.1.1 Direct and Indirect Table Lookup .................................................... 18
2.1.2 Lookup Table Reduction by Auxiliary Function ............................. 18
2.1.3 Interpolation ..................................................................................... 18
2.2
Polynomial Approximation ............................................................... 20
2.3
Piecewise Approximation................................................................... 21
2.4
Sum of Bit-Product Approximation ................................................. 22
2.5
CORDIC.............................................................................................. 23
3
Parabolic Synthesis ........................................................................... 27
3.1
Normalizing......................................................................................... 27
3.2
Developing the Hardware Architecture ........................................... 28
3.3
Methodology for developing sub-functions ...................................... 29
3.4
Hardware Implementation ................................................................ 34
3.4.1 Preprocessing ................................................................................... 35
3.4.2 Processing ........................................................................................ 35
3.4.3 Postprocessing.................................................................................. 39
4
Implementation of the sine function................................................ 41
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
4.1.6
5
Preprocessing ................................................................................... 41
Processing ........................................................................................ 42
Optimization..................................................................................... 44
Architecture...................................................................................... 45
Optimization of Word Length.......................................................... 46
Precision........................................................................................... 46
Using the Methodology ..................................................................... 49
5.1.1
5.1.2
5.1.3
5.1.4
5.1.5
5.1.6
The Sine Function ............................................................................ 49
The Cosine Function ........................................................................ 50
The Arcsine Function....................................................................... 50
The Arccosine Function ................................................................... 51
The Tangent Function ...................................................................... 51
The Arctangent Function ................................................................. 51
9
5.1.7 The Logarithmic Function ............................................................... 52
5.1.8 The Exponential Function ................................................................ 52
5.1.9 The Division Function ..................................................................... 53
5.1.10
The Square Root Function ........................................................... 53
6
Results ................................................................................................ 55
7
Conclusions ........................................................................................ 57
8
Future Work...................................................................................... 59
References .................................................................................................. 61
List of Figures............................................................................................ 65
List of Tables ............................................................................................. 67
List of Acronyms ....................................................................................... 69
10
Preface
Papers contributing to this thesis
Erik Hertz and Peter Nilsson, “Parabolic Synthesis Methodology”, In the
proceedings of the 2010 GigaHertz Symposium, pp 35, March 9-10, 2010,
Lund, Sweden.
Erik Hertz and Peter Nilsson, "A Methodology for Parabolic Synthesis", a
book chapter in VLSI, pp 199-220, In-Tech, ISBN 978-953-307-049-0,
Vienna, Austria, February 2010.
E. Hertz and P. Nilsson, “Parabolic Synthesis Methodology Implemented
on the Sine Function”, in Proceedings of the 2009 IEEE International
Symposium on Circuits and Systems, pp 253-256,Taipei, May 24-27, 2009.
E. Hertz and P. Nilsson, “A Methodology for Parabolic Synthesis of Unary
Function for Hardware Implementation”, in Proceedings of the 2008 IEEE
Conference on Signals, Circuits and System, pp 1-6, Hammamet,
November 7-9, 2008.
E. Hertz and M. Torkelson, “Aritmetik anpassad för Digitala
Signalbehandlingskretsar”, in Proceedings for RadioVetenskap och
Kommunikation 93, pp 9-12, Lund, April 5-7,1993.
E. Hertz, “En höghastighets metod för digital arithmetik”, in Proceedings of
the first GigaHertz Symposium 1992, pp p2:k, Linköping, March 25-26,
1992.
E. Hertz and M. Torkelson, “Radar DSP-chip using Approximate
Arithmetic, in Proceedings of the sixth NORSILC/NORCHIP seminar, pp
16.0-16.8, Copenhagen, October 23-24, 1991.
11
Patents contributing to this thesis
A Device for Conversion of a Binary Floating-Point Number into a Binary
2-Logarithm or the Opposite
Publication number: WO9533308
Publication date: 1995-12-07
Inventor(s): Hertz Erik [SE]
Applicant(s): Försvarets Forskningsanstalt [SE]; Hertz Erik [SE]
A Device for Conversion of a Binary Floating-Point Number into a Binary
2-Logarithm or the Opposite
Publication number: WO9414245
Publication date: 1994-06-23
Inventor(s): Hertz Erik [SE]
Applicant(s): Försvarets Forskningsanstalt [SE]; Hertz Erik [SE]
12
Papers not included in this thesis
Ulf Sjöström and E. Hertz, “VLSI Implementation of a Digital Array
Antenna”, in Proceedings of the second GigaHertz Symposium 1994, pp
ps-20, Linköping, March 22-23, 1994.
E. Hertz and M. Torkelson, “Comparison Between Two Design Concepts
for a Radar DSP Chip”, in Proceedings of the 14:th Nordic Semiconductor
Meeting, pp. 462-465, Aarhus, Denmark, June 17-20, 1990.
E. Hertz and M. Torkelson, “En VLSI-krets för radar applikation”, in
Proceedings for RadioVetenskap och Kommunikation 90, pp 455-458,
Gothenburg, April 9-11,1990.
Patents not included in this thesis
Radio Receiver
Publication number: AU2002237325
Publication date: 2002-09-24
Inventor(s): Hertz Erik [SE]; Canovas Joaquin [SE]; He Shousheng [SE];
Vicent Antonio [SE]
Applicant(s): L M Ericsson Telefon AB [SE]
Radio Receiver
Publication number: WO02073797
Publication date: 2002-09-19
Inventor(s): Hertz Erik [SE]; Canovas Joaquin [SE]; He Shousheng [SE];
Vicent Antonio [SE]
Applicant(s): L M Ericsson Telefon AB [SE]
Radio Receiver and Digital Filter Therefore
Publication number: GB2373385
Publication date: 2002-09-18
Inventor(s): Hertz Erik [SE]; Canovas Joaquin [SE]; He Shousheng [SE];
Vicent Antonio [SE]
Applicant(s): L M Ericsson Telefon AB [SE]
13
14
CHAPTER
1
1 Introduction
In relatively recent research of the history of science interpolation theory, in
particular of mathematical astronomy, revealed rudimentary solutions of
interpolation problems date back to early antiquity [1]. Examples of
interpolation techniques originally conceived by ancient Babylonian as well
as early-medieval Chinese, Indian, and Arabic astronomers and
mathematicians can be linked to the classical interpolation techniques
developed in Western countries from the 17th until the 19th century. The
available historical material has not yet given a reason to suspect that the
earliest known contributors to classical interpolation theory were influenced
in any way by mentioned ancient and medieval Eastern works. For the
classical interpolation theory it is justified to say that there is no single
person who did so much for this field as Newton. Therefore, Newton
deserves the credit for having put classical interpolation theory on a
foundation. In the course of the 18th and 19th century Newton’s theories
[2] were further studied by many others, including Stirling [3], Gauss [4],
Waring [5], Euler [6], Lagrange [7], Bessel [8], Laplace [9] [10], and
Everett [11] [12]. Whereas the developments until the end of 19th century
had been impressive, the developments in the past century have been
explosive. Another important development from the late 1800s is the rise of
approximation theory. In 1885, Weierstrass [13] justified the use of
approximations by establishing the so-called approximation theorem, which
states that every continuous function on a closed interval can be
approximated uniformly to any prescribed accuracy by a polynomial. In the
20th century two major extensions of classical interpolation theory is
introduced: firstly the concept of the cardinal function, mainly due to E. T.
Whittaker [14], but also studied before him by Borel [15] and others, and
eventually leading to the sampling theorem for band limited functions as
found in the works of J. M. Whittaker [16] [17] [18], Kotel'nikov [19],
Shannon [20], and several others, and secondly the concept of oscillatory
interpolation, researched by many and eventually resulting in Schoenberg's
theory [21] [22] of mathematical splines.
15
The parabolic synthesis methodology.
Unary functions, e.g. trigonometric functions, logarithms as well as square
root and division functions are extensively used in computer graphics,
digital signal processing, communication systems, robotics, navigation,
astrophysics, fluid physics, etc. For these high-speed applications, software
solutions are in many cases not sufficient and a hardware implementation is
therefore needed. Implementing a numerical function f(x), by a single lookup table [23] is simple and fast which is straight forward for low-precision
computations of f(x), i.e., when x only has a few bits. However, when
performing high-precision computations a single look-up table
implementation is impractical due to the huge table size and the long
execution time.
Approximations only using polynomials have the advantage of being ROMless, but they can impose large computational complexities and delays [24].
By introducing table based methods to the polynomials methods the
computational complexity can be reduced and the delays can also be
decreased to some extent [24].
The CORDIC (COordinate Rotation DIgital Computer) algorithm [25] [26]
has been used for these applications since it is faster than a software
approach. CORDIC is traditionally an iterative method and therefore slow
which makes the method insufficient for this kind of applications.
This thesis proposes a methodology of parabolic synthesis [27] develops
functions that perform an approximation of original functions in hardware.
The architecture of the processing part of the methodology is using
parallelism to reduce the execution time. For the development of
approximations of unary functions a parabolic synthesis methodology is
applied. Only low complexity operations that are simple to implement in
hardware are used
16
CHAPTER
2
2 Hardware Approximation Methods
An unavoidable problem when implementing signal and image processing
algorithms in hardware is to find realizations of elementary function
approximations in hardware. With the increasing digitalization there is a
growing request for hardware realization methods of approximations of
elementary functions. In view of the increasing relevance it is natural that
the subject of developing hardware realization methods of approximations
of elementary functions gets more and more attention. This chapter will
give a brief overview of existing hardware realization methods of
approximations of mathematical defined elementary functions.
2.1 Lookup Table
Computation by lookup table is an attractive VLSI realization method
because a memory simplifies the implementation of random logic in
hardware [28]. Also with improving memory density, multimegabit lookup
tables can become more practical in some applications. A benefit with
lookup tables is that it reduces the costs of hardware development when it
comes to design, validation and testing. It will also provide more flexibility
for design changes and reduce the number of different building blocks or
modules required when implementing arithmetic system designs.
When storing tables in read-only memories the benefit is also an improved
reliability since memories are more robust than combinational logic
circuits. By exchanging to read/write memories and reconfigurable
peripheral logic it will facilitate simpler evaluation of different functions as
well as simplifying the maintenance and repair of a design.
17
2.1.1 Direct and Indirect Table Lookup
The straight forward usage of lookup table is a direct table lookup. When
given an m-variable function f(xm-1, xm-2, … , x1, x0), the evaluation of f
when the input values is u-bit and the desired result is v-bit then the
required table construction is 2u×v. The concatenated u-bit string is then
used as an address into the table, with the read out v-bit value from the table
and directly forwarded to the output. This arrangement gives a high degree
of flexibility but is not practical in most cases. When implementing unary
functions the size of the table can be manageable when the input operand is
up to 12 to 16 bits, which gives a table size of 4K to 65K words. To
implement binary functions, such as xy, x mod y or xy, in lookup tables the
input operand has to be shorten to half of size of the operand size in the
previous case to be manageable. This since the growth of the table size is
exponential, which becomes intolerable when implementing in hardware.
With increasing size the logic latency of the memory will also contribute to
be a problem for many applications.
To reduce the consequences of the exponential growth of the table size
preprocessing steps of the operands and postprocessing steps [23] of the
values read out from the tables can be introduced and thereby perform an
indirect table lookup. When implementing this hybrid scheme the elements
in the pre- and postprocessing parts are both simpler and faster and may
also be more cost-effective than either a pure table lookup approach or a
pure logic circuit implementation based on an algorithm.
2.1.2 Lookup Table Reduction by Auxiliary Function
An approach to reduce the table size is to develop a binary function as an
auxiliary unary function. With different complexity of the auxiliary
function the size of the lookup table can be reduced to different extents.
These auxiliary functions are used in both the pre- and postprocessing steps
[23] of the lookup table. The reduction of the table size for a given
resolution can be rather significant and can also allow pipelining of the
design, which increases the throughput.
2.1.3 Interpolation
A simple method to reduce the size of the lookup table is to use linear
interpolation when implementing a function [28]. When a function f(x) is
known for x = xlo and x = xhi, where xlo < xhi, the values for x in the interval
18
[xlo,xhi] can be computed from f(xlo) and f(xhi) by interpolation. The simplest
method to perform an interpolation is to use linear interpolations where f(x)
for x in [xlo, xhi] is computed according to (1).
⎡ f (xhi ) − f (xlo ) ⎤⎦
f (x) = f (xlo ) + (x − xlo ) ⋅ ⎣
xhi − xlo
(1)
When implementing linear interpolation in hardware two lookup tables are
needed, one for the starting point of the interpolation a = f(xlo), and one for
the direction coefficient b = [f(xlo) - f(xhi)]/(xhi - xlo), as shown in Figure 1
for the initial linear approximation. When choosing xlo and xhi it is
beneficial to express them as numbers in the form of 2 to the power of z,
since the addressing of the lookup tables and computation of the linear
interpolation can be simplified. This will do that the addressing of the
lookup tables will be in the most significant part of x and the subtraction for
Δx = (x - xlo) will be in the least significant part of x. This do that the
hardware realization of the approximation can be simplified to a + b·Δx
where a and b fetched from memories for the interval in which the
computation is to be performed.
Figure 1. Linear interpolation for computing f(x).
19
To minimize the absolute and the relative error in the worst case an
improved linear approximation strategy, as shown in Figure 1, can be
chosen. As shown in Figure 1 the starting point a and the direction
coefficient b is chosen to minimize the error of the interpolation in the
interval.
2.2 Polynomial Approximation
Since polynomials only involve additions, subtractions, multiplications and
comparisons it is natural to approximate elementary functions with
polynomials. To ensure the computation efficiency the multiplier is the
most crucial part of the implementation. To choose a fast multiplier is
therefore important for the efficiency of the computation of the polynomial
approximation. For polynomial approximations there are a various number
of polynomial schemes available. The chosen polynomial scheme affects
the number of terms included for a given precision and thus the
computational complexity.
TABLE I. APPROXIMATION SCHEMES.
Taylor Polynomial
Maclaurin Polynomial
Legendre Polynomial
Chebyshev Polynomial
Jacobi Polynomial
Laguerre Polynomial
When developing polynomial approximations the challenge is to develop an
efficient approximation that conforms to the function to be approximated in
the desired interval. When developing an approximation two development
strategies are available, one to minimize the average error, called least
squares approximations, and one to minimize the worst case error, called
least maximum approximations to the function to be approximated [24].
The strategy to choose depends on the requirements of the design. When
the requirements of the design is that the error of the approximation is to
get the best fitting to the function to be approximated, least squares
approximations is favorable. An example when least squares
approximations are favorable is when the approximation is used in a series
of computations. Least maximum approximations are favorable when it is
important that the maximum error to the function to be approximated is
important to keep small. An example when least maximum approximations
20
are favorable is when the error from the approximation has to be within a
limit from the function to be approximated.
A list of commonly known approximation schemes is given in Table I.
For additional information the reader can consult Muller’s book [24].
2.3 Piecewise Approximation
Piecewise polynomial approximation [29] is a more flexible method to
approximate a function f(x) since the interval to perform an approximation
is divided into a number of subintervals, which are implemented as
piecewise approximations by polynomials of a low degree. These
piecewise-polynomial approximations are called splines, and the endpoint
of a subinterval is known as knots. An example of a linear spline
approximation is shown in Figure 2.
Figure 2. Example of a linear spline approximation.
The definition of a spline of degree n, n ≥ 1, is a polynomial function of the
degree n or less in each subinterval and has a prescribed degree of
21
smoothness. The spline is expected to be continuous and also have a
continuous derivatives of the order up to k, 0 ≤ k < n.
In real-time applications performing elementary functions in dedicated
hardware, software routines are often too slow and computationally
intensive [30]. When using piecewise approximation methods in hardware
implementations the degree of spline used is therefore often limited to first
order approximations, also called linear approximations.
When developing hardware design much effort is put on reducing the delay
of the computation and the chip area. To exclude multipliers is therefore
interesting since excluding multipliers will reduce the execution time and
the chip area of the implementation. An example of this is shown in [31]
where the multipliers have been replaced with configurable shifts.
2.4 Sum of Bit-Product Approximation
For implementation of elementary functions in hardware the approximation
method of sum of bit-products [32] can be beneficial since it can give an
area efficient implementation with a high throughput and reasonable
accuracy.
When formulating the approach, given a function, f(X), where X is
composed of N bits, as shown in (2).
(2)
This gives the function 2N function values, f(x), where 0 ≤ X ≤ 2N-1. The
functions values are then computed as a weighted sum of bit-products as
shown in (3).
(3)
In (Y) cj corresponds to the weight and pj to the bit-products there each bitproduct, pj, is composed of the bits xj that are one in the binary
representation of j.
22
2.5 CORDIC
The CORDIC (COordinate Rotation DIgital Calculation) algorithm is an
iterative algorithm able of evaluating a numeral of elementary functions
only using shifts and additions. The algorithm is therefore very appealing
for hardware implementation compared to software approaches [33] [26]
using multiplications and additions. The CORDIC algorithm was
introduced by Volder [25] and later generalized by Walther [34].
The CORDIC algorithm approximates the sine, cosine, etc. of an angle
iteratively, using only simple mathematical operations such as additions,
shifts and table lookups. The algorithm is a traditionally an iterative
process, consisting of micro-rotations where the initial vector is rotated by
predetermined step angles. However, unrolled architectures can be
beneficial for hardware as well [35] [36]. Any angle can be represented to
certain accuracy by a set of predetermined angle steps.
Figure 3. Vector rotation.
In Figure 3, the principle idea of CORDIC is shown. Rotation of a vector
(X1, Y1) by the angle θ to (X2, Y2) is computed according to (4).
23
(4)
X 2 = ( X 1 ⋅ cos(θ )) − (Y1 ⋅ sin(θ ))
Y2 = ( X 1 ⋅ sin(θ )) + (Y1 ⋅ cos(θ ))
The functions in (4) can be rearranged according to (5).
(5)
X 2 = cos(θ ) ⋅ ⎡⎣ X 1 − (Y1 ⋅ tan(θ )) ⎤⎦
Y2 = cos(θ ) ⋅ ⎡⎣( X 1 ⋅ tan(θ )) + Y1 ⎤⎦
To generalize (5) it can be further rearranged according to (6).
(6)
X i+1 = cos(θ i ) ⋅ ⎡⎣ X i − (Yi ⋅ tan(θ i )) ⎤⎦
Yi+1 = cos(θ i ) ⋅ ⎡⎣( X i ⋅ tan(θ i )) + Yi ⎤⎦
The methodology is to compute the rotation angles θ, in steps i, where each
step size is tan(θ) = ±2-i. Through iterative rotations the vector will rotate in
one or the other direction by decreasing steps until the desired angle is
achieved. The iterative rotation method is based on that the rotation from θi
to θi+1 is done in carefully chosen steps with values that imply that the
operations used are only shifts and additions. The values for θi are therefore
chosen such that tan(θi) is a fractional number expressed in trivial numbers
i.e. numbers to the 2 power. The multiplication by tan(θi) can thus be
replaced by a simple right-shift operation.
If the iterations of (6) are analyzed the cosine factors Ki shown in (7), falls
out as an iterative product series.
∞
Ki = ∏
i=0
(7)
2i
1+ 22i
Equation (7) is referred to as a scale factor, which represents the decrease in
magnitude of the vector during the rotation process. When the number of
iterations/rotations is approaching infinity the scale factor approaches the
value 0.607253. The cos(θi) term in (6) can therefore be replaced with the
scale factor Ki, as shown in (8).
24
(8)
X i+1 = K i ⋅ ⎡⎣ X i − (Yi ⋅ di ⋅ 2− i ) ⎤⎦
Yi+1 = K i ⋅ ⎡⎣( X i ⋅ di ⋅ 2− i ) + Yi ⎤⎦
As previous described in the iterative process the vector will rotate in one
or the other direction which do that the direction of the rotation di, must be
introduced in (8).
To keep track of the angle that has been rotated, (9) is introduced.
(9)
θ i+1 = θ i − di arctan(2− i )
The direction of the rotation di, is defined according to (10).
(10)
⎧
⎪ −1 if θ i < 0
di = ⎨
⎪
⎩ 1 otherwise
Since the first rotation of the vector would be rotated 45° counterclockwise
it is convenient to set the initial angle θ0 to 0.
25
26
CHAPTER
3
3 Parabolic Synthesis
Unary functions, e.g. trigonometric functions, logarithms as well as square
root and division functions are extensively used in computer graphics,
digital signal processing, communication systems, robotics, navigation,
astrophysics, fluid physics, etc. For these high-speed applications, software
solutions are in many cases not sufficient and a hardware implementation is
therefore needed.
The proposed methodology of parabolic synthesis [27] develops functions
that perform an approximation of original functions in hardware. The
architecture of the processing part of the methodology is using parallelism
to reduce the execution time. For the development of approximations of
unary functions a parabolic synthesis methodology is applied. Only low
complexity operations that are simple to implement in hardware are used.
The methodology is developed for implementing approximations of unary
functions in hardware. The approximation part is of course the important
part of this work but there are sometimes two other steps necessary, a
preprocessing normalization and a postprocessing transformation step as
described by [23] [24]. The computation is therefore divided into three
steps, normalizing, approximation and transforming.
3.1 Normalizing
The purpose with the normalization is to facilitate the hardware
implementation by limiting the numerical range.
The normalization has to satisfy that the values are in the interval 0 ≤ x < 1
on the x-axis and 0 ≤ y < 1 on the y-axis. The coordinates of the starting
point shall be (0,0). Furthermore, the ending point shall have coordinates
smaller than (1,1) and the function must be strictly concave or strictly
convex through the interval. An example of such a function, here called an
original function forg(x), is shown in Figure 4.
27
⎛ π ⋅x⎞
Figure 4. Example of normalized function, in this case sin ⎜
.
⎝ 2 ⎟⎠
3.2 Developing the Hardware Architecture
When developing a hardware architecture that approximates an original
function, only low complexity operations are used. Operations such as
shifts, additions and multiplications are efficient to implement in hardware
and therefore searched for. The downscaling of the semiconductor
technologies and the development of efficient multiplier architectures has
made the multiplication operation efficient in both size and execution time
when implemented in hardware. The multiplier is therefore commonly used
in this methodology when developing the hardware.
As in Fourier analysis [37] the proposed methodology is based on
decomposition of basic functions. The proposed methodology is not, as in
Fourier analysis, a decomposition method in terms of sinusoidal functions
but in second order parabolic functions. Second order parabolic functions
are used since they can be implemented using low complexity operations.
The proposed methodology also differs from the Fourier synthesis process
since the proposed methodology is synthesized through using
28
multiplications in the recombination process and not additions as in the
Fourier case.
The proposed methodology is founded on terms of second ordered
parabolic functions called sub-functions sn(x), that when recombined, as
shown in (11), obtains to the original function forg(x). When developing the
approximate function, the accuracy depends on the number of sub-functions
used.
(11)
f org (x) = s1 (x) ⋅ s2 (x) ⋅...⋅ s∞ (x)
The procedure when developing sub-functions is to divide the original
function forg(x), with the first sub-function s1(x). This division generates a
parabolic looking function called the first help-function f1(x), as shown in
(12).
f1 (x) =
(12)
f org (x)
s1 (x)
The first sub-function s1(x), will be chosen to be feasible for hardware,
according to the methodology. In the same manner the following helpfunctions fn(x), are generated, as shown in (13).
f n+1 (x) =
(13)
f n (x)
sn+1 (x)
It is important that the sub-functions sn(x), are chosen to be feasible for
hardware realization. The purpose with the normalization is to facilitate the
hardware implementation by limiting the numerical range.
3.3 Methodology for developing sub-functions
The methodology for developing sub-functions is founded on
decomposition of the original function forg(x), in terms of second order
parabolic functions for the interval 0 ≤ x < 1.0 and the sub intervals within
the interval. The second order parabolic function is chosen as
decomposition function since the structure is reasonably simple to
29
implement in hardware i.e. only low complexity operations such as
additions and multiplications are used.
The first sub-function
The sub-function s1(x), is developed by dividing the original function
forg(x), with x as a first order approximation. As shown in Figure 5 there are
two possible results after dividing the original function with x, one where
f(x)>1 and one where f(x)<1.
Figure 5. Two possible results after dividing an original function with x.
The first sub-function s1(x), is shown in (14). To approximate these
functions the expression 1+(c1·(1-x)) is used. The first sub-function s1(x), is
given by a multiplication of x and 1+(c1·(1-x)), which is resulting in a
second order parabolic function according to (14).
s1 (x) = x ⋅ (1+ (c1 ⋅ (1− x))) = x + (c1 ⋅ (x − x 2 ))
(14)
In (14) the coefficient c1 is determined as the limit from the division of the
original function with x and subtracted with 1, according to (15).
30
c1 = lim
x→0
f org (x)
x
(15)
−1
The second sub-function
The first help-function f1(x), is calculated according to (12) and dividing
two functions that are both continuous convex or concave and with the
same start and ending point will result in a function with an appearance
similar to a parabolic function, as shown in Figure 6.
Figure 6. Example of the first function f1(x) compared with sub-function s2(x).
The second sub-function s2(x), is chosen according to the methodology as a
second order parabolic function, see (16).
(16)
s2 (x) = 1+ (c2 ⋅(x − x 2 ))
In (16) the coefficient c2, is chosen to satisfy that the quotient between the
function f1(x) and the second sub-function s2(x) is equal to 1 when x is set to
0.5 in (17).
31
(17)
⎛ ⎛ 1⎞ ⎞
c2 = 4 ⋅ ⎜ f1 ⎜ ⎟ − 1⎟
⎝ ⎝ 2⎠ ⎠
When developing the second help-function f2(x), will result in that the new
help-function can be divided into a pair of parabolic looking functions as
shown in Figure 7, where the first interval are from 0 ≤ x < 0.5 and second
interval from 0.5 ≤ x < 1.0.
Figure 7. Example of the second function f2(x).
The size of the sub-intervals are chosen as a 2 power since the
normalization of the interval can be performed as a left shift of x where the
fractional part is the normalization of the two new intervals and the integer
part is the addressing of the coefficients for the intervals, used in the
hardware implementation as shown in section 4.1.4 Architecture.
Sub-functions when n > 2
For help-functions fn(x) when n > 2, the functions are characterized by the
form of one or more pairs of parabolic looking functions. When developing
the higher order sub-functions, each pair of parabolic looking functions is
divided into two parabolic help-functions. For each sub-interval, a parabolic
32
sub-function is developed as an approximation of the help-function fn(x), in
the sub-interval. To show which sub-interval the sub-sub-functions is valid
for, the subscript index is increased with the index m, which gives the
following appearance of the sub-help-function fn,m(x).
In equation (18) it is shown how the help-function fn(x), is divided into subhelp-functions fn,m(x), when n > 2.
(18)
As shown in (18), the numbers of sub-help-functions are doubled for each
order of n > 1 i.e. the numbers of sub-help-functions are 2n-1. From these
sub-help-functions, the corresponding sub-sub-functions are developed. In
analogy to the help-function fn(x), the sub-function sn+1(x), will have subsub-functions sn+1,m(x). In (19) it is shown how the sub-function sn(x), is
divided into sub-sub-functions when n > 2.
(19)
Note that in (19), the sub-sub-functions to the sub-functions; x has been
changed to xn. The change to xn is normalized to the corresponding interval,
which simplifies the hardware implementation of the parabolic function. To
simplify the normalization of the interval of xn it is selected as an
exponentiation by 2 of x where the integer part is removed. The
normalization of x is therefore done by multiplying x with 2n-2, which in
33
hardware is n-2 left shifts and the integer part is dropped, which gives xn as
a fractional part (frac( )) of x, as shown in (20).
(
xn = frac 2 n−2 ⋅ x
)
(20)
As in the second sub-function s2(x), the second order parabolic function is
used as an approximation of the interval of the help-function fn-1(x), as
shown in (21).
( (
sn,m (x) = 1+ cn,m ⋅ xn − xn2
))
(21)
Where the coefficients cn,m is computed according to (22).
(22)
⎛
⎛ 2 ⋅ (m + 1) − 1⎞ ⎞
cn,m = 4 ⋅ ⎜ f n−1,m ⎜
⎟⎠ − 1⎟⎠
⎝
2 n−1
⎝
After the approximation part the result is transformed into its desired form.
3.4 Hardware Implementation
Figure 8. The hardware architecture of the methodology.
34
For the hardware implementation two’s complement representation [28] is
used. The implementation is divided into three hardware parts,
preprocessing, processing, and postprocessing as shown in Figure 8, which
was introduced by P.T.P. Tang [24].
3.4.1 Preprocessing
In this part the incoming operand v is normalized to prepare the input to the
processing part, according to section 3.1. If the approximation is
implemented as a block in a larger system, the preprocessing part can be
taken into consideration in the previous blocks, which implies that the
preprocessing part can be reduced or even excluded.
3.4.2 Processing
In the processing part the approximation of the original function is directly
computed in either iterative or parallel hardware architecture. The three
equations (14), (16) and (21) has the same structure, which leads to that the
approximation can be implemented as an iterative architecture as shown in
Figure 9.
Figure 9. The principle of iterative hardware architecture.
The benefit of the iterative architecture is the small chip area whereas the
disadvantage is longer computation time. The advantages with parallel
hardware architectures are that they give a short critical path and fast
computation to the prize of a larger chip area. The principle of the parallel
hardware architecture for four sub-functions is shown in Figure 10.
35
Figure 10. The architecture principle for four sub-functions.
To increase the throughput even more, pipeline stages can be inserted in the
parallel hardware architecture.
In the sub-functions (14), (16) and (21) x2 and xn2 are reoccurring
operations. Since the square operation xn2, in the parallel hardware
architecture is a partial result of x2 a unique squarer has been developed. In
Figure 11 the algorithm that performs the squaring and delivers partial
product of xn2 is described.
Figure 11. Squaring algorithm for the partial product xn2.
36
The squaring algorithm for the partial products xn2 can be simplified as
shown in Figure 12.
Figure 12. Simplified squaring algorithm for the partial product xn2.
When simplifying the squaring algorithm in Figure 11, the result of the
component p0 in the vector p is simplified accordingly to (23).
(23)
p0 = x0 x0 = x0
The result of the component p1 in the vector p is equal to 0 since the result
of p0 never can contribute to p1.
The result of the component q1 in the vector q is simplified according to
(24).
q1 = p1 ⋅ 21 + x1 x0 ⋅ 21 + x0 x1 ⋅ 21 = 0 ⋅ 21 + x1 x0 ⋅ 22 = 0 ⋅ 21
(24)
The result of the component q2 in the vector q is simplified accordingly to
(25).
q2 = x1 x0 ⋅ 22 + x1 x1 ⋅ 22 = x1 ⋅ 22 + x1 x0 ⋅ 22
(25)
The result of the component r2 in the vector r is simplified accordingly to
(26).
r2 = q2 ⋅ 22 + x2 x0 ⋅ 22 + x0 x2 ⋅ 22 = q2 ⋅ 22 + x2 x0 ⋅ 23 = q2 ⋅ 22
(26)
The result of the component r3 in the vector r is simplified accordingly to
(27).
37
r3 = q3 ⋅ 23 + x2 x1 ⋅ 23 + x1 x2 ⋅ 23 + x2 x0 ⋅ 23 = q3 ⋅ 23 + x2 x1 ⋅ 24 + x2 x0 ⋅ 23 =
(27)
q3 ⋅ 23 + x2 x0 ⋅ 23
The result of the component r4 in the vector r is simplified accordingly to
(28).
r4 = x2 x1 ⋅ 24 + x2 x2 ⋅ 24 = x2 ⋅ 24 + x2 x1 ⋅ 24
(28)
The result of the component s3 in the vector s is simplified accordingly to
(29).
s3 = r3 ⋅ 23 + x3 x0 ⋅ 23 + x0 x3 ⋅ 23 = r3 ⋅ 23 + x3 x0 ⋅ 24 = r3 ⋅ 23
(29)
The result of the component s4 in the vector s is simplified accordingly to
(30).
s4 = r4 ⋅ 24 + x3 x0 ⋅ 24 + x3 x1 ⋅ 24 + x1 x3 ⋅ 24 = r4 ⋅ 24 + x3 x0 ⋅ 24 + x3 x1 ⋅ 25 =
(30)
r3 ⋅ 23 + x3 x0 ⋅ 24
The result of the component s5 in the vector s is simplified accordingly to
(31).
s5 = r5 ⋅ 25 + x3 x1 ⋅ 25 + x3 x2 ⋅ 25 + x2 x3 ⋅ 25 = r5 ⋅ 25 + x3 x1 ⋅ 25 + x3 x2 ⋅ 26 =
(31)
r5 ⋅ 25 + x3 x1 ⋅ 25
The result of the component s6 in the vector s is simplified accordingly to
(32).
s6 = x3 x2 ⋅ 26 + x3 x3 ⋅ 26 = x3 ⋅ 26 + x3 x2 ⋅ 26
(32)
In Figure 11 and Figure 12, the squaring algorithm that performs the partial
products xn2, is shown. The first partial product p, is the squaring of the
least significant bit in x. The second partial product q, is the squaring of the
two least significant bits in x. The partial product r, is the result of the
squaring of the three least significant bits in x and s is the result of the
38
squaring of x. The squaring operation is performed with unsigned numbers.
When analyzing the squarer in Figure 11 and Figure 12, it was found that
the resemblance to a bit-serial squarer [38] [39] is large. By introducing
registers in the design of the bit-serial squarer the partial results of xn2 is
easily extracted. The squaring algorithm can thus be simplified to one
addition since only an addition is needed when computing each new partial
product.
From (14), (16) and (21) it is found that only the coefficients values differs
when implementing different unary functions. This implies that different
unary functions can be realized in the same hardware in the processing part,
just by using different sets of coefficients.
Since the methodology is calculating an approximation of the original
function the error in the desired precision can be both positive and negative.
Especially, if the value of the approximation is less than the desired
precision, an increased word length compared with the word length needed
to accomplish the desired precision, might be necessary. If the order of the
last used sub-function is n > 1, an improvement of the precision can be
done by optimizing one or more coefficients c2 in (17) or cn,m in (22). The
optimization of the coefficients will minimize the error in the last used subfunction and thereby it can reduce the word length needed to accomplish
the desired accuracy. Computer simulations perform such coefficient
optimization numerically.
3.4.3 Postprocessing
The postprocessing part transforms the value to the output result z, as
shown in Figure 8. If the approximation is implemented as a block in a
system the postprocessing part can be taken into consideration in the
following blocks, which implies that the postprocessing part can be reduced
or excluded. The main reason for the post processing is thus to transform
the output to a feasible form for the proceeding parts in the system.
39
40
CHAPTER
4
4 Implementation of the sine function
An implementation of the function sin(v), using the proposed methodology
(Hertz & Nilsson, 2009) is described in this chapter as an example.
4.1.1 Preprocessing
Figure 13. The function f(v) before normalization and the original function forg(x).
To satisfy that the values of the incoming operand x is in the interval
0 ≤ x < 1, the operand x is multiplied with π/2 as shown in (33).
41
v=
(33)
π
⋅x
2
To normalize the function f(v)=sin(v) v is substituted with x which gives the
original function forg(x) (34).
(34)
⎛π ⎞
f org (x) = sin ⎜ ⋅ x ⎟
⎝2 ⎠
In Figure 13, the f(v) function is shown together with the original function
forg(x).
4.1.2 Processing
For the processing part, sub-functions are developed according to the
proposed methodology. For the first sub-function s1(x), the coefficient c1 is
defined according to (15), here repeated for the sine function.
c1 = lim
x→0
f org (x)
x
−1=
(15)
π
−1
2
The determined value, using the c1 coefficient, is shown in (35).
⎛⎛ π ⎞
⎞
s1 (x) = x + (c1 ⋅ (x − x 2 )) = x + ⎜ ⎜ − 1⎟ ⋅ x − x 2 ⎟ = 0.570796
⎝⎝ 2 ⎠
⎠
(
)
(35)
The first help-function f1(x), is computed as shown in (36).
f1 (x) =
(36)
f org (x)
s1 (x)
To develop the second sub-function s2(x), the coefficient c2 is defined
according to (17), here repeated again for the sine function
42
⎛ ⎛ 1⎞ ⎞
c2 = 4 ⋅ ⎜ f1 ⎜ ⎟ − 1⎟
⎝ ⎝ 2⎠ ⎠
(17)
⎛
⎛ 1⎞ ⎞
⎜ f org ⎜⎝ 2 ⎟⎠ ⎟
= 4⋅⎜
− 1⎟ = 0.400858
⎜ ⎛ 1⎞
⎟
⎜ s1 ⎜⎝ 2 ⎟⎠
⎟
⎝
⎠
The determined value of the second coefficient is numerically shown in
(37).
( (
))
(
(
s1 (x) = 1+ c2 ⋅ x − x 2 = 1+ 0.400858 ⋅ x − x 2
))
(37)
The second help-function f2(x), is computed as shown in (38).
f 2 (x) =
(38)
f1 (x)
s2 (x)
To develop the third sub-functions s3(x), the second help-function f2(x), is
divided into its two sub-help-functions as shown in (18). The third order of
sub-functions is thereby divided into two sub-sub-functions, where s3,0(x3)
is restricted to the interval 0 ≤ x < 0.5 and s3,1(x3) is restricted to the interval
0.5 ≤ x < 1.0 according to (19). A normalization of x to x3 is done to
simplify in the implementation in hardware, which is described in (20).
For each sub-function, the corresponding coefficients c3,0 and c3,1 is
determined. These coefficients are determined according to (22) where
higher order sub-functions can be developed. The determined values of the
coefficients are shown in (39).
( ))
( ( )) (
(x ) = 1+ ( c ⋅ ( x − x )) = 1+ ( 0.0105947 ⋅ ( x − x )) , 0.5 ≤ x < 1
s3,0 (x3 ) = 1+ c3,0 ⋅ x3 − x32 = 1+ −0.0122452 ⋅ x3 − x32 , 0 ≤ x < 0.5
s3,1
3
3,1
3
2
3
3
(39)
2
3
The third help-function f3(x), is computed as shown in (40).
f3 (x) =
(40)
f 2 (x)
s3 (x)
43
To develop the fourth sub-functions s4(x), the third help-function f3(x), is
divided into its four sub-sub-functions as shown in (18). The fourth order of
sub-functions is thereby divided into four sub-sub-functions, where s4,0(x4)
is restricted to the interval 0 ≤ x < 0.25, s4,1(x4) is restricted to the interval
0.25 ≤ x < 0.5, s4,2(x4) is restricted to the interval 0.5 ≤ x < 0.75 and s4,3(x4)
is restricted to the interval 0.75 ≤ x < 1.0 according to (19). A normalization
of x to x4 is done here as well, to simplify the hardware implementation,
which is described in (20).
For each sub-function, the corresponding coefficients c4,0, c4,1, c4,2 and c4,3
is determined. These coefficients are determined according to (22) which
accomplish that higher order of sub-functions can be developed. The
determined values of the coefficients are shown in (41).
(
))
(
s (x ) = 1+ ( 0.00192558 ⋅ ( x − x )) , 0.25 ≤ x < 0.5
s (x ) = 1+ ( −0.00119216 ⋅ ( x − x )) , 0.5 ≤ x < 0.75
s (x ) = 1+ ( 0.00126488 ⋅ ( x − x )) , 0.75 ≤ x < 1
s4,0 (x4 ) = 1+ −0.00223363⋅ x4 − x42 , 0 ≤ x < 0.25
4,1
4,2
4,3
4
2
4
4
4
4
(41)
4
2
4
4
2
4
No postprocessing is needed since the result out from the processing part
has the appropriate size.
4.1.3 Optimization
If no more sub-functions are to be developed the precision of the
approximation can be further improved by optimization of coefficients c4,0,
c4,1, c4,2 and c4,3. The precision is improved by adjusting the coefficient of
the sub-sub-function with the lowest precision in a manner that the sub-subfunction reaches the minimum error of the function. The optimization of the
error can favorably be performed in software. As later described by Figure
15, sub-sub-function s4,3(x) in the interval 0.75 ≤ x < 1.0 has the largest
relative error, here in floating point but also in fixed point. When
performing an optimization of sub-sub-function s4,3(x) in the interval
0.75 ≤ x < 1.0 it was found that the word length in the computations could
be reduced from 17 to 16 bits.
44
4.1.4 Architecture
In Figure 14, the architecture of the approximation of the sine function
using the proposed methodology is shown. The x2 block in Figure 14, is a
specially designed multiplier described in Figure 11 and Figure 12 that
performs the partial result u, i.e. u3 and u4, are used in the following blocks.
In the x-u block, x is subtracted with the partial result u, from the x2 block.
The result w from the x-u block is then used in the two following blocks as
shown in Figure 14. In the x+(c1·w) block s1(x) is performed, in 1+(c2·w)
s2(x) is performed, in 1+(c3·(x3-u3)) s3(x) is performed, and in 1+(c4·(x4-u4))
s4(x) performed. Note, that in the blocks for sub-function s3(x) and s4(x), the
individual index m is addressing the MUX that selects the coefficients in
the block, i.e. the most significant bits of x not used in x3 and x4,
represented by i0 and i1 in Figure 14, controls the MUXes.
Figure 14. The architecture of the implementation of the sine function.
45
4.1.5 Optimization of Word Length
As shown in (39) and (41) the absolute value of the coefficients decreases
in size with increasing index number of the coefficient. In similarity to the
word length of the coefficients the word length of the (xn-un) part, shown in
Figure 14, will decrease in size with increasing index number. The
decreased word length will course that the size of the multiplier used in a
sub-function to decrease as well, according to the highest value bit in the
coefficients and of the (xn-un) part. In resemblance to above the size of the
multipliers computing the multiplication of the sub-functions can be
analyzed. This analysis will also result in that some of the following
multipliers accordingly can be decrease in size.
4.1.6 Precision
Figure 15. Estimation of the relative error between the original function and different numbers of sub-functions.
In Figure 15 the resulting precision when using one to four sub-functions is
shown. A decibel scale is used to visualize the precision since the
combination of binary numbers and dB works very well together. In the dB
scale, 2 is equal to 20log10(2) = 20·(0.3) ≈ 6 dB and since 6 dB corresponds
to 1 bit, this will make it simpler to understand the result. As shown in
46
Figure 15, the relative error decreases with the number of used subfunctions. With 4 sub-functions we can see that accuracy better than 14 bits
will correspond to at least 14 adders in the CORDIC algorithm. A critical
path of 14 adders is thus unavoidable for the same result. In the general
case, one single adder cell delay is added to the critical path for each new
adder. However, in the CORDIC case, the sign bit from the previous stage
must be ready before next stage can start.
As shown in Figure 15, the relative error decreases with the number of subfunctions used. However, it increases the delay with the number of subfunction as shown in Table II.
TABLE II.
NUMBER OF OPERATIONS IN RELATION TO THE NUMBER OF SUB-FUNCTIONS.
Number of sub-functions
1
2
3 to 4
5 to 8
Delay
2 mult + 2 add
3 mult + 2 add
4 mult + 2 add
5 mult + 2 add
If we assume that the delay of one multiplier is about two adders, we get a
delay of 10 adders compared to 14 in the CORDIC case.
47
48
CHAPTER
5
5 Using the Methodology
It has been shown that the methodology of parabolic synthesis can directly
compute the sine function but the methodology is also able to compute
other trigonometric functions, logarithms as well as square root and
division. In the following parts algorithms for elementary functions will be
shown.
When describing the implementation of each function the different parts are
shown in a table. The first row in the table shows the function to be
implemented and in which interval the function is feasible to be
implemented in. In the second row it is described how to perform the
normalization of the function. The third row shows the original function to
be used when developing the approximation. The last row describes how to
perform transformation of the approximation into desired interval.
5.1.1 The Sine Function
When developing the algorithm that performs the approximation of the sine
function, the normalization in the preprocessing part is performed as a
substitution according to Table III. Since the outcome of the approximation
has the desired form no postprocessing is needed.
TABLE III.
ALGORITHM FOR THE SINE FUNCTION.
Algorithm
Function
Preprocessing
Processing
f (v) = sin(v)
x=
2
⋅v
π
⎛π ⎞
y = sin ⎜ ⋅ x ⎟
⎝2 ⎠
Postprocessing
Range
0≤v<
π
2
0 ≤ x <1
0 ≤ y <1
0 ≤ z <1
49
5.1.2 The Cosine Function
The algorithm that performs the approximation of the cosine function is
founded on the algorithm that performs the approximation of the sine
function. To perform the approximation of the cosine function x is
substituted with 1-x in the preprocessing part of the approximation for the
sine function.
TABLE IV.
ALGORITHM FOR THE COSINE FUNCTION.
Algorithm
Function
f (v) = cos(v)
Preprocessing
x = 1−
Processing
2
⋅v
π
⎛π ⎞
y = sin ⎜ ⋅ x ⎟
⎝2 ⎠
Postprocessing
Range
0<v≤
π
2
0 ≤ x <1
0 ≤ y <1
0 ≤ z <1
5.1.3 The Arcsine Function
When developing the algorithm that performs the approximation of the
arcsine function, the methodology, as well as other methodologies, has a
problem to perform an approximation for angels larger than π/4. Therefore,
the range of the approximation has been limited according to the range of
the function in Table V. To satisfy the requirements of the methodology in
the preprocessing part a substitution according to Table V has to be
performed. To get the desired outcome the approximation is multiplied with
a factor according to Table V.
TABLE V.
ALGORITHM FOR THE ARCSINE FUNCTION.
Algorithm
Function
Preprocessing
Processing
Postprocessing
Range
f (v) = arcsin(v)
x = 2 ⋅v
0 ≤ x <1
⎛ x ⎞
y = arcsin ⎜
⎝ 2 ⎟⎠
0 ≤ y <1
z=
π
⋅y
4
50
0≤z<
π
4
5.1.4 The Arccosine Function
The algorithm that performs the approximation of the arccosine function is
founded on the algorithm performing the approximation of the arcsine
function. The difference between the two approximations is in the
transformation in the postprocessing part, as shown in Table VI.
TABLE VI.
ALGORITHM FOR THE ARCCOSINE FUNCTION.
Algorithm
Function
f (v) = arccos(v)
Preprocessing
x=
2 ⋅v
Range
0≤v<
1
2
0 ≤ x <1
Processing
⎛ x ⎞
y = arcsin ⎜
⎝ 2 ⎟⎠
0 ≤ y <1
Postprocessing
z=
π π
+ ⋅(1− y)
4 4
π
π
<z≤
4
2
5.1.5 The Tangent Function
When developing the algorithm that performs the approximation of the
tangent function the angle range is from 0 to π/4, since the tangent function
is not strictly concave or convex for higher angles. To perform the
normalization the preprocessing part is performed as a substitution
according to Table VII. Since the outcome of the approximation has the
desired form no postprocessing is needed.
TABLE VII.
ALGORITHM FOR THE TANGENT FUNCTION.
Algorithm
Function
Preprocessing
Processing
f (v) = tan(v)
x=
4⋅v
π
⎛π ⎞
y = tan ⎜ ⋅ x ⎟
⎝4 ⎠
Postprocessing
Range
0≤v<
π
4
0 ≤ x <1
0 ≤ y <1
0 ≤ z <1
5.1.6 The Arctangent Function
When developing the algorithm that performs the approximation of the
arctangent function it can only be performed in the range from 0 to 1 where
the function is strictly concave or convex. To get the desired outcome the
51
approximation is in the postprocessing part multiplied with a factor
according to Table VIII.
TABLE VIII. ALGORITHM FOR THE ARCTANGENT FUNCTION.
Function
Preprocessing
Processing
Postprocessing
Algorithm
Range
f (v) = arctan(v)
0 ≤ v <1
x=v
0 ≤ x <1
4
y = arctan(x) ⋅
π
0 ≤ y <1
z=
π
⋅y
4
0≤z<
π
4
5.1.7 The Logarithmic Function
When developing the algorithm that performs the approximation of the
logarithm function with the base two, it is only performed on the mantissa
of the floating-point number, since the exponential part only act as a scaling
of the mantissa. For the preprocessing part a substitution according to
Table IX has to be performed to satisfy the normalization criteria for the
methodology. Since the outcome of the approximation has the desired form
no postprocessing is needed.
TABLE IX. ALGORITHM FOR THE LOGARITHM FUNCTION.
Function
Preprocessing
Processing
Postprocessing
Algorithm
Range
f (v) = log 2 (v)
1≤ v < 2
x = v −1
y = log 2 (1 + x)
0 ≤ x <1
0 ≤ y <1
0 ≤ z <1
5.1.8 The Exponential Function
When developing the algorithm that performs the approximation of the
exponential function with the base two, it is only performed on the
fractional part of the logarithm since the integer part is scaling the
fractional part of the logarithm. As shown in Table X only a one needs to
be added in the postprocessing part to get the desired outcome.
52
TABLE X.
ALGORITHM FOR THE EXPONENTIAL FUNCTION.
Function
Algorithm
Range
f (v) = 2v
0 ≤ v <1
x=v
0 ≤ x <1
Preprocessing
Processing
Postprocessing
0 ≤ y <1
z = 1+ y
1≤ z < 2
5.1.9 The Division Function
When developing the algorithm that performs the approximation of the
division it is limited to the range according to Table XI, since the division is
not strictly concave or convex outside this range. The pre- and postprocessing part both needs computation when performing the
approximation of the division.
TABLE XI. ALGORITHM FOR THE DIVISION FUNCTION.
Algorithm
Range
Function
Preprocessing
0.5 < v ≤ 1
x = 2 ⋅(1− v)
0 ≤ x <1
0 ≤ y <1
Processing
Postprocessing
5.1.10 The Square Root Function
When developing the algorithm that performs the approximation of the
square root function the range is limited according to Table XII. The preand post-processing part both needs computation when performing the
approximation of the square root function.
TABLE XII.
Function
Preprocessing
Processing
Postprocessing
ALGORITHM FOR THE SQUARE ROOT FUNCTION.
Algorithm
Range
f (v) = 1+ v
1≤ v < 2
x = v −1
0 ≤ x <1
y=
2+ x − 2
3− 2
z = 2 + y⋅
53
(
3− 2
0 ≤ y <1
)
2≤z< 3
54
CHAPTER
6
6 Results
The most common methods used when implementing an approximation of
unary functions in hardware are look-up tables, polynomials, table-based
methods with polynomials and CORDIC. Computation by table look-up is
attractive since memory is much denser than random logic in VLSI
realizations. However, since the size of the look-up table grows
exponentially with increasing word lengths, both the table size and
execution time becomes totally intolerable. Computation by polynomials is
attractive since it is ROM-less. The disadvantages are that it can impose
large computational complexities and delays. Computation by table-based
methods combined with polynomials is attractive since it reduces the
computational complexity and decreases the delays. But since the size of
the look-up tables grows with the accuracy the execution time will
increases with the needed accuracy. Computation by using the CORDIC
algorithm is attractive since it is using an angular rotation algorithm that
can be implemented with small look-up tables and hardware, which is
limited to simple shifts and additions. The CORDIC algorithm is an
iterative method with high latency and long delays. This makes the method
insufficient for applications where a short execution time is essential.
In all methods including the proposed method, it is a trade-off between
complexity and memory storage. By using parallelism in the computation
and parabolic synthesis in the recombination process, the proposed
methodology thereby gets a short critical path, which assures fast
computation.
55
56
CHAPTER
7
7 Conclusions
A novel methodology for implementing approximations of unary functions
such as trigonometric functions, logarithmic functions, as well as square
root and division functions etc. in hardware is introduced. The architecture
of the processing part automatically gives a high degree of parallelism. The
methodology to develop the approximation algorithm is founded on
parabolic synthesis. This combined with that the methodology is founded
on operations that are simple to implement in hardware such as addition,
shifts, multiplication, contributes to that the implementation in hardware is
simple to perform. By using the parallelism and parabolic synthesis, one of
the most important characteristics with the out coming hardware is the
parallelism that gives a short critical path and fast computation. The
structure of the methodology will also assure an area efficient hardware
implementation. The methodology is also suitable for automatic synthesis.
57
58
CHAPTER
8
8 Future Work
The feasibility of the methodology of parabolic synthesis has not fully been
investigated why different issues of the methodology remain to be
investigated. For the methodology there are computational cases as
wordlength optimization of the data flow in the architecture. There are
hardware implementation issues regarding that with increasing order of the
sub-function the weight of the coefficients will decrease. Both these issues
can result in better area efficiency and computation rate for the hardware
implementation.
To investigate the feasibility of the methodology it is also important to
compare it with other existing methodologies for implementing
approximations of unary functions.
When analyzing help functions it is found that these function are piecewise
continues functions. Since a help function represent the fault to an
approximation and since the fault will decrease with increasing order of
sub-function it can be interesting to use the parabolic synthesis
methodology as a preprocessing part to an interpolation methodology. It
can be expected that this approach can increase both area efficiency and
computation rate of the approximation.
59
60
References
[1] E. Meijering (2002), A Chronology of Interpolation From Ancient
Astronomy to Modern Signal and Image Processing, Proceedings of the
IEEE, ISSN: 00189219, Institute of Electrical and Electronics Engineers
Inc, vol. 90, no. 3, pp. 319-342, March 2002.
[2] I. Newton, “Methodus differentialis,” in The Mathematical Papers of
Isaac Newton, D. T. Whiteside, Ed. Cambridge, U.K.: Cambridge Univ.
Press, vol. VIII, ch. 4, pp. 236–257, 1981.
[3] J. Stirling, “Methodus differentialis Newtoniana illustrata,” Philos.
Trans., vol. 30, no. 362, pp. 1050–1070, 1719.
[4] C. F. Gauss, “Theoria interpolationis methodo nova tractata,” in Werke.
Göttingen, Germany: Königlichen Gesellschaft der Wissenschaften, vol. III,
pp. 265–327, 1866.
[5] E. Waring, “Problems concerning interpolations,” Philos. Trans. R. Soc.
London, vol. 69, pp. 59–67, 1779.
[6] L. Euler, “De eximio usu methodi interpolationum in serierum
doctrina,” in Opuscula Analytica Petropoli, vol. 1, Academia Imperialis
Scientiarum, pp. 157–210, 1783.
[7] J. L. Lagrange, “Leçons élémentaires sur les mathématiques données a
l’école normale,” in OEuvres de Lagrange, J.-A. Serret, Ed. Paris, France:
Gauthier-Villars, vol. 7, pp. 183–287, 1877.
[8] F. W. Bessel, “Anleitung und Tafeln die stündliche Bewegung des
Mondes zu finden,” Astronomische Nachrichten, vol. II, no. 33, pp. 137–
141, 1824.
[9] P. S. de Laplace, “Mémoire sur les suites (1779),” in OEuvres
Complètes de Laplace. Paris, France: Gauthier-Villars et Fils, vol. 10, pp.
1–89, 1894.
[10] P. S. de Laplace, Théorie Analytique des Probabilités, 3rd ed. Paris,
France: Ve. Courcier, vol. 7, 1820.
[11] J. D. Everett, “On a central-difference interpolation formula,” Rep. Br.
Assoc. Adv. Sci., vol. 70, pp. 648–650, 1900.
61
[12] J. D. Everett, “On a new interpolation formula,” J. Inst. Actuar., vol.
35, pp. 452–458, 1901.
[13] K. Weierstrass, “Über die analytische Darstellbarkeit sogenannter
willkürlicher Functionen reeller Argumente,” in Sitzungsberichte der
Königlich Preussischen Akademie der Wissenschaften zu Berlin, pp.
633/789–639/805, July, 1885.
[14] E. T. Whittaker, “On the functions which are represented by the
expansions of interpolation-theory,” in Proc. R. Soc. Edinburgh, vol. 35,
pp. 181–194, 1915.
[15] E. Borel, “Mémoire sur les séries divergentes,” Annales Scientifiques
de l’École Normale Supérieure, ser. 2, vol. 16, pp. 9–131, 1899.
[16] J. M. Whittaker, “Interpolatory function theory,” in Cambridge Tracts
in Mathematics and Mathematical Physics. Cambridge, U.K.: Cambridge
Univ. Press, 1935.
[17] J. M. Whittaker, “On the cardinal function of interpolation theory,” in
Proc. Edinburgh Math. Soc., vol. 1, pp. 41–46, 1927–1929.
[18] J. M. Whittaker, “The “Fourier” theory of the cardinal function,” in
Proc. Edinburgh Math. Soc., vol. 1, pp. 169–176, 1927–1929.
[19] V. A. Kotel’nikov, “On the transmission capacity of the “ether” and
wire in electrocommunications,” in Modern Sampling Theory: Mathematics
and Applications, J. J. Benedetto and P. J. S. G. Ferreira, Eds. Boston, MA:
Birkhäuser, ch. 2, pp. 27–45, 2000.
[20] C. E. Shannon, “A mathematical theory of communication,” Bell Syst.
Tech. J., vol. 27, pp. 379–423, 1948.
[21] I. J. Schoenberg, “Contributions to the problem of approximation of
equidistant data by analytic functions. Part A—on the problem of
smoothing or graduation. A first class of analytic approximation formulae,”
Quart. Appl. Math., vol. IV, no. 1, pp. 45–99, 1946.
[22] I. J. Schoenberg, “Contributions to the problem of approximation of
equidistant data by analytic functions. Part B—on the problem of
osculatory interpolation. A second class of analytic approximation
formulae,” Quart. Appl. Math., vol. IV, no. 2, pp. 112–141, 1946.
62
[23] P.T.P. Tang, “Table-lookup algorithms for elementary functions and
their error analysis,” Proc. of the 10th IEEE Symposium on Computer
Arithmetic, pp. 232 – 236, ISBN: 0-8186-9151-4, Grenoble, France, June
1991.
[24] J.-M. Muller, Elementary Functions: Algorithm Implementation,
second ed. Birkhauser, ISBN 0-8176-4372-9, Birkhauser Boston, c/o
Springer Science+Business Media Inc., 233 Spring Street, New York, NY
10013, USA, 2006.
[25] J. E. Volder, “The CORDIC Trigonometric Computing Technique,”
IRE Transactions on Electronic Computers, vol. EC-8, no. 3, pp. 330–334,
1959.
[26] R. Andrata, “A survey of CORDIC algorithms for FPGA based
computers,” Proc. of the 1998 ACM/SIGDA Sixth Inter. Symp. on Field
Programmable Gate Array (FPGA’98), pp. 191-200, ISBN: 0-89791-9785, Monterey, CA, ACM Inc, Feb. 1998.
[27] E. Hertz, P. Nilsson, “A Methodology for Parabolic Synthesis of
Unary Functions for Hardware Implementation,” Proc. of the 2nd
International Conference on Signals, Circuits and Systems, ISBN-13: 9781-4244-2628-7, SCS08_9.pdf, pp 1-6, Hammamet, Tunisia, Nov. 2008.
[28] B. Parhami, Computer Arithmetic, Oxford University Press Inc., ISBN:
0-19-512583-5, 198 Madison Avenue, New York, New York 10016, USA,
2000.
[29] D. F. Mayer, An introduction to Numerical Analysis, Cambrige
University Press, ISBN 9780511076534, The Pitt Building, Trumpington
Street, Cambridge, United Kingdom, 2003.
[30] A. Pineiro, M. D. Ercegovac, and J. D. Bruguera, “Algorithm and
Architecture for Logarithm, Exponential, and Powering Computation,”
IEEE Trans. Computers, vol. 53, no. 9, pp. 1085-1096, Sept. 2004.
[31] O. Gustafsson, K. Johansson, “Multiplierless Piecewice Linear
Approximation of Elementary Functions,” Proc. of the Fortieth Asilomar
Conference on Signals, Systems and Computers (ACSSC’06), pp. 16781681, ISBN: 1-4244-0784-2, Pacific Grove, Canada, Oct. 29 – Nov. 1
2006.
63
[32] K. Johansson, O. Gustafsson, L. Wanhammar, ”Approximation of
Elementary Functions using a Weighted Sum of Bit-Products,” IEEE
International Symposium on Circuits and Systems (ISCAS), IEEE, pp.795–
798, Aug. 28 – Sept. 2 2005.
[33] T. H. Hu, “CORDIC-Based VLSI Architectures for Digital Signal
Processing,” in IEEE Signal Processing Magazine, pp. 16-35, July1992.
[34] J.S. Walther, “A Unified Algorithm for Elementary Functions,” Proc.
Spring. Joint Computer Conf., pp. 379-385, 1971.
[35] R. Andraka, “A Survey of CORDIC Algorithms for FPGA based
Computers,” in Proceedings of FPGA 98. 1998 International Symposium
on Field Programmable Gate Arrays, pp. 191-200, Monterey, CA, USA,
February 22-25, 1998.
[36] P. Nilsson, “Complexity Reductions in Unrolled CORDIC
Architectures,” in Proceedings of the IEEE 14th International Conference
on Electronics, Circuits and Systems (ICECS 2009), pp. 868-871,
Hammamet, Tunisia, December 13-16, 2009.
[37] J. Fourier, Théorie Analytique de la Chaleur, Paris, France, 1822.
[38] P. Ienne, M. A. Viredaz (1994), “Bit-serial multipliers and squarers,”
IEEE Transactions on Computer, ISSN 0018-9340, vol. 43, no12, pp. 1445
-1450, Dec. 1994.
[39] K. Z. Pekmestzi, P. Kalivas, and N. Moshopoulos (2001), “Long
unsigned number systolic serial multipliers and squarers,” IEEE Circuits
and Systems II:, ISSN 1057-7130, vol. 48, no 3, pp. 316 -321, March 2001.
64
List of Figures
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
Figure 9.
Figure 10.
Figure 11.
Figure 12.
Figure 13.
Figure 14.
Figure 15.
Linear interpolation for computing f(x).................................... 19
Example of a linear spline approximation. .............................. 21
Vector rotation. ........................................................................ 23
⎛ π ⋅x⎞
Example of normalized function, in this case sin ⎜
. ...... 28
⎝ 2 ⎟⎠
Two possible results after dividing an original function with x.
.................................................................................................. 30
Example of the first function f1(x) compared with sub-function
s2(x). ......................................................................................... 31
Example of the second function f2(x). ...................................... 32
The hardware architecture of the methodology. ...................... 34
The principle of iterative hardware architecture. ..................... 35
The architecture principle for four sub-functions. ................... 36
Squaring algorithm for the partial product xn2. ........................ 36
Simplified squaring algorithm for the partial product xn2. ....... 37
The function f(v) before normalization and the original function
forg(x). ....................................................................................... 41
The architecture of the implementation of the sine function. .. 45
Estimation of the relative error between the original function
and different numbers of sub-functions. .................................. 46
65
66
List of Tables
TABLE I.
TABLE II.
TABLE III.
TABLE IV.
TABLE V.
TABLE VI.
TABLE VII.
TABLE VIII.
TABLE IX.
TABLE X.
TABLE XI.
TABLE XII.
APPROXIMATION SCHEMES.............................................. 20
NUMBER OF OPERATIONS IN RELATION TO THE NUMBER
OF SUB-FUNCTIONS. ........................................................ 47
ALGORITHM FOR THE SINE FUNCTION. .......................... 49
ALGORITHM FOR THE COSINE FUNCTION. ..................... 50
ALGORITHM FOR THE ARCSINE FUNCTION. ................... 50
ALGORITHM FOR THE ARCCOSINE FUNCTION. .............. 51
ALGORITHM FOR THE TANGENT FUNCTION. ................. 51
ALGORITHM FOR THE ARCTANGENT FUNCTION............ 52
ALGORITHM FOR THE LOGARITHM FUNCTION. ............. 52
ALGORITHM FOR THE EXPONENTIAL FUNCTION. .......... 53
ALGORITHM FOR THE DIVISION FUNCTION.................... 53
ALGORITHM FOR THE SQUARE ROOT FUNCTION. .......... 53
67
68
List of Acronyms
ASIC
CORDIC
DSP
MUX
ROM
VLSI
Application-Specific Integrated Circuit
COordinate Rotation DIgital Computer
Digital Signal Processor or Digital Signal Processing
MUltipleXer
Read-Only Memory
Very-Large-Scale Integration
69
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement