Matrix Inversion Using QR Decomposition by Parabolic Synthesis

Matrix Inversion Using QR Decomposition by Parabolic Synthesis
MASTER’S THESIS
Matrix Inversion Using QR Decomposition
by Parabolic Synthesis
By
Nafiz Ahmed Chisty
Supervisors:
Professor Peter Nilsson and MScEE Erik Hertz
Department of Electrical and Information Technology
Faculty of Engineering, LTH, Lund University
SE-221 00 Lund, Sweden
1
Abstract
Parabolic synthesis is one of the latest methodologies, proposed by Professor Peter Nilsson and Erik
Hertz of EIT department at Lund University (LTH), for the implementation of unary functions in
hardware. In the preceding research conducted at Lund University, it had been shown that parabolic
synthesis is an effective solution, which is both fast and consumes less area compared to all the existing
methods.
The goal of this Master’s thesis is to develop hardware for the generation of three trigonometric functions
(+sine, -sine and +cosine) using the novel approximation methodology, which is based on Parabolic
synthesis, for the use in Givens rotations for implementing matrix inversion using QR decomposition.
Two hardware designs has been developed, one for the parabolic synthesis and another for matrix
inversion but due to time limitations the two hardware designs could not be integrated together. The paper
mainly focuses on the implementation of the three trigonometric functions on both FPGA and ASIC and
compares the result with metrics of speed and area and later a hardware solution for the overall system
has been proposed.
2
Acknowledgements
The greatest gratitude after GOD goes to Professor Peter Nilsson, my Master’s Thesis supervisor and
Examiner, who not only helped me with kind insightful advice and encouragements but also made it
possible for me to finish the thesis successfully by going beyond my expectations in assisting me with
various academic and administrative issues.
I would also like to thank my other supervisor, MScEE Erik Hertz for his kind support and guidance.
Last but not least, I would like to thank my beloved family and all friends for their constant moral support
and courage.
Nafiz Ahmed Chisty
Master’s in System-on-Chip
Lund University (LTH)
Lund, Sweden
January 2012
3
Contents
Abstract ......................................................................................................................................................... 2
Acknowledgements ....................................................................................................................................... 3
CHAPTER 1 ................................................................................................................................................. 7
1 Introduction ........................................................................................................................................... 7
CHAPTER 2 ................................................................................................................................................. 9
2 Matrix Inversion by QR decomposition using Givens rotation ............................................................. 9
2.1
Matrix inversion ............................................................................................................................ 9
2.1.1
Properties of inverse matrix .................................................................................................. 9
2.2
QR decomposition ........................................................................................................................ 9
2.3
Givens Rotations ......................................................................................................................... 10
2.4
QRD Using Givens rotations ...................................................................................................... 11
2.4.1
Triangularization for QRD .................................................................................................. 11
2.4.2
The inverse matrix for QRD ............................................................................................... 12
2.5 Hardware for inverse matrix for QRD .............................................................................................. 12
CHAPTER 3 ............................................................................................................................................... 15
3 Parabolic Synthesis Methodology....................................................................................................... 15
3.1
Introduction ................................................................................................................................. 15
3.2
Other Hardware Approximation Methods .................................................................................. 15
3.2.1
Advantage of Hardware over Software ............................................................................... 15
3.2.2
Disadvantage of Look up table ........................................................................................... 15
3.2.3
Disadvantage of using polynomials .................................................................................... 15
3.2.4
Disadvantage of the CORDIC............................................................................................. 15
3.3
Parabolic Synthesis methodology ............................................................................................... 16
3.3.1
Normalizing ........................................................................................................................ 16
3.3.2
Developing the Hardware Architecture ............................................................................. 16
3.3.3
Methodology for developing sub-functions ........................................................................ 17
3.3.4
Hardware Implementation.................................................................................................. 18
3.3.5
Postprocessing..................................................................................................................... 21
CHAPTER 4 ............................................................................................................................................... 23
4 Parabolic Architecture development for Trigonometric functions ..................................................... 23
4.1
Design Methodology ................................................................................................................... 23
4.2
The sub-functions........................................................................................................................ 24
4
4.3
Angle transformation .................................................................................................................. 25
4.4
Simplifications ............................................................................................................................ 27
4.4.1
The MCM unit .................................................................................................................... 27
4.4.2
Eliminating an adder ........................................................................................................... 29
4.4.3
Two’s complement conversion ........................................................................................... 29
4.4.4
Squarer ................................................................................................................................ 29
4.5
Final architecture ........................................................................................................................ 30
4.6
Wordlengths ................................................................................................................................ 31
4.7 Hardware for matrix inversion using QRD by Parabolic synthesis .................................................. 31
CHAPTER 5 ............................................................................................................................................... 33
5 Synthesis ............................................................................................................................................. 33
5.1
Synthesis ..................................................................................................................................... 33
5.2
Types of Synthesis ...................................................................................................................... 33
5.3
Synthesis for FPGA .................................................................................................................... 34
5.3.1
5.4
Synthesis Report from FPGA ............................................................................................. 36
Synthesis for ASIC ..................................................................................................................... 36
5.4.1
Minimum Area Synthesis.................................................................................................... 36
5.4.2
High Speed Synthesis.......................................................................................................... 36
5.4.3
Results from ASIC Synthesis .............................................................................................. 36
CHAPTER 6 ............................................................................................................................................... 41
6 Results and Conclusion ........................................................................................................................ 41
CHAPTER 7 ............................................................................................................................................... 43
7 Future Work ........................................................................................................................................ 43
Reference .................................................................................................................................................... 45
Appendix 1: For Synthesis .......................................................................................................................... 47
5
6
CHAPTER
1
1 Introduction
The demand of fast and small hardware architectures are increasing day by day. Most hardware uses
Unary functions like trigonometric functions, logarithms as well as square root and division functions.
These functions are extensively used in applications like robotics, signal processing, communication
systems, navigation, fluid physics, etc. The overall performance of the system is dependent on the
methods of computing such functions. In many cases, software solutions are not sufficient and a
hardware implementation is required [1].
For low precision computations, the simplest and faster method of implementations of such functions is
by using Single look-up table. However, for High-precision computations this method gets inappropriate
due to large table size and long execution time. Implementation using polynomial approximation also has
large computational complexities and delays due to extensive use of multiplications and divisions [1].
The Coordinate Rotation Digital Computer (CORDIC) algorithm is a popular method for the fast
computation of unary functions using only simple shift-add operation. Although this method is faster
than a software solution but due to its iterative property it is slow and thus improper for high speed
applications [1].
On the other hand Parabolic Synthesis, a methodology proposed by Professor Peter Nilsson and Erik
hertz, is a method based on developing functions that performs approximation of original unary functions
in hardware. This method uses parallelism to reduce execution time and employs low complexity
operations thus making hardware implementation faster and simpler than all other existing methodologies
[1].
This thesis mainly develops hardware for the generation of three trigonometric functions (+sine, -sine and
+cosine) using parabolic synthesis. Later, hardware architecture is proposed for implementing matrix
inversion using QR decomposition.
This paper consists of eight Chapters:
Chapter 1 deals with the motivation behind this thesis work.
Chapter 2 explains the basic of matrix inversion with details of QR decomposition using Given’s
rotations.
Chapter 3 introduces the novel Parabolic synthesis methodology.
Chapter 4 explains the development of architecture for the generation of trigonometric functions using
parabolic synthesis. A hardware solution is also shown for implementing matrix inversion with QR
decomposition using the implemented parabolic architecture.
Chapter 5 deals the Synthesis procedure and discusses the results obtained from FPGA and ASIC
synthesis.
7
Chapter 6 discusses the result obtained from the thesis work.
Chapter 7 concludes the thesis work.
Chapter 8 suggests some future prospects of the implemented design with improvement.
8
CHAPTER
2
2 Matrix Inversion by QR decomposition using Givens rotation
2.1
Matrix inversion
In linear algebra, for an n-by-n square matrix A, matrix inversion is the process of finding the matrix B if
(2.1) is satisfied
AB = BA = In
(2.1)
where In denotes the n-by-n identity matrix. The inverse of A is denoted by A−1.
Non-square matrices (m-by-n matrices for which m ≠ n) do not have an inverse but may have a left
inverse or right inverse. A square matrix that is not invertible is called singular [3][4].
2.1.1
Properties of inverse matrix
Some of the most important properties for an invertible matrix A are:
(A-1)-1 = A
(kA)-1 = k-1 A-1
(2.2)
for nonzero scalar k
(2.3)
(AT)-1 = (A-1)T
(2.4)
det(A-1) = det(A)-1
(2.5)
2.2
QR decomposition
QR decomposition is an efficient frequently used methodology when matrix inversion is needed. A
typical application area is mobile communication systems using multiple antennas, i.e. Multi-Input MultiOutput (MIMO) systems. The QR decomposition factorizes a matrix into an orthogonal and an upper
triangular matrix.
A  Q R
(2.6)
where R is an upper triangular matrix and Q is orthogonal, that is, the unity matrix is
I  Q  QT
(2.7)
where QT is the transpose of Q, for real valued matrices.
MIMO systems often uses 4 transmit and 4 receive antennas. The inverse of a 4 by 4 matrix at the
receiver side is therefore often practiced. We thus get a system like:
9
 a11 a12
a
a22
21
A
 a31 a32

 a41 a42
a13
a23
a33
 q11 q12
q
q22
21
Q
 q31 q32

 q41 q42
q13
q23
q33
 r11 r12
0 r
22
R
0 0

0 0
a43
q43
r13
r23
r33
0
a14 
a24 
a34 

a44 
(2.8)
q14 
q24 
q34 

q44 
(2.9)
r14 
r24 
r34 

r44 
(2.10)
where R is an upper triangular matrix.
QR decomposition can be computed using several methods like the Gram-Schmidt process, Householder
transformations, or Givens rotations. Each has a number of advantages and disadvantages. In this thesis,
we will use Givens rotation method for computing QR decomposition since it can be parallelized and
have a lower operation count [5].
2.3
Givens Rotations
Givens rotation is a rotation in the plane spanned by two coordinates axes, which is represented by a
matrix of the form
(2.11)
where c = cos(θ) and s = sin(θ). [6]
10
2.4
QRD Using Givens rotations
Givens rotations can be used to perform QR decomposition. The process utilizes a number of cycles of
rotations whose function is to null an element in the sub-diagonal of the matrix, forming the R matrix as
shown in (2.10). The orthogonal Q matrix, as shown in (2.9), can be obtained by the concatenation of all
the Givens rotations [6].
2.4.1
Triangularization for QRD
A 3 by 3 input matrix, A1, is given in (2.12)
 a11
A1   a21
 a31
a12
a22
a32
a13 
a23 
a33 
(2.12)
From A1, we will determine the matrices A2 and A3, as well. In order to find a triangular matrix, R, three
rotations are needed, where one element is set to zero after each rotation. It can for instance be done in the
order (3, 1), (2, 1), and (3, 2). For that, three Givens rotation matrices are needed, G1, G2, and G3, as
defined below.
 c 0 s
G1   0 1 0 
  s 0 c 
(2.13)
 c s 0
G2    s c 0 
 0 0 1 
(2.14)
1 0 0 
G3  0 c s 
0  s c 
(2.15)
where c = cos(θ) and s = sin(θ). These values can be calculated as:
c1 
s1 
A1 (1,1)
A1 (1,1)2  A1 (3,1)2
(2.16)
A1 (3,1)
A1 (1,1)2  A1 (3,1)2
11
c2 
s2 
c3 
s3 
A1 (1,1)
A1 (1,1)2  A1 ( 2,1)2
(2.17)
A1 ( 2,1)
A1 (1,1)2  A1 ( 2,1)2
A1 ( 2, 2)
A1 ( 2, 2)2  A1 (3, 2)2
(2.18)
A1 (3, 2)
A1 ( 2, 2)2  A1 (3, 2)2
However, these operations include square-root, square, and division, which is not feasible for hardware
implementation.
In (2.19), the matrices A2 and A3 are determined.
A2  G1  A1
(2.19)
A3  G2  A2
Finally, the Q and R matrices can be determined, as shown in (2.20).
Q  G1T  G2T  G3T
(2.20)
R  G3  A3
2.4.2
The inverse matrix for QRD
In (2.21) the inverse of A is derived. For that, the inverse of R is needed, which is a straight forward
operation since R is upper triangular. The transpose of Q is basically done with memory operations.
A  Q R
A1  (Q  R )1
(2.21)
A1  Q 1  R 1
A1  R 1  QT
2.5 Hardware for inverse matrix for QRD
Using formula (2.12) - (2.21), the basic hardware for obtaining the inverse matrix is demonstrated in Fig.
2.1.
12
Figure 2.1. Basic hardware for obtaining matrix inversion using QRD.
13
14
CHAPTER
3
3 Parabolic Synthesis Methodology
3.1
Introduction
Parabolic Synthesis is a method based on developing functions that performs approximation of original
unary functions in hardware. This method uses parallelism to reduce execution time and employs low
complexity operations like shifts, additions, and multiplications that are simple to implement in hardware,
thus making hardware implementation faster and simpler than all other existing methodologies [1].
3.2
Other Hardware Approximation Methods
The demand of hardware approximation for the implementation of elementary functions is increasing
with the passage of time. The goal is to make the implemented hardware fast at the same time limiting the
area consumption to a minimum level.
3.2.1
Advantage of Hardware over Software
Most hardware uses Unary functions like trigonometric functions, logarithms as well as square root and
division functions. These functions are extensively used in applications like robotics, signal processing,
communication systems, navigation, fluid physics, etc. The overall performance of the system is
dependent on the methods of computing such functions. In many cases, software solutions are not
sufficient and a hardware implementation is required [1].
Some popular hardware approximation methods include single lookup table, approximations using
polynomials, CORDIC etc.
3.2.2
Disadvantage of Look up table
For low precision computations, the simplest and fastest method of implementations of such functions is
by using Single look-up table. However, for High-precision computations this method gets inappropriate
due to large table size and long execution time [1].
3.2.3
Disadvantage of using polynomials
This method is also known as ROM-less system. Implementation using polynomial approximation also
has large computational complexities and delays due to extensive use of multiplications and divisions [1].
3.2.4
Disadvantage of the CORDIC
The Coordinate Rotation Digital Computer (CORDIC) algorithm is a popular method for the fast
computation of unary functions using only simple shift-add operation. Although, this method is faster
than a software solution but due to its iterative property it is slow thus improper for high speed
applications [1].
15
3.3
Parabolic Synthesis methodology
This is a method for hardware implementation of approximations of unary functions using parallelism and
low complexity operations. The method consists of three important steps: Normalization, Processing and
Post Processing. Of these three steps, the processing step is the most important part but the other two
steps are also necessary in some cases [1] [2].
3.3.1
Normalizing
This is the first step of the Parabolic synthesis methodology. The purpose of this step is to limit the
numerical range in the interval 0 ≤ x < 1 on the x-axis and 0 ≤ y < 1 on the y-axis to facilitate the hardware
implementation. The unary function is normalized to either a concave or convex function, known as the
original function forg(x), with starting coordinate of (0,0) and ending coordinate smaller than (1,1) [1] [2].
Figure 3.1. Example of normalized function [1].
3.3.2
Developing the Hardware Architecture
For efficient hardware architecture development, this methodology is founded on second order parabolic
functions called sub-functions, sn(x), which uses low complexity operations like shifts, additions and
multiplications. Multipliers are commonly used due to the ever going scaling down of the semiconductor
technologies and fast development of efficient multiplier architecture which has led hardware
implementation of multiplication operation efficient in both size and execution time. As shown in (3.1),
the original function forg(x), can be obtained by multiplying the sub-functions and its accuracy depends on
the number of sub-functions used [1] [2].
forg ( x)  s1 ( x)  s2 ( x)  s3 ( x)  s4 ( x)
(3.1)
A parabolic looking function called the first help-function, f1(x), is obtained by dividing the original
function forg(x), with the first sub-function s1(x).
f1 ( x) 
forg ( x)
s1 ( x)
(3.2)
16
The rest of the functions is generated, as shown on (3.3).
f n 1 ( x) 
3.3.3
f n ( x)
sn 1 ( x)
(3.3)
Methodology for developing sub-functions
Sub-functions are developed by the decomposition of the original function forg(x) by using second order
parabolic functions within the interval 0 ≤ x < 1.0 and the sub intervals within the interval [1] [2].
3.3.3.1 The first sub-function
The first sub-function s1(x) can be obtained by dividing the original function forg(x) with x as a first order
approximation. The division produces two possible results, one where f(x)>1 and one where f(x)<1 as
shown on Fig. 3.2 [1] [2].
Figure 3.2. Two possible results after dividing an original function with x [1].
The first sub-function s1(x), as shown on equation (3.4), is achieved by multiplying x with the expression
1+(c1·(1-x)) where the coefficient c1 is obtained from the limit from the division of the original function
with x and subtracted with 1, according to (3.5) [1] [2].
s1 ( x)  x  (1  [c1  (1  x)])  x  c1  ( x  x 2 )
c1  lim
x 0
(3.4)
forg ( x)
1
x
(3.5)
17
3.3.3.2
The second sub-function
The second sub-function s2(x), is chosen as a second order parabolic function as shown in (3.6) [1] [2].
s2 ( x)  1  [c2  ( x  x 2 )]
(3.6)
The coefficient c2, is chosen in a such way that it satisfies with the quotient between the first function f1(x)
and the second sub-function s2(x) is equal to 1 when x is set to 0.5, as shown below.
c2  4  [ f1 (0.5)  1]
(3.7)
In this manner the second help-function f2(x), will get a shape of lying S shape as shown in figure (3.3).
This help-function can be divided into a pair of parabolic looking shapes where the first interval are from
0 ≤ x < 0.5 and second interval from 0.5 ≤ x < 1.0 [2].
Figure 3.3 Example of the second help function [2].
For easy hardware implementation, the size of the sub-intervals are chosen as a power of 2 since
the normalization of the interval can be performed as a left shift of x where the fractional part is
the normalization of the two new intervals and the integer part is the addressing of the
coefficients for the intervals [1] [2].
3.3.3.3
Sub-functions when n > 2
It is beyond the scope of this thesis to evaluate sub-functions for n>2.
3.3.4
Hardware Implementation
Two’s complement representation is used for the hardware implementation. The implementation is
divided into three hardware parts: preprocessing, processing, and postprocessing as shown in Figure 3.4.
18
Operand v
Preprocessing
Operand x
Processing
Operand y
Postprocessing
Result z
Figure 3.4 The hardware architecture of the methodology [1].
3.3.4.1
Preprocessing
In this part the input operand v is normalized for the processing part. For a large system the
preprocessing part can be reduced or eliminated if the approximation is implemented together
with other logic in the preceding block [1].
3.3.4.2
Processing
In this part, the approximation of the original function is implemented in either iterative or
parallel hardware architecture. The iterative architecture as shown on figure (3.6) has the
advantage of small chip area but at the expense of longer computation time [1].
reg
x
X
Sn(x)
reg
y
Figure 3.6 The principle of iterative hardware architecture [1].
On the other hand, the parallel hardware architectures as shown for four sub-functions on figure (3.7),
give a short critical path and fast computation at the prize of a larger chip area. The throughput can be
increased by pipelining.
19
x
S1(x)
X
S2(x)
X
y
S3(x)
X
S4(x)
Figure 3.7 The architecture principle for four sub-functions [1].
3.3.4.3
The square Unit
Square components like x2 and xn2 are reoccurring operations in the sub-functions. The square operation
xn2 in the parallel hardware architecture is a partial result of x2. That is why a unique squarer has been
developed [1].
x3
x3
r5
x2 x2
r4
x2
x2
x1
x1
p1
x1 x0
x0 x1
q1
x0
x0
x0 x0
p0
p
q0
q
q3
x1 x1
q2
x2 x0
x2 x1
x1 x2
r3
x3 x0
x0 x2
r2
r1
r0
r
s2
s1
s0
s
p
x3 x1
s7
x3 x3
s6
x3 x2
x2 x3
s5
x1 x3
s4
x0 x3
s3
Figure 3.8 Squaring algorithm for the partial product xn2 [1].
x3
x3
x3
r5
20
x2
x2 x1
r4
q3
x2 x0
r3
x2
x2
x1
x1
x1
x1 x0
q2
p1
x0
x0
x0
p0
q1
q0
q
r1
r0
r
r2
s7
x3 x2
s6
x3 x1
s5
x3 x0
s4
s3
s2
s1
s0
s
Figure 3.9 Simplified squaring algorithm for the partial product xn2 [1].
Vector p:
p0=x0x0 =x0
(3.8)
The result of component p1 is equal to 0 as the result of p0 does not contribute anything to p1.
Vector q:
q1=p1.21+x1x0.21+x0x1 .21=p1.21+x1x0.22=p1.21
(3.9)
q2= x1x0.22+x1x1 .22= x1.22 +x1x0.22
(3.10)
Vector r:
r2=q2.22+x2x0.22+x0x2 .22=q2.22+x2x0.23=q2.22
(3.11)
r3=q3.23+x2x1.23+x1x2 .23 +x2x0.23 = q3.23+x2x1.24 + x2x0.23=q3.23 +x2x0.23
r4= x2x1.24+x2x2 .24= x2.24+x2x1.24
(3.12)
(3.13)
Vector s:
s3=r3.23+x3x0.23+x0x3.23 =r3.23+x3x0.24 =r3.23
(3.14)
s4=r4.24+x3x0.24+x3x1 .24 +x1x3.24 = r4.24+x3x0.24 + x3x1.25=r3.23 +x3x0.24
(3.15)
s5=r5.25+x3x1.25+x3x2 .25 +x2x3.25 = r5.25+x3x1.25 + x3x2.26=r5.25 +x3x1.25
(3.16)
s6= x3x2.26+x3x3.26= x3.26+x3x2.26
(3.17)
The value of component s7 in the s vector is a possible carry from the s6 component. The result of square
x, x2 is in the s vector and the partial products of the square are found for x32 in r vector and in q for x42
[1].
3.3.5
Postprocessing
The main motivation for the part is to transform the output to a feasible form for the proceeding parts in
the system.
21
22
CHAPTER
4
4 Parabolic Architecture development for Trigonometric functions
4.1
Design Methodology
From chapter 3, we have seen that parabolic architecture uses parallelism to reduce the execution time
and employs low complexity operations thus making hardware implementation faster and simpler than all
other existing methodologies.
The first step of developing the architecture is to define the specifications and the requirements. The
second step is to develop the behavioral descriptions using a hardware description language like VHDL,
which will lead to the development of register transfer level (RTL) model. The RTL model is simulated
using a testbench for verification of the defined logic and for finding possible errors. After successful
simulation, the design is synthesized for either a FPGA or an ASIC. The synthesis converts the RTL
model into a design implementation in terms of logic gates. These logic gates are further simulated for the
actual implementation of the architecture in hardware and the process is known as post-synthesis
simulation. On the success of the simulation, the layout is sent for fabrication. The top down design
methodology is shown on figure 4.1.
Requirements
Behavioral
Model
RTL Model
Simulate
Synthesis
Test Bench
Gate-Level
Model
Simulate
Layout
(ASIC or FPGA)
Figure 4.1 Top down design Methodology [7].
23
4.2
The sub-functions
Based on the concepts on chapter 3, the sub-functions, which lead to the approximated sine and cosine
functions are shown in (4.1) and (4.2) respectively. The angle θf is the normalized fractional part of υ. It
can be noted that it is only s1 that differs for the sine and cosine functions.
s1 ( f )   f  c1  ( f   f2 )
s2 ( f )  1  c2  ( f   f2 )
(4.1)
sin( f )  s1 ( f )  s2 ( f )
s1 ( f )  1   f  c1  ( f   f2 )
(4.2)
s2 ( f )  1  c2  ( f   f2 )
cos( f )  s1 ( f )  s2 ( f )
Where c1 and c2 are the coefficients.
The optimal 7-bit coefficients are shown in (4.3). For obtaining parallelism in the architecture, c1 and c2
are multiplied at the same time. Figure 4.2 shows the basic Parabolic Synthesis architecture.
c1  0.5703125  0.10010012
c2  0.4062500  0.01101002
(4.3)
1 f
c1  ( f   f2 )
cos( )

1   f  c1  ( f   f2 )
c1
2
f
c2
( f   )
2
f
f
1  c2  ( f   f2 )
c2  ( f   f2 )
 f  c1  ( f   f2 )
sin( )
1 f
f
1
Figure 4.2 The first parabolic synthesis architecture
24
4.3
Angle transformation
The architecture shown in figure 4.2 is only valid for angles in the first quadrant. The other three
quadrants must also be covered, which is done by transforming normalized angles larger than “1” to the
first quadrant. Figure 4.3 illustrates the methodology. In figure 4.3a, the original angles are shown. These
are normalized to Figure 4.3b, with a factor
, e.g. the angle υ =
is transformed to the normalized
angle θf = 1/3. The equations for the normalization are shown below.
  
norm 
frac 

6
2

 
(4.4)

2
 (  )  01

6
3
2
(4.5)
2
3
 f  1  frac 
(4.6)
(4.7)
1
3
All the normalized angles in the first quadrant are less than “1”. Figure 4.3b shows the integer part of the
angle θ. The integer part, θ1θ0, will thus show which quadrant the angle is in, i.e. quadrant 1, 2, 3 or 4 and
is represented by 0, 1, 2, 3 respectively in binary representation. That is useful in the selection of the
angle when transforming to the first quadrant, as shown in figure 4.3c for the sine functions. This can be
illustrated with the example in (4.4) – (4.7).
After
normalization

2
  
  
3
2
0  

Integer part (1 0 )
2
3
   2
2
(a)
1  0  012
1  0  002
1  0  102
1  0  112
(b)
Sine functions
Cosine functions


sin( (1   f ))
2

 sin(  f )
2

cos(  ) 
2
 sin(  f )
2
sin( (1   f ))
2

sin(  f )
2


cos(  ) 
2
 sin( (1   f ))
2

 sin( (1   f ))
2
(c)

cos(  ) 
2


cos(  ) 
2

sin(  f )
2
(d)
Figure 4.3 Transforming angles from quadrant 2-4 to quadrant 1.
25
An angle υ =
(4.4) in the second quadrant corresponds to the angle υ =
in the first quadrant.
After normalization the angles will correspond to 5/3 and 1/3, where the first angel is larger than one
(4.5). The idea is now to transform 5/3, in the second quadrant to θf = 1/3 in the first quadrant. The integer
part, θ1θ0 = 01, in (4.5) is taken away to get (4.6). Finally, the angle θf = 1/3 is transformed to the first
quadrant (4.6). However, the integer part is not thrown away. It will be used to select the quadrant for the
initial angle and for two’s complement conversion at the output. Figure 4.3d shows the transformation of
the cosine functions in the four quadrants. All cosine functions are transformed to sine functions since the
architecture is designed for sine functions.
Table I shows when the input transformations are needed. To select that, the integer value θ 0 is used. This
is solved by using two MUXes.
Table I. Input transformations
Quadrant 1
Quadrant 2
Quadrant 3
Quadrant 4
Sine
f
1 f
f
1 f
Cosine
1 f
f
1 f
f
Since all calculations are done in the first quadrant, the output has to be transformed back again. As an
example, if we compute cos(υ) for the 2nd quadrant angle
, we get the value -0.5. However, the
st
cosine value is determined in the 1 quadrant with the angle
, which gives the value 0.5. To correct
that, a two’s complement conversion is needed. Table II shows when the two’s complement conversion is
needed. To select that, the integer values θ1 and θ0 are used.
Table II. Output transformations
Sine
Cosine
Quadrant 1
+
+
Quadrant 2
+
-
Quadrant 3
-
Figure 4.4 shows the updated architecture capable of transforming the angles.
26
Quadrant 4
+
1 f  f
1
0
0
Two’s
compl.
conv.
cos( )
c1
c2
1  0
1
1
f
Two’s
compl.
conv.
0
1
1
1 f
0
0
1
sin( )
 sin( )
0
1 f  f
Figure 4.4 The second parabolic architecture
4.4
Simplifications
Figure 4.4 is further improved by optimizing some of the components.
4.4.1
The MCM unit
The design can be improved by optimizing two multipliers. The multipliers can be exchanged with three
adders and some shifts as shown in Figure 4.5. The technique is called Multiple Constant Multiplication
(MCM). There is one input, θf – θf 2, two outputs c1(θf – θf 2) and c2(θf – θf 2) in both figures. In the VHDLcode, we thus eliminated expressions like “AxB” for these two multipliers.
27
Figure 4.5 Architecture for Multiple Constant Multiplication (MCM)
4.4.1.1
Example
If we use a fractional positive (it cannot be negative) input value for the left part in figure 4.5 we get:
θf - θf 2 = 1/7 => 001001001101
000010010011|01000
000000010010|01101
----------------------------000010100101|10101
000001001001|10100
----------------------------000011101111|01001
Shifted by 2
Shifted by 5
θf/4
θf/32
Shifted by 3
θf/8
In (4.8) we then get the result for the left part of the architecture in figure 4.5
c2 ( f   f2 )  ( f   f2 ) / 4  ( f   f2 ) / 32  ( f   f2 ) /8
28
(4.8)
4.4.2
Eliminating an adder
Adding a “1” to c2(θf – θf 2) can be simplified. Since c2 is positive, c2(θf – θf 2) will never be larger than
“1”, i.e. c2(θf – θf 2) < 1. The fractional part can thus be merged to the “1” directly with the wiring as
shown in figure 4.6.
Integer = 1
Fractional = c2 ( f   f )
2
Figure 4.6 The fractional bus with an added integer “1”
4.4.3
Two’s complement conversion
Figure 4.7 shows an architecture for two’s complement conversion. The architecture uses half adders
(HAs) and XOR gates. A control signal θ1 or θ1 XOR θ0 is used to select when the conversion is to be
done.
1
HA
HA
HA
HA
HA
Figure 4.7 Two’s complement conversion
4.4.4
Squarer
Instead of using a multiplier for the squaring, a simplified version can be used as shown in figure 4.8.
x5x4
x5
x5x3
x5x2
x4x3
x4
x5x1
x4x2
x3x2
x5x0
x4x1
x3x2´
x5
x5
x4x0
x3x1
x4
x4
x3x0
x2x1
x2
Figure 4.8 A 6-bit squarer
29
x3
x3
x2x0
x2
x2
x1x0
x1
x1
x1
0
x0
x0
x0
4.5
Final architecture
The final architecture with all the simplifications is shown in figure 4.9. The architecture contains:





Two multipliers,
One squarer,
Seven adders,
Two two’s conversion converter, and
Four MUXes.
1 f  f
1
c1  ( f   f2 )
0
 f   f2
0
Two’s
compl.
conv.
cos( )
1
1  0
SQR
c2  ( f   )
2
f
1
f
Two’s
compl.
conv.
1
0
1
1 f
0
1 f  f
Figure 4.9 The final architecture.
The critical path goes through:





1
1  c2  ( f   f2 )
One squarer
Four adders
One multiplier
One two’s conversion converter
One MUX
30
0
1
0
sin( )
 sin( )
4.6
Wordlengths
Figure 4.10 shows the needed internal bits to reach a 9-bit accuracy at the output.
Figure 4.10 Internal wordlengths
4.7 Hardware for matrix inversion using QRD by Parabolic synthesis
Figure 4.11 shows the proposed hardware for doing matrix inversion using QRD. Due to time limitation
the parabolic hardware could not be integrated with the QRD hardware.
31
Figure 4.11. Proposed hardware for matrix inversion using QRD.
32
CHAPTER
5
5 Synthesis
5.1
Synthesis
Synthesis is a process by which conceptual description of the logic functions needed for the desired
circuit behavior (typically register transfer level (RTL)) is turned into a design implementation in terms of
logic gates [8].
The flow chart of the synthesis is:
Figure 5.1 Flow chart of synthesis process [8].
5.2
Types of Synthesis
For the thesis work two types of synthesis have been performed. One targeted towards Virtex 2 pro
FPGA using Xilinx ISE design suit and the other have been performed for an ASIC implementation
using Synopsis Design Vision tool in STM 65nm technology. As mentioned in the previous chapter,
33
the Hardware Description Language (HDL) has been used for the design implementation. Some of
the advantages of HDL for synthesis include:
1) Decrease in design time by permitting a high-level design specification,
2) Reduced errors for manual translation from HDL to schematic design,
3) Increased optimization and efficiency due to the utilization of the automation techniques used by
the synthesis tool (such as, automatic I/O insertion and machine encoding styles) during the
optimization to the original HDL code.
5.3
Synthesis for FPGA
The design has been synthesized on Xilinx ISE Design suit for Virtex 2 pro FPGA using VHDL for a
speed grade of -7. The RTL schematic from the synthesis is shown in fig. 5.2.
Figure 5.2 RTL Schematic obtained from synthesis on Virtex 2 pro FPGA.
The RTL schematic is shown in fig. 5.3.
Figure 5.3 MCM block Schematic obtained from synthesis on Virtex 2 pro FPGA.
The technology schematic for the overall design is shown in fig. 5.4
34
35
Figure 5.4 Technology Schematic obtained from synthesis on Virtex2pro FPGA.
5.3.1
Synthesis Report from FPGA
The critical path includes the squarer, one subtractor, one multiplier, one two’s complement unit, one
multiplexer and the MCM unit. The critical path time is 20.496 ns so the maximum clock frequency that
is achievable is (1/20.496) = 48MHz. The individual delay of each component in the critical path is
shown in Appendix 1, Table VIII.
Detailed synthesis reports containing the macro statistics, cell usage and device utilization are shown in
Appendix 1, Table IX, X and XI respectively.
5.4
Synthesis for ASIC
The design has been synthesized towards an ASIC implementation on a 65nm technology using
Synopsis Design Vision tool. Two constraints, area and speed, were emphasized on this synthesis. As a
result, the design has been synthesized for both high speed and for minimum area. The scripts used for
these two types of synthesis are shown on appendix 1, Table I.
5.4.1
Minimum Area Synthesis
While doing synthesis for minimum area, we have set the clock period to a very high value and set the
maximum area to zero.
The constraints that were set for minimum area are:
i) Area, and
ii) Clock uncertainty time.
5.4.2
High Speed Synthesis
While synthesizing for high speed we have set the clock constraint to such a value so that we do not get
any negative slack and no parameter for the area constraint.
The constraints that were set for High Speed are:
i) Clock speed, and
ii) Clock uncertainty time.
5.4.3
Results from ASIC Synthesis
The results obtained from the Synopsis Design Vision synthesis report are as follows:
36
5.4.3.1
Area
The total area of the design is 450893 of which 2893 is the combinational area and the remaining 448000
is the sequential area consisting of the I/O pads and the input and output registers used for determining the
critical path. The individual component area is shown in Appendix 1, Table IV.
5.4.3.2
Implemented arithmetic blocks
The synthesis tool used the two libraries ‘IO65LPHVT_SF_1V8_50A_7M4X0Y2Z’ and
‘CORE65LPHVT’ for the implementation of the arithmetic blocks. The details of each implemented
blocks are shown in Appendix 1, Table V.
5.4.3.3
Timing
The critical path includes one squarer, one subtractor, one adder, one multiplier, one two’s complement
unit, one multiplexer and the MCM unit. The critical path time is 11.33 ns so the maximum clock
frequency that is achievable is (1/11.33) =88MHz. The individual delay of each component in the critical
path is shown in Appendix 1, Table III.
5.4.3.4
Power
From synthesis we obtained the dynamic power to be 0.0794mW of which 54.71% is net switching
power and 45.27% is cell internal power.
For obtaining a more accurate power report, the design has been simulated in Prime Time tool. The
Script and the detailed report are shown in Appendix 1, Table VII.
37
Figure 5.5 The parabolic architecture with I/O pads from Design Vision.
Figure 5.6 Parabolic architecture from Design vision.
Figure 5.7 Parabolic architecture from Primetime
38
Figure 5.8 Schematic view of the Matrix Inversion from Design Vision
39
40
CHAPTER
6
6 Results and Conclusion
The aim of the thesis was to develop hardware for the generation of three trigonometric functions
(+sine, -sine and +cosine) using the novel approximation methodology, which is based on
Parabolic synthesis, for the use in Givens rotations for implementing matrix inversion using QR
decomposition. Separate hardware for parabolic synthesis and matrix inversion using QRD is
implemented. However, due to time limitation, the two modules could not be integrated.
Although, a hardware solution had been presented at the end.
41
42
CHAPTER
7
7 Future Work
The matrix inversion unit only works for a fixed set of matrix. Furthermore, the division unit is limited to
4 bits.
In future, square root implementation could lead to matrix inversion for any set of values. The division
logic can be scaled down at the beginning and again scaled up at the end to allow the usage of larger bits.
43
44
Reference
[1] http://www.eit.lth.se/fileadmin/eit/docs/Licentiate/lic_14.5_mac.pdf
[2] Erik Hertz and Peter Nilsson, "A Methodology for Parabolic Synthesis", a book chapter in VLSI, InTech, ISBN 978-3-902613-50-9, pp. 199-220, Vienna, Austria, September 2009.
[3] http://en.wikipedia.org/wiki/Invertible_matrix
[4] http://www.purplemath.com/modules/mtrxinvr.htm
[5] http://en.wikipedia.org/wiki/QR_decomposition
[6] http://en.wikipedia.org/wiki/Givens_rotation
[7] http://www.eit.lth.se/index.php?id=241&ciuid=475&coursepage=2602&L=1
[8] http://www.eit.lth.se/fileadmin/eit/courses/etin01/slides/Synthesis.pdf
[9] P.T.P. Tang (1991), "Table-lookup algorithms for elementary functions and their error analysis," Proc.
of the 10th IEEE Symposium on Computer Arithmetic, pp. 232 - 236, ISBN: 0-8186-9151-4, Grenoble,
France, June 1991
[10] J.-M. Muller (2006), Elementary Functions: Algorithm Implementation, second ed. Birkhauser,
ISBN 0-8176-4372-9, Birkhauser Boston, c/o Springer Science+Business Media Inc., 233 Spring Street,
New York, NY 10013, USA
[11] Erik Hertz and Peter Nilsson, “Parabolic Synthesis Methodology Implemented on the Sine
Function”, in Proceedings of the 2009 International Symposium on Circuits and Systems (ISCAS’09),
Taipei, Taiwan, May 24-27, 2009.
[12] Lei Wang, Hardware Implementation of Parabolic Synthesis Methodology for the Sine and Cosine
Functions, Master’s thesis, Lund University 2009.
45
46
Appendix 1: For Synthesis
Table I: Synthesis Scripts for ASIC
For High Speed:
gui_start
remove_design -all
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/para_generics_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/adder_block_pack1.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/adder_block_pack2.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/subtractor_block_pack1.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/subtractor_block_pack2.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mcm_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mult_block_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mux2by1_pack1.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mux2by1_pack2.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mux2by1_pack3.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mux2by1_pack4.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/squarrer_block_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/twos_comp_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/XOR_block_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/parabolic_senthesis1.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/top_parabolic_senthesis1.vhd}
elaborate top_parabolic_senthesis -lib WORK -arch structural
compile -map_effort high
report_constraint -all_violators
change_names -rules verilog -hierarchy
write -format verilog -hierarchy -output netlists/top_para.v
write_sdf ./netlists/top_para.sdf
write_sdc ./netlists/top_para.sdc
For Minimum Area:
gui_start
remove_design -all
analyze -library WORK -format vhdl
47
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/para_generics_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/adder_block_pack1.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/adder_block_pack2.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/subtractor_block_pack1.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/subtractor_block_pack2.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mcm_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mult_block_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mux2by1_pack1.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mux2by1_pack2.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mux2by1_pack3.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/mux2by1_pack4.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/squarrer_block_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/twos_comp_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/XOR_block_pack.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/parabolic_senthesis.vhd}
analyze -library WORK -format vhdl
{/home/piraten/sx08nc4/Desktop/New_Parabolic/vhdl/top_parabolic_senthesis1.vhd}
elaborate top_parabolic_senthesis -lib WORK -arch structural
set_max_area 0
compile -map_effort high
report_constraint -all_violators
change_names -rules verilog -hierarchy
write -format verilog -hierarchy -output netlists/par_min.v
write_sdf ./netlists/par_min.sdf
write_sdc ./netlists/par_min.sdc
48
Table II: Area Hierarchy of the design(ASIC)
49
50
51
Table III: Components in the critical path with individual delays (ASIC)
52
53
Table IV: Area Report (ASIC)
54
Table V: Implemented Arithmetic blocks (ASIC)
55
Table VI: Prime Time Script
start_gui
remove_design -all
set power_enable_analysis true
set search_path "$env(STM065_DIR)/IO65LPHVT_SF_1V8_50A_7M4X0Y2Z_7.0/libs \
$env(STM065_DIR)/CORE65LPHVT_5.1/libs \
$env(STM065_DIR)/CORE65LPSVT_5.1/libs \
$search_path"
set link_library "* IO65LPHVT_SF_1V8_50A_7M4X0Y2Z_nom_1.00V_1.80V_25C.db \
CORE65LPHVT_nom_1.20V_25C.db CORE65LPSVT_nom_1.20V_25C.db"
set target_library "IO65LPHVT_SF_1V8_50A_7M4X0Y2Z_nom_1.00V_1.80V_25C.db \
CORE65LPHVT_nom_1.20V_25C.db CORE65LPSVT_nom_1.20V_25C.db "
56
Table VII: Power analysis obtained from Prime Time
Synthesized Time 5ns
Synthesized
Area
Net switching power
Cell Internal power
Cell Leakage
power
Total
Power
None (High Speed)
8.95e-5
1.0e-4
4.61e-8
1.9e-4
0 (Minimum Area)
9.42e-5
1.0e-4
4.22e-8
1.95e-4
8592
8.20e-5
9.4e-5
4.07e-8
1.72e-4
Table VIII: Timing report from FPGA
Table IX: Macro Statistics report from FPGA
# Multipliers
11x11-bit multiplier
12x12-bit multiplier
# Adders/Subtractors
10-bit adder
10-bit subtractor
11-bit adder
12-bit adder
9-bit adder
# Xors
1-bit xor2
:3
:1
:2
: 11
:3
:1
:1
:4
:2
:1
:1
57
Table X: Cell Usage report from FPGA
# BELS
#
GND
#
INV
#
LUT1
#
LUT2
#
LUT3
#
LUT4
#
MUXCY
#
MUXF5
#
MUXF6
#
VCC
#
XORCY
# IO Buffers
#
IBUF
#
OBUF
# MULTs
#
MULT18X18
: 305
:1
: 22
: 11
: 31
: 29
: 46
: 78
:3
:1
:1
: 82
: 49
: 13
: 36
:3
:3
Table XI: Device utilization summary from FPGA
Selected Device :
Number of Slices:
Number of 4 input LUTs:
Number of IOs:
Number of bonded IOBs:
Number of MULT18X18s:
2vp2fg256-7
76 out of 1408 5%
139 out of 2816 4%
49
49 out of 140 35%
3 out of 12 25%
58
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement