Master’s Thesis Report Design of FFTs using CORDIC and Parabolic Synthesis as an alternative to Twiddle Factor Rotations By Muhammad Waqas Shafiq Nauman Hafeez Department of Electrical and Information Technology Faculty of Engineering, LTH, Lund University SE-221 00 Lund, Sweden Abstract Fast Fourier Transform (FFT) processor is an important signal processing block widely adopted in various disciplines, such as within the computer area, computer graphics, digital signal processing, communication systems, robotics, navigational systems, etc. Advanced and complex algorithms in these disciplines need higher computation performance and low power utilization processes. Due to advancements and down scaling of hardware technologies, it becomes possible to fulfill these higher demands from the most advanced algorithms with higher clock rates on the chips. The improvement in hardware technology performance has moved the interest more over to the hardware implementation of algorithms. The scope of this thesis is to investigate different algorithms to compute the rotations, often done by complex twiddle factor multiplications in in run time, in FFT processor architectures. The FFT processor is commonly implemented with a complex multiplier but due to demand of higher point FFTs in the applications, the size of ROM in the multiplier based implementation for the twiddle factors becomes the matter of concern with larger chip area. The aim of this thesis is to design an FFT with un-rolled CORDIC and Parabolic Synthesis algorithms in order to replace complex multipliers and ROMs. The outcome of the presented thesis is comparison results for accuracy, power consumption, performance and area of these algorithms in the STM 65nm CMOS technology with different transistor technologies and power supplies. It is noted that implementation with Parabolic synthesis algorithm gives better results as it automatically gives a high degree of parallelism that gives a shorter critical path and thereby fast computation. The structure of the methodology will also assure an area efficient hardware implementation if higher point FFTs are implemented. 2 Acknowledgments It is a pleasure to thank the many people who made this thesis possible. It is difficult to overstate our gratitude to our supervisor, Dr. Peter Nilsson. With his enthusiasm, his inspiration, and his great efforts to explain things clearly and simply, he helped to make this thesis fun for us. Throughout our thesis work, he provided encouragement, sound advice, good teaching, good company, and lots of good ideas. We would have been lost without him. We are also grateful to Erik Hertz who helped us to understand his research work. We would like to thank the many people who have taught us throughout the MSc especially Joachim Rodrigues, Viktor Öwall, Markus Törmänen, Pietro Andreani and all Teacher assistants for their kind assistance with our courses and lab works especially Yasser Sherazi, Chenxin Zhang, Reza Meraji, Johan Löfgren, Stefan Molund, and Jonas Lindstrand. We wish to thank our best friends Minna, Waqas Khan, Umair, Farhan, Muaz, Salman, Naveed, Adnan, Shoaib, Ammar and Naeem for providing a stimulating and fun environment in which I learnt and grow. We wish to thank our entire families for providing a loving environment for us. My brothers, my sister Kayynat were particularly very supportive and had faith in me. Lastly, and most importantly, we wish to thank our parents. They bore us, raised us, supported us, taught us, and loved us. To them I dedicate this thesis. 3 Table of Contents 1. 2. Introduction ................................................................................................. 7 1.1 Overview .............................................................................................. 7 1.2 Organization of Thesis .......................................................................... 8 Fast Fourier Transform ................................................................................. 9 2.1 Discrete Fourier Transform ................................................................... 9 2.2 FFT Algorithms ..................................................................................... 9 2.2.1 Decimation in Time (DIT) algorithm............................................. 10 2.2.2 Decimation in Frequency (DIF) algorithm .................................... 10 2.3 3. 4. CORDIC ...................................................................................................... 15 3.1 Rotation ............................................................................................. 15 3.2 CORDIC Formulation........................................................................... 18 3.3 Unrolled CORDIC ARCHITECTURE........................................................ 20 3.4 CORDIC Scaling Factor ........................................................................ 22 3.5 Generalize CORDIC ............................................................................. 22 Parabolic Synthesis .................................................................................... 25 4.1 The Methodology ............................................................................... 25 4.1.1 Normalization ............................................................................. 25 4.1.2 Hardware Architecture Development.......................................... 25 4.1.3 Methodology for developing sub-functions ................................. 26 4.2 5. Hardware Architecture ....................................................................... 12 Hardware Implementation ................................................................. 28 4.2.1 Pre-Processing ............................................................................ 28 4.2.2 Processing .................................................................................. 29 Results ....................................................................................................... 35 5.1 Area ................................................................................................... 36 5.2 Timing ................................................................................................ 37 5.3 Static Power ....................................................................................... 38 4 5.4 Dynamic Power .................................................................................. 40 5.5 Energy ................................................................................................ 41 6. Conclusions ................................................................................................ 45 7. Future Work............................................................................................... 46 References......................................................................................................... 47 List of Figures..................................................................................................... 48 List of Tables ...................................................................................................... 49 5 Preface This thesis is submitted in partial fulfillment of the requirements of Master’s degree in System on Chip at LTH, Lund University. This thesis has been made solely by the authors, however most of the text is based on the research of others, we tried to provide references where necessary. In this thesis work, initial literature review have been done together with discussions and note taking to proceed with development process. Nauman Hafeez has done design of 14 stage un-rolled CORDIC and Parabolic Synthesis design in VHDL, tested and verified their simulation with MATLAB design. Waqas Shafiq has done design of 16-point FFT, test and verified with MATLAB results. He has also done MATLAB reference model design. Result extractions of area, speed and power after synthesis process have been done together. 6 CHAPTER 1. Introduction 1.1 Overview In past few years the advancements in VLSI technologies have opened many windows for designing real time applications, using efficient algorithms with custom chips. The use of special arithmetic techniques, parallel and pipelined architectures have led the designs far away from basic computers. Nowadays the world of information technology has lined up many areas together such as communication systems, computer graphics, robotics, navigations, astrophysics etc. and all applications demands high performance with minimum possible power consumption. Many digital signal processor designers and manufacturers are facing one big challenge, that is, how to perform complex mathematical function calculations more efficiently. To gain efficiency, it is needed to dig inside our complete designing cycle which contains algorithms, architectures, hardware technology, power supply etc. Fast Fourier Transform (FFT) is widely used transform in digital applications especially in communication systems. An FFT is an important processing block in these systems, which takes most of the hardware complexity in a digital baseband transceiver for instance. Due to demand in higher data rates in communication systems, large-point FFTs are required for multiple carrier modulation, such as 1024/2048/8192 etc. An FFT is commonly implemented with complex multiplier; a complex multiplier is equivalent to four real multipliers and two real adders, and a ROM to store the twiddle factors. The ROM in this type of implementation takes most of the chip area, consumes more power and degrades the speed because of ROM read operation ROM size increased with large-point FFTs. Hence poor performance of the FFT in terms of power, speed, and area can be seen. The main objective of the thesis is to investigate algorithms which can be used to calculate trigonometric functions and perform complex multiplications. An unrolled CORDIC architecture is widely used instead of complex multiplier and ROM based architectures. Power consumption, speed and accuracy are issues in this architecture implementation. A newly invented “Parabolic Synthesis” algorithm is also investigated that could improve the results with respect to CORDIC based implementation of the FFT. Both algorithms are implemented in VHDL, synthesized and compared in terms of power, area, accuracy and timing. Both algorithms are tested to calculate the rotations and to perform complex multiplications for a 16-Point FFT processor. 7 Traditional methodologies such as use of complex multipliers with pre-computed twiddling factors (stored in ROM/Registers) for FFT processors are also being compared with CORDIC and Parabolic Synthesis. 1.2 Organization of Thesis Chapter 2 explains the basic theory and fundamental equations of CORDIC algorithm along with the hardware architecture. Chapter 3 contains the brief description of Parabolic Synthesis and covers the basic theory, mathematical equations and hardware architectures used to implement this algorithm for FFT processors. Chapter 4 is a review of FFT and explanation of butterfly architecture is provided, and Single-path Delay Feedback (SDF) architecture for implementing FFT is also discussed and implemented in hardware. Chapter 5 contains the synthesis results of tested algorithms and gives the guideline to the reader about the selection of algorithm, the detail comparison between power, area and speed using different transistor technologies and supply voltages, is presented. Chapter 6 summarizes the results achieved in the project. Chapter 7 discusses the future work in order to improve performance. 8 Chapter 2 2. Fast Fourier Transform The Fast Fourier Transform (FFT) architecture was invented in 1965 by Cooley and Tukey [8] and is not a different algorithm from DFT. It is architecture for efficient computation of DFT. 2.1 Discrete Fourier Transform DFT is the most widely used transform of all the available transforms in digital signal processing. The DFT maps the input sequence ( ) into frequency domain. The Discrete Fourier Transform (DFT) of ( ) input sequence is defined as in (2.1), which is an N-point sequence [9]. ( ) = ( ) = 0,1, … . , Where ( ) and ( ) are complex in general and the indices The term 1 and (2.1) are integers. is twiddle factors, which is described in (2.2) = = cos 2 2 (2.2) The DFT is complex valued but in hardware, the real (Re) and imaginary (Im) parts are separated in two real parts as in (2.3) ( )= cos ( )= sin sin + (2.3) cos 2.2 FFT Algorithms Fast Fourier Transform (FFT) [8] is an efficient algorithm for computing the DFT. Its principle is based on decomposing the computation of the discrete Fourier transform with sequence of length N into successively smaller discrete Fourier transforms. Computation complexity is reduced from O(N2) to O(Nlog(N)) after 9 introducing the FFT architecture to calculate a DFT. There are different algorithms of FFT based on different decomposition schemes. Some of the algorithms are described next. 2.2.1 Decimation in Time (DIT) algorithm The algorithm which decomposes the sequence ( ) into subsequently smaller sequences is called decimation in time algorithm [10]. In order to explain it, a radix-2 DIT algorithm is explained to demonstrate the decomposition. In (2.1), the input sequence is divided into even and odd numbered sequences. In this case, the DFT is described as ( )= ( ) ( + 1) The input sequence ( ) is split into even numbered ( numbered ( + 1) sequence. = 0,1, . . , (2.4) ) sequence and odd For larger N, the complexity is reduced from O(N2) to nearly half by dividing the input sequence into even and odd numbered sequences. The complexity can be further reduced by using the symmetry and periodicity of twiddle factors. Figure 2.1 A Radix-2 DIT Butterfly Due to the even and odd numbered sequence split, the twiddle factor is also simplified. The architecture of radix-2 DIT butterfly only needs one complex multiplier and two complex adders instead of two complex multipliers. 2.2.2 Decimation in Frequency (DIF) algorithm In this algorithm, the output of the DFT is divided into smaller subsequent sequences. DFT main equation is recalled as in (2.1). Even numbered output sequence can be described in (2.5) 10 (2 ) = ( ) ( ) + , By changing the second summation from and by using the property of (2 ) = ( )+ = 0,1, . . , 1 (2.5) to N into the summation from 0 to , (2.5) can be rewritten as in (2.6) + , = 0,1, . . , 1 In the same way, odd numbered sequence can be expressed as in (2.7) (2 + 1) = ( ) + , 2 = 0,1, . . , 2 1 (2.6) (2.7) (2.6) and (2.7) are combined to get the first order DIF derivation. First step is to compute ( ) + and ( ) + and then multiply the latter term with . Fig. 2.2 shows a simple radix-2 DIF butterfly. in1 + out1 W in2 -+ x out2 Figure 2.2 A Radix-2 DIF Butterfly A 16-point radix-2 DIF FFT architecture is chosen to be used in this project. The signal flow graph of the architecture is presented in (2.3). 11 Figure 2.3 A 16-point radix-2 DIF FFT algorithm Fig. 2.3 shows the signal flow graph for a 16-point FFT. Two structures as the one in the figure are needed, one for the real part and one for the imaginary part. The output is in bit-reversal order i.e., the bits are flipped at the output. The signal flow graph in Fig. 2.4 shows 4 stages of butterflies. In between a multiplication with the twiddle factor W is shown according to (2.7). 2.3 Hardware Architecture Figure 2.4 describes a Single-path Delay Feedback (SDF) architecture of a 16point FFT using multipliers. The pre-computed twiddle factors are stored in an array or ROM if large-point FFTs are to be designed. A controller is used to generate the addresses to access the stored twiddle factors and to control multiplexers in every clock cycle. 12 Fixed point 2’s complement data format is used. In 2’s complement, first bit of a positive number is always 0 and rest of bits are ordinary binary values. For negative numbers, invert its standard binary value and add one to it. Complex data (real and imaginary) has been used for both twiddle factors and input sequence. Both real and imaginary values are 15 bits each; 1 sign bit, 4 integer bits and 10 fractional bits to get 3 decimal places of accuracy in the fractional part. Data scaling and truncation between different stages of the architecture due to multiplication is done according to precision requirements. Fig. 2.4 shows an architecture for a 16-point FFT architecture with multipliers. In the first phase, 8 input samples are, coming serially, are stored in the 8 registers of 15 bits each during the first stage. Figure 2.4 A 16-point FFT architecture using multipliers In the second phase, the first 8 samples, stored in registers, are processed, in the butterfly, with the remaining 8 samples. . The reason behind is that the 1 st sample, x(0) has to be processed with the 9th sample, x(8). The second sample, x(1) should be processed with 10th sample, x(9). The butterfly will generate the sum and difference of two input samples. The result of difference of input samples will eventually be stored in the register and multiplied with twiddle factor. In contrast, the addition of two input samples is sent to the next stage without any multiplication with a twiddle factor. The MUXes in this architecture are controlled by a controller, which keep track of the values that are sent to the next stage in the specific clock cycle. In the second stage, the same operations are done but in two parts. The first 4 results from the first stage are stored in the 4 registers in the second stage. These 4 samples are processed with the upcoming 4 results in the butterfly. The same procedure is done with the next results from the first stage. This architecture generates the first end result in the 17th clock cycle. It produces all the 16 results in 32 clock cycles. 13 Twiddling factors approximation and complex multiplication can be done by a coefficient ROM and a multiplier, by using CORDIC algorithm, and by Parabolic Synthesis as described in Fig. 2.5. Figure 2.5 A 16-point FFT architecture Figure 2.5 shows a complete architecture of a 16-point FFT with both real and imaginary parts. Complex multiplier (comprising of 4 real multipliers and 2 real adders) and ROM can be replaced with CORDIC and Parabolic Synthesis algorithms to calculate real and imaginary values for next stage in the FFT. An angle “ ” and previous stage real and imaginary values are provided to the CORDIC algorithm, it will generate the final Xre and Xim values at the output for the next stage of the FFT. In the same manner, an angle “ ” and previous stage real and imaginary values are provided to the Parabolic Synthesis algorithm, which will generate the rotations, with the final Xre and Xim values at the output according to (2.3). 14 CHAPTER 3. CORDIC The CORDIC (COordinate Rotation DIgital Computer) is a simple and efficient “shift-add” algorithm to calculate the wide range of functions including trigonometric, logarithmic, hyperbolic and linear. It is commonly used when there is no hardware multiplier is available. The CORDIC algorithm was first described by Jack E.Volder in 1959. It was developed to provide the digital solution for the real-time navigation problems [1]. Two basic CORDIC modes are well known for the computation of different functions, the rotation mode and vector mode. The CORDIC algorithm can be taken in as iterative structure of additions, subtractions and binary shift operations, which can perform fixed angle rotation (also known as micro-rotations). The CORDIC algorithm is well suited for VLSI implementations due to the simplicity of the involved operations. The CORDIC does not provide the perfect rotation when it is involved with the complex multiplication because the rotated vector gets scaled. To achieve perfect angle rotation, the scaling factor can be easily corrected. The idea behind the CORDIC for rotating the complex number by successive constant angles is highly useful and needs to be compared with other algorithms, which are claiming the similar functionality. 3.1 Rotation As the nature of CORDIC algorithm is based on simplicity, and the basic observation of unit-length vector can give the foundation of CORDIC algorithm. The point z in Fig. 3.1 at coordinates (x, y) = (1, 0) is rotated by the angle and the new point z' will be at (x, y) = (cos ,sin ). Thus, the trigonometric functions cos and sin can be computed by finding the co-ordinates of point z', which is a rotated vector of angle [2]. The rotation of the vector in a rectangular coordinate system from one point to another point can be performed by simply multiplying the vector by a complex number. Let the initial vector is z = x + jy 15 (3.1) Y r= 1 z’=(cos ,sin ) z=(1,0) X Figure 3.1 Unit-length vector in Cartesian coordinate system And the desire vector is z' which is rotated by angle z = x + jy (3.2) z =ze (3.3) z' can easily be computed by multiplying z with e = cos + sin z = (x + jy)(cos + sin ) z = x cos + jx sin + jy cos z = (x cos (3.4) (3.5) y sin y sin ) + j(x sin + y cos ) (3.6) (3.7) Equation (3.7) gives us the new rotated points of the vector z , as shown in Fig. 3.2. The complex number rotations in 2-dimensions are commutative [3], unlike in higher level of dimensions. By observing the equation (3.5), whenever two complex numbers gets multiplied with each other, their phase is added up and the magnitude gets multiplied. Thus, during the multiplication with the complex number with unity magnitude we are not affecting the magnitude but only rotating the vector by the desired angle. Such a rotation is known as a Real rotation as shown in Fig. 3.2. 16 Figure 3.2 Real Rotation of the vector x = x cos y sin y = x sin + y cos (3.8) (3.9) Let us take (3.8) and (2.9) one step further and make our way towards the CORDIC x = cos (x y sin / cos ) y = cos (y + x sin / cos ) x = cos (x y tan ) y = cos (y + x tan ) (3.10) (3.11) In (3.10) and (3.11) the factor cos provides the scaling in rotated angle and causes the reduction in magnitude due to that fact that cos 1, but the term tan brings ease in computation because it is simply a shift operation. To further investigate the (3.10) and (3.11) and neglecting the scaling factor cos from the above equation 17 x =x y tan y = y + x tan (3.12) (3.13) Equation (3.12) and (3.13) are the coordinates of the rotated vector without any scaling factor, the absence of scaling factor can affect the results during the rotations but at the end of total computation this factor can be multiplied to balance the effect. The new rotated values x and y are shown in Fig. 3.3, such rotation is known as Pseudo Rotation. Fig. 3.3 shows the different between the real rotation and pseudo rotation. Figure 3.3 Real and Pseudo Rotation of the vector 3.2 CORDIC Formulation Jack E.Volder has proposed the idea to break the rotation angle into the series of smaller angles i [4] such that the resultant series could utilize the property of tangent function tan = The resultant series of angles is shown below in Table 3.1 18 (3.14) Table 3.3.1 Pre-Computed angle set tan i deg 0 45 1 26.5 0.5 2 13.25 0.25 3 7.125 0.125 4 3.576 0.0625 5 1.789 0.03125 6 0.895 0.015625 7 0.447 0.00781 8 0.223 0.00390 9 0.115 0.001953125 10 0.055 0.0009765625 1 Using (3.14) in (3.12) and (3.13), the foundation of the CORDIC equation is verified as, which is = = + 2 (3.15) 2 (3.16) The rotations in each stage around the desired angle could be clock-wise or anticlock-wise, and with the sequential values of i the angle and tan goes on decreasing as shown in following equations. = tan 2 =± tan …….. 2 tan 2 19 tan 2 tan 2 ……. During the angle rotation, the resultant angles either perform addition or subtraction depending on the operation deciding factor . Generally, the CORDIC algorithm for other mathematical functions depends on as-well. = = + 2 2 (3.17) (3.18) Example: To compute the angle = 30º, using the rotation algorithm, is done by breaking the angles into smaller parts 30º 45º – 26.6º + 14º – 7.1º + 3.6º +1.8º – 0.9º + 0.4º All the angles in series against the angle 30º can be taken from the Table 3.1. These angle values can easily be computed using shift, add and subtract operations. 3.3 Unrolled CORDIC ARCHITECTURE An eleven stage CORDIC is shown in Fig. 3.4. In the eleven adders at the top, the remaining angle is computed for each stage. The input variable is the angle for the cosine and sine that is searched for. At the top, the fixed coefficient angle values , for each rotation are provided. These are added or subtracted from in each stage. They are fixed coefficients (hardware wired) but they can be stored in a ROM as well. In the middle and the lower adder rows the approximation of the real and imaginary signals, Xre(n) and Xim(n), are computed and provided to the right. The initial vector values, x re(1) and xim(1), are provided to the left. The input signals are taken from previous real and imaginary butterflies (in the case of FFT). Furthermore, the output signals are provided to next real and imaginary butterflies. In each stage new vectors are determined, in order to converge towards the vector that approximates the vector that represents the angle . In each vector rotation stage there is a crosswise addition or subtraction of the vector coordinates. The vector rotation depend on the sign bit, sgn, in each stage of the upper row of adders. That is, the sign bits determine if it should be an addition or subtraction. There are also divisions of the vector coordinates by a factor corresponding to 2 i where i’s are the integers {1, 2, 3, 4, 5, 6 … n}. That is, division by {1, 2, 4, 8, 16, 32 … 2n}. This corresponds to the right shifts as discussed in the CORDIC algorithm, which are done by shifting the busses between the stages. There is thus no extra cost in hardware for the divisions. 20 Figure 3.4 An un-rolled 14-stage CORDIC architecture 21 3.4 CORDIC Scaling Factor The above CORDIC algorithm introduces the constant scaling factor (K) with the resulted rotated vector, and this factor needs to be multiplied. The property cos cos( brings more ease to pre-compute the scaling factor because, the direction of rotation does not matter in case of cosine function K= (1) …. K = 0.707 x 0.894 x 0.970 x 0.992 x 0.998 …. K 0.60725 Now, constant scaling factors which can be further designed with shifts and add operations to eliminate the need of multiplier, as shown in Fig. 3.5. Figure 3.5 Hardware Multiplication of scaling factor. 3.5 Generalize CORDIC Generalized CORDIC covers the equations for the linear, circular and hyperbolic systems and the (3.17) and (3.18) can be modified to accommodate these functions as well = 22 2 (3.19) = = + + 2 (3.20) (3.21) The parameters for generalized CORDIC algorithm are listed in Table 3.2 with the rotation type Table 3.3.2 Parameters for Generalized CORDIC Linear Rotation Circular Rotation Hyperbolic Rotation = 2 = tan = tanh 23 µ=0 2 2 µ=1 µ=-1 24 CHAPTER 4. Parabolic Synthesis Parabolic synthesis methodology [6] is devised to implement approximation of different unary functions e.g. trigonometric, logarithm, square root and division functions that are extremely important in the field of astronomy, digital signal processing, image processing, robotics, navigation systems etc. High speed applications in these fields need hardware implementation where software solutions in most of the cases are not sufficient. Parabolic synthesis methodology performs an approximation of original unary function by developing other sub-function and help functions. The architecture of the proposed methodology itself parallel in nature and take great advantage of parallelism to reduce the execution time. Only low complexity operations e.g. multiplications, shifts, additions are used that is easily implemented in the hardware. Parabolic synthesis methodology has been devised in order to generate better results in terms of speed, area, power and precision as compared to algorithms employed previously for the approximations of unary functions. Pre-processing and post-processing are two steps besides the most important approximation part that are used to normalize and transformation respectively. Therefore, the implementation is divided into three parts, normalizing, approximation and transforming. 4.1 The Methodology Pre-processing is the first step; normalization, approximation is the next one in which sub-function and help functions are developed and post-processing is the last step to transform the result according to the requirement. 4.1.1 Normalization Normalization is done to facilitate the hardware implementation by restricting the numerical range. The purpose of this step is to satisfy that the values should be in the interval of 0 < 1on the x-axis and 0 < 1on the y-axis keeping the coordinates of starting point at (0,0) and ending point having coordinates smaller than (1,1). Furthermore the function must be strictly concave or convex during the interval. 4.1.2 Hardware Architecture Development Low complexity operations are used when developing hardware architecture that approximates an original function. These operations include additions, multiplications and shifts, which are efficient for hardware implementation. The emergence of efficient multiplier architecture and downscaling of semiconductor 25 technologies has made the multiplication operation efficient in hardware implementation. The proposed methodology is based on decomposition of basic functions. It is based on second order parabolic functions since these functions can easily be implemented with low complexity operations. The recombination process is done with multiplication. The methodology is devised in terms of second order parabolic functions called sub-functions ( ), these sub-functions when recombined give the approximation ( ) as shown in (4.1). The accuracy of the results of the original function depends on the number of sub-functions used. ( )= ( ) ( ) … ( ) (4.1) ( ) The first help function ( ) is generated by dividing the original function ( ) with the first sub-function . This help function will be a parabolic looking function as shown in (4.2). ( )= ( ) (4.2) ( ) The subsequent help functions will be generated according to (4.3). The ( ) should be chosen to be feasible for hardware sub-function implementation. ( ) ( )= ( ) (4.3) 4.1.3 Methodology for developing sub-functions ( ) The sub-functions are developed by decomposing the original function according to second order parabolic functions in the interval 0 <1 and the subinterval within the interval. The second order parabolic function is the decomposition function and it is easily implemented in hardware with low complexity functions such as multiplication and addition. The first sub-function 26 The methodology to develop the first sub-function ( ) is by dividing ( ) with as a first order approximation. There are original function possibilities as a result of this division, one when ( ) > 1 and one when ( ( 1. The first sub-function ( ) as given by (4.4). The expression 1 + is utilized to get approximation of these functions. The first sub-function results in a second order parabolic function as in (4.4). ( )= (1 1+ In equation (4.4), the coefficient The second sub-function ) = + ( ) the two )< ) ( ) (4.4) is determined according to (4.5). ( ) = lim 1 (4.5) The first help-function ( ), is important in developing the second sub-function. The first help function ( ), is calculated according to (4.2) and dividing two continuously concave or convex functions having the same starting and ending point which will give rise to another function similar to a parabolic function. The second sub-function ( ), is developed according to second order parabolic function as shown in (4.6). ( ) =1+ ( ) (4.6) The coefficient in (4.6), is to satisfy the condition that the quotient between the first help-function and second sub-function is equal to 1 when is set to 0.5 in (4.7). =4 1 (4.7) The second help-function ( ) is developed in such a way that it can be divided into a pair of functions that look like parabolic functions, where the interval of first function is from 0 < 0.5 and for second, interval is from 0.5 < 1. Sub-function when n>2 27 For help functions ( ) when n>2, there are one or more pairs of parabolic looking functions. Each pair of parabolic looking function is divided into two parabolic help-functions in order to develop higher order sub-functions. A parabolic sub-function is developed as an approximation of the help function ( ) in the sub-interval. Sub-sub-functions are associated with a specific subinterval, and for this the subscript index is increased with the index m in sub-help( ). The numbers of sub-help-functions are doubled in every order function in of n>1. The corresponding sub-sub-functions are developed from these help functions. 4.2 Hardware Implementation An implementation for calculating real and imaginary values during the twiddle factor multiplication in the FFT using the proposed methodology of parabolic synthesis is described in this section. The hardware implementation is divided into three parts; pre-processing, processing and post-processing. Fixed point 2’s complement representation is used. 4.2.1 Pre-Processing Pre-processing is done to satisfy that the value of incoming angle is in the interval 0 <1 and transfer angles from quadrant 2,3 and 4 into quadrant 1, to achieve this operand in radians is multiplied with as shown in Fig. 4.2. Original angle= Normalized angle = f = = (integer part)(fractional part) (4.8) The integer part 1 0 represents the quadrant of the original angle and it will be used for 2’s complement conversion at the output and as MUX select bits. Quadrant 1, 2, 3 and 4 is represented by 0, 1, 2, and 3 respectively in binary representation as shown in Fig. 4.1. 28 Figure 4.1 Angle transformation from quadrant 2-4 to quadrant 1 The word length of the input angle is fifteen bits; one sign bit, four integer bits and ten fractional bits to get three fractional digit precision. It is multiplied with the factor to get the normalized angle. Figure 4.2 Hardware architecture for pre-processing 4.2.2 Processing The Discrete Fourier Transform (DFT) is defined as in (4.9) and (4.10) ( ) = = ( ) (4.9) (4.10) The DFT is complex valued but in hardware, the real (Re) and imaginary (Im) parts are separated in two real parts as shown in (4.11) and (4.12) 29 ( )= cos ( )= (4.11) sin sin + (4.12) cos (4.11) and (4.12) can also be presented for simplicity as in (4.13) and (4.14). = = × cos( ) sin( ) × sin( ) + cos( ) (4.13) (4.14) The sine and cosine values must be determined by approximation. The approximated sine and cosine functions according to first and second subfunctions (4.4) and (4.6) are in (4.15) and (4.16) respectively. = + × =1+ × =1 + =1+ × (4.15) × (4.16) These equations are developed according to [7] with coefficients = 0.571 and = 0.401. Only two sub-functions are used as it will be equivalent to precision CORDIC algorithm can deliver with 11 stages. The angle is the normalized fractional part of . It is only that differs from the sin and cosine functions. Since there is a connection between the sine and cosine terms in all 4 quadrants, the term is used instead of in first sub-function in the cosine equation. 30 cos sin sin 2 1 2 f f sin sin 2 f 2 2 cos 2 sin 1 cos f sin 2 sin f cos 2 2 1 f sin 1 2 f 2 2 f Figure 4.3 Input Transformation from quadrant 2-4 to quadrant 1 In figure (4.4), both sine and cosine functions are approximated and multiplied with previous real and values in the FFT to get new real and imaginary values. The normalized angle from pre-processing block is multiplied with is subtracted from to get - . The and in the next step itself to get and with simple shifts and result from - is multiplied with coefficients add operation in Multiple Constant Multiplication (MCM) block as shown in Fig. 4.5. - ) - ) and There is one input - and it will generate two results with shifts and adders. It will considerably reduce the hardware, compared to two multipliers. 31 xim 1- f f f x 2 f 0 + - 1 0 + + x x 2's compl. cos Xre + - 1- f +c1x( f - 2f) 2 f + 1 0 1 x xre + 1+c2x( f - 2 f) f +c1x( f - 2 f) 1 MCM unit f - + 1- f + 1 x 0 x + Xim 0 1 1- sin 2's compl. x f f xim Figure 4.4 Architecture for the multiplication of complex numbers with approximated trigonometric functions (sine and cosine) Figure 4.5 Architecture for Multiple Constant Multiplication 32 is In case of adding “1” to - ) term, there is no need of an adder. Since positive, and - ) always will be less than “1”. The fractional part can thus be merged with “1” directly with the wiring as shown in Fig. 4.6. Figure 4.6 The fractional bus with an added integer “1” This wiring will give the second sub-functions for both sine and cosine functions as shown in Fig. 4.4. Table 4.1 shows when the input transformations are needed according to quadrant of original angle . The integer value is used to select the MUXes as shown in Fig. 4.4. Table 4.4.1 Input Transformations 1st Quadrant Sine Cosine 1- f 1- 2nd Quadrant f f 3rd Quadrant 1- f 1- f 4th Quadrant f f f The upper adder with a MUX will generate first sub-function × for cosine function and lower adder with MUX will give × for sine function. The following multipliers will produce the approximated sine and cosine values. Since all the computation is done in the first quadrant, the output is to be transformed back again. Table 4.2 shows the output transformations necessary at the output. 33 Table 4.4.2 Output Transformations 1st Quadrant 2nd Quadrant 3rd Quadrant 4th Quadrant Sine + + - - Cosine + - - + For 2’s complement conversion, the architecture in Fig. 4.7 is used with half adders and XOR gates. A control signal or XOR is used to select when the conversion is to be done according to table 4.2. Figure 4.7 Architecture for 2’s complement conversion After post-processing, sine and cosine, the approximated values are multiplied with real and imaginary parts from previous stage of the FFT. The new real and imaginary values are calculated according to (4.13) and (4.14). 34 CHAPTER 5. Results The methodology for extracting results of area, timing and power constraints in both the designs is shown in Fig. 5.1. The HDL description of the designs is done in VHDL. The target design is synthesized into a gate-level netlist using an ASIC design synthesis tool, Synopsys Design Compiler with a standard cell library such as in our case LPLVT (Low Power Low Threshold Voltage) and LPHVT (Low Power High Threshold Voltage). The results are taken on different voltages (1V, 1.1V, 1.2V) on each library. Area and timing information for both the designs are gathered after the synthesis process. The synthesis process generates a Gate-level netlist of the design and a SDC file, which contains area and timing design constraints in the Synopsys Design Constraint format. MentorGraphics ModelSim is used to generate toggle information of the target design in VCD (Value Change Dump) file by using the Gate-level netlist and the SDC file from Design Compiler. The power analysis tool, Synopsys PrimeTime estimates the static and dynamic power dissipation in the target design using the design netlist and the VCD file. Figure 5.1 ASIC flow for results extraction 35 5.1 Area Area (um2) The CORDIC and Parabolic Synthesis algorithm designs are synthesized in Synopsys Design Compiler using STMicroelectronics 65-nm LPLVT and LPHVT CMOS libraries. The synthesis is done using different supply voltages i.e., 1V, 1.1V and 1.2V. Fig. 5.2 and Fig. 5.3 show total area of the CORDIC and Parabolic Synthesis algorithm designs respectively, in LPLVT and LPHVT CMOS libraries with different supply voltages. It is obvious that parabolic synthesis design is more area efficient than 14 stage unrolled CORDIC design. 19650 19600 19550 19500 19450 19400 19350 19300 19250 19200 19150 LPHVT LPLVT 1 1,1 1,2 Supply Voltage (V) Area( um2) Figure 5.2 Total Area analysis of CORDIC design in LPLVt and LPHVt 16500 16450 16400 16350 16300 16250 16200 16150 16100 16050 16000 LPHVT LPLVT 1 1,1 Supply Voltage (V) 1,2 Figure 5.3 Total Area analysis of Parabolic Synthesis design in LPLVt and LPHVt 36 Table 5.1 Total area of CORDIC and Parabilic Synthesis in LPLVt and LPHVt Architecture CORDIC 1.1V Parabolic Synthesis Supply Voltage (V) 1V 1.2V 1V 1.1V 1.2V Area (um2) LPLVT 19303 19363 19370 16322 16331 16324 Area (um2) LPHVT 19502 19487 19491 16296 16304 16279 5.2 Timing Propogation delay (ns) Fig. 5.4 and Fig. 5.5 show the critical path propagation delay timing for the CORDIC and Parabolic Synthesis designs respectively. It can be seen that the propagation delay is less in parabolic synthesis compared to the 14-stage un-rolled CORDIC design. The reason for that is the shorter critical path in the Parabolic Synthesis design due to the high degree of parallelism. There is also a difference in delay in two different libraries LPLVT and LPHVT. The delay is smaller in LPLVT, that is, it is almost half of that in LPHVT. 50 45 40 35 30 25 20 15 10 5 0 LPHVT LPLVT 1 1,1 Supply Voltage (V) 1,2 Figure 5.4 Propagation delay of CORDIC design in LPLVt and LPHVt 37 Propogation delay (ns) 35 30 25 20 15 LPHVT 10 LPLVT 5 0 1 1,1 Supply Voltage (V) 1,2 Figure 5.5 Propagation delay of Parabolic Synthesis design in LPLVt and LPHVt Table 5.2 Total Propagation Delay of CORDIC and Parabilic Synthesis in LPLVt and LPHVt Architecture CORDIC 1.1V Voltage Supply (V) 1V Propagation delay (ns) LPLVT 23.4 19.13 16.25 13.95 11.5 9.85 Propagation delay (ns) LPHVT 47 18.66 37 1.2V Parabolic Synthesis 26 1V 32.7 1.1V 1.2V 24.1 5.3 Static Power Fig. 5.6 and Fig. 5.7 show the static power dissipation in the CORDIC and Parabolic Synthesis designs respectively, using LPLVT and LPHVT CMOS libraries. The static power dissipation is almost zero with respect to LPLVT transistor technology. The static power dissipation in the Parabolic Synthesis is less than in the CORDIC algorithm. 38 Leakage Power (uW) 14 12 10 8 6 LPLVT 4 LPHVT 2 0 1 1,1 1,2 Supply Voltage (V) Figure 5.6 Static Power (Leakage) of CORDIC design in LPLVtt and LPHVt Leakage Power (uW) 8 7 6 5 4 LPLVT 3 LPHVT 2 1 0 1 1,1 Supply Voltage (V) 1,2 Figure 5.7 Static Power (Leakage) of Parabolic Synthesis design in LPLVt and LPHVt 39 Table 5.3 Static Power (Leakage) of CORDIC and Parabolic Synthesis design in LPLVt and LPHVt Architecture CORDIC Supply Voltage (V) Parabolic Synthesis 1 1.1 1.2 1 1.1 1.2 Static Power (uW) LPLVT 5.67 8 11.8 3 5 7.21 Static Power (uW) LPLVT 0.03 0.043 0.001 0.021 0.031 0.045 5.4 Dynamic Power Fig. 5.8 and Fig. 5.9 describes the total dynamic power (both cell internal power and Net Switching power) in the unrolled-CORDIC and Parabolic Synthesis designs respectively, using LPLVT and LPHVT design libraries. The architectures are tested on maximum frequencies. Dynamic Power (uW) 600 500 400 300 LPHVT 200 LPLVT 100 0 1 1,1 1,2 Supply Voltage (V) Figure 5.8 Total Dynamic Power of CORDIC design in LPLVt and LPHVt 40 Dynamic Power (uW) 450 400 350 300 250 200 150 100 50 0 LPHVT LPLVT 1 1,1 Supply Voltage (V) 1,2 Figure 5.9 Total Dynamic Power of Parabolic Synthesis design in LPLVt and LPHVt Table 5.4 Dynamic Power of CORDIC and Parabolic Synthesis design in LPLVt and LPHVt Architecture CORDIC Parabolic Synthesis Voltage Supply (V) 1V 1.1V 1.2V 1V 1.1V 1.2V Dynamic Power (uW) LPLVT 267 330 404 203 250 306 Dynamic Power (uW) LPHVT 355 427 498 299 339 409 5.5 Energy Fig. 5.8 and Fig 5.9 for dynamic power comparison gives us total dynamic power consumed by Parabolic Synthesis and Cordic, but due to different design approaches, both architectures have different frequencies, as shown in Fig. 5.4 and fig. 5.5. Total energy consumed by both architectures at their maximum frequencies is shown below in Fig. 5.10 and Fig. 5.11. It is clearly seen that Parabolic synthesis is about 50% more energy efficient than Cordic. 41 Energy consumption (J) 1,8E-8 1,6E-8 1,4E-8 1,2E-8 1,0E-8 8,0E-9 6,0E-9 4,0E-9 2,0E-9 0,0E+0 LPHVT LPLVT 1 1,1 1,2 Supply Voltage (V) Figure 5.10 Energy Consumption in CORDIC design in LPLVt and LPHVt Energy consumption (J) 1,2E-8 1,0E-8 8,0E-9 6,0E-9 LPHVT 4,0E-9 LPLVT 2,0E-9 0,0E+0 1 1,1 Supply Voltage (V) 1,2 Figure 5.11 Energy Consumption in Parabolic Synthesis design in LPLVt and LPHVt 42 Table 5.5 Energy Consumption in CORDIC and Parabolic Synthesis design Architecture CORDIC Parabolic Synthesis Voltage Supply (V) 1 1.1 1.2 1 1.1 1.2 Energy consumption (nJ) LPLVT 6.24 6.37 6.67 2.83 2.87 3.01 Energy consumption (nJ) LPHVT 16.68 15.79 12.94 9.77 8.16 7.63 43 Table 5.6.1 required memories, registers and multipliers for FFT points No of FFT points No of register/Rom cells required to store the Twiddling factors (Real + Imaginary) Real Multipliers 16 16 12 32 32 16 64 64 20 128 128 24 256 256 28 512 512 32 1024 1024 36 2028 2048 40 44 CHAPTER 6. Conclusions A comparison of two different algorithms, un-rolled CORDIC and Parabolic Synthesis, has been made for area, speed and energy. These algorithms are utilized to approximate real and imaginary values in an FFT. Comparisons have been done on different transistor technologies at different voltages. The results show that the parabolic synthesis algorithm is better in energy consumption, speed and area. The area utilization in the Parabolic Synthesis is 16% less than in the un-rolled CORDIC for the specific precision and energy consumption, on the other hand, is 50% less than the CORDIC. In the same manner, the timing of the parabolic synthesis is about 1.4 times better than the un-rolled CORDIC algorithm. It is also concluded that, when using a LPLVT transistor technology, the static power is higher but less dynamic power with comparison of LPHVT transistor technology. Since the cell size in both the technologies is almost same, the area utilization comes to be very close to each other. Speed is better when using LPLVT transistor technology as expected due to less parasitic effects. An investigation of the rotations, using twiddle factor multiplications, is not included here. However, a rough estimate on how the number of register/ROM cells, to store the twiddle factors, increase with the number of FFT points, can be done as shown in Table 5.1 It can be noted that the number of twiddle factor cells are increasing exponentially. For small FFTs, registers can be beneficial but for large FFTs, A ROM will be preferred. But the size of memory increases exponentially with increase in FFT points, whereas on other hand the Parabolic synthesis and the CORDIC algorithms computes rotations in the real time, hence, there is no need to store twiddle factors. 45 CHAPTER 7. Future Work The comparison of Un-rolled CORDIC and Parabolic Synthesis algorithms for twiddle factors can also be carried out with multiplier based FFT architecture which needs ROM to store larger number of twiddle factors. The multiplier based FFT architecture is area efficient for the low number of FFT points. However, when the number of points is increasing, it utilizes an exponentially larger chip area. Parabolic Synthesis algorithm can also be improved with respect to wordlength optimization between the architecture. 46 References [1] Ray Andraka, “A survey of CORDIC algorithms in FPGA based computers”, Andraka Consulting Group, Inc. North Kingstown, USA. [2] Y. H Hu and S. Naganathan, “An angle recording method for CORDIC algorithm implementation”, IEEE Transaction on Computers, vol. 42, issue 1, pp. 99-102, 1993. [3] Axler, Sheldon “Linear Algebra Done Right, 2e”, Springer Press, ISBN 978-0387-98259-5. [4] Jack E. Volder, “The CORDIC Trigonometric Computing Technique”, IRE transactions on Electronic Computers, vol. EC-8, issue 3, pp. 330-334, 1959. [5] E. Hertz, P. Nilsson, “A Methodology for Parabolic Synthesis of Unary Functions for Hardware Implementation,” Proc. of the 2nd International Conference on Signals, Circuits and Systems, ISBN-13: 978-1-4244-2628-7, pp 16, Hammamet, Tunisia, Nov. 2008. [6] Erik Hertz and Peter Nilsson, “Parabolic Synthesis Methodology”, In the proceedings of the 2010 GigaHertz Symposium, pp 35, March 9-10, 2010, Lund, Sweden. [7] E. Hertz and P. Nilsson, “Parabolic Synthesis Methodology Implemented on the Sine Function”, in Proceedings of the 2009 IEEE International Symposium on Circuits and Systems, pp 253-256,Taipei, May 24-27, 2009. 47 [8] J. W. Cooley and J. W. Tukey, An algorithm for machine calculation of complex Fourier series, Math. Comp., 19 (1965), 297-301. [9] Kamisetty Rao, Do Nyeon Kim, Jae Jeong Hwang, Fast Fourier Transform: Algorithms and Application, Springer Press, ISBN 976-1-40206628-3. [10] Alan V. Oppenheim, Ronald W. Schafer, Discrete-time signal processing, Prentice Hall, ISBN 978-0-13198842-2, the University of Michigan, Michigan, USA. List of Figures Figure 2.1 A Radix-2 DIT Butterfly...................................................................... 10 Figure 2.2 A Radix-2 DIF Butterfly....................................................................... 11 Figure 2.3 A 16-point radix-2 DIF FFT algorithm.................................................. 12 Figure 3.1 Unit-length vector in Cartesian coordinate system............................. 16 Figure 3.2 Real Rotation of the vector ................................................................ 17 Figure 3.3 Real and Pseudo Rotation of the vector ............................................. 18 Figure 3.4 An un-rolled 14-stage CORDIC architecture ....................................... 21 Figure 3.5 Hardware Multiplication of scaling factor. ......................................... 22 Figure 4.1 Angle transformation from quadrant 2-4 to quadrant 1 ..................... 29 Figure 4.2 Hardware architecture for pre-processing ......................................... 29 Figure 4.3 Input Transformation from quadrant 2-4 to quadrant 1 ..................... 31 Figure 4.4 Architecture for the multiplication of complex numbers with approximated trigonometric functions (sine and cosine).................................... 32 Figure 4.5 Architecture for Multiple Constant Multiplication.............................. 32 Figure 4.6 The fractional bus with an added integer “1” ..................................... 33 Figure 4.7 Architecture for 2’s complement conversion ..................................... 34 Figure 5.1 ASIC flow for results extraction .......................................................... 35 Figure 5.2 Total Area analysis of CORDIC design in LPLVt and LPHVt................... 36 Figure 5.3 Total Area analysis of Parabolic Synthesis design in LPLVt and LPHVt . 36 Figure 5.4 Propagation delay of CORDIC design in LPLVt and LPHVt ................... 37 Figure 5.5 Propagation delay of Parabolic Synthesis design in LPLVt and LPHVt .. 38 48 Figure 5.6 Static Power (Leakage) of CORDIC design in LPLVt and LPHVt............ 39 Figure 5.7 Static Power (Leakage) of Parabolic Synthesis design in LPLVt and LPHVt ................................................................................................................ 39 Figure 5.8 Total Dynamic Power of CORDIC design in LPLVt and LPHVt............... 40 Figure 5.9 Total Dynamic Power of Parabolic Synthesis design in LPLVt and LPHVt .......................................................................................................................... 41 Figure 5.10 Energy Consumption in CORDIC design in LPLVt and LPHVt .............. 42 Figure 5.11 Energy Consumption in Parabolic Synthesis design in LPLVt and LPHVt .......................................................................................................................... 42 List of Tables Table 3.3.1 Pre-Computed angle set .................................................................. 19 Table 3.3.2 Parameters for Generalized CORDIC ................................................ 23 Table 4.4.1 Input Transformations ..................................................................... 33 Table 4.4.2 Output Transformations .................................................................. 34 Table 5.1 Total area of CORDIC and Parabilic Synthesis in LPLVt and LPHVt ......... 37 Table 5.2 Total Propagation Delay of CORDIC and Parabilic Synthesis in LPLVt and LPHVt ................................................................................................................. 38 Table 5.3 Static Power (Leakage) of CORDIC and Parabolic Synthesis design in LPLVt and LPHVt ................................................................................................. 40 Table 5.4 Dynamic Power of CORDIC and Parabolic Synthesis design in LPLV t and LPHVt ................................................................................................................. 41 Table 5.5 Energy Consumption in CORDIC and Parabolic Synthesis design.......... 43 Table 6.1.1 required memories, registers and multipliers for FFT points............. 44 49

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement