Master’s Thesis Complexity Reduction in the CORDIC Algorithm by using MUXes Yuhang Sun Department of Electrical and Information Technology, Faculty of Engineering, LTH, Lund University, August 2015. Complexity Reduction in the CORDIC Algorithm by using MUXes By Yuhang Sun Department of Electrical and Information Technology Faculty of Engineering, LTH, Lund University SE-221 00 Lund, Sweden 1 2 Abstract Nowadays, the CORDIC algorithm plays an important role to deal with the non-linear functions in hardware. In this thesis, a novel methodology is described to reduce the complexity in an unrolled CORDIC architecture, which gives higher speed, lesser area, and lower power consumption. That is, MUXes are used to replace adder stages. Five different unrolled CORDIC architectures have been implemented in ASIC using a 65nm CMOS technology with Low Power High ்ܸ transistors. The area, computational speed, accuracy, error behavior, and power consumption have been analyzed. The design aim is to reduce the power consumption, which is more and more important depending on the area. As a result the area and power consumption get 7.9% lower and 27.2% lower separately, and the speed is 22.9% higher compared to the original unrolled CORDIC architecture. Keywords: CORDIC, power consumption. 3 Acknowledgments This Master’s thesis would not exist without the support and guidance of many people, my friends, my parents, especially my supervisors. I would like to express my deepest gratitude to my supervisors, Professor Peter Nilsson, Erik Hertz, and Rakesh Gangarajaiah who guided me and helped me in this thesis work. Finally, thanks to my relatives who kept encouraging me and my parents who afford me to study in Sweden Yuhang Sun 4 Contents Abstract ......................................................................................................... 3 Acknowledgments ......................................................................................... 4 List of figures ................................................................................................. 7 List of tables .................................................................................................. 8 List of acronyms .......................................................................................... 10 1. Introduction......................................................................................... 11 1.1. 2. 3. 4. 5. Thesis outlines ............................................................................. 12 Theory.................................................................................................. 13 2.1. The CORDIC algorithm ................................................................. 13 2.2. Two starting angles ..................................................................... 16 2.3. Four starting angles ..................................................................... 17 Software simulation ............................................................................ 21 3.1. Outputs simulation ...................................................................... 21 3.2. Error behavior ............................................................................. 22 3.3. Accuracy ...................................................................................... 24 Hardware implementation .................................................................. 25 4.1. Design flow .................................................................................. 25 4.2. Hardware architectures and simulation results .......................... 26 4.2.1. The original unrolled CORDIC architecture ......................... 26 4.2.2. First stage removed architecture ........................................ 28 4.2.3. Two stages eliminated architecture .................................... 30 4.2.4. Three stages eliminated architecture ................................. 31 4.2.5. Final architecture................................................................. 33 Results ................................................................................................. 37 5.1. Test setup .................................................................................... 37 5 5.2. The area ....................................................................................... 37 5.3. Timing information ...................................................................... 39 5.4. Power consumption .................................................................... 40 5.4.1. Power analysis ..................................................................... 40 6. Conclusions.......................................................................................... 45 7. Future work ......................................................................................... 47 References ................................................................................................... 49 Appendix A .................................................................................................. 51 6 List of figures Fig. 1. Fig. 2. Fig. 3. Fig. 4. Fig. 5. Fig. 6. Fig. 7. Fig. 8. Fig. 9. Fig. 10. Fig. 11. Fig. 12. Fig. 13. Fig. 14. Fig. 15. Fig. 16. Fig. 17. Fig. 18. Fig. 19. Fig. 20. Vector rotations diagram (five stages)................................ 14 The last vector ends up on the unit circle (five stages) ...... 15 Two starting angles (three stages) ...................................... 16 Four starting angles (two stages) ........................................ 18 MATLAB simulation of CORDIC design.......................... 21 The cosine error after truncation ........................................ 22 The cosine error distribution............................................... 23 The absolute cosine error in dB .......................................... 23 Degrees and the number of error bits ................................. 24 The design flow............................................................... 25 An unrolled 5-stage CORDIC architecture ..................... 26 First stage removed architecture ..................................... 28 Two stages are removed in the architecture .................... 30 Three stages are removed architecture ............................ 32 Final architecture ............................................................ 33 The Sgn detector ............................................................. 34 The design test setup ....................................................... 37 Area at different frequencies ........................................... 38 Power at various frequencies .......................................... 42 magnified diagram for Fig. 19 ........................................ 43 7 List of tables TABLE I. Power at 10MHz.......................................................... 27 TABLE II. Power at maximum frequency..................................... 27 TABLE III. Timing at maximum speed constraints .................... 27 TABLE IV. Timing at minimum area constraints ....................... 27 TABLE V. Area at minimum area constraints .............................. 28 TABLE VI. Area at maximum speed constraints ........................ 28 TABLE VII. Power at 10MHz ...................................................... 29 TABLE VIII. Power at maximum frequency ................................. 29 TABLE IX. Timing at maximum speed constraints .................... 29 TABLE X. Timing at minimum area constraints ........................... 29 TABLE XI. Area at minimum area constraints .......................... 29 TABLE XII. Area at maximum speed constraints ........................ 29 TABLE XIII. Power at 10MHz ...................................................... 30 TABLE XIV. Power at maximum frequency ............................. 31 TABLE XV. Timing at maximum speed constraints .................... 31 TABLE XVI. Timing at minimum area constraints ................... 31 TABLE XVII. Area at minimum area constraints ...................... 31 TABLE XVIII. Area at maximum speed constraints .................... 31 TABLE XIX. Power at 10MHz .................................................. 32 TABLE XX. Power at maximum frequency ................................. 32 TABLE XXI. Timing at maximum speed constraints ................ 32 TABLE XXII. Timing at minimum area constraints ................... 33 TABLE XXIII. Area at minimum area constraints ...................... 33 TABLE XXIV. Area at maximum speed constraints .................... 33 TABLE XXV. Power at 10MHz .................................................. 34 TABLE XXVI. Power at maximum frequency ............................. 34 TABLE XXVII. Timing at maximum speed constraints ................ 35 TABLE XXVIII. Timing at minimum area constraints ................. 35 TABLE XXIX. Area at minimum area constraints ...................... 35 TABLE XXX. Area at maximum speed constraints .................... 35 TABLE XXXI. Area at minimum area constraints ....................... 38 8 TABLE XXXII. Area at maximum speed constraints .................... 39 TABLE XXXIII. Timing at maximum speed constraints .............. 39 TABLE XXXIV. Timing at minimum area constraints ................. 40 TABLE XXXV. Power at maximum speed constraints ................. 41 TABLE XXXVI. Power at minimum area constraints ................... 42 TABLE XXXVII. Coefficient angles ............................................. 51 9 List of acronyms CORDIC COordinated Rotation DIgital Computer LUT Look Up Table ASIC Application Specific Integrated Circuit VHDL Very High speed integrated circuit Hardware Description Language CMOS Complementary Metal Oxide Semiconductor VCD Value Change Dump VLSI Very Large Scale Integrated circuit 10 1. Introduction In most cases, non-linear functions play an important role in hardware. The area, computational speed, accuracy, error behavior, and the power consumption are the important factors that we need to consider in the hardware design. There are several algorithms that can be chosen for implementation of non-linear functions. Look-Up Table (LUT) is a simple and direct method to compute nonlinear functions. However, a look-up table is suitable if the precision is low or the area is not necessarily considered. The size of the table grows exponentially, which makes this method unsuitable for hardware design when the precision goes high. Another method, Parabolic Synthesis [1] is used to compute unary functions with parabolic functions. It is a novel methodology that came up with a high speed computational technology. The advantage of this method is that the delay is short. Polynomial approximations [2] are implemented with multipliers and adders, using an iterative algorithm. Which polynomial curves we choose leads to how closely the polynomial curve can follow the special function. That is where the error is generated. Very high order polynomials are used if high accuracy is required. To meet the requirement, the polynomial approximations can be implemented with least squares approximations, which minimize the average error, or least maximum approximations, which minimize the worst-case error. This thesis work is focus on the CORDIC algorithm. The CORDIC algorithm is an efficient algorithm to compute non-linear functions in hardware. The COordinate Rotation Digital Computer (CORDIC) algorithm [3], also known as the digit-by-digit method or the Volder’s algorithm, was first described by Jack E. Volder in 1959. It is an efficient algorithm to compute non-linear functions especially trigonometric functions in hardware design. Compared to other methods described above, the CORIDC has no multipliers, which means that it is only based on additions, subtractions and bit shifts. Traditionally, iterative CORDICs [4] have been widely used since the cost of area is less. Nowadays, power consumption is more important than the area parameter, [5] in hardware design, which leads to that unrolled CORDICs are feasible to use. Meanwhile, it is necessary to reduce the complexity 11 since it can also reduce the power consumption, which is the aim of this thesis. The implementation includes five different unrolled CORDIC architectures in an Application Specific Integrated Circuit (ASIC). One is the original CORDIC architecture and the other four are CORDIC architectures, with reduced complexities on various levels. The designs are written in Very High Speed Integrated Circuit Hardware Description Language (VHDL) using a 65nm CMOS technology with Low Power High ்ܸ transistors. The used supply voltage is VDD = 1.2 volts and the temperature is 25 degrees. As a result, area, timing, and power consumption, operating under different frequencies, are reported for the five different unrolled CORDIC architectures. 1.1. Thesis outlines Remaining sections are outlined below: Section 2 introduces the CORIDC algorithm and methods to reduce the complexity. Section 3 describes the software simulation, accuracy, and error behavior of the CORIDC algorithm. Section 4 describes the hardware implementations of the five different unrolled CORIDC architectures. Section 5 lists and analyzes the result of section 4, consisting of area, timing and power consumption. Section 6 concludes this thesis work. Section 7 analyzes possible future work after this thesis work. 12 2. Theory In this section, the CORDIC algorithm and the methods to reduce the complexities are introduced. Using two starting angles and four starting angles together with MUXes will reduce one adder stage and two adder stages separately in the unrolled CORDIC architectures. 2.1. The CORDIC algorithm The CORDIC algorithm, based on vector rotations is used to get an approximation of the non-linear function. In this thesis, the sine functions and cosine functions are the functions that are implemented with the CORDIC algorithm. An iterative CORDIC architecture consists of several adder stages. When the numbers of stages increased, the error between the approximation and the original function gets smaller, which leads to an improved accuracy. In Fig. 1, a 30 degree angle is taken to be the input angle. There are five vector rotations, which mean there are five stages. The initial vector, a 0 degree angle, is rotated by 45 degrees, which is detected to be larger than the input angle. In next stage, the vector is rotated -27 degrees and so on, to approximate the input angle. The positive degrees means the direction of the rotation is counter clockwise, the negative degrees means the direction of the rotation is clockwise. After five vector rotations, the last vector’s angle approximates the input angle. To improve the accuracy, more stages can be used. This will make the approximation infinitely close to the input angle. The angles 45 and 27 degrees are the coefficient angles, which are shown in Table XXXIII in Appendix A. 13 Fig. 1. Vector rotations diagram (five stages) The function of coefficient angles is shown in (1). Where݈ܽ݊݃݁ሺ݅ሻ is the coefficient angle, ݅ is the number of stages. The first five coefficient angles are given in (2). ݈ܽ݃݊݁ሺ݅ሻ ൌ ܽ݊ܽݐܿݎ൫ͳΤʹିଵ ൯ (1) ݈ܽ݊݃݁ሺͳሻ ൌ ܽ݊ܽݐܿݎሺͳΤʹଵିଵ ሻ ൌ Ͷͷ ۓ ଶିଵ ݈ۖܽ݊݃݁ሺʹሻ ൌ ܽ݊ܽݐܿݎሺͳΤʹ ሻ ൌ ʹǤͷ ݈ܽ݊݃݁ሺ͵ሻ ൌ ܽ݊ܽݐܿݎሺͳΤʹଷିଵ ሻ ൌ ͳͶǤͲͶ ݈݁݃݊ܽ ۔ሺͶሻ ൌ ܽ݊ܽݐܿݎሺͳΤʹସିଵ ሻ ൌ Ǥͳ͵ ۖ ݈݁݃݊ܽ ەሺͷሻ ൌ ܽ݊ܽݐܿݎሺͳΤʹହିଵ ሻ ൌ ͵Ǥͷͺ (2) The output is the last vector coordinates, (ݔǡ )ݕ. In this paper, CORDIC algorithm is used to deal with the sine function and cosine function. The approximated sine and cosine value of the input angle are shown in (3) and (4). ݕ ݎ ݔ ܽ ൌ ݎ ܽ ൌ 14 (3) (4) Where ܽ is the input angle, ݎis length of the last vector. It can be noted that if ݎൌ ͳ , the sine value and cosine value are exactly the coordinates, ݔൌ ܽ ܽ݊݀ ݕൌ ܽǡ which means the last vector should end up on the unit circle, as shown in Fig. 2. Fig. 2. The last vector ends up on the unit circle (five stages) To make the last vector end up on the unit circle, the length of the first vectorݎሺͳሻ should be known. There are many ways to calculateݎሺͳሻ. In this thesis, (5) is used to getݎሺͳሻ. Whenݎሺሻ ൌ ͳandݒሺͷሻ ൌ ͵Ǥͷͺι, given in (2), the result can be obtained that ݎሺͳሻ ൌ ͲǤͲͺͺ and the first vector coordinates is (0.6088, 0). In this case, the output corresponds to the cosine and sine value of the input angle. The architecture in hardware is shown in Fig. 11 in section 4.2.1. ݎሺ݅ െ ͳሻ ൌ ݎሺ݅ሻ ൈ ሺݒሺ݅ െ ͳሻሻ 15 (5) 2.2. Two starting angles The idea of this project is to use more than one start angle. This will reduce the number of stages, while the MUXes will be introduced into the design, which will be discussed in section 4. In Fig. 3, 30 degrees and 60 degrees are used as the two input angles. Fig. 3. Two starting angles (three stages) In Fig. 3, there will be a comparison between the input angle and the 45 degree angle, which is implemented with two MUXes in hardware. The architecture is shown in Fig. 13 in section 4.2.3. The MUXes in hardware are used to determine the starting vector coordinate. When the input is a 60 degree angle, which is larger than 45 degrees, the vector will rotate in the red region. The starting vector coordinate is (ݔଵ ǡ ݕଵ ). After three rotations, the output will be (ݔଷ ǡ ݕଷ ). When the input is a 30 degree angle, which is smaller than 45 degrees, the vector will rotate in the blue region. The starting vector coordinate is (ݔଵᇱ ǡ ݕଵᇱ ). After three rotations, the output will be (ݔଷᇱ ǡ ݕଷᇱ ). 16 When the input angle is exactly 45 degrees, both red lines and blue lines are suitable for the vector rotations. ݒሺͳሻ ൌ ൜ ܽ݊ܽݐܿݎሺͳሻ െ ܽ݊ܽݐܿݎሺͳΤʹሻ ൌ ͳͺǤͶ͵ ܽ݊ܽݐܿݎሺͳሻ ܽ݊ܽݐܿݎሺͳΤʹሻ ൌ ͳǤͷ (6) Equation (6) indicates that, the starting vector angles are 18.4349 degrees for the input angle below 45 degrees and 71.5651 degrees for angles above 45 degrees. Fig. 11 in section 4.2.1 shows the original unrolled CORDIC architecture, where the starting vector’s coordinates for Fig. 3 can be obtained. From the left of the architecture in Fig. 11, the first stage, is shown in (7), where ݔଵ is the ݎሺͳሻ in section 2.1. ݕଶ ൌ ݔଵ ቄ ݔൌ ݔ ଶ ଵ (7) ݔଶ ͵ݔଵ ൌ ൌ ͲǤͻͳ͵ʹ ʹ ʹ ൞ ݕଶ ݔଵ ݔଷ ൌ ݔଶ െ ൌ ൌ ͲǤ͵ͲͶͶ ʹ ʹ (8) ݔଶ ݔଵ ൌ ൌ ͲǤ͵ͲͶͶ ʹ ʹ ൞ ݕଶ ͵ݔଵ ൌ ͲǤͻͳ͵ʹ ݔଷ ൌ ݔଶ ൌ ʹ ʹ (9) ݕଷ ൌ ݕଶ ݕଷ ൌ ݕଶ െ In the second stage, the starting vector coordinates can be obtained. When the input is larger than 45 degrees, the coordinates shown in (8) will be used and when lesser than 45 degrees, the coordinate shown in (9) will be used. This will make the architecture in Fig. 13 has two stages lesser than the one in Fig. 11. 2.3. Four starting angles Fig. 4 shows the four starting angles algorithm. 4 degrees, 30 degrees, 60 degrees, and 86 degrees are used as the four input angles. The rotations depends on the comparisons between the input angle and three different degrees, 18.43 degrees, 71.56 degrees and 45 degrees from (6), which is implemented with six MUXes in a hardware design. The architecture is shown in Fig. 14 in section 4.2.4. 17 Fig. 4. Four starting angles (two stages) ܽ݊ܽݐܿݎሺͳሻ െ ܽ݊ܽݐܿݎሺͳΤʹሻ െ ܽ݊ܽݐܿݎሺͳΤͶሻ ൌ ͶǤͶͲ ܽ݊ܽݐܿݎሺͳሻ െ ܽ݊ܽݐܿݎሺͳΤʹሻ ܽ݊ܽݐܿݎሺͳΤͶሻ ൌ ͵ʹǤͶ ݒሺͳሻ ൌ ൞ ܽ݊ܽݐܿݎሺͳሻ ܽ݊ܽݐܿݎሺͳΤʹሻ െ ܽ݊ܽݐܿݎሺͳΤͶሻ ൌ ͷǤͷ͵ ܽ݊ܽݐܿݎሺͳሻ ܽ݊ܽݐܿݎሺͳΤʹሻ ܽ݊ܽݐܿݎሺͳΤͶሻ ൌ ͺͷǤͲ (10) Equation (10) indicates the starting vector angles for the input angles in four different regions. When the input angle is from 0 degree to 18.43 degrees, the starting vector angle is 4.40 degrees. When the input angle is from 18.43 degrees to 45 degrees, the starting vector angle is 32.47 degrees. When the input angle is from 45 degrees to 71.56 degrees, the starting vector angle is 57.53 degrees. When the input angle is from 71.56 degrees to 90 degrees, the starting vector angle is 85.60 degrees. With the same methodology, described in section 2.2, the third stage in Fig. 11 is shown in (11), (12), (13), and (14). 18 ݔଷ ݔଵ ͵ݔଵ ݔଵ ൌ െ ൌ ൌ ͲǤͲͳ Ͷ ʹ ͺ ͺ ൞ ݕଷ ͵ݔଵ ݔଵ ͳ͵ݔଵ ݔସ ൌ ݔଷ ൌ ൌ ൌ ͲǤͻͺͻ͵ Ͷ ʹ ͺ ͺ (11) ݔଷ ݔଵ ͵ݔଵ ݔଵ ൌ ൌ ൌ ͲǤͷ͵ʹ Ͷ ʹ ͺ ͺ ൞ ݕଷ ͵ݔଵ ݔଵ ͳͳݔଵ ݔସ ൌ ݔଷ െ ൌ െ ൌ ൌ ͲǤͺ͵ͳ Ͷ ʹ ͺ ͺ (12) ݔଷ ͵ݔଵ ݔଵ ͳͳݔଵ െ ൌ ൌ ͲǤͺ͵ͳ ݕସ ൌ ݕଷ െ ൌ Ͷ ʹ ͺ ͺ ൞ ݕଷ ݔଵ ͵ݔଵ ݔଵ ݔସ ൌ ݔଷ ൌ ൌ ൌ ͲǤͷ͵ʹ Ͷ ʹ ͺ ͺ (13) ݔଷ ͵ݔଵ ݔଵ ͳ͵ݔଵ ൌ ൌ ͲǤͻͺͻ͵ ݕସ ൌ ݕଷ ൌ Ͷ ʹ ͺ ͺ ൞ ݕଷ ݔଵ ͵ݔଵ ݔଵ ݔସ ൌ ݔଷ െ ൌ െ ൌ ൌ ͲǤͲͳ Ͷ ʹ ͺ ͺ (14) ݕସ ൌ ݕଷ െ ݕସ ൌ ݕଷ When the input angle is from 0 degree to 18.43 degrees, the coordinate shown in (11) will be used. When the input angle is from 18.43 degrees to 45 degrees, the coordinate shown in (12) will be used. When the input angle is from 45 degrees to 71.56 degrees, the coordinate shown in (13) will be used. When the input angle is from 71.56 degrees to 90 degrees, the coordinate shown in (14) will be used. This will make the architecture in Fig. 14 has three stages lesser than the one in Fig. 11. 19 20 3. Software simulation In this section, the outputs of the designs are simulated in MATLAB. The error behavior and accuracy tests are done in this section as well. 3.1. Outputs simulation In this section, the original CORDIC architecture, shown in Fig. 11, is simulated in MATLAB. The outputs of the sine and cosine functions are shown in Fig. 5. To improve the accuracy, 18 stages are used. Note that all architectures are simulated with the same result. 1.2 1 Outputs 0.8 0.6 0.4 0.2 Approximated cosine outputs Approximated sine outputs Theoretical sine outputs Theoretical cosine outputs 0 -0.2 0 10 Fig. 5. 20 30 40 50 Degrees 60 70 80 90 MATLAB simulation of CORDIC design Fig. 5 shows the approximated cosine outputs, theoretical cosine outputs, approximated sine outputs, and theoretical sine outputs with various input degrees. The green line and the black line, the blue line and the red line match each other. That is, the simulated CORIDC functions approximate the theoretical (floating point) functions, which is acceptable. 21 3.2. Error behavior The errors are tested with all possible 32768 input angles, where͵ʹͺ ൌ ʹଵହ . It means 15-bit inputs in hardware. Fig. 6 shows the error of the CORDIC cosine function after truncation to 15 bits. -5 0.5 x 10 0 Error -0.5 -1 -1.5 -2 -2.5 0 10 Fig. 6. 20 30 40 50 Degrees 60 70 80 90 The cosine error after truncation Fig. 6 indicates that with the increasing of the input angles, the error is increasing from െʹǤͶ ൈ ͳͲିହ toͲǤͶ ൈ ͳͲିହ. By displaying a histogram of the error function, the distribution of cosine error can be shown in Fig. 7. 22 4500 4000 3500 Number of errors 3000 2500 2000 1500 1000 500 0 -2.5 -2 -1.5 Fig. 7. -1 Error -0.5 0 0.5 -5 x 10 The cosine error distribution Fig. 7 indicates that the cosine error peak is placed atെʹǤͶ ൈ ͳͲିହ . Most of the errors are placed away from zero, which is a drawback. The absolute cosine error in dB is shown in Fig. 8. -80 The absolute cosine error in dB -100 -120 -140 -160 -180 -200 0 10 Fig. 8. 20 30 40 50 Degrees 60 70 The absolute cosine error in dB 23 80 90 Since the combination of binary numbers and decibel (dB) matches very well, displaying the errors in dB can simplify the understanding of the errors. The errors in dB are shown in (15), where ݔis the error. ݔௗ ൌ ʹͲ݈݃ଵ ȁݔȁ (15) 3.3. Accuracy In binary, one bit can presents 2 results, 0 and 1. That is,ʹͲ݈݃ଵ ሺʹሻ ൎ ݀ ܤcorresponds to 1 binary bit in resolution. The logarithmic error divided by െʹͲ݈݃ଵ ሺʹሻ gives the number of the bits, as shown in (16). ݊ ൌ െݔௗ ΤʹͲ݈݃ଵ ሺʹሻ (16) Fig. 9 shows the number of error bits on all possible input angles, where 15.33 is the peak, which gives 15.33 bits accuracy. 34 32 30 Sine Cosine Number of bits 28 26 24 22 20 18 16 14 0 10 Fig. 9. 20 30 40 50 Degrees 60 70 80 Degrees and the number of error bits 24 90 4. Hardware implementation In this section, the design flow for the thesis and the five different unrolled CORDIC architectures are implemented in hardware. IO65LPHVT_SF_1V8_50A_7M4X0Y2Z_nom_1.00V_1.80V_25C.db and CORE65LPHVT_nom_1.20V_25C.db are the libraries used in design compiler and prime time. In other words, a low power high VT (LPHVT) technology is used at a supply voltage at 1.2V. 4.1. Design flow Design requirements Matlab simualtion Matlab RTL simulation ModelSim verfication Matlab synthesis Design complier Post synthesis verfication ModelSim Power analysis Primetime tool Fig. 10. The design flow Fig. 10 shows the design flow for the thesis. To implement the CORDIC design in hardware, it should be starting with the MATLAB simulation. The design is coded in VHDL after that the MATLAB model 25 satisfies the requirements of the design. By using ModelSim to simulate the VHDL code, the result will be compared to the MATLAB model. A netlist and a Value Change Dump (VCD) file are generated by design complier and ModelSim separately, which are used for power analysis with the primetime tool. 4.2. Hardware architectures and simulation results In this section, five different CORDIC architectures are implemented in hardware. One is the original CORDIC architecture. The other four architectures reduce the complexity in different levels. 4.2.1. The original unrolled CORDIC architecture The original CORDIC architecture, which corresponds to the rotations in Fig. 2, is shown in Fig. 11. Only adders and inverters are used in the architecture. The multiplications such as ͳΤʹ , ͳΤͶ ǡ ͳΤͺ will be achieved by hardware wired shifts on the ASIC. Note that the figure only shows the first 5 stages of the 18-stage CORDIC that is used for the simulations. α 45 27 14 7 ADD SUB ADD SUB ADD SUB ADD SUB 1 0 y(1) y(2) ADD SUB sgn -/+ y(3) ADD SUB 1/2 x(1) ADD SUB y(4) +/- 1/8 +/- 1/4 1/8 ADD SUB x(3) -/+ sgn y(5) ADD SUB 1/4 ADD SUB x(2) -/+ ADD SUB 1/2 1 sgn sgn Sin(α) 1/16 +/- 1/16 ADD SUB x(4) -/+ ADD SUB +/ADD SUB cos(α) x(5) Fig. 11. An unrolled 5-stage CORDIC architecture In Fig. 11, ݔሺͳሻ ൌ ͲǤͲͺͺ and ݕሺͳሻ ൌ Ͳ are the first vector coordinates and ܽ is the input angle. At the top, 45 degrees, 26.5651 degrees, 14.0362 degrees, and 7.1250 degrees are the coefficient angles got from (1). There are 19 coefficient angles in the design. A total of 19 sign bits (sgn), is the result from the comparisons between the input angle and the coefficients 26 angle, which is obtained in the upper adder row. The middle and the lower adder row compute the approximations from the left to the right, where the output is generated. The sign bits determine if the middle and the lower row should be added or subtracted. The power consumption at both 10MHz and maximum frequency, which is 76.7MHz are shown in TABLE I and TABLE II, where the supply voltage is 1.2V. TABLE I. Power consumption Net switching power Cell internal power Cell leakage power Total power TABLE II. Power consumption Net switching power Cell internal power Cell leakage power Total power POWER AT 10MHZ Frequency 10MHz 10MHz 10MHz 10MHz Unit nW nW nW nW LPHVT 129100 110600 131.5 239900 POWER AT MAXIMUM FREQUENCY Frequency 76.7MHz 76.7MHz 76.7MHz 76.7MHz Unit nW nW nW nW LPHVT 1714000 1137000 291.8 2852000 The results of synthesis are shown in TABLE III, TABLE IV, TABLE V, and TABLE VI, where the highest speed, minimum area, and maximum area are obtained. The timing at minimum area constraints is tested under a 10MHz frequency, which is shown in TABLE IV. TABLE III. TIMING AT MAXIMUM SPEED CONSTRAINTS Voltage Speed Time Unit V MHz ns LPHVT 1.2 76.7 13.03 TABLE IV. TIMING AT MINIMUM AREA CONSTRAINTS Voltage Speed Time Unit V MHz ns 27 LPHVT 1.2 31 32.22 TABLE V. AREA AT MINIMUM AREA CONSTRAINTS Unit V um2 Voltage Area LPHVT 1.2 35482 TABLE VI. AREA AT MAXIMUM SPEED CONSTRAINTS Unit V um2 Voltage Area LPHVT 1.2 100989 4.2.2. First stage removed architecture A CORDIC architecture where the first stage is removed is shown in Fig. 12. This architecture has 2 adders less than the original one, i.e. 2 times 19 adder cells out of a total of 19 times 19 adder cells, less than the original design. The first vector coordinates areݔሺʹሻ ൌ ͲǤͲͺͺ andݕሺʹሻ ൌ ͲǤͲͺͺ. α 45 27 14 7 ADD SUB ADD SUB ADD SUB ADD SUB 1 sgn -/+ y(2) y(3) ADD SUB x(2) -/+ y(4) ADD SUB 1/2 1/2 sgn sgn +/- 1/4 1/8 +/- 1/8 ADD SUB x(3) y(5) ADD SUB 1/4 ADD SUB -/+ sgn Sin(α) 1/16 +/- 1/16 ADD SUB x(4) -/+ ADD SUB +/ADD SUB cos(α) x(5) Fig. 12. First stage removed architecture The power consumption at both 10MHz and maximum frequency, which is 76.7MHz are shown in TABLE VII and TABLE VIII, where the supply voltage is 1.2V. 28 TABLE VII. POWER AT 10MHZ Power consumption Net switching power Cell internal power Cell leakage power Total power Frequency 10MHz 10MHz 10MHz 10MHz TABLE VIII. Unit nW nW nW nW LPHVT 121700 104500 131.5 226300 POWER AT MAXIMUM FREQUENCY Power consumption Net switching power Cell internal power Cell leakage power Total power Frequency 76.7MHz 76.7MHz 76.7MHz 76.7MHz Unit nW nW nW nW LPHVT 1692000 1109000 296.2 2802000 The results of synthesis are shown in TABLE IX, TABLE X, TABLE XI, and TABLE XII, where the highest speed, minimum area, and maximum area are obtained. TABLE IX. TIMING AT MAXIMUM SPEED CONSTRAINTS Voltage Speed Time TABLE X. Unit V MHz ns LPHVT 1.2 76.7 13.03 TIMING AT MINIMUM AREA CONSTRAINTS Voltage Speed Time TABLE XI. AREA AT Voltage Area Unit V MHz ns LPHVT 1.2 31 32.21 MINIMUM AREA CONSTRAINTS Unit V um2 LPHVT 1.2 35394 TABLE XII. AREA AT MAXIMUM SPEED CONSTRAINTS Voltage Area Unit V um2 29 LPHVT 1.2 100484 4.2.3. Two stages eliminated architecture A CORDIC architecture where two stages are removed is shown in Fig. 13, which corresponds to the rotations in Fig. 3. This architecture has four adders less than the original one. Two of these adders are replaced by two MUXes, which are controlled by the first sgn bit in the upper adder row. The two hardware wired shifts (1/2) are also removed compared to the original architecture. As described in section 2.2, the first vector coordinates ሺݔሺ͵ሻǡ ݕሺ͵ሻሻ are shown in (8) and (9). α 45 27 14 7 ADD SUB ADD SUB ADD SUB ADD SUB sgn 1 y(3)1 1 y(3)2 0 sgn sgn -/+ y(4) ADD SUB x(3)1 1 x(3)2 0 1/8 +/- y(5) ADD SUB 1/4 1/4 -/+ sgn 1/8 ADD SUB -/+ ADD SUB Sin(α) 1/16 +/- 1/16 ADD SUB x(4) +/ADD SUB cos(α) x(5) Fig. 13. Two stages are removed in the architecture The power consumption at both 10MHz and maximum frequency, which is 83.1MHz are shown in TABLE XIII and TABLE XIV, where the supply voltage is 1.2V. TABLE XIII. Power consumption at Net switching power Cell internal power Cell leakage power Total power POWER AT 10MHZ Frequency 10MHz 10MHz 10MHz 10MHz 30 Unit nW nW nW nW LPHVT 115700 96490 126.2 212400 TABLE XIV. Power consumption at Net switching power Cell internal power Cell leakage power Total power POWER AT MAXIMUM FREQUENCY Frequency 83.1MHz 83.1MHz 83.1MHz 83.1MHz Unit nW nW nW nW LPHVT 1615000 1071000 319.2 2687000 The results of the synthesis are shown in TABLE XV, TABLE XVI, TABLE XVII, and TABLE XVIII, where the highest speed and minimum area, are obtained. TABLE XV. TIMING AT MAXIMUM SPEED CONSTRAINTS Voltage Speed Time TABLE XVI. Unit V MHz ns LPHVT 1.2 83.1 12.03 TIMING AT MINIMUM AREA CONSTRAINTS Voltage Speed Time TABLE XVII. Unit V MHz ns AREA AT Voltage Area TABLE XVIII. Unit V um2 LPHVT 1.2 31.5 31.78 MINIMUM AREA CONSTRAINTS LPHVT 1.2 33660 AREA AT MAXIMUM SPEED CONSTRAINTS Voltage Area Unit V um2 LPHVT 1.2 109840 4.2.4. Three stages eliminated architecture A CORDIC architecture where three stages are removed is shown in Fig. 14, which corresponds to the rotations in Fig. 4. This architecture has 6 adders less than the original one. Four of these adders are replaced by six MUXes, which are controlled by the first two sgns in the upper adder row. As described in section 2.3, the first vector coordinates ሺݔሺͶሻǡ ݕሺͶሻሻ are shown in (11), (12), (13), and (14). 31 α 45 27 14 ADD SUB ADD SUB ADD SUB sgn 1 y(3)1 y(3)2 1 0 y(3)3 y(3)4 1 0 1 0 x(3)3 x(3)4 1 0 ADD SUB sgn sgn y(4) sgn y(5) -/+ ADD SUB 1 x(3)1 x(3)2 7 -/+ ADD SUB Sin(α) 0 1/8 1/8 1 1/16 +/- 1/16 ADD SUB 0 x(4) +/ADD SUB cos(α) x(5) Fig. 14. Three stages are removed architecture The power consumption at both 10MHz and maximum frequency, which is 90.7MHz are shown in TABLE XIX and TABLE XX, where the supply voltage is 1.2V. TABLE XIX. Power consumption Net switching power Cell internal power Cell leakage power Total power POWER AT 10MHZ Frequency 10MHz 10MHz 10MHz 10MHz Unit nW nW nW nW LPHVT 97410 88820 123.5 186400 TABLE XX. POWER AT MAXIMUM FREQUENCY Power consumption Net switching power Cell internal power Cell leakage power Total power Frequency 90.7MHz 90.7MHz 90.7MHz 90.7MHz Unit nW nW nW nW LPHVT 1396000 922900 308.8 2319000 The results of synthesis are shown in TABLE XXI, TABLE XXII, TABLE XXIII, and TABLE XXIV, where the highest speed, minimum area, and maximum area are obtained. TABLE XXI. TIMING AT MAXIMUM SPEED CONSTRAINTS Voltage Speed Time Unit V MHz ns 32 LPHVT 1.2 90.7 11.03 TABLE XXII. TIMING AT MINIMUM AREA CONSTRAINTS Unit V MHz ns Voltage Speed Time TABLE XXIII. LPHVT 1.2 32.3 30.93 AREA AT MINIMUM AREA CONSTRAINTS Unit V um2 Voltage Area TABLE XXIV. LPHVT 1.2 33407 AREA AT MAXIMUM SPEED CONSTRAINTS Unit V um2 Voltage Area LPHVT 1.2 119512 4.2.5. Final architecture A CORDIC architecture, where three stages and one upper adder are removed is shown in Fig. 15. The first adder of the coefficients angle is replaced by a MUX and a Sgn detector compared to the architecture in Fig. 16. This means seven MUXes and a Sgn detector are introduced into the architecture. The first vector coordinates ሺݔሺͶሻǡ ݕሺͶሻሻ are also presents in (11), (12), (13), and (14). 45+27 45-27 1 0 ADD SUB α 14 7 ADD SUB ADD SUB sgn sgn Sgn detector y(3)1 y(3)2 1 0 y(3)3 y(3)4 1 0 0 x(3)1 x(3)2 1 0 1 y(4) 1 0 y(5) ADD SUB 1 1/8 -/+ ADD SUB Sin(α) 1/16 +/- 1/8 x(4) +/- 1/16 ADD SUB 0 x(3)3 x(3)4 -/+ sgn ADD SUB x(5) Fig. 15. Final architecture 33 cos(α) The Sgn detector is used to detect if the input angle is larger than 45 degrees or not. If the input is larger than 45 degrees, the Sgn detector’s output is 1, if not the Sgn detector’s output is 0. The Sgn detector is realized by using the upper 6 bits of the input angle and it consists of the inverters and NAND gates in hardware. The architecture of the Sgn detector is shown in Fig. 16. αbit21 αbit20 αbit19 αbit18 αbit17 αbit16 Fig. 16. The Sgn detector The logic function of the Sgn detector is shown in (17). ܵ݃݊ ൌ ܽ௧ଶଵ ܽ௧ଶ ܽ௧ଵଽ ܽ௧ଶ ܽ௧ଵ଼ ܽ௧ଵ ܽ௧ଵ (17) The power consumption at both 10MHz and maximum frequency, which is 99.6MHz are shown in TABLE XXV and TABLE XXVI, where the supply voltage is 1.2V. TABLE XXV. Power consumption Net switching power Cell internal power Cell leakage power Total power TABLE XXVI. Power consumption Net switching power Cell internal power Cell leakage power Total power POWER AT 10MHZ Frequency 10MHz 10MHz 10MHz 10MHz Unit nW nW nW nW LPHVT 91220 83460 123.5 174800 POWER AT MAXIMUM FREQUENCY Frequency 99.6MHz 99.6MHz 99.6MHz 99.6MHz 34 Unit nW nW nW nW LPHVT 1211000 932500 316.1 2044000 The results of synthesis are shown in TABLE XXVII, TABLE XXVIII, TABLE XXIX, and TABLE XXX, where the highest speed, minimum area, and maximum area are obtained. TABLE XXVII. TIMING AT MAXIMUM SPEED CONSTRAINTS Voltage Speed Time TABLE XXVIII. Voltage Speed Time TABLE XXIX. Voltage Area TABLE XXX. Unit V MHz ns LPHVT 1.2 99.6 10.04 TIMING AT MINIMUM AREA CONSTRAINTS Unit V MHz ns AREA AT Unit V um2 LPHVT 1.2 32.3 30.95 MINIMUM AREA CONSTRAINTS LPHVT 1.2 33394 AREA AT MAXIMUM SPEED CONSTRAINTS Voltage Area Unit V um2 35 LPHVT 1.2 135193 36 5. Results In this section, a test setup for the designs is introduced. The area, timing information, and power consumption are also analyzed. 5.1. Test setup Fig. 17 shows the test setup of the designs. In this design, the input to the testbench is generated into a text file by software simulation. The output from the top design is also verified by the software. output Testbench clk rst Top design input enable Software simulation Fig. 17. The design test setup 5.2. The area The minimum area is estimated by set_max_area 0 script in design complier. The areas, using the minimum area constraint, for the five CORDIC architectures are shown in Fig. 18. 37 4 14 x 10 Origianal architecture First stage eliminated architecture Two stages eliminated architecture Three stages eliminated architecture Final architecture 12 area (um2) 10 8 6 4 2 0 10 20 30 40 50 60 Fequency (MHz) 70 80 90 100 Fig. 18. Area at different frequencies The minimum areas can be shown in TABLE XXXI. The final architecture has the lowest area, 33200um2. TABLE XXXI. AREA AT MINIMUM AREA CONSTRAINTS Architectures Minimum area (um2) The original architecture 36067 First stage eliminated architecture 35966 Two stages eliminated architecture 33999 Three stages eliminated architecture 33389 Final architecture 33200 TABLE XXXI indicates that with reduced complexity, the minimum areas of the designs are also reduced. The percentage of the area reduction is 7.9%. The areas at maximum speed constraints for the five architectures are shown in TBALE XXXII. 38 TABLE XXXII. AREA AT MAXIMUM SPEED CONSTRAINTS Architectures Area (um2) The original architecture 100989 First stage eliminated architecture 100484 Two stages eliminated architecture 109840 Three stages eliminated architecture 119512 Final architecture 135193 5.3. Timing information The maximum speed is estimated when the area has not been set for any value. When a highest clock frequency is specified, i.e. the slack is zero, the clock frequency corresponds to the maximum speed. The timing information gives the bottleneck at the highest frequency, shown in TABLE XXXIII. TABLE XXXIII. TIMING AT MAXIMUM SPEED CONSTRAINTS Architectures Critical path (ns) Frequency (MHz) The original architecture 13.03 76.7 First stage eliminated architecture 13.03 76.7 Two stages eliminated architecture 12.03 83.1 Three stages eliminated architecture 11.03 90.7 Final architecture 10.04 99.6 Table XXXIII indicates that with the complexity reduced, the maximum frequencies of the designs are also increased. More optimizations architecture compared to the others architectures, has the highest speed 10.04ns with ͻǤͻ ൈ ͳͲ Hz frequency. The speed got 22.9% higher 39 performance. The timing at minimum area constraints are shown in TABLEXXXIV. TABLE XXXIV. TIMING AT MINIMUM AREA CONSTRAINTS Architectures Time (ns) Speed (MHz) The original architecture 32.22 31 First stage eliminated architecture 32.21 31 Two stages eliminated architecture 31.78 31.5 Three stages eliminated architecture 30.93 32.3 Final architecture 30.95 32.3 5.4. Power consumption 5.4.1. Power analysis The CMOS transistors’ power consists of dynamic power and static power, as shown in (18) [6]. ܲ௧௧ ൌ ܲௗ௬ ܲ௦௧௧ (18) The dynamic power consists of the switching power and the internal power, as shown in (19). ܲௗ௬ ൌ ܲ௦௪௧ ܲ௧ (19) The dynamic power can be written in (20). ܲௗ௬ ൌ ܽ ܸܥଶ ݂ (20) Where the factor ܽ is the switching activity, ܥis the node capacitance, ݂ is the clock frequency, and ܸ is the supply voltage. The static power comes mainly from the sub threshold leakage current. 40 TABLE XXXV and TABLE XXXVI indicate the power consumption at the maximum speed constraints and the power consumption at minimum area constraints separately. The power at minimum area constraints is tested under a 10MHz frequency. TABLE XXXV. POWER AT MAXIMUM SPEED CONSTRAINTS Architectures Switching power Internal power (mW) (mW) The original 76.7 architecture 1.714 1.137 291.8 2.852 First stage 76.7 eliminated architecture 1.692 1.109 296.2 2.802 Two stages 83.1 eliminated architecture 1.615 1.071 319.2 2.687 Three stages 90.7 eliminated architecture 1.396 0.9229 308.8 2.319 99.6 1.211 0.8325 316.1 2.044 Frequency (MHz) Final architecture 41 Leakage (nW) Total power (mW) TABLE XXXVI. POWER AT MINIMUM AREA CONSTRAINTS Architectures Switchin g power Internal power (mW) (mW) The original 10 architecture 0.1291 0.1106 131.5 0.2399 First stage 10 eliminated architecture 0.1217 0.1045 131.5 0.2263 Two stages 10 eliminated architecture 0.1157 0.0964 126.2 0.2124 Three stages 10 eliminated architecture 0.09741 0.0882 123.5 0.1864 10 0.09122 0.08346 123.5 0.1748 Frequency (MHz) Final architecture Leakage Total power (nW) (mW) The power consumption for the five different CORDIC architectures at various frequencies is shown in Fig. 19. 7 10 6 10 Origianal architecture First stage eliminated architecture Two stages eliminated architecture Three stages eliminated architecture Final architecture Power (nW) 5 10 4 10 3 10 2 10 1 10 2 10 3 10 4 5 10 10 Fequency (Hz) 6 10 Fig. 19. Power at various frequencies 42 7 10 8 10 Power (nW) Origianal architecture First stage eliminated architecture Two stages eliminated architecture Three stages eliminated architecture Final architecture 5 10 6.63 10 6.65 10 6.67 10 6.69 10 Fequency (Hz) 6.71 10 6.73 10 Fig. 20. magnified diagram for Fig. 19 Fig. 19 shows the power consumption at different frequencies for the five architectures described in section 4. Five curves are closed, but in Fig. 20, the magnified diagram for Fig. 19, indicates that the final architecture cost the least dynamic power. We can also get that when the adder stages are eliminated, the cost of power is less. There is a knee in the figure at low frequencies. This is because at low frequencies the static power is much larger than the dynamic power, and it changes only a little with the frequency. As a result the area is 7.9% lower in the final architecture. A substantial improvement can be seen for the power consumption, which is 27.2% lower in the final architecture compared to the original one and the speed, which is 22.9% higher. Area much higher for maximum speed, because there are too many logic gates used to achieve higher speed. 43 44 6. Conclusions In this thesis, MUXes are used in hardware to reduce the complexity. Five different CORIDC architectures are implemented with eliminating the stages. The area, computational speed, accuracy, error behavior, and the power consumption have been analyzed under the software simulation and hardware implementation. The speed, minimum area and power consumption have got optimized in different levels. As a result the area and power consumption get 7.9% lower and 27.2% lower separately, and the speed is 22.9% higher compared to the original unrolled CORDIC architecture. It is also proved that in unrolled CORDIC architectures, the reduction of the power consumption can be achieved by reducing the complexity, which meets the aim of this thesis. 45 46 7. Future work In this thesis, three stages are eliminated at most. There are more stages can be eliminated for the reduction of the power consumption. More than one Sgn detector can be introduced to reduce the coefficients adder. The number of the iteration should decrease to some extent, which can also reduce the power consumption. 47 48 References [1] Erik Hertz and Peter Nilsson, “Parabolic Synthesis Methodology Implemented on the Sine Function”, in Proceedings of the 2009 International Symposium on Circuits and Systems (ISCAS’09), Taipei, Taiwan, May 2427, 2009 . [2] Gordon K. Smyth “Polynomial Approximation” in Encyclopedia of Biostatistics ,John Wiley & Sons, Ltd, Chichester, Peter Armitage and Theodore Colton, (ISBN 0471 975761) ,1998. [3] J. E. Volder, “The CORDIC Trigonometric Computing Technique”, IRE Transactions on Electronic Computers, vol. EC-8, no. 3, 1959, pp. 330–334. [4] Hue, Y.H., “CORDIC-based VLSI architectures for digital signal processing”, IEEE Signal Processing Magazine, pp. 16-35, ISSN: 1053-5888, July 1992. [5] Peter Nilsson “Complexity Reductions in Unrolled CORDIC Architectures,” in Proceedings of the IEEE 14th International Conference on Electronics, Circuits and Systems (ICECS 2009), pp. 868-871, Hammamet, Tunisia, December 13-16, 2009. [6] http://en.wikipedia.org/wiki/CMOS#Power:_switching_and_leakage 49 50 Appendix A TABLE XXXVII. α1 COEFFICIENT ANGLES Decimal angle Binary angle 45 00101101000000000000000 α2 26.565032958984375 00011010100100001010011 α3 14.036224365234375 00001110000010010100011 α4 7.125000000000000 00000111001000000000000 α5 3.576324462890625 00000011100100111000101 α6 1.789886474609375 00000001110010100011011 α7 0.895172119140625 00000000111001010010101 α8 0.447601318359375 00000000011100101001011 α9 0.223785400390625 00000000001110010100101 α10 0.111877441406250 00000000000111001010010 α11 0.055938720703125 00000000000011100101001 α12 0.027954101562500 00000000000001110010100 α13 0.013977050781250 00000000000000111001010 α14 0.006988525390625 00000000000000011100101 α15 0.003479003906250 00000000000000001110010 α16 0.001739501953125 00000000000000000111001 α17 0.000854492187500 00000000000000000011100 α18 0.000427246093750 00000000000000000001110 α19 0.000213623046875 00000000000000000000111 51 52 Series of Master’s theses Department of Electrical and Information Technology LU/LTH-EIT 2015-460 http://www.eit.lth.se

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement