Linköping Studies in Science and Technology. Dissertations, No. 1385 Low Complexity Techniques for Low Density Parity Check Code Decoders and Parallel Sigma-Delta ADC Structures Anton Blad Department of Electrical Engineering Linköping University SE–581 83 Linköping Sweden Linköping 2011 Low Complexity Techniques for Low Density Parity Check Code Decoders and Parallel Sigma-Delta ADC Structures Anton Blad Linköping Studies in Science and Technology. Dissertations, No. 1385 c 2011 Anton Blad Copyright ISBN 978-91-7393-104-5 ISSN 0345-7524 e-mail: [email protected] thesis url: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-69432 Department of Electrical Engineering Linköping University SE–581 83 Linköping Sweden Printed by LiU-Tryck, Linköping, Sweden 2011 Abstract In this thesis, contributions are made in the area of receivers for wireless communication standards. The thesis consists of two parts, focusing on implementations of forward error correction using low-density parity-check (LDPC) codes, and highbandwidth analog-to-digital converters (ADC) using sigma-delta modulators. LDPC codes have received wide-spread attention since 1995 as practical capacity-approaching code candidates. It has been shown that the class of codes can perform arbitrarily close to the channel capacity, and LDPC codes are also used or suggested for a number of current and future communication standards, including the 802.16e WiMAX standard, the 802.11n WLAN standard, and the second generation of digital TV standards DVB-x2. The first part of the thesis contains two main contributions to the problem of decoding LDPC codes, denoted the early-decision decoding algorithm and the check-merging decoding algorithm. The early-decision decoding algorithm is a method of terminating parts of the decoding process early for bits that have high reliabilities, thereby reducing the computational complexity of the decoder. The check-merging decoding algorithm is a method of reducing the code complexity of rate-compatible LDPC codes and increasing the efficiency of the decoding algorithm, thereby offering a significant throughput increase. For the two algorithms, architectures are proposed and synthesized for FPGAs and the resulting performance and logic utilization are compared with the original algorithms. Sigma-delta ADCs are the natural choice of low-to-medium bandwidth applications that require high resolution. However, suggestions have also been made to use them for high-bandwidth communication standards which require either high sampling rates or several ADCs operating in parallel. In this thesis, two contributions are made in the area of high-bandwidth ADCs using sigma-delta modulators. The first is a general formulation of parallel ADCs using modulation of the input data. The formulation allows a system’s sensitivity to analog mismatch errors in the channels to be analyzed and it is shown that some systems can be made insensitive to certain matching errors, whereas others may require matching of limited subsets of the channels, or full matching of all channels. Limited sensitivity to mismatch errors reduces the complexity of the analog parts. Simulation results are provided for a time-interleaved ADC, a Hadamard-modulated ADC, a frequencyband decomposed ADC, as well as for a new modulation scheme that is insensitive to channel gain mismatches. The second contribution relates to the implementation of high-speed digital filters, where a typical application is decimation filters for a high-bandwidth sigma-delta ADC. A bit-level optimization algorithm is proposed that minimizes a cost function defined as a weighted sum of the number of full adders, half adders and registers. Simulation results show comparisons between bit-level optimized filters and structures obtained using common heuristics for carry-save adder trees. i Populärvetenskaplig sammanfattning De flesta av dagens trådlösa kommunikationssystem bygger på digital överföring av data. Denna avhandling innehåller två delar, vilka ger bidrag till två olika delar i moderna mottagare för digitala kommunikationssystem. I den första delen behandlas energieffektiv avkodning av felrättande koder. Felrättande kodning är en av de främsta fördelarna med digital kommunikation gentemot analog kommunikation, och bygger på att man i överföringen lägger till redundans till informationen man skickar. Under överföringen blir det ovillkorligt fel, men med hjälp av redundansen är det möjligt att räkna ut var det gick fel, och man kan på så sätt öka pålitligheten i överföringen. Felrättande koder med stor förmåga att upptäcka och rätta fel är lätta att konstruera, men att bygga praktiska implementeringar av en avkodare är svårare, och avkodaren använder ofta en stor del av strömförbrukningen i IC-kretsar för mottagare. I den här avhandlingen behandlas en specifik typ av felrättande koder (LDPC-koder) som kommer mycket nära den teoretiska gränsen för felrättande förmåga, och två förbättringar av avkodningsalgoritmer föreslås. I båda fallen är de föreslagna algoritmerna mer komplexa, men minskar effektförbrukningen i en implementering. I den andra delen av avhandlingen behandlas en specifik typ av analog-tilldigital-omvandlare (ADC), som omvandlar den mottagna signalen till digital information. Sigma-delta är en typ av ADC som lämpar sig särskilt väl för integration med digitala system på en gemensam IC-krets. Nackdelen idag är dock att konverteringshastigheten är relativt låg. Ett sätt att öka hastigheten är att använda flera omvandlare parallellt, som då tar hand om delar av insignalen var för sig. Nackdelen är att sådana system ofta blir känsliga för variationer mellan de olika omvandlarna, och i den här avhandlingen föreslås en metod att modellera flera parallella sigma-delta-ADC:er för att analysera känslighetskraven. Det visar sig att vissa system är känsliga för variationer, medan andra kan kräva anpassning av begränsade delmängder av omvandlarna. Ett annat problem är att utdata från sigma-delta-omvandlare består av en dataström med mycket hög datatakt och mycket kvantiseringsbrus. För att kunna använda dataströmmen i en applikation måste den först decimeras. Avhandlingen innehåller också en metod att formulera skapandet sådana decimeringsfilter som ett optimeringsproblem, för att på så vis skapa filter med låg komplexitet. iii Acknowledgments It is with mixed feelings I look back at the 5+ years as a PhD student at Electronics Systems. This thesis is the result of many hours of work in a competent and motivating environment, an environment promoting independence and individualism while still offering abundant support and opportunities for discussion. Being a PhD student has been hard at times, but looking back I cannot imagine conditions better suited for a researcher at the start of his career. There are many people who have given me inspiration and support through these years, and I want to take the opportunity here to say thanks to the following people: • My supervisor Dr. Oscar Gustafsson for being a huge source of motivation by always taking his time to discuss my work and endlessly offering new insights and ideas. • My very good friends Dr. Fredrik Kuivinen and M.Sc. Jakob Rosén for all the fun with electronics projects and retro gaming sessions in the evenings at campus. • Prof. Christer Svensson for getting me in contact with the STMicroelectronics research lab in Geneva. • Dr. Andras Pozsgay at the Advanced Radio Architectures group at STMicroelectronics in Geneva, for offering me the possibility of working with “practical” research for six months during spring 2007. It has been a very valuable experience. • All the other people in my research group at STMicroelectronics in Geneva, for making my stay there a pleasant experience. • Prof. Fei Zesong at the Department of Electrical Engineering at Beijing Institute of Technology in China, for giving me the possibility of two three-month PhD student exchanges in spring 2009 and winter 2010/2011. • M.Sc. Wang Hao and M.Sc. Shen Zhuzhe for helping me with all the practicalities for my visits in Beijing. • M.Sc. Zhao Hongjie for the cooperation during his year as a PhD student exchange at Linköping University. • All the others at the Modern Communication Lab at Beijing Institute of Technology for their kindness and support in an environment that was very different to what I am used to. • Dr. Kent Palmkvist for help with FPGA- and VHDL-related issues. • M.Sc. Sune Söderkvist for the big contribution to the generally happy and positive atmosphere at Electronics Systems. v Acknowledgments Acknowledgments • All the other present and former colleagues at Electronics Systems. • All the colleagues at the Electronic Components research group during my time there from 2006 to 2008. • All the colleagues at Communications Systems during my time there as a research engineer in spring 2010. • All my friends who have made my life pleasant during my time as a PhD student. • Last but not least, I thank my parents Maj and Bengt Blad for all their encouragement and time that they gave me as a child, which is definitely part of the reason that I have come this far in life. I also thank my sisters Lisa and Tove Blad, who I don’t see very often but still feel I am very close to when I do. Anton Blad Linköping, July 2011 vi Abbreviations ADC AWGN BEC BER BLER BPSK BSC CFU CIC CMOS CNU CSA CSD DAC DECT DFT DTTB DVB-S2 Analog-to-Digital Converter Additive White Gaussian Noise Binary Erasure Channel Bit Error Rate Block Error Rate Binary Phase Shift Keying Binary Symmetric Channel Check Function Unit Cascaded Integrator Comb Complementary Metal Oxide Semiconductor Check Node processing Unit Carry-Save Adder Canonic Signed-Digit Digital-to-Analog Converter Digital Enhanced Cordless Telecommunications Discrete Fourier Transform Digital Terrestrial Television Broadcasting Digital Video Broadcasting - Satellite 2nd generation vii Abbreviations Eb /N0 ECC ED FIR FPGA GPS ILP k-SE k-SR LAN LDPC LUT MPR MSD MUX OSR PSD QAM QC-LDPC QPSK RAM ROM SD SNR USB VHDL VLSI VMA VNU WLAN WPAN Abbreviations Bit energy to noise spectral density (normalized SNR) Error Correction Coding Early Decision Finite Impulse Response Field Programmable Gate Array Global Positioning Satellite Integer Linear Programming k-step enabled k-step recoverable Local Area Network Low-Density Parity Check Look-Up Table McClellan-Parks-Rabiner Minimum Signed-Digit Multiplexer Oversampling ratio Power Spectral Density Quadrature Amplitude Modulation Quasi-Cyclic Low-Density Parity-Check Quadrature Phase Shift Keying Random Access Memory Read-Only Memory Signed-Digit Signal-to-Noise Ratio Universal Serial Bus VHSIC (Very High Speed Integrated Circuit) Hardware Description Language Very Large Scale Integration Vector Merge Adder Variable Node processing Unit Wireless Local Area Network Wireless Personal Area Network viii Preface Thesis outline In this thesis, contributions are made in two different areas related to the design of receivers for radio communications, and the contents are therefore separated into two parts. Part I consists of Chapters 1–6 and offers contributions in the area of low density parity check (LDPC) code decoding, whereas Part II consists of Chapters 7–13 and offers contributions related to high-speed analog-to-digital conversion using Σ∆-ADCs. The outline of Part I is as follows. In Chapter 1, a short background, possible applications and the scientific contributions are discussed. In Chapter 2, the basics of digital communications are described and LDPC codes are introduced. Also, two decoder architectures are described, which are used as reference implementations for the contributed work. In Chapter 3, early decision decoding is proposed as a method of reducing the computational complexity of the decoding algorithm. Performance issues related to the algorithm are analyzed, and solutions are suggested. Also, an implementation of the algorithm for FPGA is described, and the resulting estimations of area and power dissipation are included. In Chapter 4, an improved algorithm for decoding of rate-compatible LDPC codes are proposed. The algorithm offers a significant reduction of the average number of iterations required for decoding of punctured codes, thereby offering a significant increase in ix Preface Preface throughput. An architecture implementing the algorithm is proposed, and simulation and synthesis results are included. In Chapter 5, a minor contribution in the data representation of a sum-product LDPC decoder is explained. It is shown how redundancy in the data representation can be used to reduce the required memory used for storage of messages between iterations. Finally, in Chapter 6, conclusions are given and future work is discussed. The outline of Part II is as follows. In Chapter 7, an introduction to highspeed data conversion is given, and the scientific contributions of the second part of the thesis are described. In Chapter 8, a short introduction to finite impulse response (FIR) filters, multirate theory and FIR filter architectures is given. In Chapter 9, the basics of ADCs using Σ∆-modulators are discussed, and some highspeed structures using parallel Σ∆-ADCs are shown. In Chapter 10, a general model for the analysis of matching requirements in parallel Σ∆-ADCs is proposed. It is shown that some parallel systems may become alias-free with limited matching between subsets of the channels, whereas others may require matching between all channels. In Chapter 11, a short analysis of the relations between oversampling factors, Σ∆-modulator orders, required signal-to-noise ratio (SNR) and decimation filter complexity is contributed. In Chapter 12, an integer linear programming approach to the design of high-speed decimation filters for Σ∆-ADCs is proposed. Several architectures are discussed and their complexities compared. Finally, in Chapter 13, conclusions are given and future work is discussed. Publications This thesis contains research done at Electronics Systems, department of Electrical Engineering at Linköping University, Sweden. The work has been done between March 2005 and June 2011, and has resulted in the following publications [7–17]: 1. A. Blad, O. Gustafsson, and L. Wanhammar, “An LDPC decoding algorithm utilizing early decisions,” in Proc. National Conf. Radio Science, Jun. 2005. 2. A. Blad, O. Gustafsson, and L. Wanhammar, “An early decision decoding algorithm for LDPC codes using dynamic thresholds,” in Proc. European Conf. Circuit Theory Design, Aug. 2005, pp. 285–288. 3. A. Blad, O. Gustafsson, and L. Wanhammar, “A hybrid early decisionprobability propagation decoding algorithm for low-density parity-check codes,” in Proc. Asilomar Conf. Signals, Syst., Comp., Oct. 2005. 4. A. Blad, O. Gustafsson, and L. Wanhammar, “Implementation aspects of an early decision decoder for LDPC codes,” in Proc. Nordic Event ASIC Design, Nov. 2005. 5. A. Blad and O. Gustafsson, “Energy-efficient data representation in LDPC decoders,” IET Electron. Lett., vol. 42, no. 18, pp. 1051–1052, Aug. 2006. x Preface Preface 6. A. Blad, P. Löwenborg, and H. Johansson, “Design trade-offs for linear-phase FIR decimation filters and sigma-delta modulators,” in Proc. XIV European Signal Process. Conf., Sep. 2006. 7. A. Blad, H. Johansson, and P. Löwenborg, “Multirate formulation for mismatch sensitivity analysis of analog-to-digital converters that utilize parallel sigma-delta modulators,” Eurasip J. Advances Signal Process., vol. 2008, 2008, article ID 289184, 11 pages. 8. A. Blad and O. Gustafsson, “Integer linear programming-based bit-level optimization for high-speed FIR decimation filter architectures,” Springer Circuits, Syst. Signal Process. - Special Issue on Low Power Digital Filter Design Techniques and Their Applications, vol. 29, no. 1, pp. 81–101, Feb. 2010. 9. A. Blad and O. Gustafsson, “Redundancy reduction for high-speed FIR filter architectures based on carry-save adder trees,” in Proc. Int. Symp. Circuits, Syst., May 2010. 10. A. Blad, O. Gustafsson, M. Zheng, and Z. Fei, “Integer linear programming based optimization of puncturing sequences for quasi-cyclic low-density parity-check codes,” in Proc. Int. Symp. Turbo-Codes, Related Topics, Sep. 2010. 11. A. Blad and O. Gustafsson, “FPGA implementation of rate-compatible QC-LDPC code decoder,” in Proc. European Conf. Circuit Theory Design, Aug. 2011. During the period, the following papers were also published, but are either outside the scope of this thesis or overlapping with the publications above: 1. A. Blad, C. Svensson, H. Johansson, and S. Andersson, “An RF sampling radio frontend based on sigma-delta conversion,” in Proc. Nordic Event ASIC Design, Nov. 2006. 2. A. Blad, H. Johansson, and P. Löwenborg, “A general formulation of analogto-digital converters using parallel sigma-delta modulators and modulation sequences,” in Proc. Asia-Pacific Conf. Circuits Syst., Dec. 2006, pp. 438– 441. 3. A. Blad and O. Gustafsson, “Bit-level optimized high-speed architectures for decimation filter applications,” in Proc. Int. Symp. Circuits, Syst., May 2008. 4. M. Zheng, Z. Fei, X. Chen, J. Kuang, and A. Blad, “Power efficient partial repeated cooperation scheme with regular LDPC code,” in Proc. Vehicular Tech. Conf., May 2010. 5. O. Gustafsson, K. Amiri, D. Andersson, A. Blad, C. Bonnet, J. R. Cavallaro, J. Declerckz, A. Dejonghe, P. Eliardsson, M. Glasse, A. Hayar, L. Hollevoet, xi Preface Preface C. Hunter, M. Joshi, F. Kaltenberger, R. Knopp, K. Le, Z. Miljanic, P. Murphy, F. Naessens, N. Nikaein, D. Nussbaum, R. Pacalet, P. Raghavan, A. Sabharwal, O. Sarode, P. Spasojevic, Y. Sun, H. M. Tullberg, T. Vander Aa, L. Van der Perre, M. Wetterwald and M. Wu, “Architecture for cognitive radio testbeds and demonstrators - An overview,” in Proc. Int. Conf. Cognitive Radio Oriented Wireless Networks Comm., Jun. 2010. 6. A. Blad, O. Gustafsson, M. Zheng, and Z. Fei, “Rate-compatible LDPC code decoder using check-node merging,” in Proc. Asilomar Conf. Signals, Syst., Comp., Nov. 2010. 7. M. Abbas, O. Gustafsson, and A. Blad, “Low-complexity parallel evaluation of powers exploiting bit-level redundancy,” in Proc. Asilomar Conf. Signals, Syst., Comp., Nov. 2010. xii Contents I Decoding of low-density parity-check codes 1 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Scientific contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Error correction coding 2.1 Digital communications . . . . . . . . 2.1.1 Channel models . . . . . . . . . 2.1.2 Modulation methods . . . . . . 2.1.3 Uncoded communication . . . . 2.2 Coding theory . . . . . . . . . . . . . . 2.2.1 Shannon bound . . . . . . . . . 2.2.2 Block codes . . . . . . . . . . . 2.3 LDPC codes . . . . . . . . . . . . . . . 2.3.1 Tanner graphs . . . . . . . . . 2.3.2 Quasi-cyclic LDPC codes . . . 2.3.3 Randomized quasi-cyclic codes 2.4 LDPC decoding algorithms . . . . . . 2.4.1 Sum-product algorithm . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 5 5 7 8 8 10 10 12 14 14 18 18 19 21 21 21 xiv Contents 2.4.2 Min-sum approximation . . . . . . . Rate-compatible LDPC codes . . . . . . . . 2.5.1 SR-nodes . . . . . . . . . . . . . . . 2.5.2 Decoding of rate-compatible codes . LDPC decoder architectures . . . . . . . . . 2.6.1 Parallel architecture . . . . . . . . . 2.6.2 Serial architecture . . . . . . . . . . 2.6.3 Partly parallel architecture . . . . . 2.6.4 Finite wordlength considerations . . 2.6.5 Scaling of Φ(x) . . . . . . . . . . . . Sum-product reference decoder architecture 2.7.1 Architecture overview . . . . . . . . 2.7.2 Memory block . . . . . . . . . . . . 2.7.3 Variable node processing unit . . . . 2.7.4 Check node processing unit . . . . . 2.7.5 Interconnection networks . . . . . . 2.7.6 Memory address generation . . . . . 2.7.7 Φ function . . . . . . . . . . . . . . . Check-serial min-sum decoder architecture . 2.8.1 Decoder schedule . . . . . . . . . . . 2.8.2 Architecture overview . . . . . . . . 2.8.3 Check node function unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 25 26 26 27 28 29 29 30 31 32 32 33 34 35 36 37 37 37 38 39 3 Early-decision decoding 3.1 Early-decision algorithm . . . . . . . . . . . 3.1.1 Choice of threshold . . . . . . . . . . 3.1.2 Handling of decided bits . . . . . . . 3.1.3 Bound on error correction capability 3.1.4 Enforcing check constraints . . . . . 3.1.5 Enforcing check approximations . . . 3.2 Hybrid decoding . . . . . . . . . . . . . . . 3.3 Early-decision decoder architecture . . . . . 3.3.1 Memory block . . . . . . . . . . . . 3.3.2 Node processing units . . . . . . . . 3.3.3 Early decision logic . . . . . . . . . . 3.3.4 Enforcing check constraints . . . . . 3.4 Hybrid decoder . . . . . . . . . . . . . . . . 3.5 Simulation results . . . . . . . . . . . . . . 3.5.1 Choice of threshold . . . . . . . . . . 3.5.2 Enforcing check constraints . . . . . 3.5.3 Hybrid decoding . . . . . . . . . . . 3.5.4 Fixed-point simulations . . . . . . . 3.6 Synthesis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 42 43 43 44 44 46 47 47 47 48 48 51 52 52 52 55 56 58 61 2.5 2.6 2.7 2.8 Contents xv 4 Rate-compatible LDPC codes 4.1 Design of puncturing patterns . . . . . . . . . . . . 4.1.1 Preliminaries . . . . . . . . . . . . . . . . . 4.1.2 Optimization problem . . . . . . . . . . . . 4.1.3 Puncturing pattern design . . . . . . . . . . 4.2 Check-merging decoding algorithm . . . . . . . . . 4.2.1 Defining HP . . . . . . . . . . . . . . . . . 4.2.2 Algorithmic properties of decoding with HP 4.2.3 Choosing the puncturing sequence p . . . . 4.3 Rate-compatible QC-LDPC code decoder . . . . . 4.3.1 Decoder schedule . . . . . . . . . . . . . . . 4.3.2 Architecture overview . . . . . . . . . . . . 4.3.3 Cyclic shifters . . . . . . . . . . . . . . . . . 4.3.4 Check function unit . . . . . . . . . . . . . 4.3.5 Bit-sum update unit . . . . . . . . . . . . . 4.3.6 Memories . . . . . . . . . . . . . . . . . . . 4.4 Simulation results . . . . . . . . . . . . . . . . . . 4.4.1 Design of puncturing sequences . . . . . . . 4.4.2 Check-merging decoding algorithm . . . . . 4.5 Synthesis results of check-merging decoder . . . . . 4.5.1 Maximum check node degrees . . . . . . . . 4.5.2 Decoding throughput . . . . . . . . . . . . 4.5.3 FPGA synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 64 64 65 67 68 68 70 71 72 72 73 74 74 76 76 77 77 78 79 80 80 80 5 Data representations 5.1 Fixed wordlength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 83 83 85 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions and future work 87 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 II High-speed analog-to-digital conversion 89 7 Introduction 91 7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3 Scientific contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8 FIR filters 8.1 FIR filter basics . . . . . . . 8.1.1 FIR filter definition 8.1.2 z-transform . . . . . 8.1.3 Linear phase filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 95 96 96 97 xvi Contents 8.2 8.3 8.4 FIR filter design . . . . . . . . . . . . . . . Multirate signal processing . . . . . . . . . 8.3.1 Sampling rate conversion . . . . . . 8.3.2 Polyphase decomposition . . . . . . 8.3.3 Multirate sampling rate conversion . FIR filter architectures . . . . . . . . . . . . 8.4.1 Conventional FIR filter architectures 8.4.2 High-speed FIR filter architecture . 9 Sigma-delta data converters 9.1 Sigma-delta data conversion . . . . . . 9.1.1 Sigma-delta ADC overview . . 9.1.2 Sigma-delta modulators . . . . 9.1.3 Quantization noise power . . . 9.1.4 SNR estimation . . . . . . . . . 9.2 Modulator structures . . . . . . . . . . 9.3 Modulated parallel sigma-delta ADCs 9.4 Data rate decimation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Parallel sigma-delta ADCs 10.1 Linear system model . . . . . . . . . . . . 10.1.1 Signal transfer function . . . . . . 10.1.2 Alias-free system . . . . . . . . . . 10.1.3 L-decimated alias-free system . . . 10.2 Sensitivity to channel mismatches . . . . . 10.2.1 Modulator nonidealities . . . . . . 10.2.2 Modulation sequence errors . . . . 10.2.3 Modulation sequence offset errors . 10.2.4 Channel offset errors . . . . . . . . 10.3 Simulation results . . . . . . . . . . . . . 10.3.1 Time-interleaved ADC . . . . . . . 10.3.2 Hadamard-modulated ADC . . . . 10.3.3 Frequency-band decomposed ADC 10.3.4 Generation of new scheme . . . . . 10.4 Noise model of system . . . . . . . . . . . 11 Sigma-delta ADC decimation filters 11.1 Design considerations . . . . . . . . 11.1.1 FIR decimation filters . . . . 11.1.2 Decimation filter specification 11.1.3 Signal-to-noise-ratio . . . . . 11.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 98 98 99 100 100 100 101 . . . . . . . . 105 . 105 . 105 . 106 . 107 . 108 . 109 . 110 . 112 . . . . . . . . . . . . . . . 115 . 116 . 118 . 120 . 121 . 123 . 123 . 124 . 125 . 126 . 126 . 126 . 129 . 130 . 131 . 133 . . . . . 135 . 135 . 136 . 137 . 138 . 139 Contents 12 High-speed digital filtering 12.1 FIR filter realizations . . . . . . . . . 12.1.1 Architectures . . . . . . . . . . 12.1.2 Partial product generation . . . 12.2 Implementation complexity . . . . . . 12.2.1 Adder complexity . . . . . . . 12.2.2 Register complexity . . . . . . 12.3 Partial product redundancy reduction 12.3.1 Proposed algorithm . . . . . . 12.4 ILP optimization . . . . . . . . . . . . 12.4.1 ILP problem formulation . . . 12.4.2 DF1 architecture . . . . . . . . 12.4.3 DF2 architecture . . . . . . . . 12.4.4 DF3 architecture . . . . . . . . 12.4.5 TF architecture . . . . . . . . . 12.4.6 Constant term placement . . . 12.5 Results . . . . . . . . . . . . . . . . . . 12.5.1 Architecture comparison . . . . 12.5.2 Coefficient representation . . . 12.5.3 Subexpression sharing . . . . . xvii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 146 146 150 152 152 153 155 157 157 157 159 160 160 161 161 161 162 164 164 13 Conclusions and future work 169 13.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 13.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Part I Decoding of low-density parity-check codes 1 1 Introduction 1.1 Background Digital communication is used ubiquitously for transferring data between electronic equipment. Examples include cable and satellite TV, mobile phone voice and data transmissions, wired and wireless LAN, GPS, computer peripheral connections through USB and IEEE1394 and many more. The basic principles of a digital communications system are known, and one of the main advantages of digital communications systems over analog is the ability to use error correction coding (ECC) for the data transmission. ECC is used in almost all digital communications systems to improve link performance and reduce transmitter power requirements [3]. By adding redundant data to the transmitted data stream, the system allows a limited amount of transmission errors to be corrected, resulting in a reduction of the number of errors in the transmitted information. However, for the digital data symbols that are received correctly, the received information is identical to that which is sent. This can be contrasted to analog communications systems, where transmission noise will irrevocably degrade the signal quality, and the only way to ensure a predefined signal quality at the receiver is to use enough transmitter power. Thus, the metrics used to measure the transmission quality are intrinsically different for digital and analog communications, with bit error rate (BER) or block error rate (BLER) for 3 4 CHAPTER 1. INTRODUCTION Data source Data sink Source coding Channel coding Modulation Source decoding Channel decoding Demodulation Figure 1.1 Simple communications system model digital systems, and signal-to-noise ratio (SNR) for analog systems. Whereas analog error correction is not principally impossible, analog communications systems are different enough on a system level to make practically feasible implementations hard to envisage. As the quality metrics of digital and analog communications systems are different, the performance of an analog and a digital system cannot easily be objectively compared with each other. However, it is often the case that a digital system with a quality subjectively comparable to that of an analog system requires significantly less power and/or bandwidth. One example is the switch from analog to digital TV, where image coding and ECC allow four standard definition channels of comparable quality in the same bandwidth as one channel in the analog TV. A simple model of a digital communications system is shown in Fig. 1.1. The modeled system encompasses wireless and wired communications, as well as data storage, for example on optical disks and hard drives. However, the properties of the blocks are dependent on data rate, acceptable error probability, channel conditions, the nature of the data, and so on. In the communications system, data is usually first source coded (or compressed) to reduce the amount of data that needs to be transmitted, and then channel coded to add redundancy to protect against transmission errors. The modulator then converts the digital data stream into an analog waveform suitable for transmission. During transmission, the analog waveform is affected by channel noise, and thus the received signal differs to the sent. The result is that when the signal is demodulated, the digital data will contain errors. It is the purpose of the channel decoder to correct the errors using the redundancy introduced by the channel coder. Finally, the data stream is unpacked by the source decoder, recreating data suitable to be used by the application. The work in this part of the thesis considers the hardware implementation of the channel decoder for low-density parity-check (LDPC) codes. The decoding of LDPC codes is complex, and is often a major part of the baseband processing of a receiver. For example, the flexible decoder in [79] supports the LDPC codes in IEEE 802.11n and IEEE 802.16e, as well as the Turbo codes in 3GPP-LTE, but at a maximum power dissipation of 675 mW. The need for low-power components is obviously high in battery-driven applications like hand helds and mobile phones, but becomes increasingly important also in stationary equipment like computers, 1.2. APPLICATIONS 5 computer peripherals and TV receivers, due to the need of removing the waste heat produced. Thus the focus of this work is on reducing the power dissipation of LDPC decoders, without sacrificing the error-correction performance. LDPC codes were discovered originally in 1962 by Robert Gallager [39]. He showed that the class of codes has excellent theoretical properties and he also provided a decoding algorithm. However, as the hardware of the time was not powerful enough to run the decoding algorithm efficiently, LDPC codes were not practically usable and were forgotten. They were rediscovered in 1995 [74,101], and have been shown to have a very good performance close to the theoretical Shannon limit [73, 75]. Since the rediscovery, LDPC codes have been successfully used in a number of applications, and are suggested for use in a number of important future communications standards. 1.2 Applications Today, LDPC codes are used or are proposed to be used in a number of applications with widely different characteristics and requirements. In 2003, a type of LDPC codes was accepted to be used for the DVB-S2 standard for satellite TV [113]. The same type of code was then adopted for both the DVB-T2 [114] and DVB-C2 [115] standards for terrestial and cable-based TV, respectively. A similar type has also been accepted for the DTTB standard for digital TV in China [122]. The systemlevel requirements of these systems are relatively low, with relaxed latency requirements as the communication is unidirectional, and relatively small constraints on power dissipation, as the user equipment is typically not battery-driven. Thus, the adopted code is complex with a resulting complex decoder implementation. Opposite requirements apply for the WLAN IEEE 802.11n [118] and WiMax IEEE 802.16e [120] standards, for which LDPC codes have been chosen as optional ECC schemes. In these applications, communication is typically bi-directional, necessitating low latency. Also, the user equipment is typically battery-driven, making low power dissipation critical. For these applications, the code length is restricted directly by the latency requirements. However, it is preferable to reduce the decoder complexity as much as possible to save power dissipation. Whereas these types of applications are seen as the primary motivation for the work in this part of the thesis, LDPC codes are also used or suggested in several other standards and applications. Among them are the IEEE 802.3an [121] standard for 10Gbit/s Ethernet, the IEEE 802.15.3c [119] mm-wave WPAN standard, and the gsfc-std-9100 [116] standard for deep-space communications. 1.3 Scientific contributions There are two main scientific contribution in the first part of the thesis. The first is a modification to the sum-product decoding algorithm for LDPC codes, called the early-decision algorithm, and is described in Chapter 3. The aim of the early-decision modification is to dynamically reduce the number of possible 6 CHAPTER 1. INTRODUCTION states of the decoder during decoding, and thereby reduce the amount of internal communication of the hardware. However, this algorithm modification impacts the error correction performance of the code, and it is therefore also investigated how the modified decoding algorithm can be efficiently combined with the original algorithm to yield a resulting hybrid decoder which retains the performance of the original algorithm while still offering a reduction of internal communication. The second main contribution is an improved algorithm for decoding of ratecompatible LDPC codes, and is described in Chapter 4. Using rate-compatible LDPC codes obtained through puncturing, the higher-rate codes can trivially be decoded by the low-rate mother code. However, by defining a specific code by merging relevant check nodes for each of the punctured rates, the code complexity can be reduced at the same time as the propagation speed of the extrinsic information is increased. The result is a significant reduction in the convergence time of the decoding algorithm for the higher-rate codes. A minor contribution is the observation of redundancy in the internal data format in a fixed-width implementation of the decoding algorithm. It is shown that a simple data encoding can further reduce the amount of internal communication. The performance of the proposed algorithms have been evaluated in software. For the early-decision algorithm, it is verified that the modifications have an insignificant impact on the error correction performance, and the change in the internal communication is estimated. For the check-merging decoding algorithm, the modifications can be shown to even improve the error correction performance. However, these improvements are mostly due to the reduced convergence time, allowing the algorithm to converge for codewords which the original algorithm does not have sufficient time for. The early-decision and check-merging algorithms have been implemented in a Xilinx Virtex 5 FPGA and an Altera Cyclone II FPGA, respectively. As similar implementations have not been published before, they have mainly been compared with implementations of the original reference decoders. For the early-decision decoder, the required overhead has been determined, and the power dissipation of both the original and proposed architecture have been simulated and compared using regular quasi-cyclic LDPC codes with an additional scrambling layer. For the check-merging decoder, the required overhead has been determined, and the increased throughput obtainable with the modification has been quantized for two different implementations geared for the IEEE 802.16e and IEEE 802.11n standards, respectively. 2 Error correction coding In this chapter, the basics of digital communications systems and error correction coding are explained. In Sec. 2.1, a model of a system using digital communications is shown, and the channel model and different modulation methods are explained. In Sec. 2.2, coding theory is introduced as a way of reducing the required transmission power with a retained bit error probability, and block codes are defined. In Sec. 2.3, LDPC codes are defined as a special case of general block codes, and Tanner graphs are introduced as a way of visualizing the structure of an LDPC code. The sum-product decoding algorithm and the min-sum approximation are discussed in Sec. 2.4, and in Sec. 2.5, rate-compatible LDPC codes are introduced as a way of obtaining practical usable codes with a wide range of rates. Also, the implications of rate-compatibility on the decoding algorithm are discussed. In Sec. 2.6, several general decoder architectures with different parallelism degrees are discussed, including a serial architecture, a parallel architecture, and a partly parallel architecture. In Sec. 2.7, a partly parallel architecture for a specific class of regular LDPC codes is described, and is also used as a reference for the early decision algorithm proposed in Chapter 3. In Sec. 2.8, a partly parallel architecture using the min-sum decoding algorithm for general quasi-cyclic LDPC codes is described, and is used as a reference for the check merging decoding algorithm proposed in Chapter 4. 7 8 CHAPTER 2. ERROR CORRECTION CODING n(t) Endpoint A xn ∈ A Modulation s(t) r(t) Endpoint B Demodulation x̃n ∈ B Figure 2.1 Digital communications system model. 2.1 Digital communications Consider a two-user digital communications system, such as the one shown in Fig. 2.1, where an endpoint A transmits information to an endpoint B. Whereas multi-user communications systems with multiple transmitting and receiving endpoints can be defined, only systems with one transmitter and receiver will be considered in this thesis. The system is digital, meaning that the information is represented by a sequence of symbols xn from a finite discrete alphabet A. The sequence is mapped onto an analog signal s(t) which is transmitted to the receiver through the air, through a cable, or using any other medium. During transmission, the signal is distorted by noise n(t), and thus the received signal r(t) is not equal to the transmitted signal. By the demodulator, the received signal r(t) is mapped to symbols x̃n from an alphabet B, which may or may not be the same as alphabet A, and may be either discrete or continuous. Typically, if the output data stream is used directly by the receiving application, B = A. However, commonly some form of error coding is employed, which can benefit from including symbol reliability information in the reception alphabet B. 2.1.1 Channel models In analyzing the performance of a digital communications system, the chain in Fig. 2.1 is modeled as a probabilistic mapping P (X̃ = b | X = a), ∀a ∈ A, b ∈ B, from the transmission alphabet A to the reception alphabet B. The system modeled by the probabilistic mapping is formally called a channel, and X and X̃ are stochastic variables denoting the input and output of the channel, respectively. For the channel, the following requirement must be satisfied for discrete reception alphabets X P (X̃ = b | X = a) = 1, ∀a ∈ A, (2.1) b∈B or analogously for continuous reception alphabets Z P (X̃ = b | X = a) = 1, ∀a ∈ A. (2.2) b∈B Depending on the characteristics of the modulator, demodulator, transmission medium, and the accuracy requirement of the model, different channel models are suitable. Some common channel models include 2.1. DIGITAL COMMUNICATIONS 9 • the binary symmetric channel (BSC), a discrete channel defined by the alphabets A = B = {0, 1}, and the mapping P (X̃ = 0 | X = 0) = P (X̃ = 1 | X = 1) = 1 − p P (X̃ = 1 | X = 0) = P (X̃ = 0 | X = 1) = p, where p is the cross-over probability that the sent binary symbol will be received in error. The BSC is an adequate channel model in many cases when a hard-decision demodulator is used, as well as in early stages of a system design to compute the approximate performance of a digital communications system. • the binary erasure channel (BEC), a discrete channel defined by the alphabets A = {0, 1}, B = {0, 1, e}, and the mapping P (X̃ = 0 | X = 0) = P (X̃ = 1 | X = 1) = 1 − p P (X̃ = e | X = 0) = P (X̃ = e | X = 1) = p P (X̃ = 1 | X = 0) = P (X̃ = 0 | X = 1) = 0, where p is the erasure probability, i.e., the received symbols are either known by the receiver, or known that they are unknown. The binary erasure channel is commonly used in theoretical estimations of the performance of a digital communications system due to its simplicity, but can also be adequately used in low-noise system modeling. • the additive white Gaussian noise (AWGN) channel with noise spectral density N0 , a continuous channel defined by a discrete alphabet A and a continuous alphabet B, and the mapping P (X̃ = b | X = a) = f(a,σ) (b), (2.3) where f(a,σ) (b) is the probability density function for a normally distributed p stochastic variable with mean a and standard deviation σ = N0 /2. The size of the input alphabet is usually determined by the modulation method used, and is further explained in Sec. 2.1.2. The AWGN channel models real-world noise sources well, especially for cable-based communications systems. • the Rayleigh and Rician fading channels. The Rayleigh channel is appropriate for modeling a wireless communications system when no line-of-sight is present between the transmitter and receiver, such as cellular phone networks and metropolitan area networks. The Rician channel is more appropriate when a dominating line-of-sight communications path is available, such as for wireless LANs and personal area networks. The work in this thesis considers the AWGN channel with a binary input alphabet only. 10 CHAPTER 2. ERROR CORRECTION CODING mk ∈ I Symbol mapping xn ∈ A Modulation s(t) n(t) m̂k ∈ I Symbol demapping Demodulation r(t) x̃n ∈ B Figure 2.2 Model of uncoded digital communications system. 2.1.2 Modulation methods The size of the transmission alphabet A for the AWGN channel is commonly determined by the modulation method used. Common modulation methods include • the binary phase-shift keying (BPSK) modulation, using the transmission √ √ alphabet A = {− E, + E} and reception alphabet B = R. E denotes the symbol energy. • the quadrature phase-shift keying (QPSK) modulation, using the transmisp sion alphabet A = E/2{(−1 − i), (−1 + i), (+1 − i), (+1 + i)} with complex symbols, and reception alphabet B = C. The binary source information is mapped in blocks of two bits onto the symbols of the transmission alphabet. As the alphabets are complex, the probability density function in (2.3) is the probability density function for the two-dimensional Gaussian distribution. • the quadrature amplitude (QAM) modulation, which is a generalization of the QPSK modulation to higher orders, using equi-spaced symbols from the complex plane. In this thesis, BPSK modulation has been assumed exclusively. However, the methods are not limited to BPSK modulation, but may straight-forwardly be applied to systems using other modulation methods as well. 2.1.3 Uncoded communication In order to use the channel for communication of data, some way of mapping the binary source information to the transmitted symbols is needed. In the system using uncoded communications depicted in Fig. 2.2, this is done by the symbol mapper, which maps the source bits mk to the transmitted symbols xn . The transmitted symbols may be produced at a different rate than the source bits are consumed. 2.1. DIGITAL COMMUNICATIONS 11 On the receiver side, the end application is interested in the most likely symbols that were sent, and not the received symbols. However, the transmitted and received data are symbols from different alphabets, and thus a symbol demapper is used to infer the most likely transmitted symbols from the received ones, before mapping them back to the binary information stream m̂k . In the uncoded case, this is done on a per-symbol basis. For the BSC, the source bits are mapped directly to the transmitted symbols such that xn = mk , where n = k, whereas the BEC is not used with uncoded communications and is thus not discussed. For the AWGN with BPSK modulation, the source√ bits are conventionally mapped so that the bit√0 is mapped to the symbol + E, whereas the bit 1 is mapped to the symbol − E. For higher-order modulation, several source bits are mapped to each symbol, and the source bits are typically mapped using gray mapping so that symbols that are close in the complex plane differ by one bit. The optimal decision rules for the symbol demapper can be formulated as follows for different channels. For the BSC, ( x̃n if p < 0.5 m̂k = (2.4) 1 − x̃n if p > 0.5, where the case p > 0.5 is rather unlikely. For the AWGN channel using BPSK modulation, ( 0 if x̃n > 0 m̂k = (2.5) 1 if x̃n < 0. Finally, if QPSK modulation with gray mapping of source bits to transmitted symbols is used, 00 01 {m̂k , m̂k+1 } = 11 10 if if if if Re x̃n Re x̃n Re x̃n Re x̃n > 0, Im x̃n < 0, Im x̃n < 0, Im x̃n > 0, Im x̃n >0 >0 <0 < 0. (2.6) In analyzing the performance of a communications system, the probability of erroneous transmissions is interesting. For BPSK communications with equal symbol probabilities, the bit error probability can be defined as PB,BP SK = P (m̂k 6= mk ) = = P (x̃n > 0 | xn = 1)P (xn = 1) + P (x̃n < 0 | x0 = 0)P (xn = 0) = ! √ ! r E 2E =Q =Q , (2.7) σ N0 where Q(x) is the cumulative density function for normally distributed stochastic variables. However, it turns out that significantly lower error probabilities can be achieved by adding redundancy to the transmitted information, while keeping 12 CHAPTER 2. ERROR CORRECTION CODING (m0 , . . . , mK−1 ) (x0 , . . . , xN −1 ) Channel Modulation coding s(t) n(t) (m̂0 , . . . , m̂K−1 ) Channel decoding Demodulation r(t) (x̃0 , . . . , x̃N −1 ) Figure 2.3 Error correction system overview the total transmitter power unchanged. Thus, the individual symbol energies are reduced, and the saved energy is used to transmit redundant symbols computed from the information symbols according to some well-defined code. 2.2 Coding theory Consider the error correction system in Fig. 2.3. As the codes in this thesis are block codes, the properties of the system are formulated assuming that a block code is used. Also, it is assumed that the symbols used for the messages are binary symbols. A message m with K bits is to be communicated over a noisy channel. The message is encoded to the codeword x with N bits, where N > K. The codeword is then modulated to the analog signal s(t) using BPSK modulation with an energy of E per bit. During transmission over the AWGN channel, the noise signal n(t) with a one-sided spectral density of N0 is added to the signal to produce the received signal r(t). The received signal is demodulated to produce the received vector x̃, which may contain either bits or scalars. The channel decoder is then used to find the most likely sent codeword x̂, given the received vector x̃. From x̂, the message bits m̂ are then extracted. For the system, a number of properties can be defined: • The information transmitted is K bits. • The block size of the code is N bits. Generally, in order to achieve better error correction performance, N must be increased. However, a larger block size requires a more complex encoder/decoder and increases the latency of the system, and there is therefore a trade-off between these factors in the design of the coding system. • The code rate is R = K/N . Obviously, increasing the code rate increases the amount of information transmitted for a fixed block size N . However, it is also the case that a reduced code rate allows more information to be 2.2. CODING THEORY 13 1 0.9 0.8 Capacity 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −2 −1 0 1 Eb/N0 2 3 4 Figure 2.4 Capacity for binary-input AWGN channel using different SNR. transmitted for a constant transmitter power level (see Sec. 2.2.1), and the code rate is therefore also a trade-off between error correction performance and encoder/decoder complexity. • The normalized SNR at the receiver is Eb /N0 = ER/N0 and is used instead of the actual SNR E/N0 in order to allow a fair comparison between codes of different rates. The normalized SNR is denoted SNR in the rest of this thesis. • The bit error rate (BER) is the fraction of differing bits in m and m̂, averaged over several blocks. • The block error rate (BLER) is the fraction of blocks where m and m̂ differs. Coding systems are analyzed in depth in any introductionary book on coding theory, e.g. [3, 102]. 14 CHAPTER 2. ERROR CORRECTION CODING 2.2.1 Shannon bound In 1948, Claude E. Shannon proved the noisy channel coding theorem [89], that can be phrased in the following way. For each channel, as defined in Sec. 2.1.1, there is associated a quantity called the channel capacity. The channel capacity is the maximum amount of information, as measured by the shannon unit, that can be transferred per channel use, guaranteeing error-free transmission. Moreover, error-free transmission at information rates above the channel capacity is not possible. Thus, transmitting information at a rate below the channel capacity allows an arbitrarily low error rate, i.e., there are arbitrarily good error-correcting codes. Additionally, the noisy channel coding theorem states that above the channel capacity, data transmission can not be done without errors, regardless of the code used. The capacity for the AWGN channel using BPSK modulation and assuming equi-probable inputs is given here without derivation, but calculations are found e.g. in [3]. It is CBIAW GN = Z ∞ −∞ f√2E/N (y) log2 0 2f√2E/N (y) 0 f√2E/N (y) + f−√2E/N (y) 0 0 ! dy, (2.8) where f±√2E/N (y) are the probability density functions for Gaussian stochastic 0 p √ variables with means ± E and standard deviation N0 /2. In Fig. 2.4 the capacity of a binary-input AWGN channel is plotted as a function of the normalized SNR Eb /N0 = ER/N0 , and it can be seen that reducing the code rate allows error-free communications using less energy even if more bits are sent for each information bit. Shannon’s theorem can be rephrased in the following way: for each information rate (or code rate) there is a limit on the channel conditions, above which communication can achieve an arbitrarily low error rate, and below which communication must introduce errors. This limit is commonly referred to as the Shannon limit, and is commonly plotted in code performance plots to show how far the code is from the theoretical limit. The Shannon limit can be found numericallyp for the binary input AWGN channel by iteratively solving (2.8) for the argument E/N0 that yields the desired information rate. 2.2.2 Block codes There are two standard ways of defining block codes: through a generator matrix G or through a parity-check matrix H. For a message length of K bits and block length of N bits, G has dimensions of K × N , and H has dimensions of M × N , where M = N − K. Denoting the set of codewords by C, C can be defined in the 2.2. CODING THEORY 15 following two ways: C = x = mG | m ∈ {0, 1}K C = x ∈ {0, 1}N | HxT = 0 (2.9) (2.10) The most important property of a code regarding performance is the minimum Hamming distance d, which is the minimum number of bits that two codewords may differ in. Moreover, as the set of codewords C is linear, it is also the weight of the lowest-weight codeword which is not the all-zero codeword. The minimum distance is important because all transmission errors with a weight strictly less than d/2 can be corrected. However, for practical codes d is often not known exactly, as it is often difficult to calculate theoretically, and exhaustive searches are not realistic with block sizes of thousands of bits. Also, depending on the type of decoder used, the actual error-correcting ability may be both above and below d/2. Thus the performance of modern codes is usually determined experimentally by simulations over a noisy channel and by measuring the actual bit- or block-error rate at the output of the decoder. A simple example of a block code is the (N, K, d) = (7, 4, 3) Hamming code defined by the parity-check matrix 1 1 1 1 0 0 0 (2.11) H = 1 1 0 0 1 1 0 . 1 0 1 0 1 0 1 The code has a block length of N = 7 bits, and a message length of K = 4 bits. Thus the code rate is R = K/N = 4/7. It can easily be shown that the minimumweight codeword has a weight of d = 3, which is therefore the minimum distance of the code. The error correcting performance of this code over the AWGN channel is shown in Fig. 2.5. As can be seen, the code performance is just somewhat better than uncoded transmission. There exists a Hamming code with parameters (N, K, d) = (2m − 1, 2m − m − 1, 3) for every integer m ≥ 2, and their paritycheck matrices are constructed by concatenating every nonzero m-bit vector. The advantage of these codes is that decoding is very simple, and they are used e.g. in memory chips. To decode a received block using Hamming coding, consider for example the (7, 4, 3) Hamming code and a received vector x̃. Then the syndrome of the received vector is Hx̃T , which is a three-bit vector. If the syndrome is zero, the received vector is a valid codeword, and decoding is finished. If the syndrome is non-zero, the received vector could become a codeword if the bit corresponding to the column in H that matched the syndrome is flipped. It should thus be noted that the columns of H contain every non-zero three-bit vector, and thus every received vector x̃ will be at a distance of at most one from a valid codeword. Thus decoding will consist of changing at most one bit, determined by the syndrome if it is non-zero. To increase the error correcting performance, the code needs to be able to correct more than single bit errors, and then the above decoding technique does not work. While the method could be generalized to determine the bits to flip by 16 CHAPTER 2. ERROR CORRECTION CODING 0 10 Uncoded Hamming(7,4,3) Hamming(255,247,3) Reed−Solomon(31,25,7) LDPC(144,72,10) LDPC(288,144,14) −2 Bit error rate 10 −4 10 −6 10 −8 10 0 2 4 6 E /N [dB] b 8 10 12 0 Figure 2.5 Error correcting performance of short codes. The Hamming and Reed-Solomon curves are estimations for hard-decision decoding, whereas the LDPC curves are obtained using simulations with soft-decision decoding. finding the minimum set of columns whose sum is the syndrome, this is usually not efficient. Thus the syndrome is usually computed only to determine if a given vector is a codeword or not. The performance of other short codes are also shown in Fig. 2.5. The Hamming and Reed-Solomon curves are estimations for hard-decision decoding obtained using the MATLABTM function bercoding. The LDPC codes are randomly constructed (3, 6)-regular codes (as defined in Sec. 2.3). Ensembles of 100 codes were generated, and their minimum distances computed using integer linear programming optimization. Among the codes with the largest minimum distances, the codes with the best performance under the sum-product algorithm were selected. The performance of some long codes are shown in Fig. 2.6. The performance of the N = 107 LDPC code is from [26], whereas the performance of the N = 106 codes are from [86]. It is seen that at a block length of 106 bits, the LDPC code performs better than the Turbo code. The N = 107 code is a highly optimized irregular LDPC code with variable node degrees up to 200, and performs within 0.04 dB of the Shannon limit at a bit error rate of 10−6 . At shorter block lengths 2.2. CODING THEORY 17 −2 10 Shannon limit 7 LDPC, N=10 6 LDPC, N=10 6 Turbo, N=10 LDPC(9216,3,6) −3 Bit error rate 10 −4 10 −5 10 −6 10 0 0.5 1 1.5 E /N [dB] b 2 2.5 3 0 Figure 2.6 Error correcting performance of long codes. of 1000-10000 bits, the performance of Turbo codes and LDPC codes are generally comparable. The (9216, 3, 6) code is a randomly constructed regular code, also used in the simulations in Sec. 3.5. For block codes, there are three general ways in which a decoding attempt may terminate: • Decoder successful: The decoder has found a valid codeword, and the corresponding message m̂ equals m. • Decoder error: The decoder has found a valid codeword, and the corresponding message m̂ differs from m. • Decoder failure: The decoder was unable to find a valid codeword using the resources specified. For both the error and the failure result, the decoder has been unable to find the correct sent message m. However, the key difference is that decoder failures are detectable, whereas decoder errors are not. Thus, if, for example, several decoder algorithms are available, the decoding could be retried with another algorithm when a decoder failure occurs. 18 CHAPTER 2. ERROR CORRECTION CODING v0 v1 v2 c0 v3 c1 v4 v5 v6 c2 Figure 2.7 Example of Tanner graph for the (7, 4, 3) Hamming code. Check nodes c0 c1 c2 v0 1 1 1 v1 1 1 0 Variable nodes v2 v3 v4 v5 1 1 0 0 0 0 1 1 1 0 1 0 v6 0 0 1 Figure 2.8 Parity-check matrix H for (7, 4, 3) Hamming code. 2.3 LDPC codes A low-density parity-check (LDPC) code is a code defined by a parity-check matrix with low density, i.e., the parity-check matrix H has a low number of 1s. It has been shown [39] that there exists classes of such codes that asymptotically reach the Shannon bound with a density tending to zero as the block length tends to infinity. Moreover, the theorem also states that such codes are generated with a probability approaching one if the parity-check matrix H is just constructed randomly. However, the design of practical decoders is greatly simplified if some structure can be imposed upon the parity-check matrix. This seems to often negatively impact the error-correcting performance of the codes, leading to a trade-off between the performance of the code and the complexity of the encoder and decoder. 2.3.1 Tanner graphs LDPC codes are commonly visualized using Tanner graphs [92]. Moreover, the iterative decoding algorithms are defined directly on the graph (see Sec. 2.4.1). The Tanner graph consists of nodes representing the columns and rows of the paritycheck matrix, with an edge between two nodes if the element in the intersection of the corresponding row and column in the parity-check matrix is 1. Nodes corresponding to columns are called variable nodes, and nodes corresponding to rows are called check nodes. As there are no intersections between columns and between rows, the resulting graph is bipartite with all the edges between variable nodes and check nodes. An example of a Tanner graph is shown in Fig. 2.7, and its corresponding parity-check matrix is shown in Fig. 2.8. Comparing to (2.11), it is seen that the matrix is that of the (7, 4, 3) Hamming code. 2.3. LDPC CODES 19 Having defined the Tanner graph, there are some properties which are interesting for the decoding algorithms for LDPC codes. • A check node regular code is a code for which all check nodes have the same degree. • A variable node regular code is a code for which all variable nodes have the same degree. • A (j, k)-regular code is a code which is variable node regular with variable node degree j and check node regular with check node degree k. • The girth of a code is the length of the shortest cycle in its Tanner graph. • The diameter of a code is the largest distance between two nodes in its Tanner graph. Using a regular code can simplify the decoder architecture. However, it has also been conjectured [39] that regular codes can not be capacity-approaching under message-passing decoding. The conjecture will be proved if it can be showed that cycles in the code can not enhance the performance of the decoder on average. Furthermore, it has also been showed [25, 27, 72, 87] that codes need to have a wide range of node degree distributions in order to be capacity-approaching. Therefore, assuming that the conjecture is true, there is a trade-off between code performance and decoder complexity regarding the regularity of the code. The sum-product decoding algorithm for LDPC codes computes exact marginal bit probabilities when the code’s Tanner graph is free of cycles [65]. However, it can also be shown that the graph must contain cycles for the code to have more than minimal error correcting performance [36]. Specifically, it is shown that for a cycle-free code C with parameters (N, K, d) and rate R = K/N , the following conditions apply. If R ≥ 0.5, then d ≤ 2, and if R < 0.5, then C is obtained from a code with R ≥ 0.5 and d ≤ 2 by repetition of certain symbols. Thus, as cycles are needed for the code to have good theoretical properties, but also inhibit the performance of the practical decoder, the concept of girth is important. Using a code with large girth and small diameter is generally expected to improve the performance, and codes are therefore usually designed so that the girth is at least six. 2.3.2 Quasi-cyclic LDPC codes One common way of imposing structure on an LDPC code is to construct the parity-check matrix from equally sized sub-matrices which are either all zeros or cyclically shifted identity matrices. These types of LDPC codes are denoted quasicyclic (QC-LDPC) codes. Typically, QC-LDPC codes are defined from a base 20 CHAPTER 2. ERROR CORRECTION CODING I1,1 I1,2 I1,k I2,1 I2,2 Ik,1 P1,1 P2,1 I2,k Ik,2 Ik,k Pk,1 P1,2 P2,2 Pk,2 P1,k R1,1 R2,1 Rk,1 R1,2 R2,2 Rk,2 P2,k Pk,k R1,k R2,k Rk,k Figure 2.9 Parity-check matrix structure of randomized quasi-cyclic codes through joint code and decoder architecture design. matrix Hb of size Mb × Nb with integer elements: Hb (0, 0) Hb (1, 0) .. . Hb = Hb (Mb − 1, 0) Hb (0, 1) Hb (1, 1) .. . ··· ··· .. . Hb (0, Nb − 1) Hb (1, Nb − 1) .. . Hb (Mb − 1, 1) ··· Hb (Mb − 1, Nb − 1) . (2.12) For an expansion factor of z, a parity-check matrix H of size Mb z × Nb z is constructed from Hb by replacing each element with a square sub-matrix of size z × z. The sub-matrix is the all zero matrix if Hb (m, n) = −1, otherwise it is an identity matrix circularly right-shifted by φ(Hb (m, n), z). φ(k, z) is commonly a scaling function, modulo function, or the identity function. Methods of constructing QC-LDPC codes include algebraic methods [55, 77, 95], geometric methods [63, 71, 95], and random or optimization approaches [37, 98]. QC-LDPC codes tend to have decent performance while also allowing the implementation to be efficiently parallelized. The block size may easily be adapted by changing the exansion factor z. Also, certain construction methods can ensure that the girth of the code is at least 8 [96]. In all of the standards using LDPC codes that are referenced in this thesis, the codes are of QC-LDPC structure. 2.4. LDPC DECODING ALGORITHMS 2.3.3 21 Randomized quasi-cyclic codes The performance of regular quasi-cyclic codes can be increased relatively easily by the addition of a randomizing layer in the hardware architecture. This type of codes resulted from an effort of joint code and decoding architecture design [107, 110]. The codes are (3, k)-regular, with the general structure shown in Fig. 2.9, and have a girth of at least six. In the figure, I represents L × L identity matrices, where L is a scaling constant, and P represents cyclically shifted L × L identity matrices. The column weight is 3, and the row weight is k. Thus there are k 2 each of the I- and P -type matrices. The bottom part is a partly randomized matrix, also with row weight k. The submatrix is obtained from a quasi-cyclic matrix by moving some of the ones within their columns according to certain constraints. The constraints are best described directly by the decoder implementation, described in Sec. 2.7. 2.4 LDPC decoding algorithms Normally, LDPC codes are decoded using a belief propagation algorithm. In this section, the sum-product algorithm and the common min-sum approximation are explained. 2.4.1 Sum-product algorithm The sum-product decoding algorithm is defined directly on the Tanner graph of the code [39, 65, 74, 101]. It is an iterative algorithm, consecutively propagating bit probabilities and parity-check constraint satisfiability likelihoods until the algorithm converges to a valid codeword, or a predefined maximum number of iterations is reached. A number of variables are defined: • The prior probabilities p0n and p1n denote the probabilities that bit n is zero and one, respectively, considering only the received channel information and not the code structure. 0 1 • The variable-to-check messages qnm and qnm are defined for each edge between a variable node n and a check node m. They denote the probabilities that bit n is zero and one, respectively, considering the prior variable probabilities and the likelihood that parity-check relations other than m involving bit n are satisfied. 0 1 • The check-to-variable messages rmn and rmn are defined for each edge between a check node m and a variable node n. They denote the likelihoods that parity-check relation m is satisfied considering variable probabilities for the other involved bits given by their variable-to-check messages, and given that bit n is zero and one, respectively. • The pseudo-posterior probabilities qn0 and qn1 are updated in each iteration and denote the probabilities that bit n is zero and one, respectively, considering the information propagated so far during the decoding. 22 CHAPTER 2. ERROR CORRECTION CODING pn pn ′ vn vn ′ pn′′ vn′′ cm c m′ Figure 2.10 Sum-product decoding: Initialization phase qn qn′ vn vn′′ vn ′ qnm cm qn′′ qn′ m qn′′ m′ c m′ Figure 2.11 Sum-product decoding: Variable node update phase • The hard-decision vector x̂n denotes the most likely bit values, considering bit n and its surrounding. The number of surrounding bits considered increases with each iteration. Decoding a received vector consists of three phases: initialization phase, variable node update phase, and check node update phase. In the initialization phase, shown in Fig. 2.10, the messages are cleared and the prior probabilities are initialized to the individual bit probabilities based on received channel information. In the variable node update phase, shown in Fig. 2.11, the variable-to-check messages are computed for each variable node from the prior probabilities and the check-to-variable messages along the adjoining edges. Also, the pseudo-posterior 2.4. LDPC DECODING ALGORITHMS vn ′ vn 23 vn′′ rm′ n′′ rmn′ rmn cm c m′ Figure 2.12 Sum-product decoding: Check node update phase probabilities are calculated, and the hard-decision bits are set to the most likely bit values based on the pseudo-posterior probabilities. In the check node update phase, shown in Fig. 2.12, the check-to-variable messages are computed based on the variable-to-check messages, and all check node relations are evaluated based on the hard-decision vector. If all check node constraints are satisfied, decoding stops, and the current hard-decision vector is output. Decoding continues until either a valid codeword is found, or a preset maximum number of iterations is reached. In the latter case, a decoding failure occurs, whereas the former case results in either a decoder success or a decoder error. However, for well-defined codes with block lengths of at least 1000 bits, decoder errors are extremely rare. Therefore, when a decoding attempt is unsuccessful, it will almost always be known. Decoding is usually performed in the log-likelihood ratio domain using the vari0 1 0 1 ables γn = log(p0n /p1n ), αnm = log(qnm /qnm ), βmn = log(rmn /rmn ) and λn = 0 1 log(qn /qn ). In this domain, the variable update equations can be written [65] X αnm = γn + βmn (2.13) m′ ∈M(n)\m βmn = Y n′ ∈N (m)\n λn = γn + X sign αnm · Φ βmn , X n′ ∈N (m)\n Φ (|αnm |) (2.14) (2.15) m′ ∈M(n) where M(n) denotes the neighbors to variable node n, N (m) denotes the neighbors to check node m, and Φ(x) = − log tanh(x/2). The sum-product algorithm is used in the implementation of the early-decision algorithm in Chapter 3. 24 2.4.2 CHAPTER 2. ERROR CORRECTION CODING Min-sum approximation Whereas (2.13) and (2.15) consist of sums and are simple to implement in hardware, (2.14) is a bit more complex. One way of simplifying the hardware implementation is the use of the min-sum approximation [38] which replaces the check node operation by the minimum of the arguments. The min-sum approximation results in an overestimation of the reliabilities of messages, as only the probability for one message is used in the operation. This can be partly compensated for by adding an offset to variable-to-check messages [23], and results in the following equations: X αnm = γn + βmn (2.16) m′ ∈M(n)\m βmn = Y n′ ∈N (m)\n λn = γn + X sign αnm · max min n′ ∈N (b)\n |αnm − δ| , 0 βmn , (2.17) (2.18) m′ ∈M(n) where δ is a constant determined by simulations. An additional result of the approximation is that the number of different messages from a specific check node is reduced to at most two. This enables a significant reduction in the memory requirements for storage of the check-to-variable messages in some architectures, especially when a layered decoding schedule is used. The offset min-sum algorithm is used in the implementation of the rate-compatible decoder in Chapter 4. 2.5 Rate-compatible LDPC codes A class of rate-compatible codes is defined as a set of codes with the same number of codewords but different rates, and where codewords of higher rates can be obtained from codewords of lower rates by removing bits at fixed positions [48]. Thus the information content is the same in the codes, but the amount of parity information differs. The benefits of rate-compatibility include better adaptation to channel environments and more efficient implementations of encoders and decoders. Better adaptation to channel environments is achieved through the large number of possible rates to choose from, whereas more efficient implementations are achieved through the reuse of hardware between the encoders and decoders of the different rates. An additional advantage is the possibility to use smart ARQ schemes where the retransmission consists of a small amount of extra parity bits rather than a completely recoded packet. There are two main methods of defining such classes of codes: puncturing and extension. Using puncturing, a low-rate mother code is designed and the higherrate codes are then defined by removing bits at fixed positions in the blocks. Using extension, lower-rate codes are defined from a high-rate mother code by adding additional parity bits. 2.5. RATE-COMPATIBLE LDPC CODES 25 2-SR node 1-SR node 1-SR node Figure 2.13 Recovery tree of a 2-SR node. The circles are variable nodes and the squares are check nodes. The filled circles are punctured nodes. Disadvantages of rate-compatible codes include reduced performance of the code and decoder. Since it is difficult to optimize a class of rate-compatible codes for a range of different rates, there will generally be a performance difference between a rate-compatible and a dedicated code of a specific rate. However, better adaptation to channel conditions may still allow a decrease in the average number of transmitted bits. 2.5.1 SR-nodes For LDPC codes, a straight-forward way of decoding rate-compatible codes obtained through puncturing is to use the parity-check matrix of the low-rate mother code and initialize the prior LLRs of the punctured nodes to zero. However, such nodes will delay the probability propagation of its check node neighbors until it receives a non-zero message from one of its neighbors. The concept of k-step recoverable (k-SR) nodes was introduced in [47] based on the assumption that the performance of an LDPC code using a particular puncturing pattern is mainly determined by the recovery time of the punctured nodes. The recovery time of a punctured variable node is defined as the minimum number of iterations required before a node can start to produce non-zero messages. A non-punctured node can thus be denoted a 0-SR node. A punctured node having at least one check node neighbor for which it is the only punctured node may receive a non-zero message from that node in the first iteration and is thus a 1-SR node. Generally, a k-SR node is reinitialized by its neighbors after k iterations. Figure 2.13 shows the recovery tree of a 2-SR punctured node. The 2-SR node has no other check node neighbors for which it is the only punctured node. 26 CHAPTER 2. ERROR CORRECTION CODING In [47], a greedy algorithm was proposed that successively chooses variable nodes that receive likelihood data from their neighbours as early as possible. Since then, several other greedy algorithms have been proposed. In [24], an algorithm that takes into account the reliability of the information used for reinitializing punctured nodes was suggested. However, it can only be used for parity-check matrices with dual-diagonal structures. In [46] and [83], algorithms maximizing the reinitialization reliability for general parity-check matrix structures were proposed. The algorithm in [46] is essentially a refinement of the one proposed in [47], sequentially puncturing k-SR nodes for increasing values of k while trying to choose nodes with high recovering reliability. In [83], the aims are similar, but instead of minimizing the number of iterations for recovering, the algorithm tries to choose nodes based on the number of unpunctured nodes used in the computation of the recovery information. 2.5.2 Decoding of rate-compatible codes Rate-compatible LDPC codes can be straight-forwardly decoded by any LDPC decoder simply by initializing the a-priori LLR values of punctured nodes to zero. In that respect, any LDPC decoder is also a rate-compatible decoder. However, by reducing the number of different codes that must be supported by the decoder, the usage of rate-compatible codes may allow simplifications to the architecture. In [106], a rate-compatible decoder is presented for the WiMAX rate-1/2 code, where hard-wiring the information exchange between the nodes allows a low-complexity implementation. In contrast, in Chapter 4 an alternative decoding algorithm is proposed, which removes the punctured nodes altogether from the code. The result is a significant increase of the propagation speed of the messages in the decoder, which together with a reduced code complexity allows a significant throughput increase. 2.6 LDPC decoder architectures To achieve good theoretical properties, the code is typically required to have a certain degree of randomness or irregularity. However, this makes efficient hardware implementations difficult [99,104]. For example, a direct instantiation of the Tanner graph of a 1024-bit code in a 0.16 µm CMOS process resulted in a chip with more than 26000 wires with an average length of 3 mm, and a routing overhead of 50% [18,52]. It is also the case that the required numerical accuracy of the computations is low. Thus, the sum-product algorithm can be said to be communication-bound rather than computation-bound. The architectures for the sum-product decoding algorithm can be divided into three main types [44]: the parallel, the serial, and the partly parallel architecture. These are briefly described here. 2.6. LDPC DECODER ARCHITECTURES 27 Data I/O VNU VNU VNU VNU VNU reg reg reg reg reg Interconnection CNU CNU CNU CNU Figure 2.14 Parallel architecture for sum-product decoding algorithm rec. data cntrl RAM dec. data CNU VNU Figure 2.15 Serial architecture for sum-product decoding algorithm. 2.6.1 Parallel architecture Directly instantiating the Tanner graph of the code yields the parallel architecture, as shown in Fig. 2.14. As the check node computations, as well as the variable node computations, are intraindependent (i.e. the check node computations depend only on the result of variable node computations, and vice versa), the algorithm is inherently parallelizable. All check node computations can be done in parallel, followed by computations of all the variable nodes. An example implementation is the above mentioned 1024-bit code decoder, achieving a throughput of 1 Gb/s while performing 64 iterations. The chip has an active area of 52.5 mm2 and a power dissipation of 690 mW, and is manufactured in a 0.16 µm CMOS process [18]. However, due to the graph irregularity required for good codes, the parallel architecture is hardly scalable to larger codes. Also, the irregularity of purely random codes makes it difficult to time-multiplex the 28 CHAPTER 2. ERROR CORRECTION CODING Data I/O MEM MEM BANK VNU BANK VNU 0 1 MEM BANK VNU k−1 cntrl Interconnection CNU CNU CNU Figure 2.16 Partly parallel architecture for sum-product decoding algorithm computations efficiently. 2.6.2 Serial architecture Another obvious architecture is the serial architecture, shown in Fig. 2.15. In the serial architecture, the messages are stored in a memory between generation and consumption. Control logic is used to schedule the variable node and check node computations, and the code structure is realized through the memory addressing. However, in a code with good theoretical properties, the sets of check-to-variable node messages that a set of variable nodes depend on are largely disjunctive (e.g. in a code with girth six, at most one check-to-variable message is shared between the dependencies of any two variable nodes), which makes an efficient code schedule difficult and requires that the memory contains most of the messages. Moreover, in a general code, increasing the throughput by partitioning the memory is made difficult by the irregular dependencies of node computations, although certain code constructions methods (e.g. QC-LDPC codes) can ensure that such a partitioning can be done. Still, the performance of the serial architecture is likely to be severely limited by memory accesses. In [105], iteration-level loop unrolling was used to achieve 1 Gb/s throughput of a serial-like decoder for an (N, j, k) = (4608, 4, 36) code, but with memory requirements of 73728 words per iteration. 2.6. LDPC DECODER ARCHITECTURES 2.6.3 29 Partly parallel architecture A third possible architecture is the serial/parallel, or partly parallel architecture, shown in Fig. 2.16, which can be seen as either a time-multiplexed parallel decoder, or an interleaved serial decoder. The partly parallel architecture retains the speed achievable with parallel processing, while also allowing longer codes without resulting in excessive routing. However, neither the parallel nor the serial architecture can usually be efficiently transformed using a general random code. Thus, the use of a partly parallel architecture usually requires using a joint code and decoder design flow. Generally, the QC-LDPC codes (see Sec. 2.3.2) obtained through various techniques are suitable to be used with a partly parallel architecture. In the partly parallel architecture, parallelism can be achieved in a number of different ways, with different advantages and drawbacks. Considering a QC-LDPC code, the main parallelism choices are either inside or across the sub-matrices. Examples of the partly parallel architecture include a 3.33 Gb/s (1200, 720) code decoder with a power dissipation of 644 mW, manufactured in a 0.18µm technology [70], and a 250 Mb/s (1944, 972) code decoder dissipating 76 mW, manufactured in a 0.13µm technology [90]. The implementations in this thesis are of the partly parallel type architecture. The implementation in Chapter 3 is based on an FPGA implementation achieving a throughput of 54 Mbps for a (9216, 4608) code using a clock frequency of 56 MHz and performing 18 iterations [108]. For the implementation in Chapter 4, two degrees of parallelism are investigated: 24 and 81, respectively. The examined codes are those of the IEEE 802.16e WiMAX standard and the IEEE 802.11n WLAN standard. The achieved throughputs are rate-dependent, but throughputs starting from 33 Mbps and 100 Mbps are achieved in an FPGA implementation with a clock frequency of 100 MHz and performing 10 iterations. It is based on [66]. 2.6.4 Finite wordlength considerations Earlier investigations have shown that the sum-product algorithm for LDPC decoding has relatively low requirements on data wordlength [111]. Even using as few as 4–6 bits yields only a fraction of a dB as SNR penalty for the decoder performance. Considering the equations (2.13)–(2.15), the variable node computations, (2.13) and (2.15), are naturally done using two’s complement arithmetic, whereas the check node computations (2.14) are more efficiently carried out using signed-magnitude arithmetic. This is the solution chosen for the implementation in Sec. 3.3, and data representation converters are therefore used between the variable node and check node processing elements. It should also be noted that due to the inverting characteristic of the domain transfer function Φ(x), shown in Fig. 2.17, there is a big difference between positive and negative zero in the signed-magnitude representation, as the sum in (2.14) is given directly as an argument to Φ(x). This makes the signed-magnitude representation particularly suited for the check-node computations. Considering Φ(x), the fixed-point implementation is not trivial. The functions 30 CHAPTER 2. ERROR CORRECTION CODING 4 Continuous Discrete (2,2) Discrete (2,3) 3.5 3 Φ(x) 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 x 2.5 3 3.5 4 Figure 2.17 Figure showing Φ(x) and discrete functions obtained through rounding. For the discrete functions, the notation is (wi , wf ), where wi is the number of integer bits and wf is the number of fractional bits. obtained through rounding of the function values are shown for some different data representations in Fig. 2.17. Whereas other quantization rules such as truncation, rounding towards infinity, or arbitrary rules obtained through simulations can be considered, rounding to nearest has been used in this thesis, and the problem is not further considered. However, it can be noted that because of the highly non-linear nature of Φ(x), many numbers do not occur as function values for the fixed-point implementations. This fact can be exploited, and in Sec. 5.2 a compression scheme is introduced. 2.6.5 Scaling of Φ(x) Considering the Φ(x) function in (2.14), using the natural base for the logarithm is not necessary. However, when the natural base is used, the inverse function Φ−1 (x) = 2arctanh exp(x) is identical to Φ(x), and thus in (2.14) the inverse transformation can be done using Φ(x). However, the arithmetic transformation of the check node update rule to a sum of magnitudes work equally well with any other logarithm base. The resulting difference between the forward transformation 2.7. SUM-PRODUCT REFERENCE DECODER ARCHITECTURE 31 Block1,1 Block2,1 Block3,1 Blockk,k VNU VNU VNU VNU RAM1,1 RAM2,1 RAM3,1 RAMk,k π1 : Fixed interconnection CNU1,1 CNU1,k π2 : Fixed interconnection CNU2,1 CNU2,k π3 : Configurable interconnection CNU3,1 CNU3,k Parity check Figure 2.18 Partially parallel overview. (3, k)-regular sum-product decoder function Φ(x) and the reverse transformation function Φ−1 (x) may or may not be a problem in the implementation. In a fixed-point implementation, changing the logarithm base can be seen as a scaling of the inputs and outputs to the CNU, which can often be done without any overhead if separate implementations are already used for the forward and reverse transformation functions. In [31], it is shown that such a scaling can improve the performance of the sum-product decoder using fixed-point data. In Sec. 5.2, it is shown how the benefits of internal data coding depend on the choice of logarithm base. 2.7 Sum-product reference decoder architecture In this section, the sum-product reference decoder for the work in Chapter 3 is presented. The section includes an architecture overview, description of memory blocks and implementations of the variable node processing unit (VNU) and check node processing unit (CNU), as well as the interconnection networks. 32 CHAPTER 2. ERROR CORRECTION CODING INT RAM w 1 VNU w w+1 EXT RAM1 w+1 w To π1 w w+1 EXT RAM2 w+1 w To π2 DEC RAM w w+1 EXT RAM3 w+1 w To π3 Figure 2.19 Memory block of (3, k)-regular sum-product decoder, containing memories used during decoding, and the VNUs. 2.7.1 Architecture overview An overview of the architecture is shown in Fig. 2.18. The architecture contains k 2 memory blocks with the memory banks used during decoding and the VNUs. The memory blocks are connected to 3k CNUs through two regular and fixed interconnection networks (π1 and π2 ), and one randomized and configurable interconnection network (π3 ). The purpose of the VNUs is to perform the computations of the variable-to-check node messages αnm , as in (2.13), and the pseudo-posterior probabilities λn , as in (2.15). Similarly, the purpose of the CNUs is to perform the computation of the check-to-variable messages βmn , as in (2.14), and to compute the check parities using the hard-decision variables x̂n obtained from the pseudo-posterior probabilities. However, in the implementation, the computation of Φ(αnm ) is moved from the CNUs to the VNUs, and thus the implementation deviates slightly from the formal formulation of the algorithm. 2.7.2 Memory block The structure of the memory blocks is shown in Fig. 2.19, where w denotes the wordlength of the prior probabilities γn . One memory block contains one INT RAM of width w for storing γn , one DEC RAM of width 1 to store the hard-decision values x̂n , and three EXT RAMi of width w + 1 to store the extrinsic information αnm and βmn during decoding. Thus, there are a total of 5k2 memories. All memories are of size L. The EXT RAMi are dual-port memories, read and written every clock cycle in both the variable node update phase and the check node update phase, whereas the INT RAM and DEC RAM are single-port memories used only in the variable node update phase. The messages are stored in two’s complement representation in INT RAM and in signed magnitude representation 2.7. SUM-PRODUCT REFERENCE DECODER ARCHITECTURE 33 INT RAM EXT RAM1 EXT RAM2 EXT RAM3 w w SM/2C w+2 Sign w SM/2C w+2 2C/SM SM/2C w+2 2C/SM w Φ w w+2 2C/SM w Φ w Φ DEC RAM w+1 w+1 w+1 EXT RAM1 EXT RAM2 EXT RAM3 Figure 2.20 Implementation of VNU for the (3, k)-regular sum-product decoder. in EXT RAMi . At the output of the EXT RAMi , there are two registers. Thus switching of signals in the CNU can be disabled in the VNU phase and vice versa. 2.7.3 Variable node processing unit The implementation of the VNU is relatively straight-forward, as shown in Fig. 2.20. As the extrinsic information is stored in signed magnitude format, the data are first converted to two’s complement. An adder network performs the computations of the variable-to-check node messages and the pseudo-posterior probabilities, and the variable-to-check node messages are then converted back to signed magnitude format and truncated to w bits. Finally, the Φ(x) function (see Fig. 2.17) of the magnitude is computed, and the results are joined with the hard-decision bit to form the w + 1-bit hybrid data. 34 CHAPTER 2. ERROR CORRECTION CODING πi πi w+1 πi w+1 πi w+1 πi w+1 =1 w w S(x, y) S(x, y) w =1 w w =1 S(x, y) S(x, y) S(x, y) w+1 =1 S(x, y) S(x, y) parity S(x, y) S(x, y) S(x, y) S(x, y) w+3 w+3 w+3 w+3 w+3 w+3 w+2 w+2 w+2 w+2 w+2 w+2 w−1 w−1 w−1 w−1 w−1 w−1 Φ Φ w πi w+1 =1 w S(x, y) πi Φ w πi Φ w πi Φ w πi Φ w πi w πi Figure 2.21 Implementation of CNU for the (3, k)-regular sum-product decoder with k = 6. 2.7.4 Check node processing unit The implementation of the CNU for k = 6 is shown in Fig. 2.21. First, the harddecision bits are extracted and xored to compute the parity of the check constraint. Then the check-to-variable messages are computed using the function S(x, y) defined as S(x, y) = sign (x) · sign (y) · (|x| + |y|) . (2.19) Finally, the results are truncated to w bits, and the Φ(x) function of the magnitude is computed. 2.7. SUM-PRODUCT REFERENCE DECODER ARCHITECTURE CNU1,1 CNU1,2 35 CNU1,k a1,1 a2,1 ak,1 a1,2 a2,2 ak,2 a1,k a2,k ak,k Figure 2.22 Function of π1 interconnection network, with am,n denoting the message from/to memory block (m, n). CNU2,1 a1,1 a2,1 ak,1 CNU2,2 a1,2 a2,2 ak,2 CNU2,k a1,k a2,k ak,k Figure 2.23 Function of π2 interconnection network, with am,n denoting the message from/to memory block (m, n). 2.7.5 Interconnection networks The functions of the fixed interconnection networks are shown in Figs. 2.22 and 2.23. π1 connects the messages from Blockx,y with the same x coordinate to the same CNU, whereas π2 connects the messages with the same y coordinate to the same CNU. The function of the configurable interconnection network π3 is shown in Fig. 2.24 for the forward path. am,n denotes the message from Blockm,n . am,n are permuted to bm,n by the permutation functions Ψl,n , where l is the value of AG1 , defined in Sec. 2.7.6. Ψl,n is either the identity permutation or the fixed permutation Rn depending on l and n. Thus, formally ( am,n if ψl,n = 0 bm,n = , (2.20) aRn (m),n if ψl,n = 1 where ψl,n are values stored in a ROM. Similarly, bm,n are permuted to cm,n through the permutation functions Ωl,m , which are either the identity permutations or the fixed permutations Cm depending on l and m. The permutation can be formalized ( bm,n if ωl,m = 0 cm,n = , (2.21) bm,Cm (n) if ωl,m = 1 36 CHAPTER 2. ERROR CORRECTION CODING a1,1 a2,1 ak,1 Ψl,1 b1,1 b2,1 bk,1 a1,2 a2,2 ak,2 Ψl,2 b1,2 b2,2 bk,2 a1,k a2,k ak,k Ψl,k b1,k b2,k bk,k c1,1 c2,1 ck,1 c1,2 c2,2 ck,2 c1,k c2,k ck,k b1,1 b2,1 bk,1 b1,2 b2,2 bk,2 b1,k b2,k bk,k Ωl,1 Ωl,2 Ωl,k CNU3,1 c1,1 c2,1 ck,1 CNU3,2 c1,2 c2,2 ck,2 CNU3,k c1,k c2,k ck,k Figure 2.24 Forward path of π3 interconnection network, with am,n denoting the message from memory block (m, n). Ψl,n and Ωl,m denote permutation functions that are either the identity permutation or fixed permutations Rn and Cm , respectively. where ωl,m are values stored in a ROM. Finally, the values cm,n are connected to CNU3,n . 2.7.6 Memory address generation The addresses to the EXT RAMi memories are given by mod-L binary counters AGix,y , i = 1, 2, 3, where AG1 is also used for the INT RAM and DEC RAM memories. The counters are initialized to 0 at the beginning of variable node processing, i and to the values Cx,y at the beginning of check node processing, which are chosen according to the constraints 1 Cx,y =0 Cx31 ,y − 2 Cx,y 3 Cx,y 1 Cx32 ,y (2.22) = ((x − 1) · y) mod L (2.23) 6= ((x − 1) · y) mod L, ∀x1 , x2 ∈ {1, . . . , k}, x1 6= x2 . (2.25) 6= 3 Cx,y , ∀y1 , y2 2 ∈ {1, . . . , k}, y1 6= y2 (2.24) The purpose of the constraints are to ensure that the code results in a code with girth of at least six, and are further motivated in [108]. 2.8. CHECK-SERIAL MIN-SUM DECODER ARCHITECTURE x 0 1 2 3 Φ[x] 15 8 6 4 x 4 5 6 7 Φ[x] 3 2 2 1 x 8 9 10 11 Φ[x] 1 1 1 1 x 12 13 14 15 37 Φ[x] 0 0 0 0 Table 2.1 Definition of Φ[x] function for (wi , wf ) = (3, 2) fixed point data representation. The architecture as described results in the code shown in Fig. 2.9, where Ix,y denotes an L × L identity matrix, Px,y denotes an L × L identity matrix circularly 2 right-shifted by Cx,y , and Rx,y denotes an Lk × L matrix with column weight one 3 and row weight at most one, subject to further constraints imposed by Cx,y and π3 . The extrinsic data corresponding to Ix,y is stored in EXT RAM1 in Blockx,y , the data corresponding to Px,y in EXT RAM2 in Blockx,y , and the data corresponding to Rx,y in EXT RAM3 in Blockx,y . 2.7.7 Φ function The definition of the Φ[x] function used in the sum-product reference decoder architecture is shown in Table 2.1. The same function is used for the early decision decoder without enforcing check constraints in Sec. 3.3. 2.8 Check-serial min-sum decoder architecture In this section, the reference decoder [66] for the work in Chapter 4 is presented. The decoder handles all the LDPC codes of the WiMAX standard. It uses the min-sum algorithm ((2.16)–(2.18)) in a layered schedule, and uses serial check node operations. Using a layered schedule, bit-sums are updated directly after the completion of each layer, effectively replacing the variable-node update equation (2.16) with updates directly at the end of each check node computation. The result is a significant increase in the propagation rate of the messages which reduces the convergence time of the algorithm [50]. This section includes a description of the schedule, an architecture overview, and details on the check node function unit (CFU). 2.8.1 Decoder schedule The schedule of the decoder and relevant definitions are shown in Fig. 2.25: • A layer is a group of check nodes where all the involved variable nodes are independent. It may be bigger than the expansion factor of the base paritycheck matrix. 38 CHAPTER 2. ERROR CORRECTION CODING Bit-sum block Check node group Check node processing schedule Layer clk k clk k+1 clk k+4 clk k+5 clk k+2 clk k+3 clk k+6 clk k+7 Figure 2.25 Schedule of check-serial min-sum decoder. • A check node group is a number of check nodes that are processed in parallel. The variable nodes in the check node groups are processed serially, with the bit-sums for each variable node updated after each operation. • A bit-sum block is the bit-sums of the variable nodes involved in the checknode groups. The bit-sums in a bit-sum block are accessed in parallel and are stored in different memories. One bit-sum block is read per clock cycle, and then re-written one per clock cycle after the processing delay of the routing networks and CFUs. 2.8.2 Architecture overview An overview of the architecture is shown in Fig. 2.26. The data-path consists of a number of bit-sum memories, a cyclic shifter network, CFUs and a decision memory. The bit-sum memories store the pseudo-posterior bit probabilities as computed by (2.18). In each clock cycle, a bit-sum block is read from the bit-sum memories, routed to the correct CFU by the cyclic shifter, and the CFU computes updated bit-sums that are rewritten to the bit-sum memories. Also, hard decisions are done on the updated bit-sums and written to the decision memory. The CFUs also compute the parities of the signs of the input bit-sums, and decoding stops when all parities are satisfied in an iteration, or when a pre-defined maximum number of iterations is reached. The following parameters are defined for the architecture: • Q is the parallelization factor and the size of the bit-sum blocks • wq is the bit-sum wordlength • wr is the wordlength of the min-sum magnitudes 2.8. CHECK-SERIAL MIN-SUM DECODER ARCHITECTURE 39 Channel LLR Addr Gen Bit-sum memory Shift Ctrl QxQ wq -bit cyclic shift Qxwq in idx Counter rst(0) Qxwq we row input start Q CFU Ctrl CFU out idx Counter rst(0) Qxwq we row output start Q Out Ctrl Decision memory Decision Figure 2.26 Overview of check-serial min-sum decoder architecture. old wq bitsum v2c mag S mem we MinSum update parity idx we min wr min2 wr BitSum update wq new bitsum wr Reg 1+we+2wr FIFO sign in idx Reg BitMsg gen wq MinSum mem parity Figure 2.27 CFU of check-serial min-sum decoder. • we is the edge index wordlength 2.8.3 Check node function unit The implementation of the CFU is shown in Fig. 2.27. It contains a min-sum memory and a sign memory storing the check-to-variable message magnitudes and signs, respectively. The entries in the min-sum memory stores the check node parity, the magnitudes of the two smallest variable-to-check messages and the index of the variable node with the smallest variable-to-check message. The wordlength of the min-sum memory is thus 1 + 2wr + we . At the input of the CFU, the bit message generator subtracts the old check- 40 CHAPTER 2. ERROR CORRECTION CODING to-variable message from the input bit-sum, forming the variable-to-check message according to (2.16). This is followed by the min-sum update that does a sequential search for the two smallest magnitudes of the variable-to-check messages, and the index of the smallest one. The result is a packed description of all the check-tovariable messages according to (2.17) and is stored in the min-sum memory. It is also used by the bit-sum update unit to compute the new bit-sum (2.18) from the old variable-to-check message and the new check-to-variable message. For details of the CFU implementation, the reader is referred to [66]. 3 Early-decision decoding In this chapter, a modification to the sum-product algorithm that the authors call the early-decision decoding algorithm is introduced. Basically, the idea is to reduce internal communication of the decoder by early decision of bits that already have a high enough probability of being either zero or one. This is done in several steps in Sec. 3.1: by defining a measure of bit reliabilities, by defining how bits are decided based on their reliabilities, and by defining the processing of decided bits in the decoding process. In Sec. 3.2, a hybrid algorithm is proposed, removing the performance penalty caused by the early-decision decoding algorithm. To estimate the power dissipation savings obtainable with the methods in this thesis, the algorithms have been implemented in VHDL and synthesized to a Xilinx Virtex 5 FPGA. As a basis for the architecture, the FPGA implementation in Sec. 2.7 (also in [108]) has been used. The codes that can be decoded by the architecture are similar to QC-LDPC codes, but a configurable interconnection networks allows some degree of randomness, resulting in the type of codes described in Sec. 2.3.3. The modifications done to implement the early decision algorithm is described in Sec. 3.3, and in Sec. 3.4, it is shown how the early decision decoder can implement the hybrid decoding algorithm. In Sec. 3.5, floating-point and fixed-point simulation results are provided. The floating-point simulations show the performance of the early decision and hybrid algorithms for three different codes, whereas the performance of the hybrid algorithm under fixed wordlength conditions is shown by the fixed-point simulations. 41 CHAPTER 3. EARLY-DECISION DECODING 10 Bit reliability Bit reliability 42 0 −10 0 10 20 30 Iteration 40 20 0 −20 50 0 20 40 60 Iteration 80 Bit reliability Bit reliability (a) (N, j, k) = (144, 3, 6), successful decod- (b) (N, j, k) = (144, 3, 6), unsuccessful deing. coding. 20 0 −20 0 2 4 Iteration 6 20 0 −20 8 0 20 40 60 Iteration 80 (c) (N, j, k) = (1152, 3, 6), successful decod- (d) (N, j, k) = (1152, 3, 6), unsuccessful deing. coding. Figure 3.1 Typical reliabilities cn during decoding of different codes. The gray lines show the individual bit reliabilities, and the thick black lines are the magnitude averages. In Sec. 3.6, slice utilization and power dissipation estimations are provided from synthesis results for a Xilinx Virtex 5 FPGA. Secs. 3.1.1, 3.1.2, 3.1.3 and 3.2 have been published in [14], [11] and [12], whereas Secs. 3.1.4 and 3.1.5 are previously unpublished. The implementations of algorithms, Sec. 3.3 and Sec. 3.4, have been published in [13]. 3.1 Early-decision algorithm During the first iterations of the decoding of a block, when the messages are still in dependent, the probability that the hard-decision variable is wrong is min qn0 , qn1 . Assuming that this value is small, which is the case in the circumstances considered, it can be approximated as min qn0 /qn1 , qn1 /qn0 , which can be rewritten as exp(−|λn |). Thus, a measure of the reliability of a bit during decoding can be defined as cn = |λn | , (3.1) with the interpretation that the hard-decision bit is correct with a probability of 1 − exp(−cn ). Typical values of cn during decoding of a block are shown for two different codes in Fig. 3.1. The slowly increasing average reliabilities in Figs. 3.1(a) and 3.1(c) are typical for successfully decoded blocks. The slope is generally steeper for longer codes, as the reliabilities escalate quickly in local parts of the code graph where the decoder has converged. Similarly, the oscillating behavior in Figs. 3.1(b) and 3.1(d) is common for unsuccessfully decoded blocks. The key point to recognize 3.1. EARLY-DECISION ALGORITHM 43 in these figures is that high reliabilities are unlikely to change sign, i.e., their corresponding bits are unlikely to change value. Early decision decoding is introduced through the definition of a threshold t denoting the minimum required reliability of a bit to consider it to be sufficiently well determined. The threshold is in the simplest case just a constant, but may also be a function of the iteration, the node degree, or other values. Different choices of the threshold are discussed in Sec. 3.1.1. When a bit is decided, incoming messages to the corresponding node are ignored, and a fixed value is chosen for the outgoing messages. The value will be an approximation of the actual bit probability and will affect the probability computations of bits in subsequent iterations. Different choices of values for decided bits are discussed in Sec. 3.1.2. Early decision adds an alternative condition for finishing the decoding of a block, which is when a decision has been made for all bits. However, when the decoder makes an erroneous decision, this tends to lock the algorithm by rendering its adjacent bits undecidable as the bit values in the graph become inconsistent. In Sec. 3.1.4, detecting the introduced inconsistencies in an implementation-friendly way is discussed. 3.1.1 Choice of threshold The simplest choice of a threshold is a constant t = t0 . In a cycle-free graph this is a logical choice, as the pseudo-posterior probabilities qn are valid. However, in a graph with cycles, the messages will be correlated after g/4 iterations, where g is the girth of the code. As the girth is often at most 6 for codes used in practice, this will be already after the first iteration. Detailed analysis of the impact of the correlation of messages on the probabilities is difficult to do. However, as the pseudo-posterior probabilities of a cycle-less graph increase with the size of the graph, it can be assumed that the presence of cycles causes an escalation of the pseudo-posterior probabilities. In [11], thresholds defined as t = t0 + td i, where i is the current iteration, are investigated. However, in fixed-point implementations with course quantizations and low saturation limits significant gains are hard to achieve and therefore only constant thresholds are considered in this thesis. 3.1.2 Handling of decided bits When computations of probabilities stop for a bit, the outgoing messages from the node will not represent the correct probabilities, and subsequent computations for nearby nodes will therefore be done using incorrect data. The most straightforward way is to stop computing the variable-to-check messages, and use the values from the iteration in which the bit was decided in subsequent iterations. However, messages must still be communicated along the edges, so the potential gain is small in this case. There are two other natural choices of variable-to-check messages for early decided bits. One is to set the message αnm to t or −t, depending on the hard-decision 44 CHAPTER 3. EARLY-DECISION DECODING value. The other is to set αnm to ∞ or −∞, thereby essentially regarding the bit as known during the subsequent iterations. The first choice can be expected to introduce the least amount of noise in the decoding, and thereby yield more accurate decoding results, whereas the second choice can be expected to propagate increased reliabilities and thus decide bits faster. Similar ideas are presented in [112]. In this thesis, results are only provided for the second case, as simulations have shown the difference in performance to be insignificant. Results are provided in Sec. 3.5. 3.1.3 Bound on error correction capability Considering early decision applied only on the pseudo-posterior probabilities before the first iteration, a lower bound on the error correction capability over a white Gaussian channel with standard deviation σ can be calculated. Before any messages have been sent, the pseudo-posterior probabilities are equal to the prior probabilities, qn = pn . An incorrect decision will be made if the reliability of a bit is above the threshold, but the hard decision bit is not equal to the sent bit. The probability of this occurring can be written Bbit = P (cn > t, x̂n 6= xn ) = P (γn > t, x = −1) + P (γn < −t, x = 1) . (3.2) Assuming that the bits 0 and 1 are equally probable, the error probability can be rewritten Bbit = P (γn > t|x = −1) P (x = −1) + P (γn < −t|x = 1) P (x = 1) = tσ 2 = P (γn < −t|x = 1) = P rn < − |x = 1 = 2 2 tσ =Q − . (3.3) 2 If no corrections of erroneous decisions are done, an incorrect bit decision will result in the whole block being incorrectly decoded. The probability of this happening for a block with N bits is N Bblock = 1 − (1 − Bbit ) (3.4) It can thus be seen that the early decision algorithm imposes a lower bound on the block error probability, which grows with increasing SNR (as Bbit grows with increasing SNR). This is important, as LDPC decoding is sometimes performed outside the SNR region which is practical to characterize with simulations, and the presence of an error floor will then reduce the expected decoder performance. 3.1.4 Enforcing check constraints When an erroneous decision is made, usually the decoder will neither converge on a codeword nor manage to decide every bit. Thus the decoder will continue to 3.1. EARLY-DECISION ALGORITHM v0 v1 v2 v3 v4 v5 45 v6 v7 c0 v8 v9 v10 c1 (a) Variable node update phase. White variable nodes have been decided to 0 and black variable nodes have been decided to 1. v5 is still undecided. v0 v1 v2 v3 v4 c0 (b) Check node update phase. variable messages. v5 v6 v7 v8 v9 v10 c1 Thick arrows indicate enforcing check-to- Figure 3.2 Principle of enforcing check constraints. iterate the maximum number of iterations, and if erroneous decisions are frequent, the average number of iterations can increase significantly and severely reduce the throughput of the decoder. However, if the graph inconsistency can be discovered, the decoder can be aborted early and the block can be retried using a less approximative algorithm. The principle of how graph inconsistencies can be discovered is shown in Fig. 3.2. In the example, bits v0 , . . . , v4 have been decided. Thus, the incoming messages α0,0 , . . . , α4,0 to check node c0 will not change, and the message β0,5 from check node c0 to variable node v5 will also be constant. As the decided conditions are passed along with the messages α0,0 , . . . , α4,0 , this is known to the check node, and an enforcing condition can therefore be flagged for β0,5 , meaning that the receiving variable node must take the value of the message for the graph to remain consistent. Similarly, variable nodes v6 , . . . , v10 are also decided, resulting in that message β1,5 is also enforcing. However, to satisfy check node c0 , v5 must take the value 0, and to satisfy c1 , v5 must take the value 1, and as the variable nodes v0 , . . . , v4 , v6 , . . . , v10 can not change, an incorrect decision has been discovered. However, it is not known exactly which bit has been incorrectly decided, and 46 CHAPTER 3. EARLY-DECISION DECODING thus the error can not be simply corrected. In this thesis, the event is handled by aborting the decoding process, but other options are also possible, e.g., to remove all decisions and continue with the current extrinsic information, or to use the current graph state as a starting point for the general sum-product algorithm. As an enforcing check constraint requires that the receiving variable node must take the value of the associated check-to-variable message for the graph to remain consistent, an early decision can be made for the receiving variable node. In floating-point simulations with large dynamic range this is not necessary as the large magnitude of the check-to-variable message ensures that a decision will be made based on the pseudo-posterior probability. However, in fixed-point implementations, a decision may have to be explicitly made as the magnitude of the check-to-variable message is limited by the data representation. 3.1.5 Enforcing check approximations In a practical implementation, using an extra bit to denote enforcing checks is relatively costly, and thus approximations using the absolute value of the checkto-variable messages are suggested. Consider the check node c0 with k inputs v0 , . . . , vk−1 , where v0 , . . . , vk−2 have been decided. Assume that the values ±z are used for the messages from the decided variables. The absolute value of the check-to-variable message β0,k−1 from c0 to vk−1 will then be k−2 ! ! k−2 Y X |β0,k−1 | = sign αn,0 · Φ Φ (|αn,0 |) n=0 n=0 ! k−2 z X = 2arctanh exp log tanh 2 n=0 z k−1 . (3.5) = 2arctanh tanh 2 Using this approximation to determine enforcing checks, the check-to-variable message βmn from a k-input check node is enforcing if z k−1 |βmn | ≥ 2arctanh tanh . (3.6) 2 In a fixed-point implementation, the condition can be written |βmn | ≥ Φ [(k − 1) Φ [z]] , (3.7) where Φ[x] denotes the discrete function obtained through rounding of the argument and function values of Φ(x), as in Fig. 2.17. Typically, z is the maximum representable value in the data representation, and thus Φ [z] = 0, and Φ [(k − 1) Φ [z]] is also the maximum representable value. However, due to the limited dynamic range of the data, z will be a common value also for variables that are not decided. Thus check-to-variable messages will often be enforcing when the involved variables are not decided, which inhibits performance. An alternative implementation is described in Sec. 3.3.4. 3.2. HYBRID DECODING 3.2 47 Hybrid decoding For the sum-product algorithm, undetected errors are extremely rare, and this property is retained with the early-decision modification. Thus, it is almost always detectable that a wrong decision has been made, but not for which bit. However, the unsuccessfully decoded block can be retried using the regular sum-product algorithm. As the block is then decoded twice, the resources required for that block will be significantly increased, and there is therefore a trade-off adjusted by the threshold level. Lowering the threshold will increase the number of bit decisions, but also increase the probability that a block is unsuccessfully decoded and will require an additional decoding pass. Determining the optimal threshold level analytically is difficult, and in this thesis simulations are used for a number of codes to determine its impact. Using an adequate threshold, redecoding a block is relatively rare, and thus the average decoding latency is only slightly increased. However, the maximum latency is doubled, which might make the hybrid decoding algorithm unsuited for applications sensitive to the communication latency. 3.3 Early-decision decoder architecture In this section, the changes made to the sum-product reference decoder to realize the early decision decoder are explained. In the early decision decoder architecture, additional registers are introduced which are not shown in the reference architecture. However, as the implementation is pipelined in several stages, the registers are present also in the reference architecture, and do not therefore constitute additional hardware resources for the early decision architecture. 3.3.1 Memory block In the early-decision decoder architecture, additional memories DIS RAM of size L/8 × 8 bits are added to the memory blocks to store the early decisions, as seen in Fig. 3.3. The reason that the memories are implemented with a wordlength of eight is that decisions must be read for all CNUs in the check node update phase, which would thus require three-port memories if a wordlength of one is used. However, as data are always read sequentially due to the quasi-cyclic structure of the codes, parallelizing the memory accesses by increasing the wordlength is easy. In addition, the DEC RAMs have been changed to the same size to be able to utilize the same addressing mechanism. The parallelized accesses to the hard decision and early decision memories are handled by the ED LOGIC block which uses an additional output signal from the VNU to determine early decisions. In the VNU update phase, the EXT RAMs are addressed with the same addresses and the three-bit disabled signal from the ED LOGIC block is identical for all memories. Thus, when a bit is disabled, the registers on the memory outputs are disabled, and the VNU will be idle. In the CNU update phase, the wordlength of the signals from the EXT RAMs to the CNUs have been increased by one to 48 CHAPTER 3. EARLY-DECISION DECODING INT RAM DEC RAM 8 w DIS RAM 8 3 hard decision VNU threshold early decision w w EXT RAM1 w w ED LOGIC w w w To π1 To π2 3 w EXT RAM2 hard decision disabled w w EXT RAM3 w w To π3 Figure 3.3 Memory block of (3, k)-regular early-decision decoder, containing memories used during decoding, and the variable node processing unit. include the disabled bits. As data are read from different memory positions in the three memories, the disabled bits are not necessarily identical. When a bit is disabled, the registers on the memory outputs are disabled, and only the hard decision bit is used in the CNU computations. 3.3.2 Node processing units In the variable node processing unit, shown in Fig. 3.4, the magnitude of the pseudo-posterior likelihood is compared to the threshold value to determine if the bit will be disabled in further iterations. In the check node processing unit, shown in Fig. 3.5, MUXes at the input determine if the variable-to-check messages or the hard decision bits are used for the computations. At the CNU outputs, registers are used to disable driving of the check-to-variable messages for bits that are disabled. If enforcing check constraints, as defined in Sec. 3.1.4, are implemented, additional changes are made to the VNU and CNU. These are described in Sec. 3.3.4. 3.3.3 Early decision logic The implementation of the ED LOGIC block is shown in Fig. 3.6, and consists mostly of a set of shift registers. During the VNU phase, hard decision and early decision bits are read from DEC RAM and DIS RAM and stored in one shift register each. Newly made hard decisions and early decisions arrive from the VNU. However, for the bits that are disabled, the VNU computations are turned 3.3. EARLY-DECISION DECODER ARCHITECTURE 49 INT RAM EXT RAM1 EXT RAM2 EXT RAM3 w w SM/2C w+2 SM/2C w+2 2C/SM Sign w SM/2C w+2 2C/SM w+2 2C/SM Magn w threshold w > Φ w Φ w Φ early decision ED LOGIC hard decision EXT RAM1 EXT RAM2 EXT RAM3 Figure 3.4 Implementation of variable node processing unit for the (3, k)regular early-decision decoder. off, and thus the stored values from previous iterations are recycled. During the CNU phase, the memories are only read. However, the three address generators produce different addresses, and thus three shift registers with different contents are used. Because of this implementation, an additional constraint is imposed on the code structure. As DEC RAM and DIS RAM are only single-port memories, simultaneous reading from several memory locations is not allowed. The implication of this is the following three constraints: 1 2 Cx,y 6= Cx,y (mod 8) 6= (mod 8) 2 Cx,y 1 Cx,y 6= 3 Cx,y 3 Cx,y (mod 8) (3.8) However, if needed, the architecture can be modified to remove the constraints either by using dual-port memories, or by introducing an additional pipeline stage. 50 CHAPTER 3. EARLY-DECISION DECODING πi πi w+2 πi w+2 πi w+2 hard decision πi w+2 w+2 hard decision =1 disabled disabled disabled disabled w w w w w w−1 0 w−1 1 0 w−1 1 w 0 0 S(x, y) w 0 1 w w S(x, y) S(x, y) S(x, y) w − 1 parity 1 S(x, y) S(x, y) 0 w−1 1 w S(x, y) S(x, y) 0 =1 0 w−1 1 w w 0 0 =1 =1 disabled 0 w+2 hard decision =1 0 πi S(x, y) S(x, y) S(x, y) S(x, y) w+3 w+3 w+3 w+3 w+3 w+3 w+2 w+2 w+2 w+2 w+2 w+2 w−1 w−1 w−1 w−1 w−1 w−1 Φ Φ w Φ w Φ w Φ w Φ w w Reg Reg Reg Reg Reg Reg πi πi πi πi πi πi Figure 3.5 Implementation of check node processing unit for the (3, k)regular early-decision decoder. 3.3. EARLY-DECISION DECODER ARCHITECTURE 51 DEC RAM DIS RAM 1 hard decision 0 ShiftReg load load 1 early decision AG1 0 ShiftReg load ShiftReg ShiftReg ShiftReg load ShiftReg load load 3 3 ShiftReg hard decision disable ShiftReg x mod 8 = 7 AG2 x mod 8 = 7 AG3 x mod 8 = 7 Figure 3.6 Implementation of early decision logic for the (3, k)-regular early-decision decoder. 3.3.4 Enforcing check constraints The principle of enforcing check constraints is described in Sec. 3.1.4, and an approximation in Sec. 3.1.5. The enforcing check constraint approximation has been simulated on a bit-true level, with some additional changes to improve the performance. The modifications, and their associated hardware requirements are explained in detail in this section. First, with a requirement of 3k − 4 and gates, the enforcing conditions of the check-to-variable messages are computed explicitly in the CNU from the disabled bits of the incoming variable-to-check messages. Then, using k w-bit 2-to-1 MUXes, the magnitude of the check-to-variable message is set to the highest possible value in the representation for enforcing messages. Also, in order to make this value unique, the function value of Φ[x] for x = 0 is reduced by one. As the slope of the continuous Φ(x) function is steep for x close to zero, this change can be expected to introduce a small amount of errors, as the “true” value of the check-tovariable message is anywhere between Φ(2−(wf +1) ) and infinity. Using this change, enforcing check constraints can be detected by the receiving VNU, as the magnitude of the incoming check-to-variable message is maximized only for check-to-variable messages that are enforcing. Second, the VNU will use the knowledge of enforcing check constraints both 52 CHAPTER 3. EARLY-DECISION DECODING to detect contradicting enforcing check constraints, and to make additional early decisions for variables that receive enforcing check-to-variable messages. The contradiction detection can be straight-forwardly implemented for a three-input VNU using 3(w − 1) + 6 and gates and 3 xor gates, where w is the wordlength of the check-to-variable messages. The logic for the additional early decisions can be implemented using e.g. three one-bit two-to-one MUXes. Third, the CNU detects the case that all incoming variable-to-check messages are decided, but the parity-check constraint is not satisfied. This can be done using one additional and gate to determine if all incoming messages are decided, and one and gate to determine the inconsistency. 3.4 Hybrid decoder With the early decision decoder as defined in this chapter, the sum-product algorithm can be implemented using the same hardware simply by adjusting the threshold. Thus, the hybrid decoder can be implemented with some controlling logic to clear the extrinsic message memories and change the threshold level when a decoder failure occurs for the early decision decoder. As seen in Sec. 3.5.3, the hybrid decoder offers gains only in the high SNR region, as most blocks will fail to be decoded by both decoders in the low SNR region. Thus, if channel information is available, the SNR estimation can be used to determine if the receiver will use the early decision algorithm or not, and could also be used to set a suitable threshold. However, this idea has not been further investigated in this thesis. 3.5 Simulation results 3.5.1 Choice of threshold Figure 3.7 shows early decision decoding results for a (3,6)-regular rate-1/2 code with a block length of 1152 bits. In all decoding instances, the maximum number of iterations was set to 100, and the early decision algorithm with thresholds of 4,5 and 6 were simulated. While 100 iterations is generally not practically feasible, the number was chosen to highlight the behaviour of the sum-product algorithm, and the complications resulting from the early decision modification. It is obvious that early decision impacts the block error rate, shown in Fig. 3.7(a), significantly more than the bit error rate, shown in Fig. 3.7(b). This is to be expected, as the decoding algorithm may correct most of the bits even if some bit is erroneously decided. The behaviour is consistent through all performed simulations, and thus only the block error rate is shown in most cases. If the LDPC code is used as an outer code in an error correction system, the small amount of bit errors introduced by the early decision decoder can be corrected by the decoder for the inner code. In other cases, hybrid decoding, as explained in Sec. 3.2, may be efficiently utilized. 3.5. SIMULATION RESULTS 53 0 0 10 10 −1 10 −1 −2 10 −2 10 Bit error rate Block error rate 10 ED, t=4 ED, t=5 ED, t=6 PP −3 10 −4 10 −3 10 −4 10 −5 10 ED, t=4 ED, t=5 ED, t=6 PP −6 10 −5 10 −7 10 −6 10 −8 1 1.5 2 Eb/N0 [dB] 2.5 10 3 (a) Block error rate. 1 1.5 2 Eb/N0 [dB] 2.5 3 (b) Bit error rate. 5 x 10 80 PP ED, t=6 ED, t=5 ED, t=4 5 4 Average number of iterations Average number of messages 6 3 2 1 0 1 1.5 2 Eb/N0 [dB] 2.5 3 (c) Average number of internal messages. ED, t=4 ED, t=5 ED, t=6 PP 60 40 20 0 1 1.5 2 Eb/N0 [dB] 2.5 3 (d) Average number of iterations. Figure 3.7 Early decision decoding results of rate-1/2 regular code with (N, j, k) = (1152, 3, 6) The average number of internal messages sent per block is shown in Fig. 3.7(c), and the average number of iterations is shown in Fig. 3.7(d). It is evident that, whereas the internal communication is reduced with reducing thresholds, the decoding time is increased as the decoder may get stuck when an incorrect decision is made. In the following figures, usually a measure of the relative communication reduction is shown instead of the absolute number of internal messages. The measure is defined as comm.red. = 1 − Cmod /Cref , where Cmod is the number of internal messages using the modified algorithm, and Cref is the number of internal messages using the reference algorithm. Results of reducing the number of iterations using enforcing check constraints is shown in Sec. 3.1.4. The performance of the early decision algorithm for a rate-3/4 (N, j, k) = (1152, 3, 12) code is shown in Fig. 3.8(a), and for a rate-1/2 (N, j, k) = (9216, 3, 6) code in Fig. 3.8(b). The performance for the higher rate code is significantly better than for the rate-1/2 code of the same block size. Similarly, the performance of the longer code is worse than for the shorter. It is intuitive that codes with higher CHAPTER 3. EARLY-DECISION DECODING Block error rate BLER, t=4 BLER, t=5 BLER, t=6 BLER, PP Comm., t=4 Comm., t=5 Comm., t=6 −2 10 −3 −4 2.2 2.4 2.6 0.8 10 0.6 10 10 10 2.8 3 3.2 Eb/N0 [dB] 3.4 3.6 0.6 −1 0.4 BLER, t=6 BLER, t=8 BLER, t=10 BLER, PP Comm., t=6 Comm., t=8 Comm., t=10 −2 10 −3 0.4 10 0.2 10 0.2 0 −4 (a) Rate-3/4 code, (N, j, k) = (1152, 3, 12). Comm. reduction −1 10 0 1 Block error rate 0 10 Comm. reduction 54 1 1.1 1.2 1.3 1.4 Eb/N0 [dB] 1.5 1.6 −0.2 1.7 (b) Rate-1/2 code, (N, j, k) = (9216, 3, 6). Figure 3.8 Performance of early decision decoding algorithm on codes with high rates and codes with long block lengths. 0 0 10 10 Block error rate Block error rate −1 −1 10 −2 10 1 t=3, iters=2 t=3, iters=1 t=3, iters=0 PP Limit 1.5 10 −2 10 −3 10 −4 2 Eb/N0 [dB] (a) t = 3. 2.5 3 10 1 t=4, iters=2 t=4, iters=1 t=4, iters=0 PP Limit 1.5 2 2.5 3 Eb/N0 [dB] 3.5 4 (b) t = 4. Figure 3.9 Error floor resulting from early decision decoding and making decisions only in the initial iterations. error correction capabilities (such as lower rate and larger block size) allow noisier signals, which increases the chance of making wrong decisions and thereby reducing performance. As described in Sec. 3.1.3, early decision decoding results in an error floor as defined by (3.4). The error floor has been computed for two different thresholds for the (N, j, k) = (1152, 3, 6) code, and is shown for t = 3 in Fig. 3.9(a) and for t = 4 in Fig. 3.9(b). As seen, the theoretical limit is tight when decisions are performed only on the initial data (iters = 0). However, if decisions are performed in later iterations, the block error rate increases rapidly. The reasons for this are two. First, the messages no longer have a Gaussian distribution after the first iteration, and the decision error probability is therefore no longer valid. Second, the dependence 3.5. SIMULATION RESULTS 55 0 30 t=4 t=5 t=6 PP −1 10 Average number of iterations Block error rate 10 −2 10 −3 10 1 1.5 2 Eb/N0 [dB] 2.5 20 15 10 5 0 1 3 (a) Block error rate. PP t=6 t=5 t=4 25 1.5 2 Eb/N0 [dB] 2.5 3 (b) Average number of iterations. Figure 3.10 Early decision decoding of (N, j, k) = (1152, 3, 6) code using enforcing checks. 0 t=4 t=5 t=6 PP −1 10 −2 10 −3 10 2.5 3 Eb/N0 [dB] (a) Block error rate. 3.5 30 Average number of iterations Block error rate 10 PP t=6 t=5 t=4 25 20 15 10 5 0 2.5 3 Eb/N0 [dB] 3.5 (b) Average number of iterations. Figure 3.11 Early decision decoding of (N, j, k) = (1152, 3, 12) code using enforcing checks. of the messages caused by cycles in the graph causes the reliabilities of the bits to escalate. 3.5.2 Enforcing check constraints Enforcing check constraints were introduced in Sec. 3.1.4 as a way of reducing the number of iterations required by the early decision algorithm when decoding of a block fails. The results are shown for the (N, j, k) = (1152, 3, 6) code in Fig. 3.10 and for the (N, j, k) = (1152, 3, 12) code in Fig. 3.11. By comparing with Fig. 3.7(a) and Fig. 3.8(a) it can be seen that the error correction capability of the decoder is not additionally worsened by the enforcing check constraints Block error rate 30 BLER, EC BLER, ECA BLER, PP Iters, PP Iters, EC 20 Iters, ECA −1 10 −2 10 10 −3 10 1 1.5 2 Eb/N0 [dB] 2.5 (a) (N, j, k) = (1152, 3, 6) 0 3 0 10 Block error rate 0 10 30 BLER, EC BLER, ECA BLER, PP Iters, PP Iters, EC 20 Iters, ECA −1 10 −2 10 10 −3 10 2.2 2.4 2.6 2.8 3 3.2 Eb/N0 [dB] 3.4 3.6 Average number of iterations CHAPTER 3. EARLY-DECISION DECODING Average number of iterations 56 0 (b) (N, j, k) = (1152, 3, 12) Figure 3.12 Early decision decoding of (N, j, k) = (1152, 3, 6) and (N, j, k) = (1152, 3, 12) codes using enforcing check approximation. t = 4. modification. However, the average number of iterations is significantly reduced, and for low SNRs the early decision algorithm requires an even lower number of iterations than the standard algorithm as graph inconsistencies are discovered for blocks that the standard algorithm will not be able to successfully decode. The enforcing check approximation is investigated in Figs. 3.12(a) and 3.12(b) for the (N, j, k) = (1152, 3, 6) and (N, j, k) = (1152, 3, 12) codes, respectively. It is apparent that the approximation does not reduce the decoder performance, and thus that additional state information to attribute check-to-variable messages as enforcing is not needed. 3.5.3 Hybrid decoding In this section, results using the hybrid algorithm explained in Sec. 3.2 is presented. The maximum number of iterations were set to 100 for both the early decision pass and the standard algorithm. For the early decision algorithm, enforcing check constraints have been used to reduce the number of iterations. Simulations have been done on the three example codes with (N, j, k) = (1152, 3, 6), (1152, 3, 12), and (9216, 3, 6). For each code, first the SNR has been kept constant while the threshold has been varied. It can be seen that setting the threshold too low increases the communication (reduces the communication reduction) as the early decision algorithm will fail on most blocks. Similarly, setting the threshold too high also increases the communication as bits with high reliabilities are not decided. Thus, for a fixed SNR there will be an optimal choice of the threshold with respect to the internal communication. For each code, SNRs corresponding to block error rates of around 10−2 and 10−4 with the standard algorithm have been used. Using the optimal threshold for a block error rate of 10−4 , simulations with varying SNR have been done. It can be seen that the block error rate performance 3.5. SIMULATION RESULTS Block error rate 10 −1 10 0.7 0.6 0.5 0.4 0.3 −2 10 0.2 0.1 −3 10 0 −0.1 −4 10 2 3 4 Threshold Average number of iterations 0 40 0.8 Comm. reduction BLER, SNR=1.75, Hybrid BLER, SNR=1.75, PP BLER, SNR=2.25, Hybrid BLER, SNR=2.25, PP Comm., SNR=2.25, Hybrid Comm., SNR=1.75, Hybrid 57 30 20 10 0 2 −0.2 6 5 (a) Block error rate, communication reduction. SNR=1.75, Hybrid SNR=1.75, PP SNR=2.25, Hybrid SNR=2.25, PP 3 4 Threshold 5 6 (b) Average number of iterations. Figure 3.13 Decoding of (N, j, k) = (1152, 3, 6) code using hybrid algorithm showing performance as function of threshold. Block error rate 10 −2 10 0.6 −3 10 0.5 0.4 −4 10 0.3 30 0.2 −5 10 0.1 −6 10 1 1.5 2 Eb/N0 [dB] 2.5 0 3 (a) Block error rate, communication reduction. Average number of iterations 1 BLER, t=4 BLER, t=5 0.9 BLER, PP 0.8 Comm., t=4 Comm., t=5 0.7 −1 Comm. reduction 0 10 t=4 t=5 PP 25 20 15 10 5 0 1 1.5 2 Eb/N0 [dB] 2.5 3 (b) Average number of iterations. Figure 3.14 Decoding of (N, j, k) = (1152, 3, 6) code using hybrid algorithm showing performance as function of SNR for threshold 4 (optimal) and 5. of the hybrid algorithm is indistinguishable from that of the standard algorithm (in the SNR region simulated), whereas the internal communication of the decoder is much lower. In fact, for the rate-1/2 codes, no decoder errors have occurred, and therefore all errors introduced by wrong decisions of the early decision algorithm were detected. However, for the rate-3/4 code, undetected errors did occur, and thus it is possible that the performance of the hybrid algorithm is inferior to that of the standard algorithm. Simulations for the (N, j, k) = (1152, 3, 6) code are presented in Figs. 3.13 and 3.14, for the (N, j, k) = (1152, 3, 12) code in Figs. 3.15 and 3.16, and for the (N, j, k) = (9216, 3, 6) code in Figs. 3.17 and 3.18. In the plots, the average 58 CHAPTER 3. EARLY-DECISION DECODING Block error rate 10 −1 10 Comm. reduction 0 0.4 −2 10 0.2 −3 10 0 −4 10 2 3 4 Threshold Average number of iterations 30 BLER, SNR=2.8, Hybrid BLER, SNR=2.8, PP 1 BLER, SNR=3.4, Hybrid BLER, SNR=3.4, PP Comm., SNR=3.4, Hybrid 0.8 Comm., SNR=2.8, Hybrid 0.6 20 15 10 5 0 2 −0.2 6 5 SNR=2.8, Hybrid SNR=2.8, PP SNR=3.4, Hybrid SNR=3.4, PP 25 (a) Block error rate, communication reduction. 3 4 Threshold 5 6 (b) Average number of iterations. Figure 3.15 Decoding of (N, j, k) = (1152, 3, 12) code using hybrid algorithm showing performance as function of threshold. −1 Block error rate 10 0.6 −2 10 0.5 0.4 0.3 −3 10 30 0.2 0.1 −4 10 2.2 2.4 2.6 2.8 3 3.2 Eb/N0 [dB] 3.4 3.6 0 (a) Block error rate, communication reduction. Average number of iterations 1 BLER, t=3.75 BLER, t=4.5 0.9 BLER, PP 0.8 Comm., t=3.75 Comm., t=4.5 0.7 Comm. reduction 0 10 t=3.75 t=4.5 PP 25 20 15 10 5 0 2.5 3 Eb/N0 [dB] 3.5 (b) Average number of iterations. Figure 3.16 Decoding of (N, j, k) = (1152, 3, 12) code using hybrid algorithm showing performance as function of SNR for threshold 3.75 (optimal) and 4.5. number of iterations denote the sum of the number of iterations required for the early decision pass and the standard algorithm pass. 3.5.4 Fixed-point simulations The performance of a randomized QC-LDPC code with parameters (N, j, k) = (1152, 3, 6) has been simulated using a fixed-point implementation. Among an ensemble of 300 codes constructed using random parameters, the code with the best block error correcting performance at an SNR of 2.6 was selected. The random 2 3 parameters were defined in Sec. 2.7, and are the initial values Cx,y and Cx,y of the 59 0 10 −1 10 0.1 −2 10 0 100 Comm. reduction 0.5 BLER, SNR=1.3, Hybrid BLER, SNR=1.3, PP 0.4 BLER, SNR=1.5, Hybrid BLER, SNR=1.5, PP Comm., SNR=1.5, Hybrid 0.3 Comm., SNR=1.3, Hybrid 0.2 −3 10 −0.1 −4 10 3 4 5 6 Threshold Average number of iterations Block error rate 3.5. SIMULATION RESULTS 80 60 40 20 0 3 −0.2 8 7 (a) Block error rate, communication reduction. SNR=1.3, Hybrid SNR=1.3, PP SNR=1.5, Hybrid SNR=1.5, PP 4 5 6 Threshold 7 8 (b) Average number of iterations. Figure 3.17 Decoding of (N, j, k) = (9216, 3, 6) code using hybrid algorithm showing performance as function of threshold. −1 Block error rate 10 0.2 −2 10 0.1 0 −0.1 −3 10 100 −0.2 −0.3 −4 10 1 1.1 1.2 1.3 1.4 Eb/N0 [dB] 1.5 1.6 −0.4 1.7 (a) Block error rate, communication reduction. Average number of iterations 0.6 BLER, t=5 BLER, t=7 0.5 BLER, PP 0.4 Comm., t=5 Comm., t=7 0.3 Comm. reduction 0 10 t=5 t=7 PP 80 60 40 20 0 1 1.2 1.4 Eb/N0 [dB] 1.6 (b) Average number of iterations. Figure 3.18 Decoding of (N, j, k) = (9216, 3, 6) code using hybrid algorithm showing performance as function of SNR for threshold 5 (optimal) and 7. address generators AG2x,y and AG3x,y , the permutation functions Ψl,n and Ωl,m , and the contents of the row and column permutation ROMs for π3 , ψl,n and ωl,m . Furthermore, the initial values of the address generators were chosen to satisfy (2.25) to ensure high girth, as well as (3.8) to ensure that an early decision decoder implementation is possible. In Fig. 3.19, the performance of the randomized QC-LDPC code is compared with the random LDPC code used in the earlier sections. It is seen that the performances are comparable down to a block error rate of 10−6 . In the same figure, the performances using a fixed-point implementation with wordlengths of 5 and 6 are shown. For both fixed-point simulations, two integer bits were used, as 60 CHAPTER 3. EARLY-DECISION DECODING 0 10 −2 10 Average number of iterations −1 10 Block error rate 100 RQC code, (5,2) RQC code, (6,3) RQC code Random code −3 10 −4 10 −5 10 −6 10 1 1.5 2 Eb/N0 [dB] (a) Block error rate. 2.5 3 RQC code, (5,2) RQC code, (6,3) RQC code Random code 80 60 40 20 0 1 1.5 2 Eb/N0 [dB] 2.5 3 (b) Average number of iterations. Figure 3.19 Decoding of (N, j, k) = (1152, 3, 6) randomized QC-LDPC code, using the floating-point sum-product algorithm and fixed-point precision. (w, wf ) denotes quantization to w bits where wf bits are fraction bits. increasing the number of integer bits did not increase the performance. Down to a block error rate of 10−4 , both fixed-point representations show a performance hit of less than 0.1 dB. The performance difference between wordlengths of 5 and 6 bits is rather small, which is consistent with previous results [111], and thus w = 5 has been used for the following simulations. It is apparent that the fixed-point implementations show an increased error floor, which is likely due to clipping of the message magnitudes. In Fig. 3.20, the performance of the hybrid early decision-sum product decoder is simulated for SNRs of 2.0 and 2.5, while the threshold is varied. Similarly to the floating point case, fox a fixed SNR there is an optimal choice of the threshold that decreases the internal communications of the decoder. The limited increase of the number of iterations for low thresholds is due to the use of enforced check constraints. The fixed-point implementation uses a different scaling of the input due to the limited range of the data representation, and thus the thresholds used in the fixed-point simulations are not directly comparable to the ones used in the floating-point simulations. Keeping the threshold constant and varying the SNR yields the results in Fig. 3.21, where it can be seen that the use of early decision gives a somewhat lower block error probability than using only the sum-product algorithm. This can be explained by the fact that the increased probabilities of the early decided bits increases the likelihood of the surrounding bits to take consistent values. Thereby in some cases, the early decision decoder converges to the correct codeword, where the sum-product decoder would fail. Regarding the reduction of internal communication, results close to those obtained using floating-point data representation are achieved. 3.6. SYNTHESIS RESULTS 61 −1 Block error rate 10 −2 10 0.7 0.6 0.5 0.4 0.3 0.2 0.1 −3 10 30 0.8 0 −0.1 −4 10 3 3.5 4 4.5 Threshold 5 5.5 Average number of iterations BLER, SNR=2.0, Hybrid BLER, SNR=2.0, PP BLER, SNR=2.5, Hybrid BLER, SNR=2.5, PP Comm., SNR=2.5, Hybrid Comm., SNR=2.0, Hybrid Comm. reduction 0 10 20 15 10 5 0 3 −0.2 6 (a) Block error rate, communication reduction. SNR=2.0, Hybrid SNR=2.0, PP SNR=2.5, Hybrid SNR=2.5, PP 25 3.5 4 4.5 5 Threshold 5.5 6 (b) Average number of iterations. Figure 3.20 Decoding of (N, j, k) = (1152, 3, 6) randomized QC-LDPC code using (w, wf ) = (5, 2) precision and the hybrid algorithm. −1 Block error rate 10 0.4 −2 10 0.3 0.2 0.1 −3 10 30 0 −0.1 −4 10 1 1.5 2 Eb/N0 [dB] 2.5 −0.2 3 (a) Block error rate, communication reduction. Average number of iterations 0.8 BLER, PP BLER, t=4 0.7 BLER, t=5 0.6 Comm., t=4 Comm., t=5 0.5 Comm. reduction 0 10 t=4 t=5 PP 25 20 15 10 5 0 1 1.5 2 Eb/N0 [dB] 2.5 3 (b) Average number of iterations. Figure 3.21 Decoding of (N, j, k) = (1152, 3, 6) randomized QC-LDPC code using (w, wf ) = (5, 2) precision and the hybrid algorithm. 3.6 Synthesis results The reference sum-product decoder and the early decision decoder as described in Sec. 2.7 and Sec. 3.3, respectively, have been implemented in a Xilinx Virtex 5 FPGA for the randomized QC-LDPC code in the previous section. A threshold of t = 8 was used at an SNR of 3 dB. The utilization of the FPGA is shown in Table 3.1, along with energy estimations obtained with Xilinx’ power analyzer XPower. As seen, the early-decision algorithm reduces the logic energy dissipation drastically. Unfortunately, the control overhead is relatively large, resulting in a 62 CHAPTER 3. EARLY-DECISION DECODING Slice utilization [%] Number of block RAMs Maximum frequency [MHz] Logic energy [pJ/iteration/bit] Clock energy [pJ/iteration/bit] Reference decoder 14.5 54 302 499 141 Early-decision decoder 21.0 54 220 291 190 Table 3.1 Synthesis results for the reference sum-product decoder and the early-decision decoder in a Xilinx Virtex 5 FPGA. slice utilization overhead of 45% and a net energy reduction of only 16% when increased clock distribution energy is considered. However, some points are worth mentioning: • In order not to restrain the architecture to short codes, the hard decision and early decision memories and accompanying logic in each memory block were designed in a general and scalable way. However, with the code size used, this resulted in two four-byte memories with eight eight-bit shift registers used to schedule the memory accesses (c.f. Fig. 3.6). As the controlling logic is independent of the code size, it can be expected that the early decision overhead would be reduced with larger codes. Alternatively, for short codes, implementing the hard decision and early decision memories using individual registers might also significantly reduce the overhead. • The addresses to the hard decision and early decision memories were generated individually from the address generators, along with control signals to the shift registers. However, a more efficient implementation utilizing sharing between different memory blocks is expected to be possible. 4 Rate-compatible LDPC codes In this chapter, new results related to rate-compatible LDPC codes are presented. In Sec. 4.1, an ILP optimization approach to the design of puncturing patterns for QC-LDPC codes is proposed. The algorithm generates puncturing sequences that are optimal in terms of recovery time and eliminates the element of randomness that is inherent in some of the greedy algorithms [46, 47]. In Sec. 4.2, an improved algorithm for the decoding of rate-compatible LDPC codes is presented. The algorithm systematically creates a new parity-check matrix used by the decoding algorithm. In the new matrix, the punctured nodes are not present, and thereby the convergence speed is significantly improved. In Sec. 4.3, a check-serial architecture for the decoding of rate-compatible LDPC codes using the proposed check-merging algorithm is presented. The proposed architecture is an extension of the work done in [66]. Finally, Sec. 4.4 and Sec. 4.5 contain simulation results of the check-merging decoding algorithm and synthesis results of the decoder architecture, respectively. Previous work in the area is presented in [43] and [60]. In [43], a method of designing rate-less codes through check node splitting is proposed. Using the method, high-degree check nodes are split in two with equal weight to add additional parity bits to the code. In [60], a decoding algorithm using check node merging for dual-diagonal LDPC codes is suggested and the algorithm is analyzed using density evolution. Section 4.1 has been published in [15], and Sec. 4.3 has been published in [10]. 63 64 CHAPTER 4. RATE-COMPATIBLE LDPC CODES 2-SR node 2-SE node 1-SR node 1-SR node 1-SE node 1-SE node Figure 4.1 Recovery tree of a 2-SR node. The filled nodes denote punctured bits in the codeword. Bits that are recovered after k iterations are denoted k-SR. Recovering check nodes are denoted k-SE. 4.1 Design of puncturing patterns The concept of k-SR nodes was explained in Sec. 2.5.1. Here, an ILP-based optimization algorithm for the design of puncturing sequences is proposed. The algorithm uses the number of k-SR nodes as a cost measure. The primary goal of the algorithm is to optimize the puncturing pattern for fast recovery of the punctured nodes. Given a parity check matrix, it maximizes the number of 1-SR nodes. Following that, it maximizes the number of punctured nodes for increasing values of K, given the puncturing patterns obtained in earlier iterations and the constraint that only k-SR nodes with k ≤ K are allowed. In Sec. 4.1.1, needed notations are introduced. In Sec. 4.1.2, the optimization problem with variable definitions, goal function and constraints is described. Also, design cuts to decrease the optimization time are discussed. Then, the proposed algorithm for designing the puncturing patterns using the ILP optimization problem is presented in Sec. 4.1.3. 4.1.1 Preliminaries Using a message passing algorithm to decode a punctured block, the check nodes adjacent to a punctured bit node will propagate null information until the punctured variable node has received a non-zero message from another check node. Thus, the concept of recoverable nodes (k-SR nodes) was introduced in Sec. 2.5.1. Here, the concept is extended for check nodes in order to enable the optimization problem formulation. As shown in Fig. 4.1, a check node is denoted 1-step enabled (1-SE) if it involves a singular punctured node and is used to recover that node in the first iteration. Furthermore, a punctured variable node is denoted 1-SR if it is connected to a 4.1. DESIGN OF PUNCTURING PATTERNS 65 1-SE check node. Generally, a check node is denoted k-SE if it begins to communicate non-null information in the kth iteration and has been chosen to recover a punctured node. A punctured variable node is denoted k-SR if it is connected to a k-SE check node but to no lower k-SE check node. It is possible to create puncturing patterns that contain nodes that will never be recovered. Such nodes thus never change value and severely impact the decoding performance. In [46], it was shown that decreasing the recovery time increases the reliability of the recovery information, and it was then conjectured that increasing the number of nodes with fast recovery enhances the performance of the puncturing pattern. 4.1.2 Optimization problem The optimization problem is formulated to maximize the number of k-SR variable nodes, where k < K for a given value of K. As secondary optimization criteria, the algorithm chooses k-SR variable nodes and k-SE check nodes such that first the sum of the degrees of the k-SR variable nodes is minimized, and then the sum of the degrees of the k-SE check nodes is minimized, as low-degree nodes were reported to give good results in [83]. The following constants are defined: Hb : the base matrix of the code, as defined in Sec. 2.3.2 M: the number of rows in Hb N: the number of columns in Hb Ha : the z = 1 expansion of Hb (each -1 element replaced by 0 and each nonnegative element replaced by 1) Cp : the cost for each non-punctured node Cr : the recoverability cost Cv : the variable node degree cost Cc : the check node degree cost K: the maximum allowed recovery delay (maximum k-SR node) F: set of nodes for which puncturing is forced Furthermore, the set Qm of involved variable nodes in check node m and the set Rn of involved check nodes in variable node n are defined as Qm = {n : Ha (m, n) = 1} Rn = {m : Ha (m, n) = 1} The pn : sk,n : ck,m : rk,n : (4.1) (4.2) following binary variables are defined: 1 if variable node n is punctured, 0 otherwise 1 if variable node n is a k-SR node, 0 otherwise, k = 1, . . . , K 1 if check node m is a k-SE node, 0 otherwise, k = 1, . . . , K 1 if variable node n is still unrecovered at level k (after k iterations), 0 otherwise, k = 0, . . . , K 66 CHAPTER 4. RATE-COMPATIBLE LDPC CODES The goal function G is to be minimized and is written G = Cp Gp + Cr Gr + Cv Gv + Cc Gc , (4.3) where Gp = N −1 X n=0 Gr = (1 − pn ) K N −1 X X (4.4) ksk,n (4.5) k=1 n=0 Gv = N −1 X n=0 Gc = |Rn |pn K M −1 X X k=1 m=0 (4.6) |Qm |ck,m (4.7) Gp maximizes the number of punctured nodes, whereas Gr ensures as fast recovery as possible. Secondary objectives are achieved by Gv and Gc that minimize the degrees of the punctured variable nodes and the check nodes used for recovery, respectively. By setting Cp ≫ Cr ≫ Cv ≫ Cc it can be ensured that the partial objectives are achieved with falling priorities. The goal function is to be minimized subject to the following constraints: Initialize unrecovered nodes: For n = 0, . . . , N − 1, let r0,n = pn . (4.8) At level 0, only the non-punctured nodes (0-SR nodes) are recovered. Recover nodes: For k = 0, . . . , K − 1 and n = 0, . . . , N − 1, rk+1,n ≥ rk,n − sk,n . (4.9) This constraint makes sure that a punctured node will be unrecovered until the level where it is marked as a k-SR node. Require recovered neighbours: For m = 0, . . . , M − 1, k = 1, . . . , K and n = 0, . . . , N − 1, X rk,n ≥ −|Qm |(2 − sk,n − ck,m ) + rk−1,n′ . (4.10) n′ ∈Qm \n Thus, if check node m is used to recover variable node n at level k, the first term of the right hand side will be zero and the variable node is allowed to be recovered (rk,n = 0) only if the other check node’s neighbours are already recovered (the sum is zero). If check node m is not k-SE or variable node n is not k-SR, the condition imposes no additional constraints. 4.1. DESIGN OF PUNCTURING PATTERNS Require k-SE neighbour: For k = 1, . . . , K and n = 0, . . . , N − 1, X sk,n ≤ ck,m . 67 (4.11) m∈Rn The constraint states that, in order for a variable node n to be k-SR, one of its check node neighbours must be k-SE. Require recovering of all nodes: For n = 0, . . . , N − 1, rK,n = 0, (4.12) which states that all variable nodes must be recovered at level K. Forced punctures: For n ∈ F , pn = 1. Thus, for the nodes in the set F , puncturing is forced. However, the recovery level is not constrained. In addition to the optimization constraints, two cuts can be used to decrease the optimization time without affecting optimality of the results. Cut 1: For m = 0, . . . , M − 1, k = 0, . . . , K − 1 and n ∈ Qm , X ck+1,m ≥ rk,n − rk,n′ . (4.13) n′ ∈Qm \n The cut implies that, whenever a variable node is the only unrecovered node in a check m at level k, check node m must be k + 1-SE. Cut 2: For m = 0, . . . , M − 1, k = 0, . . . , K − 1 and n ∈ Qm , X sk+1,n ≥ rk,n − rk,n′ . (4.14) n′ ∈Qm \n The cut implies that, whenever a variable n is the only unrecovered node in a check at level k, variable node n must be k + 1-SR. 4.1.3 Puncturing pattern design The following algorithm is proposed for the design of puncturing sequences for a class of codes based on a base matrix Hb with different expansion factors z. 1. Initialize the puncturing sequence p = () and set the maximum recovery delay K = 1. Set the costs Cp ≫ Cr ≫ Cv ≫ Cc . 2. Set the forced puncturing pattern F = p and solve the optimization problem on Hb . 3. Let q denote the puncturing pattern solution. Order q \ p first by increasing recovery delay and then by increasing node degree, and append the sequence to p. 4. If a pre-defined maximum recovery delay K has been reached, quit. Otherwise, increase K and go to Step 2. 68 CHAPTER 4. RATE-COMPATIBLE LDPC CODES The result of the above procedure is an ordered sequence p defining a puncturing order for the blocks of the base matrix Hb . For the parity-check matrix H with expansion factor z, the puncturing sequence pz is defined by replacing each element a in p with the sequence za, . . . , z(a + 1) − 1. Simulation results of the proposed puncturing pattern design algorithm are in Sec. 4.4.1. 4.2 Check-merging decoding algorithm Assume that a code defined by a parity-check matrix H of size M × N is used for a communications system. Also assume that signal conditions allow for a puncturing pattern p = (p0 , . . . , pL−1 ) to be used. The sum-product decoding algorithm is defined on H and initializes the prior log-likelihood ratios of the punctured nodes γp0 , . . . , γpL−1 to zero. However, by merging the check nodes involving each punctured bit and purging the rows and columns associated with them, a modified parity-check matrix HP of size (M − L) × (N − L) can be defined. Decoding using HP instead of H can in many circumstances reduce the convergence time for the decoding algorithm significantly, while the computational complexity of each iteration is reduced. The main motivation for the proposed algorithm is systems with varying signal conditions, i.e., most wireless systems. In such systems, rate-compatible codes can be used to adapt the protection better to the current channel. The result is that most of the transmitted blocks use some amount of puncturing, and then increasing the throughput of the decoder for punctured blocks is motivated. 4.2.1 Defining HP In the following, let all matrix operations be defined using modulo-2 arithmetic. HP can be defined if and only if the columns of H corresponding to the punctured bits p are linearly independent. The implications of the requirement and how to construct puncturing patterns that satisfy it are discussed in Sec. 4.2.3. Consider a sub-matrix B of size M × L formed by taking the columns of H corresponding to the punctured bits p. By assumption, the columns of B are linearly independent. Thus, B has a set of L rows that are linearly independent. Let c = (c0 , . . . , cL−1 ) denote such a set. c is thus the set of check nodes to purge along with the punctured variable nodes. Given a set of punctured variable nodes p = (p0 , . . . , pL−1 ) and purged check nodes c = (c0 , . . . , cL−1 ), perform column and row reordering of H such that the punctured variable nodes p and the purged check nodes c end up at the right and bottom, respectively. Denote the resulting matrix H̃. H̃ thus has the following structure: HA BA H̃ = , (4.15) HB BB 4.2. CHECK-MERGING DECODING ALGORITHM 69 where BA and BB have sizes of (M − L) × L and L × L, respectively. Further, BB is non-singular as its rows are linearly independent. Define the matrix A of size M × M as I BA A= , (4.16) 0 BB where I is the identity matrix of size (M −L)×(M −L). Since BB is a non-singular matrix, A is also non-singular and its inverse is I BA B−1 −1 B A = . (4.17) 0 B−1 B In order to define HP , first define the matrix H̃R of size M × N as HB 0 HA + BA B−1 −1 B H̃R = A H̃ = . B−1 I B HB (4.18) As H̃R is obtained by reversible row operations on H, the matrices have the same null space and thus define the same code. However, considering the behavior of sum-product decoding with punctured bits over H̃R , two observations can be made. First, consider the check-to-variable message βcl n0 from the purged check node cl to one of its non-punctured neighbors n0 ∈ N (cl ) \ pl , as depicted in Fig. 4.2(a). Y X βcl n0 = sign αn′ cl · Φ Φ (|αn′ cl |) = n′ ∈N (cl )\n0 = Y n′ ∈N (c n′ ∈N (cl )\n0 sign αn′ cl · Φ Φ (|αpl cl |) + l )\n0 X n′ ∈N (c Φ (|αn′ cl |) = 0, l )\{n0 ,pl } (4.19) since Φ (|αpl cl |) = Φ (γpl ) = Φ(0) = ∞. Thus, the purged check nodes c affect neither the computations of the variable-to-check messages nor the pseudo-posterior likelihoods at their neighboring variable nodes. Second, consider the variable nodes involved in check node cl , as depicted in Fig. 4.2(b). The hard-decision of pl can be written Y Y sign λpl = sign βcl pl = sign αncl = sign λn . (4.20) n∈N (cl )\pl n∈N (cl )\pl Thus, in a converged state, where all non-purged check nodes are fulfilled and the hard-decision values of the variable nodes do not change, cl will eventually also be fulfilled. It follows from these two observations that the purged check nodes do not affect the behavior of the decoding algorithm, apart from possibly delaying convergence. HP is thus defined as the matrix H̃R with the purged check nodes and punctured variable nodes removed: HP = HA + BA B−1 B HB . (4.21) 70 CHAPTER 4. RATE-COMPATIBLE LDPC CODES n0 n1 n2 pl βcl n0 n0 n1 n2 β c l pl αn1 cl αn0 cl cl pl αn2 cl cl (a) Message from purged check node cl (b) Message from purged check node cl to punctured node pl . to non-punctured node n0 . Figure 4.2 Check-to-variable messages from purged check nodes. punctured variable node n0 n1 n2 n3 n4 n5 n6 H m0 m1 purged check node H̃R punctured variable node H purged check node H̃R n0 n1 n2 n4 n5 n6 HP HP m0 (a) Degree-2 variable node. (b) Degree-3 variable node. Figure 4.3 Merging of check nodes for punctured variable nodes. 4.2.2 Algorithmic properties of decoding with HP In the case when all punctured variable nodes are of degree 2, decoding over HP has some interesting properties. Consider first the case when the punctured variable nodes do not share any check node and H is free of length-4 cycles. Given these assumptions, BB is a permuted identity matrix and BA is a permuted identity matrix with possible additional zero-filled rows. In this case, HP = HA + BA BTB HB , and each purged check node cl is merged with the other neighbor of its corresponding punctured variable pl , as shown in Fig. 4.3(a). Denoting the variable and check node operations of node k by Vk and Ck , i respectively, consider the messages βm in iteration i in the top and bottom 0 n0 4.2. CHECK-MERGING DECODING ALGORITHM 71 graphs in Fig. 4.3(a). In the top graph using H, i βm = C0 (αni−1 , αni−1 , αni−1 )= 0 n0 1 m0 2 m0 3 m0 i−1 = C0 (αni−1 , αni−1 , V3 (βm )) = 1 m0 2 m0 1 n3 i−1 = C0 (αni−1 , αni−1 , βm )= 1 m0 2 m0 1 n3 = C0 (αni−1 , αni−1 , C1 (αni−2 , αni−2 , αni−2 )) = 1 m0 2 m0 4 m1 5 m1 6 m1 = C0 (αni−1 , αni−1 , αni−2 , αni−2 , αni−2 ), 1 m0 2 m0 4 m1 5 m1 6 m1 (4.22) whereas in the bottom graph using HP , i βm = C0 (αni−1 , αni−1 , αni−1 , αni−1 , αni−1 ). 0 n0 1 m0 2 m0 4 m0 5 m0 6 m0 (4.23) The operations performed when decoding using HP are thus the same as when decoding using H, and differ only in the speed of the propagation of the messages. It can thus be expected that decoding using HP achieves convergence in less iterations than using H. As seen in Sec. 4.4, this is also the case. When the punctured variable nodes share check nodes, (4.23) can be shown to still hold by repeated application of (4.21) for independent sets of nodes. However, as (4.21) may introduce length-4 cycles, (4.22) may possibly contain duplicate messages which are not present in (4.23). Although the presence of duplicate messages results in a computational change in the algorithm, the defined code is the same. The necessary conditions of punctured degree-2 variable nodes are relatively strict, but are often met in practice due to the simplified encoder designs for these types of codes [69]. Puncturing of higher-degree nodes is possible and an example of a degree-3 node is shown in Fig. 4.3(b). However, for higher-degree nodes (4.18) normally introduces length-4 cycles which reduce the performance of the decoding algorithm. 4.2.3 Choosing the puncturing sequence p The requirement that the columns in H corresponding to the punctured nodes must be linearly independent is related to the concepts of k-SR variable nodes (defined in Sec. 2.5.1) and k-SE check nodes (defined in Sec. 4.1.1). However, the requirement of linear independence is less strict. First, it is shown that the columns in H corresponding to a set of k-SR nodes are linearly independent. Then the implications of linear independence in terms of recoverability is discussed. Assume that the parity-check matrix H is given, and let p denote a set of recoverable punctured nodes. Further, let c denote the set of check nodes used for the recovery of the punctured nodes. Denote the k-SR nodes by pk and the corresponding k-SE check nodes by ck . Now, define H̃ as in Sec. 4.2.1 and consider the BB sub-matrix. Obviously, the rows c1 and columns p1 corresponding to the 1-SR nodes form the identity matrix. Generally, the checks ck can only contain lSR nodes where l < k, in addition to the k-SR node it is recovering. It follows that BB is upper triangular with 1s along the diagonal, and is therefore non-singular. 72 CHAPTER 4. RATE-COMPATIBLE LDPC CODES p0 p1 p2 Figure 4.4 Example of non-recoverable punctured nodes. Thus, as a set of recoverable nodes implies that HP can be defined, any of the available algorithms to determine k-SR puncturing sequences can be used also to determine sets of punctured nodes used for check-merged decoding. The converse, however, is not true. Consider for example the punctured variable nodes shown in Fig. 4.4. None of the nodes are recoverable, as all their check node neighbors require at least one other of the punctured variable nodes to be recovered in order to produce non-zero messages. However, the parity-check matrix corresponding to the graph is ··· 1 1 0 · · · 1 1 1 · · · 0 1 1 H = · · · 0 0 0 , (4.24) .. .. .. .. . . . . ··· 0 0 0 where the three rightmost columns corresponding to the punctured variable nodes are linearly independent. 4.3 Rate-compatible QC-LDPC code decoder In this section, a check-merging decoder architecture for QC-LDPC codes is presented. The architecture is based on the reference decoder in Sec. 2.8 (also in [66]), and includes modifications to enable check-merging. The modifications require some logic overhead, but the benefit is a significantly increased throughput of the decoder on punctured blocks. The increased throughput comes from three sources: • faster propagation of messages due to merged check nodes, leading to a decrease in the average number of iterations, • a reduction in the code complexity due to removed nodes, reducing the number of clock cycles for each iteration, and • the elimination of idle waiting cycles between layers. 4.3.1 Decoder schedule The schedule of the decoder and relevant definitions are described in Sec. 2.8.1. 4.3. RATE-COMPATIBLE QC-LDPC CODE DECODER clk k 73 clk k+2 clk k+1 clk k+3 Checkmerged layer Figure 4.5 Schedule of decoder on check node merged parity-check matrix. Figure 4.5 shows the resulting parity-check matrix when the bits in the rightmost sub-matrix in Fig. 2.25 have been punctured and the last two layers merged. As can be seen, the merged layer consists of the sums of the corresponding rows in the original parity-check matrix. The result is a parity-check matrix that may contain sub-matrices that have column weights larger than one. The bit-sum blocks in a check-node group may then overlap, which the architecture has to be aware of. Whereas the original architecture computes updated bit-sums directly, the proposed architecture computes bit-sum differences which are then added to the old bit-sums to produce the updated ones. The proposed architecture thus efficiently handles the case where a variable node is involved simultaneously in several check nodes. As the check nodes in a check node group are processed in parallel, check node merging can only be used if all the check nodes in the check node group are purged. This is not a major limitation, however, as puncturing sequences are normally defined on the base matrix of the code and then expanded to the used block length. Thus, at most one check node group needs to have both purged and non-purged check nodes, and this check node group is then processed by initializing the punctured variable nodes to zero. 4.3.2 Architecture overview An overview of the check node merging decoder architecture is shown in Fig. 4.6. At the start of each block, the contents of the bit-sum memories are initialized with the log-likelihood ratios received from the channel. Then the contents of the bit-sum memories are updated during a number of iterations. In each iteration, the hard-decision values are also written to the decision memory, from where they can be read when decoding of the block is finished. In each iteration, bit-sums are read from the bit-sum memories, routed by a cyclic shifter to the correct check function unit (CFU), and then routed back. The output of the CFUs are bit-sum differences which are accumulated for each read bit-sum value in the bit-sum update block. The updated bit-sums are then 74 CHAPTER 4. RATE-COMPATIBLE LDPC CODES Channel LLR Decision memory Q Qxwq Bit-sum memory Decision Q Bit-sum update Qxwq Qx(wr+2) QxQ wq -bit cyclic shift QxQ wr+2-bit cyclic shift Qxwq Q Qx(wr+2) CFU Counter rst(0) row input start in idx Counter we row output start rst(0) out idx we Figure 4.6 Architecture of decoder utilizing check merging. rewritten to the bit-sum memories. The two counters count the current input and output indices of the bit-sum blocks of the currently processed check node group. The following parameters are defined for the architecture: • Z is the maximum expansion factor • Q is the parallelization factor • C is the maximum check node degree • wq is the bit-sum wordlength • wr is the wordlength of the min-sum magnitudes • we is the edge index wordlength 4.3.3 Cyclic shifters As in the original architecture [66], the shifters are implemented as barrel shifters with programmable wrap-around. However, unlike the original architecture the return path needs a cyclic shift as a bit-sum may need updates produced in several different CFUs. 4.3.4 Check function unit In the CFU, shown in Fig. 4.7, the old check-to-variable messages are stored for each of the check nodes’ neighbors. As the min-sum algorithm is used, only two values are needed. These are stored as wr -bit magnitudes along with a we -bit index of the node with the minimum message and one bit containing the check node parity. These bits are grouped together and stored in the min-sum memory. 4.3. RATE-COMPATIBLE QC-LDPC CODE DECODER we+1+2wr in idx ChkMsg gen Reg c2v old out idx wr+1 Reg out idx ms old S mem MinSum mem bitsum wq BitMsg gen CMP qidx MinSum update A==B qmin qmin2 1 we wr wr sign wr+1 c2v new c2v diff ms new we+1+2wr wr+1 bitsum wr+1 wq 1 wr+1 2ctosm c2v magn wr+2 BitMsg generator c2v parity qidx qmin qmin2 smto2c 0 1 wr+1 Reg r ChkMsg generator ms msg parity =1 sign idx sign v2c magnw ChkMsg gen ChkMsg gen FIFO 75 wr v2c sign v2c magn MinSum update =1 sign offset wr O parity O parity O idx ms en rst(0) O min O min2 1 row input start min2 sel ≥1 msb0 Reg & 1 CMP magn min update ≥1 x<O min wr+1 wr msg CMP ≥1 min2 update x<O min2 wr lsb in idx Reg en rst(0) O idx msb0 0 we min update row input start O min 0 1 Reg enrst(2wr −1) min update min2 sel 0 1 Reg enrst(2wr −1) min2 update O min wr row input start O min2 wr row input start Figure 4.7 Check function unit. Along with the signs of each check-to-variable message stored in the S memory, the check message generator generates the old check-to-variable message in two’s complement wr + 1-bit representation for each variable node neighbor. The old check-to-variable messages are subtracted from the bit-sum input to generate the 76 CHAPTER 4. RATE-COMPATIBLE LDPC CODES new bitsum 1 wq 0 sel (C-2) 1 0 sel (0) Reg sel (C-1) 0 Reg 0 1 Reg Reg 1 wq+1 wq wr+2 old bitsum c2v diff Figure 4.8 Bit-sum update unit. variable-to-check messages in the bit message generator. These are used by the min-sum update unit to compute the new min-sum values, which are subsequently stored in the min-sum memory. They are also sent to the second check message generator, which generates the new check-to-variable messages. The difference between the old and new check-to-variable message is computed, and forms the output of the CFU. The check message generator consists of a mux to choose the correct check-tovariable message and a converter from signed magnitude to two’s complement. The bit message generator computes the variable-to-check message by adding the inverted old check-to-variable message to the bit-sum and then converting the result back to signed magnitude representation. The min-sum update unit is unchanged from the original in [66], and performs a basic search for the two minimum magnitudes in its input. A programmable offset value is used to correct for the overly optimistic estimates of the min-sum algorithm. Also, the parity of the check node is computed. 4.3.5 Bit-sum update unit In Fig. 4.8, the bit-sum update unit is shown. When processing of a check node group starts, the bit-sums are loaded in the shift registers with a delay equal to the current node degree, and thus equal to the delay of the CFUs. Then, when the bit-sum differences from the CFUs start to arrive, they are added to the old bit-sums. The sum is truncated and sent to the output to be written back to the bit-sum memories. When several CFUs have updates to the same bit-sum, the updated bit-sum is reloaded in the shift register to replace the obsolete value. 4.3.6 Memories The bit-sums are organized into Q memory banks with data wordlengths of wq bits. Each memory l contains the bit-sums of the variable nodes vk where k = l mod Q. The min-sums are organized into Q memory banks with data wordlengths of we + 1 + 2wr bits, which holds a we -bit edge index, 1-bit parity, and two wr -bit 4.4. SIMULATION RESULTS 77 0 Block error rate 10 −1 10 −2 10 −3 10 1 1.5 2 2.5 3 Eb/N0 [dB] 3.5 4 Figure 4.9 Comparison of puncturing sequences for the WiMax rate-1/2 code punctured to rate 2/3 and 3/4. The straight lines are the optimized sequences and the dotted lines are obtained using the heuristic in [46]. magnitudes. Each memory l contains the min-sums of the check nodes ck where k = l mod Q. 1 4.4 4.4.1 Simulation results Design of puncturing sequences The optimization algorithm in Sec. 4.1 has been applied to the WiMax rate-1/2 base matrix ((N, M ) = (24, 12)) to create a base puncturing pattern of length 11, with which code rates up to 0.92 can be obtained. The base puncturing pattern was expanded with factors of 24, 36 and 48 to create puncturing patterns for block lengths of 576, 864 and 1152. To solve the optimization problem, the SCIP solver [2] has been used. Solving the problem for the WiMax rate-1/2 base matrix was done in seconds on a desktop computer, and resulted in the puncturing sequence p = (13, 16, 19, 22, 14, 17, 20, 23, 6, 15, 8) . (4.25) The performance of the codes was evaluated using the sum-product decoding algorithm with a maximum of 100 iterations. For all of the measurements a minimum of 100 failed decoding attempts were observed. In Fig. 4.9, puncturing sequences for the (N, K) = (576, 288) code punctured to rate 2/3 and 3/4 have been compared. The figure shows the ILP-optimized 1 For standard codes and parallelization degrees, these memory arrangements may result in sparingly used memories. To circumvent this, the same techniques as in [66] may be used. 78 CHAPTER 4. RATE-COMPATIBLE LDPC CODES 1-SR Number 4 130 5 372 6 385 7 113 Table 4.1 Number of 1-SR nodes for 1000 runs of the algorithm [46] applied to the WiMax rate-1/2 base matrix. The proposed algorithm achieves 8 1-SR nodes. Rate [46] Proposed 0.75 2 1 0.8 2–3 2 0.86 2–4 2 0.92 6 4 Table 4.2 Recovery delay (maximum k-SR node) for different punctured rates of the WiMax rate-1/2 base matrix. sequence together with an ensemble of ten random sequences obtained using the greedy algorithm in [46]. In order to provide a fair comparison with scalable block sizes, the greedy algorithm was applied on the base matrix and the obtained base puncturing patterns were expanded with the matrix expansion factor. For rate 2/3, the performance of the optimized sequence is comparable to those obtained with the heuristics, whereas for rate 3/4, the performance of the optimized sequence is superior to those obtained with the heuristics. There is a small trade-off between performance at different rates, which partly explains why the optimized sequence is not superior for all rates. However, it is also the case that the algorithm optimizes recovery time, which may not necessarily result in optimal performance. In Table 4.1, the number of 1-SR nodes obtained with the algorithm in [46] is shown to vary between 4 and 7. In contrast, the optimization achieves 8 1-SR nodes. In addition, as seen in Table 4.2 the recovery delay is commonly higher for the greedy algorithm, leading to shorter decoding times for the optimized puncturing sequences. In Fig. 4.10(a), the rate-1/2 length-576 WiMax code punctured to rates 2/3, 3/4, and 5/6 have been compared to the dedicated codes of the standard. It is seen that the performance difference is small for the lower rates. In Fig. 4.10(b), the performance loss of puncturing the rate-1/2 WiMax code to rate 3/4 is shown for block sizes of 576, 864 and 1152. The puncturing patterns are obtained from the same base puncturing pattern, showing the possibility of using the proposed scheme with scalable block sizes. 4.4.2 Check-merging decoding algorithm The proposed check-merging decoding algorithm in Sec. 4.2 has been applied to the WiMAX rate-1/2 code punctured to rates 3/5 and 3/4. The resulting codes have been simulated over an AWGN channel and decoded with the sum-product algorithm on both the rate-1/2 graph and the check-merged graph. The block error 4.5. SYNTHESIS RESULTS OF CHECK-MERGING DECODER 0 0 10 10 10 −2 10 −3 10 0 10 −2 10 −3 10 −4 10 N=576, punct N=576, ref N=864, punct N=864, ref N=1152, punct N=1152, ref −1 5/6, punct 5/6, ref 3/4, punct 3/4, ref 2/3, punct 2/3, ref 1/2 Block error rate −1 Block error rate 79 −4 1 2 3 4 Eb/N0 [dB] 5 10 6 2 2.5 3 3.5 4 Eb/N0 [dB] 4.5 5 (a) Comparison of the punctured rate-1/2 (b) Performance of the rate-1/2 WiMax length-576 WiMax code with dedicated rates code punctured to rate 3/4 for block sizes of the standard. of 576, 864 and 1152. Figure 4.10 Performance of ILP-based puncturing scheme. 0 20 Average number of iterations 10 −1 Block error rate 10 −2 10 −3 10 −4 10 R=3/4, SP R=3/4, prop. R=3/5, SP R=3/5, prop. R=1/2 −5 10 −6 10 0 1 2 3 4 Eb/N0 [dB] 5 6 R=3/4, SP R=3/4, prop. R=3/5, SP R=3/5, prop. R=1/2 15 10 5 0 0 1 2 3 4 Eb/N0 [dB] 5 6 (a) Block error rate of WiMAX rate-1/2 (b) Average number of iterations for code punctured to rates 3/5 and 3/4. WiMAX rate-1/2 code punctured to rates 3/5 and 3/4. Figure 4.11 Simulation results of check-merging decoding algorithm. rate and average number of iterations have been computed. Simulations using a maximum of 20 iterations are shown in Figs. 4.11(a) and 4.11(b), respectively. The main benefit is the significant reduction of the average number of iterations, allowing an increase in the throughput. 4.5 Synthesis results of check-merging decoder The proposed check-merging architecture in Sec. 4.3 has been synthesized for the LDPC codes used in IEEE 802.16e and IEEE 802.11n in order to determine the logic overhead required to support check merged decoding. For IEEE 802.16e, a parallelization factor of Q = 24 has been used. For IEEE 802.11n, the maximum 80 CHAPTER 4. RATE-COMPATIBLE LDPC CODES 60 802.16e 1/2 802.16e 2/3a 802.16e 3/4a Imp. limit 40 Maximum check node degree Maximum check node degree 50 30 20 10 0 0.5 0.6 0.7 Punctured rate 0.8 0.9 50 40 802.11n 1/2 802.11n 2/3 802.11n 3/4 Imp. limit 30 20 10 0 0.5 0.6 0.7 Punctured rate 0.8 0.9 Figure 4.12 Maximum check node degrees of check node merged matrices for IEEE 802.16e codes and IEEE 802.11n codes. length codes of N = 1944 have been assumed, and in order to more accurately conform to the data rate requirements a larger parallelization factor of Q = 81 has been used. For all implementations, wq = 8, wr = 5 and we = 5. 4.5.1 Maximum check node degrees The maximum check node degree supported by the decoder is determined by the length of the register chain in the bit-sum update unit. Therefore a limit must be set based on the degree distributions of the check node merged matrices. Figure 4.12 shows the maximum check node degrees as a function of the punctured rate for the considered codes of the IEEE 802.16e and IEEE 802.11n standards. In both cases, C = 28 has been considered a suitable trade-off between decoder complexity and achievable rates. 4.5.2 Decoding throughput As the complexity of the code is reduced when checks are merged, the attainable throughput increases as the punctured rate increases. This gain is in addition to the faster convergence rate of the check merging decoder. Assuming a fixed number of 10 iterations, the minimum achievable throughputs for IEEE 802.16e and IEEE 802.11n are shown in Fig. 4.13. 4.5.3 FPGA synthesis The reference decoder (Sec. 2.8) and the proposed check merging decoder have been synthesized for an Altera Cyclone II EP2C70 FPGA with speedgrade -6. Mentor Precision was used for the synthesis, and the results are shown in Tables 4.3 and 4.4 for IEEE 802.16e and IEEE 802.11n, respectively. The C values for the reference decoders were set to the maximum check node degrees of the fixed rate codes in the 4.5. SYNTHESIS RESULTS OF CHECK-MERGING DECODER 55 200 802.16e 3/4a 802.16e 2/3a 802.16e 1/2 Inf. bit throughput [Mb/s] Inf. bit throughput [Mb/s] 60 50 45 40 35 30 0.5 0.6 0.7 Punctured rate 0.8 0.9 180 81 802.11n 3/4 802.11n 2/3 802.11n 1/2 160 140 120 100 0.5 0.6 0.7 Punctured rate 0.8 0.9 Figure 4.13 Information bit throughput of decoder on IEEE 802.16e codes and IEEE 802.11n codes. Table 4.3 Synthesis results of decoders for IEEE 802.16e codes. Q C LUTs Regs Mem. bits Fixed rate (Sec. 2.8 24 20 4750 6303 20640 Rate-compatible (Sec. 4.3) 24 28 5988 8015 20480 Table 4.4 Synthesis results of decoders for IEEE 802.11n codes. Q C LUTs Regs Mem. bits Fixed rate (Sec. 2.8) 81 23 16396 21865 66560 Rate-compatible (Sec. 4.3) 81 28 22884 28635 66336 respective standards. For 802.16e, the overhead of the check merging decoder is 26% and 27% for the LUTs and registers, respectively. For 802.11n, the overhead is bigger at 40% and 31% for the LUTs and registers, respectively. A significant number of the LUTs is consumed in the barrel shifters. 82 CHAPTER 4. RATE-COMPATIBLE LDPC CODES 5 Data representations In this chapter, the representation of the data in a fixed-point decoder implementation is discussed. It is shown that the usual data representation is redundant, and that in many cases coding of the data can be applied to reduce the width of the data buses without sacrificing the error-correcting performance of the decoder. This chapter has been published in [7]. 5.1 Fixed wordlength The sum-product algorithm, as formulated in the log-likelihood domain in (2.13)– (2.15), lends itself well towards fixed wordlength implementations. Usually, decent performance is achieved using wordlengths as short as 4–6 bits [111]. Thus, the additions can be efficiently implemented using ripple-carry adders, and the domain transfer function Φ(x) can be implemented using direct gate-level synthesis. 5.2 Data compression The function Φ(x) is shown in Fig. 5.1. In a hardware implementation, usually the output of this function is stored in memories between the phases, and it is therefore of interest to represent the information efficiently. However, as the function is non83 84 CHAPTER 5. DATA REPRESENTATIONS 4 3.5 3 Φ(x) 2.5 2 1.5 1 0.5 0 0 1 2 x 3 4 Figure 5.1 The domain transfer function Φ(x). VNU αmn CNU 2 βmn LUT 1 LUT (a) Conventional data-path. VNU CNU αmn Enc Dec βmn Dec Enc (b) Data-path utilizing compression. Figure 5.2 Data-paths in a sum-product decoder. linear, all possible words will not be represented at the output, and there is therefore sometimes an opportunity to code the data. A general model of the data flow in a sum-product decoder is shown in Fig. 5.2(a). The VNU block performs the computation of the αnm messages in (2.13), whereas the CNU block performs the additions for the βmn messages in (2.14). Φ(x) is implemented by the LUT blocks, and can be part of either the VNU block or the CNU block. The dashed lines 1 and 2 show the common partitions used in decoder implementations, e.g., in [109] and [68], respectively. Instead of these partitions, the implementation in Fig. 5.2(b) is suggested, where an encoder is used to convert the messages to a more compact representation which is used to communicate the 5.3. RESULTS 85 2.2 1.6 1.4 1.2 LUT 1, w =2 f LUT 2, w =2 1.6 Redundant bits 1.8 Redundant bits 1.8 1 integer bit 2 integer bits 3 integer bits 2 f LUT 1, w =3 f 1.4 LUT 2, w =3 f 1.2 1 1 0.8 0.8 0.6 2 4 6 Fractional bits 8 10 0.6 1.5 2 2.5 3 Logarithm base 3.5 4 (a) Redundant bits B(wi , wf ) of LUT value (b) Redundant bits of LUT value entries for entries for different data representations. different logarithm bases using data representations with wi = 2, wf = 2, and wi = 2, wf = 3. Figure 5.3 Redundant bits of discretized domain transfer function Φ(x). messages between the processing units. At the destination, a decoder is used to convert the coded data back to the original message. Denoting the number of integer bits in the uncoded data representation by wi , and the number of fractional bits by wf , the number of redundant bits in the output of the LUT blocks can be written B(wi , wf ) = log2 2wi +wf − No , (5.1) where No is the number of unique function values of the discretized function of Φ(x). 5.3 Results B(wi , wf ) is plotted for different parameters in Fig. 5.3(a). Obviously B(wi , wf ) also depends on the base of the logarithm in Φ(x) described in Sec. 2.6.5. However, as the domain transfer functions differ when the logarithm is not the natural logarithm, B(wi , wf ) will be different for the two LUTs. The representation with wi = 2 and wf = 2 was shown in [111] to be a good trade-off between the performance and complexity of the decoder, and the number of redundant bits as a function of the logarithm base is shown in Fig. 5.3(b). As shown, compression can be combined with logarithm scaling for bases from 2.4 to 3.2 for wf = 2, for the considered representation, without precision loss. Outside this interval, approximation of the domain transfer function is needed in one direction. Assuming that the natural logarithm base and a data representation with wi = 2 and wf = 2 is used, messages consist of 6 bits (including sign bit and hard decision/parity-check bit), and compression thus results in a wordlength reduction of 16.7%. The area overhead associated by the separation of the look-up 86 CHAPTER 5. DATA REPRESENTATIONS 90 50 80 40 Frequency [%] Frequency [%] 70 60 50 40 30 20 30 20 10 10 0 0 0.5 1 1.5 2 2.5 3 3.5 (a) Distribution of input values to CNU. 0 0 0.5 1 1.5 2 2.5 3 3.5 (b) Distribution of input values to VNU. Figure 5.4 Distributions of arguments to discretized domain transfer function Φ(x). table has been estimated by synthesis to standard cells in a 0.35µm CMOS process. The synthesis was performed using Synopsys design compiler. The synthesis of the original look-up table required 12 cells utilizing a total area of 726µm2 . Realized as separate blocks using a straightforward mapping of data to the intermediate compressed format, the encoder and decoder parts utilized 8 cells occupying 528µm2 and 6 cells occupying 309µm2 , respectively. As comparison, a 6-input CNU and a 3-input VNU occupy roughly 39000µm2 and 31000µm2 respectively, when synthesized using the same methods. Thus the area overhead of 111µm2 per look-up table amounts to a 1–1.5% area increase for the process elements. In contrast, problems with large routing overheads are commonly reported in implementations, with overheads as large as 100% in the fully parallel (N, K) = (1024, 512) code decoder in [18]. Additionally to reducing the wordlength of messages, the separation of the lookup table also allows a possibility to choose a representation suitable for energyefficient communication, and, as shown above, the separation is essentially free. Consider the distribution of look-up table values for CNU input (Fig. 5.4(a)) and VNU input (Fig. 5.4(b)) obtained using an (N, K) = (1152, 576) code. It is obvious that the energy dissipation for communication of messages will depend on the encoding chosen. For example, in a parallel architecture the most common values can be assigned representations with a small mutual distance, whereas in a partly parallel or serial architecture an asymmetric memory [22] might be efficiently used if low-weight representations are chosen for the most common values. 6 Conclusions and future work In this chapter, the work presented in this part of the thesis is concluded, and suggestions for future work are given. 6.1 Conclusions In part I of this thesis, two improved algorithms for the decoding of LDPC codes were proposed. Chapter 3 concerns the early decision algorithm, which reduces the computational complexity by deciding certain parts of the codeword early during decoding. Furthermore, the early decision modification was combined with the sum-product algorithm to form a hybrid decoding algorithm which does not visibly reduce the performance of the decoder. For a regular (N, j, k) = (1152, 3, 6) code, an internal communication reduction of 40% was achieved at a block error rate of 10−6 . In general, larger reductions are obtainable for codes with higher rates and smaller block sizes. In Chapter 4, the check merging decoding algorithm for rate-compatible LDPC codes was suggested. The algorithm proposes a technique of merging check nodes of punctured LDPC codes, thereby reducing the decoding complexity. For dualdiagonal LDPC codes, the algorithm offers a reduced decoding complexity due to two factors: the complexity of each iteration is reduced due to the removal of nodes in the code’s Tanner graph, and the number of iterations required for convergence 87 88 CHAPTER 6. CONCLUSIONS AND FUTURE WORK is reduced due to faster propagation of messages in the graph. For the punctured rate-1/2 length-576 IEEE 802.16e code, punctured to rate-3/4, the check merging algorithm offers a more than 60% reduction in the average number of iterations at a block error rate of 10−3 over an AWGN channel. The early decision and check merging algorithm have both been implemented on FPGAs. In both cases, the proposed algorithms come with a logic overhead compared with their respective reference implementations. The early decision algorithm was implemented for a (3, 6)-regular code class, whereas the check merging algorithm was implemented for the QC-LDPC codes with dual-diagonal structure used in the IEEE 802.16e and IEEE 802.11n standards. The logic overhead of the early decision algorithm was relatively big at 45%, whereas the logic overhead of the check merging algorithm was 26% and 40% for the IEEE 802.16e and IEEE 802.11n implementations, respectively. 6.2 Future work The following ideas are identified as possibilities for future work: • Removing early decisions when graph inconsistencies are encountered. This is likely necessary for the early decision algorithm to work well with codes longer than 1000–2000 bits. • The behavior of the early decision algorithm using irregular codes and efficient choices of thresholds should be analyzed. • The check merging algorithm works well when the punctured nodes are of degree two, but investing the possibilities of adapting the algorithm to work with punctured nodes of higher degrees is of interest. • For both of the proposed algorithms, their performance over fading channels is of interest. Part II High-speed analog-to-digital conversion 89 7 Introduction 7.1 Background The analog-to-digital converter (ADC) is the interface between the analog signal domain and the digital processing domain, and is used in all digital communications systems. Normally, an ADC consists of a sample-and-hold stage and a quantization stage, where the sample-and-hold performs discretization in time and the quantization stage performs discretization of the signal level. There are many different types of ADCs, and some of the more common ones are: • the flash ADC, which consists of a number of comparators with different reference signal levels, followed by a thermometer-to-binary decoder. The flash ADC is fast, but achieves low resolution as the number of comparators grows exponentially with the required number of bits. • the successive approximation ADC, which uses an internal digital-to-analog converter (DAC) fed on a register containing the current approximation of the analog input. In each clock cycle, a comparator compares the input to the current approximation, deciding one bit of the output. The successive approximation ADC is slower than the flash ADC, but can achieve medium resolutions. • the pipeline ADC, which consists of several cascaded flash ADCs. In each 91 92 CHAPTER 7. INTRODUCTION stage, the signal is discretized by a course flash ADC, and the approximation is then subtracted from the signal and the difference is passed on to the following stages. The pipeline ADC thus combines the speed of the flash ADC with the resolution of the successive approximation ADC. • the sigma-delta ADC (Σ∆-ADC), which is essentially a low-resolution flash ADC with a negative feed-back loop for the quantization error. The feedback allows the quantization noise to be frequency shaped, which is combined with oversampling to place the majority of the quantization noise in a frequency band unoccupied by the signal. This requires the ADC to be followed by a filter to remove the quantization noise and a downsampling stage to reduce the sampling frequency to a useful rate. Due to the oversampling, the achievable speed of a Σ∆-ADC is limited, but the architecture is normally resistant to analog errors, allowing high resolutions to be obtained. The resistance to analog errors of the Σ∆-ADC is attractive from a system-onchip point of view, as digital circuits are almost exclusively made in cost-efficient CMOS processes with small feature sizes, in which linear analog devices are difficult to design. Σ∆-ADCs allow a relaxation on the requirements of the analog devices, and instead compensating for errors with digital signal processing for which the CMOS processes are well suited. Therefore, attemps have recently been made to build high-speed Σ∆-ADCs [19, 82, 103] for general communications standards. The work in this part of the thesis considers the design of high-speed ADCs through parallel Σ∆-modulators, and the design of high-speed FIR filters with short wordlengths which are usable as decimation filters for high-speed ADCs. The proposed parallel Σ∆-modulator designs use modulation of the input signal to reduce matching requirements of the analog devices, and a general method to analyze the matching requirements of such systems is presented. The targeted data converters have bandwidths of tens of MHz and sampling frequencies in the GHz range, making both the linearity of analog devices and the implementation of digital decimation filters difficult. The proposed design of high-speed FIR filters uses bitlevel optimization on the placements of full and half adders in the implementation. 7.2 Applications There are many applications that could benefit from system-on-chip integration of transceiver frontends, but the main target is area- and power-constrained devices such as handhelds and laptops. Such devices support an increasing number of communication standards such as GSM and UMTS for voice and data communications, wireless LAN for high-speed internet access, bluetooth for peripheral connections and GPS for positioning services. These standards have wildly differing requirements on bandwidth and resolution, making the Σ∆-ADC attractive as these measures can be traded by adjusting the width of the digital decimation filter. 7.3. SCIENTIFIC CONTRIBUTIONS 7.3 93 Scientific contributions There are two main contributions in this part of the thesis. In Chapter 10, a method of analyzing general parallel ADCs for matching requirements is presented. This is done by a multi-rate formulation of the parallel system, and the stability of subsets of the channels can then be determined by observing the transfer function matrix for the sub-system. Time-interleaved, Hadamard-modulated and frequencyband decomposed parallel ADCs are then described as special cases of the general system, and their resistances to analog imperfections are analyzed. The multi-rate formulation is then used as a base to find insensitive systems. The second main contribution is in Chapter 12 and considers the design of highspeed FIR filters that may be used as decimation filters for high-speed Σ∆-ADCs. The problem of designing an efficient filter is decomposed in two parts, where the first part considers generation of a set of weighted bit-products and the second part considers the construction of a pipelined summation tree. In the first part, the filter is first formulated as a multi-rate system. Then, the branch-filters are implemented using one of several structures, including direct form FIR and transposed direct form FIR, and bit-products are generated for each of the coefficients. Finally, bitproducts with the same bit weight are merged. In the second part, the summation of the generated bit-products is considered. Traditional methods including Wallace trees and Dadda trees are used as references, and a bit-level optimization algorithm is proposed that minimizes a cost function based on the number of full adders, half adders and pipeline registers. Finally, the suitability of the different structures are evaluated for decimation and interpolation filters of varying wordlengths, rate change factors and pipeline depths. 94 CHAPTER 7. INTRODUCTION 8 FIR filters In this chapter, the basics of finite impulse response (FIR) filters are discussed. The definition of a filter from an impulse response is in Sec. 8.1, and the z-transform is introduced as a tool to analyze the behavior of a filter. In Sec. 8.2, some design methods for FIR filters are touched briefly. In Sec. 8.3, the basics of sampling rate conversion and multirate signal processing are discussed, and in Sec. 8.4, the main architectures for the realization of FIR filters are introduced. Also, a bitlevel realization of an FIR filter for decimation is shown, and the realization uses multirate theory to reduce the arithmetic complexity. More in-depth introductions to digital filters are available in [78], and multirate theory is extensively discussed in [94]. 8.1 FIR filter basics A digital filter is a discrete linear time-invariant system that is typically used to amplify or suppress certain frequencies of a signal. Digital filters can be partitioned into two main classes: finite impulse response (FIR) filters and infinite impulse response (IIR) filters. FIR filters often result in more complex implementations than IIR filters, for a given filter specification. However, FIR filters have other advantages such as better phase characteristics, better stability, fewer finite-wordlength considerations and better pipeline abilities. The work in this thesis considers FIR 95 96 CHAPTER 8. FIR FILTERS filters only. 8.1.1 FIR filter definition An FIR filter [78] can be characterized by its impulse response h(k), a scalar discrete function of k. For a causal FIR filter of order N , h(k) = 0 when k < 0 or k > N , and the impulse response thus has at most N + 1 nonzero values. The impulse response defines factors with which different delays of the input should be multiplied, and the output is the sum of these. Thus, denoting the input and output by x(n) and y(n), respectively, the behavior of an N th order filter with impulse response h(k) is defined by y(n) = N X k=0 x(n − k)h(k). (8.1) It can be seen that when the input is the Kronecker delta function (x(n) = δ(n)), the output is the impulse response (y(n) = h(n)). 8.1.2 z-transform The impulse response describes the filter’s behavior in the time domain, but does not say much about a filter’s behavior in the frequency domain. Thus, in order to analyze a filter’s frequency characteristics, the z-transform H(z) of the impulse response h(k) is defined as H(z) = ∞ X h(k)z −k , (8.2) k=−∞ and the input and output are related by the equation Y (z) = H(z)X(z). (8.3) jω The frequency response of the to visualize filter is then H(e ) which is common jω as the magnitude response H(e ) and the phase response arg H(ejω ). It is also possible to visualize the filter’s frequency characteristics by a pool-zero diagram. This is done by rewriting H(z) as H(z) = N X k=0 h(k)z −k = PN k=0 h(k)z N −k , zN (8.4) where the zeros and poles are the zeros of the numerator and the denominator, respectively. Since the denominator is a single power of z, all the poles are in the origin, making the filter unconditionally stable. 8.1. FIR FILTER BASICS 97 0.3 10 0.25 0 |H(ejω)| [dB] 0.2 h(k) 0.15 0.1 0.05 −20 −30 −40 0 −0.05 0 10 k 20 −50 0 30 0 0.5 −200 arg H(ejω) [deg] 1 0 −0.5 −1 −1 0.1 0.2 0.3 0.4 Normalized frequency 0.5 (b) Magnitude response H(ejω ). (a) Impulse response h(k). Imaginary axis −10 −400 −600 −0.5 0 0.5 Real axis 1 (c) Pole-zero diagram. 1.5 −800 0 0.1 0.2 0.3 0.4 Normalized frequency 0.5 (d) Phase response arg H(ejω ). Figure 8.1 28th order linear phase FIR filter with impulse response, frequency response and pole-zero diagram. 8.1.3 Linear phase filters One of the advantages of FIR filters is the ability to design filters with linear phase. Linear phase filters ensure that the delay is the same for all frequencies, which is often desirable in data communications applications. A linear phase filter results when the impulse response coefficients are either symmetric or anti-symmetric around the middle, and are commonly designed using the MPR algorithm (see Sec. 8.2). Consider the example design of a 28th order linear phase lowpass FIR filter shown in Fig. 8.1. The figure shows the impulse response h(k), the magnitude response H(ejω ), the pole-zero diagram of H(z) and the phase response arg H(ejω ). The linear phase property can be seen in both the coefficient symmetry in the impulse response in Fig. 8.1(a) and in the phase response in Fig. 8.1(d). 98 CHAPTER 8. FIR FILTERS x(m) L y(n) (a) L-fold upsampler. x(n) L y(m) (b) L-fold downsampler. Figure 8.2 Rate change elements. 8.2 FIR filter design There are many ways to design an FIR filter. In this section, some of the more commonly used when optimizing for a desired magnitude are given. A desired real non-negative magnitude response function Hd (ejω ) is given along with a weighting function W (ejω ). The optimization process then tries to design the filter H(ejω ), as defined by (8.2), to match Hd (ejω ) as closely as possible, with allowed relative errors for different frequencies defined by the weighting function W (ejω ). Using a mini-max goal function, the maximum weighted error over the frequency band is minimized for a given filter order, i.e., minimize max W (ejω ) H(ejω ) − Hd (ejω ) . (8.5) ω∈[0,2π] In certain cases, it is more beneficial to minimize the error energy rather than the maximum error. This leads to a least-squares optimization goal, which can be defined as Z 2π 2 minimize W (ejω ) H(ejω ) − Hd (ejω ) . (8.6) 0 In this thesis, the McClellan-Parks-Rabiner (MPR) algorithm [84, 85] has been used for the design of optimal mini-max FIR filters. 8.3 8.3.1 Multirate signal processing Sampling rate conversion In a system utilizing multiple sampling rates, rate changes are modelled by the upsampler and downsampler elements, shown in Fig. 8.2. The L-fold upsampler inserts zeros in the input stream, such that ( n for n = 0, ±L, ±2L, . . . x L y(n) = (8.7) 0 otherwise. In the z-domain, (8.7) becomes Y (z) = X(z L ), and the spectrum of y(n) thus consists of L copies of the spectrum of x(n). (8.8) 8.3. MULTIRATE SIGNAL PROCESSING x(m) L H(z) y(n) x(n) (a) L-fold interpolation. 99 H(z) L y(m) (b) L-fold decimation. Figure 8.3 Conventional interpolation and decimation of a signal. The L-fold downsampler retains every Lth sample of its input and discards the others, such that y(m) = x(mL). (8.9) In the z-domain, (8.9) becomes Y (z) = L−1 1 X −j 2πl 1 X e L zL , L (8.10) l=0 and the output spectrum thus consists of L stretched and shifted versions of the input spectrum. It can be seen that both the upsampler and downsampler affects the spectrum of the input: the output of the upsampler contains replicas of the input spectrum, whereas the output of the downsampler contains aliasing. Thus, when the upsampler and downsampler are used to interpolate or decimate a signal, the signal must also be filtered to remove the replicas (for the upsampler) or to remove the aliasing (for the downsampler). Typically, the filter is a lowpass filter, but can also be a highpass filter or any bandpass filter with a bandwidth of at most 2π/L. For the interpolator, filtering is performed on the output as shown in Fig. 8.3(a), whereas it is performed on the input for the decimator as shown in Fig. 8.3(b). For both the interpolator and the decimator, it can be seen that the filtering is performed at the higher sampling rate. It can also be noted that the filtering is unnecessarily complex in both cases. For the interpolation filter it is known that L − 1 of L input samples are zero and do not contribute to the output. For the decimation filter, L − 1 of L output samples are computed, but then discarded by the downsampler. In Sec. 8.3.3, it is shown how this knowledge can be used to reduce the complexities of the filters. 8.3.2 Polyphase decomposition For a filter h(k) with z-transform H(z), (8.2) can be rewritten H(z) = ∞ X k=−∞ h(k)z −k = L−1 X l=0 z −l ∞ X k=−∞ h(Lk + l)z −Lk = L−1 X z −l Hl (z L ), (8.11) l=0 where Hl (z) is the z-transform of the lth polyphase component hl (k) = h(Lk + l), l = 0, 1, . . . , L − 1. (8.12) 100 CHAPTER 8. FIR FILTERS x(n) H(z L ) x(m) L L H(z L ) y(m) ⇐⇒ y(n) x(n) ⇐⇒ x(m) L H(z) H(z) y(m) L y(n) Figure 8.4 Noble identities. This decomposition is usually called the Type-1 polyphase decomposition [94], and is often used to parallelize the implementations of FIR filters [78]. It is also normally used in the implementations of FIR rate change filters, as it allows the implementation to retain the complexity, but run at a lower sampling rate. 8.3.3 Multirate sampling rate conversion The upsampler and downsampler are time-varying operations [94], and therefore operations can in general not be moved through them. However, the Noble identities [94] allow filters and rate change operations to exchange places in certain conditions. The Noble identities are shown in Fig. 8.4. Using polyphase decomposition and the Noble identities, the complexities of the interpolator and decimator in Fig. 8.3 can be reduced significantly. The transformation of the interpolator is shown in Fig. 8.5, where the interpolation filter H(z) is first polyphase decomposed with a factor of L. The resulting subfilters Hl (z) have the same total complexity as the original filter H(z). Then the upsampler is moved to after each subfilter, reducing the sampling frequencies of the filters by a factor of L. Finally, analyzing the structure at the output it is realized that only one branch is non-zero at a given time and can therefore be realized as a commutator between the branches. The transformation of the decimator is shown in Fig. 8.6 and is analogous. 8.4 8.4.1 FIR filter architectures Conventional FIR filter architectures The most straight-forward implementation of an FIR filter is the direct form architecture, which is a direct realization of the filter definition (8.1). The architecture is shown in Fig. 8.7, and has a delay line for the input signal x(n), forming a number of delayed input signals x(n−k). The delay line is tapped at a number of positions, where the delayed input is scaled by the appropriate impulse response coefficient, and all products are then summed together and form the output y(n). The transposed direct form structure can be obtained from the direct form structure using the transpose operation [78], which for systems with a single input and output results in an arithmetically equivalent system. The transposed direct form architecture is shown in Fig. 8.8. The transposed direct form may lend itself better to high-speed implementations, as the critical path of the adders at 8.4. FIR FILTER ARCHITECTURES x(m) 101 L H(z) y(n) (a) Conventional interpolation. x(m) H0 (z L ) L y(n) H1 (z L ) z −1 HL−1 (z L ) z −(L−1) (b) Polyphase decomposition of filter. x(m) H0 (z) L y(n) H1 (z) L z −1 HL−1 (z) L z −(L−1) (c) Moving upsampler to after subfilters. x(m) H0 (z) H1 (z) y(n) HL−1 (z) (d) Replacing output structure by commutator. Figure 8.5 Interpolator realized using polyphase decomposition. the output of the direct form structure may become significant and require extra pipelining registers. 8.4.2 High-speed FIR filter architecture In the case of implementations with high speed and short wordlengths, it may be more beneficial to consider realizations on the bit level rather than the word level. One such architecture is described here, and is also used as a base for the work in Chapter 12. First, the detailed derivation of a decimation filter on direct form is described, and then the resulting architecture for a transposed direct form realization is shown. Finally, the realizations of single rate and interpolation filters are discussed. Consider the structure in Fig. 8.9(a), corresponding to the arithmetic part of the multirate decimation filter in Fig. 8.6. The system has L inputs x0 (n), . . . , xL−1 (n), L independent subfilters H0 (z), . . . , HL−1 (z), and one output y(n). In Fig. 8.9(b), 102 CHAPTER 8. FIR FILTERS x(n) H(z) L y(m) (a) Conventional decimation. H0 (z L ) x(n) L z −1 H1 (z L ) z −(L−1) HL−1 (z L ) y(m) (b) Polyphase decomposition of filter. x(n) L H0 (z) z −1 L H1 (z) z −(L−1) L HL−1 (z) y(m) (c) Moving downsampler to before subfilters. y(m) H0 (z) H1 (z) x(n) HL−1 (z) (d) Replacing input structure by commutator. Figure 8.6 Decimator realized using polyphase decomposition. z −1 x(n) h(0) z −1 h(1) z −1 z −1 h(2) h(N − 1) h(3) y(n) Figure 8.7 Direct form FIR filter architecture. x(n) h(N − 1) z −1 h(3) z −1 h(2) z −1 h(1) h(0) z −1 Figure 8.8 Transposed direct form FIR filter architecture. y(n) 8.4. FIR FILTER ARCHITECTURES x0 (n) H0 (z) x1 (n) H1 (z) xL−1 (n) 103 y(n) HL−1 (z) (a) Multirate filter structure. x0 (n) z −1 z −1 z −1 H0 (z) x1 (n) z −1 z −1 z −1 H1 (z) y(n) xL−1 (n) z −1 z −1 HL−1 (z) z −1 (b) Direct form realization of subfilters. x0 (n) x1 (n) xL−1 (n) wd wd z −1 z −1 z −1 z −1 z −1 z −1 z −1 z −1 z −1 wd Partial product generation Nw−1 N2 N1 N0 Carry-save adder tree wi VMA y(n) w (c) Realization using partial product generation and carrysave adder tree. Figure 8.9 High-speed realization of FIR decimation filter on direct form using partial product generation and carry-save adders. 104 CHAPTER 8. FIR FILTERS x0 (n) x1 (n) wd Partial product generation wd VMA CSA z −1 CSA z −1 CSA xL−1 (n) wd y(n) w Figure 8.10 High-speed architecture of FIR decimation filter on transposed direct form. the subfilters have been realized on direct form. Now, instead of directly computing all products of the inputs and coefficients, partial products corresponding to the non-zero bits of the coefficient are generated. The detailed generation of the partial products depends on the representation of the data and the coefficients, and is described in Sec. 12.1.2. However, the result is a number of partial products with different bit weights, represented by the partial product vectors in Fig. 8.9(c) where wd is the input data wordlength, Nm is the number of partial products of bit weight m and w is the wordlength of y(n). Typically, Nm = 0 for some m. The partial products are merged in a carry-save adder tree, which results in a still-redundant output of two vectors of length wi ≤ w. Finally, the vectors are merged by a vector merge adder to a non-redundant output y(n). A similar architecture on transposed direct form is shown in Fig. 8.10. In the transposed direct form architecture, the delay elements are at the output, and the partial product generation thus consists of several partitions of partial products corresponding to different delays. Each partition is reduced by a carry-save adder tree to a redundant form of two vectors in order to reduce the number of registers required. The described architectures can be used also for single rate filters, simply by letting L = 1. It is also possible to realize an interpolation filter (Fig. 8.5) as L independent branches. However, for a direct form architecture, the delay elements of the input can typically be shared between the branches. 9 Sigma-delta data converters In this chapter, a brief introduction to Σ∆-ADCs is given. The basics of Σ∆modulators are given in Sec. 9.1. Signal and noise transfer functions of the modulator are defined, and expressions for the resulting quantization noise power and modulator SNR are obtained. In Sec. 9.2, some common structures of Σ∆-modulators are shown. In Sec. 9.3, ADC systems using several Σ∆-modulators in parallel are introduced. Finally, Sec. 9.4 gives some notes on implementations of decimation filters for Σ∆-ADCs. More in-depth analyses of Σ∆-ADCs are given in, e.g., [21, 81, 88]. 9.1 9.1.1 Sigma-delta data conversion Sigma-delta ADC overview The Σ∆-ADC is today often the preferred architecture for realizing low- to mediumspeed analog-to-digital converters (ADCs) with effective resolution above 12 bits. Higher resolution than this is difficult to achieve for non-oversampled ADCs without laser trimming or digital error correction, since device matching-errors of semiconductor processes limit the accuracy of critical analog components [88]. The Σ∆-ADC can overcome this problem by combining the speed advantage of analog circuits with the robustness and accuracy of digital circuits. Through oversampling 105 106 CHAPTER 9. SIGMA-DELTA DATA CONVERTERS x(t) S/H x(n) Σ∆-modulator R ADC y(n) H(z) u(n) DAC Figure 9.1 Model of first order Σ∆-ADC e(n) x(n) z −1 y(n) Figure 9.2 Linear model of the first order Σ∆-modulator in Fig. 9.1 and noise shaping, the Σ∆-modulator converts precise signal waveforms to an oversampled digital sequence where the information is localized in a narrow frequency band in which the quantization noise is heavily suppressed. One of the prices to pay for these advantages is the required digital decimator operating on high sampling rate data. For Nyquist-rate CMOS ADCs, the power consumption increases approximately by a factor of four when increasing the resolution by one bit [93]. Hence, the power consumption of accurate Nyquist-rate ADCs tends to become very high. On the other hand, the analog circuitry of oversampled Σ∆-modulators does not need to be accurate and power savings can therefore be made, in particular for continuous-time Σ∆-modulators [53]. A model of a first order Σ∆-ADC is shown in Fig. 9.1. The differentiator at the input subtracts the value of the previous conversion from the input, and the difference is integrated. Following the integration, the signal is sampled and quantized by a local ADC. The result is that the error of the previous conversion is subtracted from the next conversion, thus causing a suppression of the quantization error at low frequencies. Instead, the error appears at the higher frequencies. In the digital domain, a decimation filter can efficiently remove the noise, resulting in a signal that can achieve high resolution. However, in order to achieve high resolution, the sampling frequency has to be significantly larger than the signal bandwidth. This often causes the power dissipation of the digital decimation filter to constitute a major part of the power dissipation of the ADC. 9.1.2 Sigma-delta modulators When analyzing the performance of a Σ∆-modulator, the input signal and the quantization noise are assumed to be uncorrelated. With this assumption, the Σ∆-modulator in Fig. 9.1 can be modelled by the structure in Fig. 9.2. Using 9.1. SIGMA-DELTA DATA CONVERSION 107 this model, a signal transfer function F (z) and a noise transfer function G(z) are defined such that the output y(n) can be written Y (z) = F (z)X(z) + G(z)E(z), (9.1) where X(z), E(z) and Y (z) are the z-transforms of x(n), e(n) and y(n), respectively. In the linear model of the first order Σ∆-modulator, the signal and noise transfer functions are given by F (z) = z −1 (9.2) G(z) = 1 − z −1 . (9.3) Thus, the signal is passed unaffected by the modulator, whereas the quantization noise is high-pass shaped. It is also possible to construct higher order modulators, where the noise transfer function has a stronger high pass shape. Examples of the structures used in this thesis are included in Sec. 9.2. A straight-forward design of a kth order modulator yields signal and noise transfer functions given by F (z) = z −k G(z) = 1 − z (9.4) −1 k . (9.5) The noise transfer function G(z) is shown in Fig. 9.3 for Σ∆-modulators of different orders. It is seen that, whereas the higher order modulators have a higher suppression of the noise at lower frequencies, the advantage comes at a cost of increased noise amplification at the higher frequencies. This increases the requirements of the digital decimation filter. Higher order modulators also have other considerations, including stability problems that reduce the allowed swing of the input signal, and analog matching problems. However, these are not further discussed in this thesis. 9.1.3 Quantization noise power It is common to regard the quantization noise e(n) as a white stochastic process with a uniform distribution in the interval [−Q/2, Q/2], where Q is the quantization step. Assuming uniform quantization in the range [−1, 1], Q= 2 , 2B − 1 (9.6) where B is the number of bits used in the quantization. Let Ree (ejωT ) denote the power spectral density (PSD) of the quantization noise. Then, Ree (ejωT ) = Q2 1 = , B 12 3(2 − 1)2 and the noise power at the output of the modulator can be written 4K ωT 2K jωT jωT jωT 2 Rye ye (e ) = Ree (e ) G(e ) = sin 3(2B − 1)2 2 (9.7) (9.8) 108 CHAPTER 9. SIGMA-DELTA DATA CONVERTERS 2 10 1 10 0 |G(jω)| 10 −1 10 −2 10 4th order 3rd order 2nd order 1st order 0th order −3 10 −4 10 0 0.2pi 0.4pi 0.6pi 0.8pi pi jω Figure 9.3 Frequency response of noise transfer function G(z) for Σ∆modulators of different orders The assumption of e(n) to be independent of x(n) is in reality not a well-motivated assumption. In contrast, e(n) is highly dependent of x(n) which causes the appearance of idle tones in the output spectrum. In addition, the assumption that the quantization noise is white with a uniform distribution is also not well motivated. In reality, the input is slow-changing due to being band-limited, which causes the quantization noise to be highly correlated in time. Both of these effects are well pronounced for single-bit Σ∆-modulators, which are among the most commonly used in practice. In Chapter 11, an alternative analysis method is presented, with which the actual quantization noise for a specific input signal is used to evaluate the performance of different Σ∆-modulators. 9.1.4 SNR estimation Assume that the Σ∆-modulator uses an oversampling ratio (OSR) of L, and thus π π that the input signal is bandlimited to the interval − L , L . The allowed input signal power depends on several things, including the modulator order, number of quantization bits, and the overloading behavior of the modulator, but assume for simplicity that a sine wave of unit amplitude is used. Then the input signal power 9.2. MODULATOR STRUCTURES 109 200 SNR 150 4th order 3rd order 2nd order 1st order 0th order 100 50 0 1 2 4 8 16 Oversampling ratio 32 64 128 Figure 9.4 SNR as function of OSR for single-bit Σ∆-modulators of different orders. is σx2 = 1/2. Since the signal transfer function is a pure delay, the output signal 2 power is σyx = σx2 . Now, the signal-to-noise ratio (SNR) at the output due to quantization noise can be written SNR = 2 σyx 1 2π π L π −L R Rye ye (ejωT )dω . (9.9) Figure 9.4 shows (9.9) evaluated for single-bit Σ∆-modulators of different orders. The oversampling gain is 3 + 6K dB/octave for a Σ∆-modulator of order K. 9.2 Modulator structures In this thesis, three different Σ∆-modulator structures are considered. These are presented in this section, and are the first order, single feedback Σ∆-modulator shown in Fig. 9.5, the second order, double feedback Σ∆-modulator shown in Fig. 9.6, and the fourth order multi-stage Σ∆-modulator (MASH) shown in Fig. 9.7. The signal and noise transfer functions of the first and second order Σ∆-modulators 110 CHAPTER 9. SIGMA-DELTA DATA CONVERTERS x(n) T Q y(n) D/A Figure 9.5 A first-order, single-feedback Σ∆-modulator. x(n) T Q y(n) T D/A Figure 9.6 A second-order, double-feedback Σ∆-modulator. x(n) T Q T y(n) T D/A 0.5 T T 2 Q T T D/A 2 Figure 9.7 A multistage noise-shaping (MASH) Σ∆-modulator. are given by (9.4) and (9.5), respectively. For the MASH Σ∆-modulator, the signal and noise transfer functions are given by FMASH (z) = z −2 GMASH (z) = 2(1 − z 9.3 (9.10) −1 4 ) . (9.11) Modulated parallel sigma-delta ADCs Traditionally, ADCs and DACs based on Σ∆-modulation have been used primarily for low bandwidth and high-resolution applications such as audio. The requirements make the architecture perfectly suited for this purpose. However, 9.3. MODULATED PARALLEL SIGMA-DELTA ADCS x(t) S/H Σ∆ H0 (z) Σ∆ H1 (z) Σ∆ HN −1 (z) 111 y(n) Figure 9.8 Architecture of a parallel Σ∆-ADC using time interleaving of the input signal. 1, 1, 1, ... 1, 1, 1, ... Σ∆ H0 (z) Σ∆ H1 (z) Σ∆ HN −1 (z) 1, -1, 1, ... x(t) S/H 1, -1, 1, ... 1, -1, -1, ... y(n) 1, -1, -1, ... Figure 9.9 Architecture of a parallel Σ∆-ADC using Hadamard modulation of the input signal. x(t) S/H lowpass Σ∆ H0 (z) bandpass Σ∆ H1 (z) highpass Σ∆ HN −1 (z) y(n) Figure 9.10 Architecture of a parallel Σ∆-ADC using frequency-band decomposition of the input signal. in later years, advancements in VLSI technology have allowed greatly increased clock frequencies, and Σ∆-ADCs with bandwidths of tens of MHz have been reported [19, 103]. This makes it possible to use Σ∆-ADCs in a wider context, for example in wireless communications. In order to reach such bandwidths with feasible resolutions, sampling frequencies in the GHz range are used. However, higher operating frequencies than this are hard to obtain with current technology. Even with reduced feature sizes, the increased performance with digital circuits comes from increased parallelism rather than increased operating frequencies. One way to further increase the bandwidth of a Σ∆-ADC is to use several modulators in parallel, where a part of the input signal is converted in each channel. 112 CHAPTER 9. SIGMA-DELTA DATA CONVERTERS Several flavors of such Σ∆-ADCs have been proposed, and these can essentially be divided into four categories: time-interleaved modulators [34, 35], Hadamard modulators [35, 40, 41, 61, 62], frequency-band decomposed modulators [29, 32, 35] and multirate modulators based on block-digital filtering [28, 56, 58, 64]. In the time-interleaved modulator, shown in Fig. 9.8, samples are interleaved in time between the channels. Each modulator is running at the input sampling rate, with its input grounded between consecutive samples. This is a simple scheme, but as interleaving causes aliasing of the spectrum, the channels have to be carefully matched in order to cancel aliasing in the deinterleaving at the output. In a Hadamard modulator, shown in Fig. 9.9, the signal is modulated by a sequence constructed from the rows of a Hadamard matrix. One advantage over the time-interleaved modulator is an inherent coding gain, which increases the dynamic range of the ADC [35], whereas a disadvantage is that the number of channels is restricted to a number for which there exists a known Hadamard matrix. Another advantage, as is shown in Chapter 10, is the reduced sensitivity to mismatches in the analog circuitry. The third category of parallel modulators is the frequency-band decomposed modulators, shown in Fig. 9.10, in which the signal is decomposed in frequency rather than time. This scheme is insensitive to analog mismatches, but has increased hardware complexity because it requires the use of bandpass modulators. The idea of the multirate modulators is different, based on a polyphase decomposition of the integrator in one channel, and is not further considered. 9.4 Data rate decimation A significant part of the complexity, and power dissipation, of a Σ∆-ADC is the digital decimation filter. This is especially true for the high-speed ADCs considered in this thesis, and for operation in the GHz range designing the decimation filter is a significant challenge. Normally, the decimation filter is designed in several stages to relax the requirements on each stage and reduce the overall complexity [94,100]. For the first stage, a common choice is the cascaded integrator comb (CIC) filter (also known as moving average filter). The impulse response of an N -tap (order N − 1) CIC filter is N −1 1 X −i 1 1 − z −N H(z) = z = (9.12) N i=0 N 1 − z −1 If the number of taps, N , is selected as N = M , where M is the decimation rate, the CIC filter has zeros on the unit circle at 2iπ/M rad for i = 1, 2, . . . , M − 1, i.e., the angles that are folded to 0 during the decimation. To increase the attenuation, several CIC filters are cascaded. The main reason of the popularity of the CIC filter is the direct implementation of (9.12) as a cascade of an accumulator and a differentiator [20, 51], depicted in Fig. 9.11. This implementation has low arithmetic complexity, as for a cascade of L CIC filters only 2L adders are required. However, the wordlength of the 9.4. DATA RATE DECIMATION 113 M x(n) 1/M y(m) z −1 z −1 Figure 9.11 Direct implementation of (9.12) as cascaded accumulator and differentiator. x0 (m) x1 (m) x(n) xN −1 (m) Partial product gen. CSA VMA y(m) Figure 9.12 General architecture of a decimation filter implemented using polyphase decomposition and partial product reduction with a carry-save adder tree. accumulators in a CIC filter is significantly longer than the input wordlength, which may lead to problems obtaining high throughput as the accumulators operate at the input sampling rate. This can be alleviated by the use of redundant arithmetic in the accumulators [76] or parallelizing the accumulators to operate at a lower sampling rate. However, this increases the complexity of the implementation which makes it less attractive for operations at high speed. For decimation filters with short wordlength and high throughput requirements, it may be significantly more efficient to compute the impulse response of the cascaded filters and realize the resulting linear-phase FIR filter using polyphase decomposition [1, 42, 45, 67]. The realization using a direct form structure is described in Sec. 8.3, and a general architecture of a decimation filter implemented using polyphase decomposition is shown in Fig. 9.12. Another advantage of the architecture is the ability to use general FIR filters to allow an arbitrary set of filter coefficients, optimized for a suitable cost function. This is especially useful when using higher order Σ∆-modulators with arbitrary zero positioning for the noise transfer function. 114 CHAPTER 9. SIGMA-DELTA DATA CONVERTERS 10 Parallel sigma-delta ADCs In this chapter, a multi-rate formulation of a system of parallel Σ∆-ADCs is presented. The channels use modulation of the input sequence to achieve parallelism, and also allow the Σ∆-modulators to differ between the channels. The formulation is used to analyze a system’s sensitivity to mismatches between the channels, including channel gain, general Σ∆-modulator mismatches and modulation sequence errors. It is shown that certain systems can become noise-free with limited calibration of a subset of the channels, whereas others may require full calibration of all channels. The multi-rate formulation is also used as a tool to devise a three-channel parallel ADC that is insensitive to gain mismatches in the channels. In Sec. 10.1, the proposed multi-rate formulation is presented, and the resulting signal transfer function of the parallel system is derived. In Sec. 10.2, the model of one channel of the parallel system is described. The model includes analog imperfections such as modulator non-idealities, channel gain, offset errors and modulation sequence mismatches. In Sec. 10.3, the formulation is used to analyze the sensitivities of a time-interleaved ADC, Hadamard-modulated ADC and a frequency-band decomposed ADC, and the new insensitive modulation scheme is presented. Finally, the quantization noise characteristics of a parallel modulated system is discussed briefly in Sec. 10.4. This chapter has been published in [16]. 115 116 CHAPTER 10. PARALLEL SIGMA-DELTA ADCS a0 (n) b0 (n) y0 (n) Σ∆0 G0 (z) a1 (n) b1 (n) y1 (n) x(n) Σ∆1 G1 (z) aN −1 (n) y(n) bN −1 (n) yN −1 (n) Σ∆N −1 GN −1 (z) Figure 10.1 ADC system using parallel Σ∆-modulators and modulation sequences. 10.1 Linear system model The system in Fig. 10.1 is considered. In this scheme, the input signal x(n) is first divided into N channels. In each channel k, k = 0, 1, . . . , N − 1, the signal is first modulated by the M -periodic sequence ak (n) = ak (n+M ). The resulting sequence is then fed into a Σ∆-modulator Σ∆k , followed by a digital filter Gk (z). The output of the filter is modulated by the M -periodic sequence bk (n) = bk (n + M ) which produces the channel output sequence yk (n). Finally, the overall output sequence y(n) is obtained by summing all channel output sequences. The Σ∆modulator in each channel works in the same way as an ordinary Σ∆-modulator. By increasing the channel oversampling, and reducing the passband width of the channel filter accordingly, more of the shaped noise is removed, and the resolution is increased. By using several channels in parallel, wider signal bands can be handled without increasing the input sampling rate to unreasonable values. In other words, instead of using one single Σ∆-ADC with a very high input sampling rate, a number of Σ∆-ADCs in parallel provide essentially the same resolution but with a reasonable input sampling rate. Traditional parallel systems that are covered by the proposed model include the time-interleaved, Hadamard modulated and frequency-band decomposed ADCs discussed in Sec. 9.3. The overall output y(n) is determined by the input x(n), the signal transfer function of the system, and the quantization noise generated in the Σ∆-modulators. Using a linear model for analysis, the signal input-to-output relation and noise input-to-output relation can be analyzed separately. The signal transfer function from x(n) to y(n) should equal (or at least approximate) a delay in the frequency band of interest. The main problem in practice is that the overall scheme is subject 10.1. LINEAR SYSTEM MODEL 117 ak (n) x(n) bk (n) Hk (z) yk (n) (a) Model of channel. x0 (m) x(n) yk,0 (m) M M M ak,0 z M bk,0 −1 z x1 (m) M z M yk,1 (m) M ak,1 z −1 M bk,1 Hk (z) z −1 z xM −1 (m) M z M yk,M −1 (m) M ak,M −1 z −1 M yk (n) bk,M −1 (b) Polyphase decomposition of multipliers. x0 (m) x(n) yk,0 (m) M z −1 M x1 (m) M z −1 Pk (z) xM −1 (m) M z −1 yk,1 (m) M yk,M −1 (m) M z −1 yk (n) (c) Multirate formulation of a channel. Figure 10.2 Equivalent signal transfer models of a channel of the parallel system in Fig. 10.1 118 CHAPTER 10. PARALLEL SIGMA-DELTA ADCS xq (m) M zq z M −1 Hk (z) z −r M ak,q yk,r (m) bk,r Figure 10.3 Path from xq (m) to yk,r (m) in channel k as depicted in Fig. 10.2(b). to channel gain, offset, and modulation sequence level mismatches [4, 35, 57]. This is where the new general formulation becomes useful as it gives a relation between the input and output from which one can easily deduce a particular scheme’s sensitivity to mismatch errors. The noise contribution, on the other hand, is essentially unaffected by channel mismatches. Therefore, the noise analysis can be handled in the traditional way, as in Sec. 10.4. 10.1.1 Signal transfer function From the signal input-to-output point of view, the system depicted in Fig. 10.2(a) is obtained for channel k. Here, each Hk (z) represents a cascade of the corresponding signal transfer function of the Σ∆-modulator and the digital filter Gk (z). To derive a useful input-output relation in the z-domain, multirate filter bank theory [94] is used. As ak (n) and bk (n) are M -periodic sequences, each multiplication can be modelled as M branches with constant multiplications and with the samples interleaved between the branches. This is shown in the structure in Fig. 10.2(b), where ( ak (0) for n = 0 ak,n = (10.1) ak (M − n) for n = 1, 2, . . . , M − 1 bk,n = bk (M − 1 − n) for n = 0, 1, . . . , M − 1. (10.2) Now, consider the system shown in Fig. 10.3, representing the path from xq (m) to yk,r (m) in Fig. 10.2(b). Denoting H̃k (z) = z M −1 Hk (z), (10.3) the transfer function from xq (m) to yk,r (m) is given by the first polyphase component in the polyphase decomposition of z q H̃k (z)z −r , scaled by ak,q bk,r . For p = q − r = 0, 1, . . . , M − 1, the polyphase decomposition of z p H̃k (z) can be written M −1 X z p H̃k (z) = z p−i H̃k,i (z M ), (10.4) i=0 and the first polyphase component is H̃k,p (z), i.e., the pth polyphase component of H̃k (z) as specified by the Type 1 polyphase representation in Sec. 8.3. For 10.1. LINEAR SYSTEM MODEL x(n) 119 M M z −1 z −1 M M P(z) z −1 z −1 M M y(n) Figure 10.4 Equivalent representation of the system in Fig. 10.1 based on the equivalences in Fig. 10.2. P(z) is given by (10.8). p = −M + 1, . . . , −1, z p H̃k (z) = M −1 X z p−i+M z −M H̃k,i (z M ) (10.5) i=0 and the first polyphase component is z −1 H̃k,p+M (z). Returning to the system in Fig. 10.2(b), the transfer functions Pkr,q (z) from xq (m) to yk,r (m) can now be written ( bk,r H̃k,q−r (z)ak,q for q ≥ r r,q Pk (z) = (10.6) bk,r z −1 H̃k,q−r+M (z)ak,q for q < r. The relations can be written in matrix form as Pk (z) = ak,0 bk,0 H̃k,0 (z) ak,0 bk,1 z −1 H̃k,M −1 (z) ak,0 bk,2 z −1 H̃k,M −2 (z) .. . ak,0 bk,M −1 z −1 H̃k,1 (z) ak,1 bk,0 H̃k,1 (z) ak,1 bk,1 H̃k,0 (z) ak,1 bk,2 z −1 H̃k,M −1 (z) .. . ··· ··· ··· .. . ak,1 bk,M −1 z −1 H̃k,2 (z) ··· ak,M −1 bk,0 H̃k,M −1 (z) ak,M −1 bk,1 H̃k,M −2 (z) ak,M −1 bk,2 H̃k,M −3 (z) , .. . ak,M −1 bk,M −1 H̃k,0 (z) (10.7) and it is thus obvious that one channel of the system can be represented by the structure in Fig. 10.2(c). In the whole system (Fig. 10.1) a number of such channels are summed at the output, and the parallel system of N channels can be represented 120 CHAPTER 10. PARALLEL SIGMA-DELTA ADCS by the structure in Fig. 10.4, where the matrix P(z) is given by P(z) = N −1 X Pk (z). (10.8) Pk (z) = Sk · H̃k (z) (10.9) k=0 For convenience, we write (10.7) as where the multiplication is element-wise, and where H̃k (z) and Sk are given by H̃k,0 (z) z −1 H̃k,M −1 (z) −1 H̃k (z) = z H̃k,M −2 (z) .. . z −1 H̃k,1 (z) H̃k,1 (z) H̃k,0 (z) z −1 H̃k,M −1 (z) .. . ··· ··· ··· .. . z −1 H̃k,2 (z) ··· H̃k,M −1 (z) H̃k,M −2 (z) H̃k,M −3 (z) .. . (10.10) H̃k,0 (z) and Sk = ak,0 bk,0 ak,0 bk,1 ak,0 bk,2 .. . ak,1 bk,0 ak,1 bk,1 ak,1 bk,2 .. . ··· ··· ··· .. . ak,M −1 bk,0 ak,M −1 bk,1 ak,M −1 bk,2 .. . ak,0 bk,M −1 ak,1 bk,M −1 ··· ak,M −1 bk,M −1 respectively. (10.11) can equivalently be written as Sk = bTk ak , (10.11) (10.12) where ak = ak,0 ak,1 · · · ak,M −1 , bk = bk,0 bk,1 · · · bk,M −1 , (10.13) and T stands for transpose. Examples of the Sk -matrices and the ak - and bk vectors are provided for a time-interleaved system, a Hadamard-modulated system and a frequency-band decomposed system in Sec. 10.3.1, Sec. 10.3.2, and Sec. 10.3.3, respectively. 10.1.2 Alias-free system With the system represented as above, it is known that it is alias-free, and thus time-invariant, if and only if the matrix P(z) is pseudocirculant [94]. Under this condition, the output z-transform becomes Y (z) = HA (z)X(z), (10.14) 10.1. LINEAR SYSTEM MODEL 121 where HA (z) = z −M +1 N −1 M −1 X X k=0 i=0 −i M s0,i k z H̃k,i (z ) = N −1 M −1 X X −i M s0,i k z Hk,i (z ) (10.15) k=0 i=0 with s0,i k denoting the elements on the first row of Sk . This case corresponds to a Nyquist sampled ADC of which two special cases are the time-interleaved ADC [34, 58] and Hadamard-modulated ADC [40]. These systems are also described in the context of the multirate formulation in Sec. 10.3.1 and Sec. 10.3.2, respectively. Regarding H̃k (z), it is seen in (10.10) that it is pseudocirculant for an arbitrary H̃k (z). It would thus be sufficient to make Sk circulant for each channel k in order to make each Pk (z) pseudocirculant and end up with a pseudocirculant P(z). Unfortunately, the set of circulant real-valued Sk achievable by the construction in (10.12) is limited, because the rank of Sk is one. However, for purposes of error cancellation between channels it is beneficial to group the channels in sets where the matrices within each set sum to a circular matrix. The channel set {0, P 1, . . . , N − 1} is thus partitioned into the sets C0 , . . . , CI−1 , where each sum k∈Ci Sk is a circulant matrix. It is assumed that the modulators and filters are identical for channels belonging to the same partition, Hk (z) = Hl (z) whenever k, l ∈ Ci , and thus H̃k (z) = H̃l (z). The matrix for partition i is denoted H̃0,i (z). Sensitivity to channel mismatches are discussed further in Sec. 10.2. 10.1.3 L-decimated alias-free system A system is denoted an L-decimated alias-free system if it is alias-free before decimation by a factor of L. A channel of such a system is shown in Fig. 10.5(a). Obviously, the decimation can be performed before the modulation, as shown in Fig. 10.5(b), if the index of the modulation sequence is scaled by a factor of L. Considering the equivalent system in Fig. 10.5(c), it is apparent that the downsampling by L can be moved to after the scalings by bk,l if the delay elements z −1 are replaced by L-fold delay elements z −L . The system may then be described as in Fig. 10.5(d), where Pk (z) is defined by (10.6). However, the outputs are taken from every Lth row of Pk (z), such that the first output yk,L−1 mod M (m) is taken from row L, the second output yk,2L−1 mod M (m) is taken from row (2L − 1 mod M ) + 1, and so on. It is thus apparent that only rows gcd(L, M ) · i, i = 0, 1, 2, . . . are used. The L-decimated system corresponds to an oversampled ADC. The main observation that should be made is that the L-decimated system may be described in the same way as the critically sampled (non-oversampled) system, but that relaxations may be allowed on the requirements of the modulation sequences. As only a subset of the rows of P(z) are used, the matrix needs to be pseudo-circulant only on these rows. As in the critically sampled case, the channel set {0, 1, . . . , N − 1} P is partitioned into sets C0 , . . . , CI−1 where the matrix k∈Ci Sk is circulant on the rows gcd(L, M ) · i, i = 0, 1, 2, . . . , and H̃k (z) = H̃l (z) = H̃0,i (z) when k, l ∈ Ci . The oversampled Hadamard-modulated system in [41] belongs to this category of the formulation, and another example of a decimated system is given in Sec. 10.3.4. 122 CHAPTER 10. PARALLEL SIGMA-DELTA ADCS ak (n) bk (n) x(n) Hk (z) L yk (l) (a) Decimation at output. ak (n) bk (Ll) x(n) Hk (z) L yk (l) (b) Internal decimation. x(n) M z M x0 (m) −1 ak,0 z z x1 (m) Hk (z) M M M bk,L−1 mod M M M bk,2L−1 mod M M ak,1 z −1 L z −1 xM −1 (m) M z −1 z z M M ak,M −1 bk,M L−1 mod M M = bk,M −1 yk (l) (c) Polyphase decomposition of input and output. x(n) M z −1 M x0 (m) yk,L−1 mod M (m) x1 (m) yk,2L−1 mod M (m) M z −1 L xM −1 (m) M Pk (z) L M z −1 yk,M −1 (m) L z −1 M yk (l) (d) Multirate formulation of a channel. yk,L (m) denotes the output pertaining to the Lth row of Pk (z). Figure 10.5 Channel model of L-decimated system. 10.2. SENSITIVITY TO CHANNEL MISMATCHES 123 ak (n) εk (n) γk δk x(n) ∆Hk (z) bk (n) Hk (z) yk (n) Figure 10.6 Channel model with nonideal analog circuits. 10.2 Sensitivity to channel mismatches In this section, the channel model used for the sensitivity analysis is explained. Figure 10.6 shows one of the channels of a parallel Σ∆-ADC, where several nonidealities resulting from imperfect analog circuits have been included. Difficulties in realizing the exact values of the analog modulation sequence are modelled by an additive error term εk (n). The error is assumed to be static, i.e., it depends only on the value of ak (n), and is therefore a periodic sequence with the same periodicity as ak (n). The time-varying error εk (n) may be a major concern when the modulation sequences contain non-trivial elements, i.e., elements that are not −1, 0, or 1. The trivial elements may be realized without a multiplier by exchanging, grounding, or passing through the inputs to the modulator, and are for this reason particularly attractive on the analog side. A channel-specific gain γk is included in the sensitivity analysis, and analog imperfections in the modulator are modelled as the transfer function ∆Hk (z). The modulator nonidealities including channel gain and modulation sequence errors are analyzed separately in the context of the multirate formulation. In practice, there is also a channel offset δk which is not suitable for analysis in this context, as it is signal independent. Channel offsets are commented in Sec. 10.2.4. 10.2.1 Modulator nonidealities P Assume that the ideal system is alias-free, i.e., the matrix P(z) = Pk (z) is pseudocirculant. Due to analog circuit errors the transfer function of channel k deviates from the ideal Hk (z) to γk (Hk (z) + ∆Hk (z)), and H̃k (z) is replaced by Ĥk (z) = γk (Hk (z)+∆Hk (z))z M −1 . The transfer matrix for channel k thus becomes P̂k (z) with elements ( bk,j Ĥk,i−j (z)ak,i for i ≥ j j,i P̂k (z) = (10.16) bk,j z −1 Ĥk,i−j+M (z)ak,i for i < j where Ĥk,p (z) are the polyphase components of Ĥk (z). It is apparent that P̂k (z) is pseudocirculant whenever Pk (z) is. Thus a system where all the Sk matrices are circulant is completely insensitive to modulator mismatches. 124 CHAPTER 10. PARALLEL SIGMA-DELTA ADCS P In the general case, unfortunately, all Sk are not circulant and P̂k (z) = Sk · Ĥk (z) does not sum up to a pseudocirculant matrix as the matrices Ĥk (z) are different between the channels. Partitioning the channel set into the sets Ci , as described in Sec. 10.1.1, and matching the modulators of channels belonging to the same partition Ci , i.e., defining γk = γl and ∆Hk (z) = ∆Hl (z) when k, l ∈ Ci , allows P̂(z) to be written P P̂(z) = N −1 X k=0 Sk · Ĥk (z) = I−1 X i=0 Ĥ0,i (z) · X Sk (10.17) k∈Ci and it is apparent that each term in the outer sum is pseudocirculant, and thus that P(z) is as well. Thus the system is alias-free and non-linear distortion is eliminated. 10.2.2 Modulation sequence errors P It is assumed that the ideal system is alias-free, i.e., that P(z) = Pk (z) is pseudocirculant. Due to difficulties in realizing the analog modulation sequence, the signal is modulated in channel k by the sequence âk = ak + εk rather than the ideal sequence ak . We consider here different choices of the modulation sequences. Bi-level sequence for an insensitive channel Assume that an analog modulation sequence with two levels is used for an insensitive channel, i.e., Sk = bTk ak is a circular matrix. Examples of this type of channel include the first two channels of a Hadamard modulated system. Assuming that the sequence errors εk depend only on ak , i.e., εk (n1 ) = εk(n2 ) when ak (n1 ) = ak (n2 ), the modulation vector can be written âk = αk ak + βk βk · · · βk for some values of the scaling factor αk and offset term βk . The channel matrix P̂k (z) for the channel modulated with the sequence âk then becomes P̂k (z) = bTk (αk ak + βk βk · · · βk ) · H̃k (z) = = αk Sk · H̃k (z) + βk Bk H̃k (z), (10.18) where Bk is a diagonal matrix consisting of the elements of bk . The first term is pseudocirculant, and thus the system is insensitive to modulation sequence scaling factors in channel k. The impact of the offset term βk , i.e., the second term, is explained in Sec. 10.2.3. Bi-level sequence for sensitive channels Consider one of the subsets Ci in the partition of the channel set. The sum of P the Sk -matrices corresponding to the channels in the set, k∈Ci Sk , is a circulant matrix, whereas the constituent matrices are not. Examples of this type of channels are the time-interleaved systems and the Hadamard modulated systems with more 10.2. SENSITIVITY TO CHANNEL MISMATCHES 125 than two channels. As in the insensitive case, the modulation vectors are written âk = αk ak + βk βk · · · βk , and the sum of the channel matrices for the channel subset becomes X X P̂k (z) = bTk (αk ak + βk βk · · · βk ) · H̃k (z) = k∈Ci k∈Ci = H̃0,i (z) · X αk Sk k∈Ci ! + X βk Bk H̃k (z), (10.19) k∈Ci where Bk is a diagonal matrix consisting of the elements of bk . The first sum is generally not a pseudocirculant matrix, and the channels are thus sensitive to sequence gain errors. If the gains are matched, denote α0,i = αk = αl when k, l ∈ Ci , the channel matrix sum may be written ! X X X P̂k (z) = α0,i H̃0,i (z) · Sk + βk Bk H̃k (z), (10.20) k∈Ci k∈Ci k∈Ci and it is seen that the first term is a pseudocirculant matrix, and the channel set is alias-free. Again, the impact of the offset term βk is explained in Sec. 10.2.3. Multi-level sequences If an insensitive channel is modulated with a multi-level sequence âk = ak + εk the channel matrix becomes P̂k (z) = bTk (ak + εk ) · H̃k (z) = Sk · H̃k (z) + bTk εk · H̃k (z), (10.21) which is pseudocirculant only if bTk εk is a circulant matrix. Systems with multilevel analog modulation sequences are thus sensitive to level errors. 10.2.3 Modulation sequence offset errors Consider the modulation sequence offset errors introduced in Sec. 10.2.2. The channel matrix for a channel with a modulation sequence containing an offset error can be written as (10.18). Thus the error pertaining to the sequence offset is additive, and can be modelled as in Fig. 10.7. The signal is thus first filtered through Hk (z) and then aliased by the system Bk , as Bk is not pseudocirculant unless the elements in the digital modulation sequence bk are identical. However, as the signal is first filtered, only signal components in the passband of Hk (z) will cause aliasing. If the signal contains no information in this band, aliasing will be completely suppressed. Typically the signal has a guard band either at the low-frequency or high-frequency region to allow transition bands of the filters, and the modulator can then be suitably chosen as either a low-pass type or high-pass type [80], respectively. Errors pertaining to sequence offsets are demonstrated in Sec. 10.3.1. 126 CHAPTER 10. PARALLEL SIGMA-DELTA ADCS βk x(n) M M z −1 M z z −1 yk (n) z Bk Hk (z) M M M M M Figure 10.7 Model of errors in a parallel system pertaining to sequence offsets. 10.2.4 Channel offset errors Channel offsets must be removed for each channel in order not to overload the Σ∆modulator. Offsets affect the system in a nonlinear way and may not be analyzed using the multirate formulation. However, the problem has been well investigated and numerous solutions exist [4, 58, 81]. 10.3 Simulation results In this section, the proposed multi-rate formulation is used to analyze the sensitivity to channel mismatch errors for some different systems. Results are provided for a time-interleaved ADC, a Hadamard modulated ADC and a frequency-band decomposed ADC, as described in Sec. 9.3. Also, an example is provided of how the formulation can be used to derive a new architecture that is insensitive to channel matching errors. 10.3.1 Time-interleaved ADC Consider a time-interleaved ADC with four channels. The samples are interleaved between the channels, each encompassing identical second-order low-pass modulators and decimation filters. Ideally, their z-domain transforms may be written ( z −1 Hk (z) = H(z) = 0 −π/4 ≤ ωT ≤ π/4 otherwise. (10.22) 10.3. SIMULATION RESULTS 127 All modulators are running at the input sampling rate, with their inputs grounded between consecutive samples. Thus the modulation sequences are a0 (n) = b0 (n) = 1, 0, 0, 0, . . . , a1 (n) = b1 (n) = 0, 1, 0, 0, . . . , (10.23) a2 (n) = b2 (n) = 0, 0, 1, 0, . . . , a3 (n) = b3 (n) = 0, 0, 0, 1, . . . , all periodic with period M = 4. The vectors ak and bk are as defined by (10.13) a0 = b 3 = 1 0 0 0 a1 = b 0 = 0 0 0 1 (10.24) a2 = b 1 = 0 0 1 0 a3 = b 2 = 0 1 0 0 The matrices Sk , defined by 0 0 S0 = bT0 a0 = 0 1 0 0 S2 = bT2 a2 = 0 0 (10.12), then become 0 0 0 0 0 0 0 0 S1 = bT1 a1 = 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 S3 = bT3 a3 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Because the sum of all Sk -matrices is a circulant matrix, the system is alias-free and the transfer function for the system is given by (10.15) as 4 −1 HA (z) = z −1 s0,1 3 H3,1 (z ) = z (10.25) where H3,1 (z) = 1 is the second polyphase component in the polyphase decomposition of H(z). The transfer function is thus a simple delay, and the system will digitize the complete spectrum. As none of the Sk -matrices are circulant, and a circulant matrix can be formed only by summing all the matrices, the time-interleaved ADC requires matching of all channels in order to eliminate aliasing. Thus we define C0 = {0, 1, 2, 3}, according to Sec. 10.1.2. The system has been simulated with modulator nonidealities and errors of bi-level sequences for sensitive channels, as described in Sec. 10.2. Figure 10.8(a) shows the output spectrum for the ideal case with no mismatches between channels (γk = 1 for all k). Applying 2% gain mismatch for one of the channels (γ0 = 0.98, γ1 = γ2 = γ3 = 1), the spectrum in Fig. 10.8(b) results, where the aliasing components can be clearly seen. In Fig. 10.8(c), the channel gains are set to one, and a 1% offset error has been added to the first modulation sequence (β0 = 0.01, β1 = β2 = β3 = 0), which results in aliasing. In Fig. 10.8(d), high-pass modulators have been used instead, and the distortions disappear, as predicted by the analysis in Sec. 10.2.3. CHAPTER 10. PARALLEL SIGMA-DELTA ADCS Amplitude [dB] 128 0 −50 −100 −150 0 0.2π 0.4π ωT 0.6π 0.8π π 0.8π π Amplitude [dB] (a) Simulation using ideal system. 0 −50 −100 −150 0 0.2π 0.4π ωT 0.6π Amplitude [dB] (b) Simulation with 2% gain mismatch in one channel. 0 −50 −100 −150 0 0.2π 0.4π ωT 0.6π 0.8π π Amplitude [dB] (c) Simulation with 1% offset error in one modulation sequence. 0 −50 −100 −150 0 0.2π 0.4π ωT 0.6π 0.8π π (d) Simulation with 1% offset error in one modulation sequence using high-pass modulators instead of low-pass modulators. Figure 10.8 Example 1: Sensitivity of time-interleaved ADC. 10.3. SIMULATION RESULTS 10.3.2 129 Hadamard-modulated ADC Consider a non-oversampling Hadamard-modulated ADC with eight channels. In this case, every channel filter is an 8th-band filter (Hk (z) = H(z), k = 0, . . . , 7) and the modulation sequences ak (n) and bk (n) are a0 (n) = b0 (n) = 1, 1, 1, 1, 1, 1, 1, 1, . . . , a1 (n) = b1 (n) = 1, −1, 1, −1, 1, −1, 1, −1, . . . , a2 (n) = b2 (n) = 1, 1, −1, −1, 1, 1, −1, −1, . . . , a3 (n) = b3 (n) = 1, −1, −1, 1, 1, −1, −1, 1, . . . , a4 (n) = b4 (n) = 1, 1, 1, 1, −1, −1, −1, −1, . . . , a5 (n) = b5 (n) = 1, −1, 1, −1, −1, 1, −1, 1, . . . , (10.26) a6 (n) = b6 (n) = 1, 1, −1, −1, −1, −1, 1, 1, . . . , a7 (n) = b7 (n) = 1, −1, −1, 1, −1, 1, 1, −1, . . . . The vectors ak and bk become 1 a0 = b 0 = 1 a1 = −b1 = 1 −1 a2 = b3 = 1 −1 1 a3 = −b2 = 1 a4 = 1 −1 b4 = −1 −1 1 a5 = 1 b5 = 1 −1 1 a6 = 1 1 b6 = 1 a7 = 1 −1 1 b7 = −1 1 1 1 −1 −1 1 1 1 1 −1 1 −1 1 −1 1 1 −1 1 1 1 −1 1 . 1 1 −1 −1 1 −1 −1 1 −1 −1 −1 −1 1 1 −1 1 −1 −1 1 1 −1 −1 −1 1 −1 −1 −1 1 1 1 −1 1 −1 1 1 1 −1 −1 −1 −1 −1 −1 −1 1 1 −1 1 −1 −1 With Sk = bTk ak , the following matrices can be computed: S0 = 1 −1 1 −1 1 S1 = −1 1 −1 1 1 −1 1 −1 1 −1 1 −1 −1 1 −1 1 −1 1 −1 1 1 −1 1 −1 1 −1 1 −1 1 −1 1 −1 1 −1 1 −1 1 1 −1 1 −1 1 −1 1 −1 −1 1 −1 1 −1 1 −1 1 1 −1 1 −1 1 −1 1 −1 (10.27) 130 CHAPTER 10. PARALLEL SIGMA-DELTA ADCS 0 −2 0 2 S2 + S3 = 0 −2 0 2 0 0 0 −4 S4 + S5 + S6 + S7 = 0 0 0 4 2 0 −2 0 2 0 −2 0 2 0 −2 0 2 0 −2 0 2 0 −2 0 2 0 −2 0 4 0 0 0 4 0 0 0 4 0 0 0 −4 0 0 0 −4 0 0 0 −4 0 0 0 0 2 −2 0 0 −2 2 0 0 2 −2 0 0 −2 2 0 0 −4 0 0 0 0 4 0 0 4 0 0 0 0 −4 0 0 −2 2 0 0 2 −2 0 0 −2 2 0 0 2 −2 0 0 0 −4 0 0 −4 0 0 0 0 4 0 0 4 0 0 It is seen that S0 and S1 are circulant matrices. Also, S2 + S3 is circulant. Further, the remaining matrices sum to a circulant matrix S4 + S5 + S6 + S7 , whereas no smaller subset does. Thus, in order to eliminate aliasing, the channels are partitioned into the sets C0 = {0}, C1 = {1}, C2 = {2, 3} and C3 = {4, 5, 6, 7}. The Hadamard-modulated ADC thus contains both insensitive channels 0 and 1, and sensitive channels 2, . . . , 7. Using the model of the ideal system, the spectrum of the output signal is as shown in Fig. 10.9(a). Figure 10.9(b) shows the output spectrum for the system with 1% random gain mismatch (γk ∈ [0.99, 1.01]), where the aliasing distortions are readily seen. Matching the gains of the C2 -channels to each other (setting γ2 = γ3 ) and the gains of the C3 -channels to each other (setting γ4 = γ5 = γ6 = γ7 ), the spectrum in Fig. 10.9(c) results, and the distortions disappear. Although the Hadamard-modulated ADC is less sensitive than the time-interleaved ADC, the matching requirements for eight-channel systems and above are still severe. 10.3.3 Frequency-band decomposed ADC For the frequency-band decomposed ADC, the input signal is applied unmodulated to N modulators converting different frequency bands. Consider as an example a four-channel system consisting of a lowpass channel, a highpass channel, and two bandpass channels centered at 3π/8 and 5π/8. As the signal is not modulated, ak = b k = 1 1 1 1 (10.28) for all k, and 1 1 Sk = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Amplitude [dB] 10.3. SIMULATION RESULTS 131 0 −50 −100 −150 0 0.2π 0.4π ωT 0.6π 0.8π π 0.8π π Amplitude [dB] (a) Simulation using ideal model. 0 −50 −100 −150 0 0.2π 0.4π ωT 0.6π Amplitude [dB] (b) Simulation using 1% channel gain mismatch. 0 −50 −100 −150 0 0.2π 0.4π ωT 0.6π 0.8π π (c) Simulation using gain matching of sensitive channels. Figure 10.9 Example 2: Sensitivity of Hadamard-modulated ADC. for all k. As each Sk -matrix is circulant, the system is insensitive to channel mismatches. Further, modulation sequence errors are irrelevant in this case, as the signal is not modulated. The frequency-band decomposed ADC is thus highly resistant to mismatches. Its obvious drawback, however, is the need to use bandpass modulators which are more expensive in hardware. 10.3.4 Generation of new scheme This example demonstrates that the formulation can also be used to devise new schemes, although a general method is not presented. A three-channel parallel system using lowpass modulators is designed. The signal is assumed to be in the frequency band −π/4 < ωT < π/4, and the ADC is thus an oversampled system and is described according to Sec. 10.1.3 with L = 4 and M = 8. Using complex modulation sequences, three bands of width π/4 centered at CHAPTER 10. PARALLEL SIGMA-DELTA ADCS Amplitude [dB] 132 50 0 −50 −100 −150 −200 0 0.2π 0.4π ωT 0.6π 0.8π π Figure 10.10 Example 4: Sensitivity of new scheme. Simulation using 10% channel gain mismatch. −π/4, 0, and π/4 can be translated to baseband and converted with a lowpass ADC. These modulation sequences are a0 (n) = 1, a1 (n) = exp(jπn/4), a2 (n) = exp(−jπn/4) and bk (n) = ak∗ (n). Summing the resultant Sk -matrices yields √ √ √ √ 2 2 2 0 − 2 −2 − 2 0 √ √ √ √ 0 2 2 − 2 −2 √ √2 √0 √ − 2 − 2 0 2 2 − 2 −2 √2 √0 √ √ X 2 0 2 2 2 0 − 2 − −2 √ √ √ Sk = 1 + √ (10.29) − 2 −2 − 2 0 2 2 2 0 √ √ √ √ 0 0 2 2 − 2 −2 √ − 2 √ √2 √ 2 − 2 −2 0 2 √ − 2 √ √2 √0 2 0 − 2 −2 − 2 0 2 2 Unfortunately, using complex modulation sequences is not practical. However, as the modulators and filters are identical for all channels (Hk (z) = H(z) for all k), any other choice of modulation sequences resulting in the same matrix will perform the same function. Moreover, for a decimated system, relaxations may be allowed on the new modulation sequences. In this case, with decimation by four, it is sufficient to find replacing modulation sequences a′k and b′k such that the sum of P ′ the resulting Sk -matrices equal Sk on rows 4 and 8, as gcd(L, M ) = 4. One such choice of modulation sequences is a′0 = 1 1 1 1 1 1 1 1 a′1 = 1 1 0 −1 −1 −1 0 1 a′2 = 1 0 0 0 −1 0 0 0 b′0 = 0 0 0 1 0 0 0 1 √ √ b′1 = 0 0 0 − 2 0 0 0 2 √ √ ′ (10.30) b2 = 0 0 0 ( 2 − 2) 0 0 0 (2 − 2) . The analog modulation sequences a′k can easily be implemented by switching or grounding the inputs to the modulators, whereas the nontrivial multiplications in 10.4. NOISE MODEL OF SYSTEM 133 b0 (n) q0 (n) NTF0 (z) G0 (z) b1 (n) q1 (n) NTF1 (z) G1 (z) y(n) bN −1 (n) qN −1 (n) NTFN −1 (z) GN −1 (z) Figure 10.11 Noise model of parallel system. b′k can be implemented with high precision digitally. Note that 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 √ √ √ √ X −2 − 2 0 2 2 2 0 − 2 ′T ′ , (10.31) b k ak = 1 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 √ √ √ √0 2 2 0 − 2 −2 − 2 0 2 P which is equal to Sk in (10.29) on rows 4 and 8. Note also that the S′k -matrices, given on rows 4 and 8 by ′ b0,3 ′ 1 1 1 1 1 1 1 1 a = b′0,7 0 1 1 1 1 1 1 1 1 √ √ √ √ √ ′ √ b1,3 ′ −√ 2 −√ 2 0 2 2 2 0 −√ 2 √ √ √ a = b′1,7 1 2 2 0 − 2 − 2 − 2 0 2 √ ′ √ b2,3 ′ ( 2− √2) 0 0 0 (2√− 2) 0 0 0 , a = b′2,7 2 (2 − 2) 0 0 0 ( 2 − 2) 0 0 0 are circulant on these rows, and thus the system is insensitive to channel mismatches. This is demonstrated in Fig. 10.10, where the channel gain mismatch is 10% and no aliasing results. However, as three levels are used in the analog modulation sequences a′1 and a′2 , the system is sensitive to mismatches in the modulation sequences of these channels, as described in Sec. 10.2. 10.4 Noise model of system The primary purpose of this work is to investigate the signal transfer characteristics of a parallel Σ∆-system. However, a system’s noise properties are also affected 134 CHAPTER 10. PARALLEL SIGMA-DELTA ADCS yk,0 (m) qk (n) NTFk (z) Gk (z) M M yk (n) bk,0 z −1 yk,1 (m) M z M bk,1 z −1 yk,M −1 (m) M z M bk,M −1 Figure 10.12 Noise model of channel k. by the choice of modulation sequences, and therefore a simple noise analysis is included. A noise model of the parallel Σ∆-system can be depicted as in Fig. 10.11. The quantization noise qk (n) of channel k is filtered through the noise transfer function NTFk (z) and filter Gk (z). The filtered noise is then modulated by the sequence bk (n). The channels are summed to form the output y(n). In order to determine the statistical properties of the output y(n), channel k is modeled as in Fig. 10.12. Denoting the spectral density of the quantization noise of channel k by RQk (ejω ), the spectral densities of the polyphase components yk,m of the channel output can be written Ryk,m (ejω ) = b2k,m M −1 X l=0 Gk,l (ejω )2 RQ (ejω ) k (10.32) where Gk,l (z) are the polyphase components of the cascaded system NTFk (z)Gk (z). It is seen that the noise power is scaled by the factor b2k,m , and it is thus of interest to keep the amplitudes of the modulation sequences low on the digital side. For example, in the generation of the new modulation scheme in Sec. 10.3.4, alternative choices of a1 and b2 would have been a1 = 0 1 0 −1 0 −1 0 1 b2 = 0 0 0 −2 0 0 0 2 . However, in this case the noise power is larger. This shows that the smaller magnitudes of the digital modulation sequences, as in (10.30), are preferable from a noise perspective. 11 Sigma-delta ADC decimation filters In this chapter, the design complexity requirements of decimation filters for Σ∆ADCs are investigated. For fast Σ∆-ADCs with large oversampling factors, the power dissipation of the decimation filter may become a significant part of the whole ADC, and is often overlooked. The complexity of linear phase FIR filterbased decimators are considered for a wide range of oversampling ratios (OSR) and three different modulator structures. Simulation results show the SNR degradation of the ADC as a function of the transition bandwidth and the filter order. In Sec. 11.1, needed notations are introduced and the filter design methodology is discussed. In Sec. 11.2, filter design results are provided. This chapter has been published in [17]. 11.1 Design considerations Consider the system shown in Fig. 11.1. An analog signal x(t) is sampled and quantized by a Σ∆-modulator to a low resolution signal u(n) with high sampling x(t) Σ∆ u(n) H(z) L y(m) Figure 11.1 Σ∆-ADC with digital decimation filter 135 136 CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS 0.06 Magnitude Magnitude 1 0.5 0 0 0.2π 0.4π 0.6π 0.8π π ωT [rad] 0.04 0.02 0 0 0.2π 0.4π 0.6π 0.8π π ωT [rad] (a) Ideal (solid line) and practical (dashed (b) Filtered quantization noise using ideal line) decimation filter, and shaped quanti- (solid line) and practical (dashed line) deczation noise before filtering (dotted line). imation filter. Figure 11.2 Output quantization noise spectral density function for a 1-bit first-order sigma-delta. rate. The signal is then filtered and downsampled, resulting in a signal y(m) with high resolution and a low sampling rate. 11.1.1 FIR decimation filters The filters in this chapter are FIR filters designed in the minimax sense (see Sec. 8.2). It may be argued that minimizing the energy through a least-squares design is more appropriate in order to maximize the SNR. However, in communications systems it is common that the suppression of blockers imposes minimax constraints on the filter, thus making minimax design interesting. The filters are also designed as single-stage filters, whereas they are in practice normally designed in several stages to relax the overall requirements. In particular, recursive comb decimators provide efficient filtering with few arithmetic operations. The results in this chapter do not contradict the fact that such structures are efficient, but rather concentrates on the relation between the stringency of the filter requirements and ADC parameters. As a measure of the filter requirements stringency, the filter order is used. Ideally, a decimation filter would be used that passes through all the frequencies in the signal bandwidth, while completely suppressing all frequencies outside the signal bandwidth. Such a filter is shown by the solid line in Fig. 11.2(a). In this case, the signal is degraded only by the quantization noise in the same band as the signal. However, in practice the filter has a transition band, as exemplified by the dashed line in Fig. 11.2(a). Then, the signal quality is reduced also by the noise within the transition band. Moreover, as seen in the output power spectral density of the filtered quantization noise in Fig. 11.2(b), the noise in the transition band can consitute a considerable fraction of the total noise power. 11.1. DESIGN CONSIDERATIONS 137 1 + δc 1 − δc δs π OSR π OSR (1 + ∆) Figure 11.3 Decimation filter specification. For FIR filter-based decimators, the filter complexity is to the largest extent determined by the transition bandwidth [49]. Further, a significant part of the quantization noise energy of an oversampled Σ∆-modulator is located in this transition band, in particular when higher-order noise shaping is applied. Therefore, there exists a pronounced trade-off between decimation filter complexity and Σ∆modulator signal-to-noise-ratio (SNR). The work in this chapter presents investigations on how the SNR is degraded as a function of the filter transition bandwidth and filter order for various commonly used Σ∆-modulator architectures and oversampling ratio (OSR). It is also demonstrated that, for a given filter order, there exists an optimum choice of the stopband ripple and stopband edge for equi-ripple filter solutions which minimizes the SNR degradation. 11.1.2 Decimation filter specification The considered filters are N th-order symmetric linear-phase FIR filters that satisfy the specification shown in Fig. 11.3 where δc , δs , ωc T = π/OSR, and ωs T = π(1 + ∆)/OSR, denote the passband ripple, stopband ripple, passband edge and stopband edge, respectively. The filters are designed in the minimax sense using the McClellan-Parks-Rabiner’s (MPR) algorithm. The passband ripple is specified to be 0.01, but after each design it may be slightly smaller as there generally is a design margin. This is because the MPR algorithm only handles the ratio between the passband and stopband ripples. The stopband edge is related to the passband edge through the relative transition bandwidth ∆ that is defined as ∆= and shown in Fig. 11.3. ωs T − ωc T , ωc T (11.1) 138 CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS e(n) x(n) F (z) Q y(n) D/A e(n) F (z) u(n) -1 Figure 11.4 Principle of the extraction of quantization noise from a Σ∆modulator and the filtering thereof through a time-domain realization of the Σ∆-modulator noise transfer function. 11.1.3 Signal-to-noise-ratio Consider a Σ∆-modulator with an input signal x(n), output signal y(n), and quantization error e(n), as defined in Sec. 9.2. The output signal and noise power of the Σ∆-modulator after filtering are denoted Pyx and Pye , respectively. Assuming that the input signal x(n) is bandlimited to ωc T , and the passband ripple δc of the filter H(z) is small enough so its effect on the signal power can be neglected, Pyx equals the input signal power, i.e., Pyx = Px . The contribution to the output noise power emanates from the passband, transition band, and stopband regions. The (pb) (tb) (sb) corresponding noise powers are denoted Pye , Pye , and Pye , respectively. The (pb) (tb) (sb) total noise power is thus Pye = Pye + Pye + Pye , and the SNR at the output is given by Pyx (11.2) SNR = 10 log10 Pye Further, the SNR degradation ∆SN R is defined as the degradation in SNR caused by using a practical filter instead of an ideal lowpass filter. Thus, ∆SNR = 10 log10 Pyx (pb) Pye − 10 log10 Pyx Pye = 10 log10 (pb) Pye Pye (11.3) Using the linear model discussed in Sec. 9.1.3, the noise power is computed as Z π 2 2 2 1 Q Pe = G(ejωT ) H(ejωT ) d(ωT ), (11.4) 2π −π 12 where Q is the quantization step, G(z) is the noise transfer function of the Σ∆modulator, and H(z) is the z-transform of the decimation filter. The noise power in 11.2. SIMULATION RESULTS 139 the different bands can be computed analogously by using the appropriate integration limits. However, the linear model tends to be less appropriate for coarse quantization steps. In particular, problems arise for one-bit quantization. Therefore, in this work, the noise power has been evaluated using the model in Fig. 11.4, where the extracted quantization error e(n) has been filtered through a time-domain realization of the Σ∆-modulator noise transfer function G(z). The obtained sequence u(n) is then filtered through the decimation filter in order to get the final output noise. The noise in the different bands is then obtained from the DFT of the output noise signal. In Fig. 11.5, the noise powers of the two different cases are compared. 11.2 Simulation results For the three Σ∆-modulator topologies shown in Figs. 9.5–9.7, two different investigations have been done. In the first, the transition bandwidth of the decimation filter has been varied for different OSR and filter orders and the corresponding SNR degradation has been calculated. The results of the investigation are plotted in Fig. 11.6. For a given transition bandwidth and filter order the stopband ripple is fixed. From the figures, it can be seen that the SNR degradation has a minimum for a certain choice of ∆. In the second investigation, the optimal choice of the relative transition bandwidth ∆ has been found for filter orders between 100 and 1000 for different OSR. The optimal ∆’s and the corresponding SNR degradations for the three Σ∆modulator topologies are shown in Figs. 11.7–11.9. Decreasing the transition bandwidth below the optimum causes the SNR to worsen considerably, because of rapidly decreasing stopband attenuation. Also, for large enough transition bands, increasing the filter order will yield no significant SNR improvements because essentially all the noise power is from the transition band region. It should be noted that optimal SNR may occur for a transition bandwidth several times wider than the passband width. 140 CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS 10 Noise power 10 10 10 10 10 −4 Σ∆ order 1, linear model Σ∆ order 1, extracted noise Σ∆ order 2, linear model Σ∆ order 2, extracted noise −5 −6 −7 −8 −9 1 2 3 4 Quantization bits (a) Σ∆-modulators of order one and two. −5 10 Linear model, pass band Extracted noise, pass band Linear model, transition band Extracted noise, transition band Linear model, stop band Extracted noise, stop band −6 Noise power 10 −7 10 −8 10 −9 10 −10 10 1 2 3 4 Quantization bits (b) Power in different bands of decimation filter, using second-order Σ∆modulator. Figure 11.5 Quantization noise power of Σ∆-modulators after decimation filter, according to the linear noise model and the computed quantization noise. The OSR is 32. 11.2. SIMULATION RESULTS 141 16 14 ∆ SNR 12 OSR=512, N=500 OSR=512, N=1000 OSR=128, N=500 OSR=128, N=1000 OSR=32, N=200 OSR=32, N=500 OSR=32, N=1000 10 8 6 4 2 0 0 1 2 3 4 5 6 7 Relative transition bandwidth (∆) 8 9 10 (a) First-order Σ∆-modulator. 30 25 ∆ SNR 20 OSR=512, N=500 OSR=512, N=1000 OSR=128, N=500 OSR=128, N=1000 OSR=32, N=200 OSR=32, N=500 OSR=32, N=1000 15 10 5 0 0 1 2 3 4 5 6 7 Relative transition bandwidth (∆) 8 9 10 (b) Second-order Σ∆-modulator. 30 25 ∆ SNR 20 OSR=512, N=500 OSR=512, N=1000 OSR=128, N=500 OSR=128, N=1000 OSR=32, N=200 OSR=32, N=500 OSR=32, N=1000 15 10 5 0 0 1 2 3 4 5 6 7 Relative transition bandwidth (∆) 8 9 10 (c) MASH Σ∆-modulator. Figure 11.6 SNR degradation as a function of relative transition bandwidth ∆ for the different modulator structures. 142 CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS 5 OSR = 256 OSR = 128 OSR = 64 OSR = 32 OSR = 16 OSR = 8 4 ∆ 3 2 1 0 100 200 300 400 500 600 700 FIR filter order 800 900 1000 (a) SNR-optimal choice of ∆. 10 OSR = 256 OSR = 128 OSR = 64 OSR = 32 OSR = 16 OSR = 8 9 8 ∆ SNR 7 6 5 4 3 2 1 0 100 200 300 400 500 600 700 FIR filter order 800 900 1000 (b) Minimal SNR degradation. Figure 11.7 SNR-optimal choice of ∆ and minimal SNR degradation as functions of the filter order for the first-order Σ∆-modulator. 11.2. SIMULATION RESULTS 143 5 OSR = 256 OSR = 128 OSR = 64 OSR = 32 OSR = 16 OSR = 8 4 ∆ 3 2 1 0 100 200 300 400 500 600 700 FIR filter order 800 900 1000 (a) SNR-optimal choice of ∆. 20 OSR = 256 OSR = 128 OSR = 64 OSR = 32 OSR = 16 OSR = 8 18 16 ∆ SNR 14 12 10 8 6 4 2 0 100 200 300 400 500 600 700 FIR filter order 800 900 1000 (b) Minimal SNR degradation. Figure 11.8 SNR-optimal choice of ∆ and minimal SNR degradation as functions of the filter order for the second-order Σ∆modulator. 144 CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS 5 OSR = 256 OSR = 128 OSR = 64 OSR = 32 OSR = 16 OSR = 8 4 ∆ 3 2 1 0 100 200 300 400 500 600 700 FIR filter order 800 900 1000 (a) SNR-optimal choice of ∆. 20 OSR = 256 OSR = 128 OSR = 64 OSR = 32 OSR = 16 OSR = 8 18 16 ∆ SNR 14 12 10 8 6 4 2 0 100 200 300 400 500 600 700 FIR filter order 800 900 1000 (b) Minimal SNR degradation. Figure 11.9 SNR-optimal choice of ∆ and minimal SNR degradation as functions of the filter order for the MASH Σ∆-modulator. 12 High-speed digital filtering In this chapter, the bit-level design of high-speed FIR filters is considered. For a given filter impulse response, the design process is divided into four steps: the choice of a filter architecture, the generation of partial products from the input, the identification of shared subexpressions, and the construction of an efficient pipelined reduction tree for the partial products. The focus is on the two last parts. For the identification of common subexpressions, a heuristic algorithm is proposed that identifies possible shared subexpressions before their allocation to adders in the reduction tree. In the construction of the reduction tree, a proposed bit-level optimization algorithm minimizes a weighted sum of the number of full adders, half adders and registers. In Sec. 12.1, the investigated architectures are presented, including the directform structure, the transposed direct-form structure, and a new direct-form structure utilizing partial symmetry adders. In Sec. 12.2, the complexities of the architectures are estimated for different wordlengths and pipeline levels. In Sec. 12.3, the proposed algorithm for partial product redundance reduction is explained, and in Sec. 12.4.1, the proposed optimization algorithm is discussed. Results comparing the different architectures are finally in Sec. 12.5. Secs. 12.1, 12.2 and 12.4.1 have been published in [8], whereas Sec. 12.3 has been published in [9]. 145 146 12.1 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING FIR filter realizations The two main architectures for FIR filters are the direct-form (DF) architecture and the transposed direct-form (TF) architecture. These were presented in Sec. 8.4, and multi-rate structures using partial product generation and reduction trees using carry-save adders were shown. The suitability of different architectures for high-speed FIR filters is investigated. For each of the considered architecture, the implementation can be partitioned into three parts: a partial product generation stage that realizes the filter coefficient multiplications, a carry-save adder (CSA) reduction tree to efficiently reduce the number of partial products, and a vector merge adder (VMA) to merge the CSA output to a non-redundant binary form. In [1,42,45,67], different methods of generating partial products for given filters were investigated. However, here the focus is rather on the generation of an efficient pipelined reduction tree. This is done through a formulation as an integer linear programming (ILP) problem, with which a bit-level optimized reduction tree can be obtained. As cost function for the optimization algorithm, a weighted sum of the number of full adders, half adders, and registers is used. The model is similar to that presented in [59], but was not formulated as an ILP problem there. Compared with the traditional heuristic methods: Wallace trees [97], Dadda trees [30], and Reduced Area [6] trees, the bit-level optimization yields better results for a number of reasons. The aggressive use of half adders in Wallace trees leads to fast reductions, but generally a more efficient use of half adders is possible. The Dadda structure uses half adders more restrictively only to try to maximize the opportunities to use full adders. However, only placing full adders as late as possible makes the structure unsuitable for pipelining. It is also the case that the heuristics work well for reduction trees for general multipliers but less so for other reduction trees. For example, the Reduced Area heuristic is claimed to be optimal in terms of hardware resources for general multipliers, but simulations provided in this paper show that this is not necessarily the case for general partial product trees. Moreover, the heuristics do not consider that partial products might be added at different levels in the reduction tree, which is the case for several of the architectures considered in this paper. It should be noted that the proposed realizations can be applied to reduction trees in other applications as well. One possible application is Merged arithmetic [91], where the results of several multiplications are summed in a carry-save adder tree. Other examples are high-speed complex multipliers implemented using distributed arithmetic [5], and implementation of general functions as sums of weighted bit-products [54]. 12.1.1 Architectures The direct-form (DF1) architecture is depicted in Fig. 12.1. In this architecture, a delay chain at the input provides the algorithmic delays, and an adder tree sums the partial products generated from the taps. An interesting characteristic of the direct-form architecture is its ability to utilize the symmetry of linear-phase FIR 12.1. FIR FILTER REALIZATIONS x0 x1 xM −2 xM −1 w w w w 147 T T T T T T T T T T T T Partial product generation CSA T CSA T VMA wout Figure 12.1 Direct-form architecture (DF1). x0 x1 xM −2 xM −1 w w T T T T T T T T T T T T w w w+1 w+1 w+1 w+1 Partial product generation CSA T CSA T VMA wout Figure 12.2 Direct-form architecture utilizing coefficient symmetry (DF2). filter coefficients. This architecture is depicted in Fig. 12.2, and is denoted the DF2 architecture. The benefit of utilizing the symmetry is that the number of multiplications is halved. However, in the application of high-speed decimation filters, this feature is not necessarily efficient because of the need for short critical paths. The inability to implement the symmetry adders using carry save arithmetic leads to excessive use of registers in pipelined ripple-carry adders. This can be readily observed in the experimental results in Sec. 12.5. A solution to this problem was suggested in [33], but demonstrated in [45] to have limited efficiency. As a possible alternative solution, dividing the symmetry adders into smaller blocks of adders as depicted in Fig. 12.3 is considered. This solution is denoted partial symmetry, and will reduce the register complexity as the carry propagation chain is broken. However, partial symmetry results in slightly more partial products than 148 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING 8 lsb msb x1 x2 HA FA FA FA HA FA s1,0 FA s1,1 FA s1,2 D D D D s1,4 s1,3 s0,4 s0,3 s0,0 s0,1 s0,2 Figure 12.3 Partial symmetry adder w x0 w x1 w xM −2 w xM −1 Partial products Partial products CSA CSA T Partial products CSA T CSA VMA wout Figure 12.4 Transposed direct-form architecture (TF). full symmetry does. The partial symmetry architecture is referred to as the DF3 architecture. The TF architecture is depicted in Fig. 12.4. For high-speed decimation filters, the TF architecture may provide a more register-efficient realization, as the algorithmic delay elements are also used as pipelining registers in the summation tree. Because of the architecture’s limited ability to utilize filter coefficient symmetry, however, the TF architecture may require more adders than the direct-form architecture. The reasons that symmetry cannot be utilized are two. First, the symmetric multiplier may be connected to a different input because of the polyphase decomposition. Second, instead of computing each multiplication explicitly, all generated partial products are merged in one carry-save tree. As the number of cascaded adders in each stage is restricted, there may be additional CSA stages to reduce the number of partial products before the VMA. Hence, the output of each CSA stage is not necessarily represented using at most two bits for each bit weight. For each of the depicted architectures in Fig. 12.1, Fig. 12.2 and Fig. 12.4, the structure performs internal decimation by M . M may be 1 in order to describe a Partial products 12.1. FIR FILTER REALIZATIONS 149 CSA stage 1 T CSA stage j T CSA J − 1 T CSA J T VMA wout Figure 12.5 Carry-save adder tree pipelined in J stages. single-rate filter. For an interpolating filter, M = 1 and the filter consists of several independent reduction trees for the outputs. Normally, the delays at the inputs are shared between the branches for the DF architectures. A structure may also contain both several inputs and several outputs in order to describe general multirate filters. However, the results in this thesis are constrained to simple decimation and interpolation filters. For both the DF architectures and the TF architecture, the result after partial product generation is a number of partial products with different bit weights and different delays. The general structure for the summation tree is shown in Fig. 12.5, where the carry-save adder is divided into J stages. The stages are separated by pipeline registers, and inputs are accepted in all stages. Each stage has the structure shown in Fig. 12.6, allowing a maximum adder depth of K levels. Again, partial products may be added in every level. Considering the investigated architectures, for the DF1 architecture all partial products are added in the first level of the first stage. For the TF architecture partial products are added in the first level of several stages, as the pipeline registers are also used as algorithmic delays. Finally, partial products are added in several levels of the first stages for the DF2 and DF3 architectures, as the inputs are delayed by the symmetry adders. 150 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING Partial products T T T T Full/half adders (level 1) Full/half adders (level 2) Full/half adders (level K) T T T T Figure 12.6 One stage of a carry-save adder tree, with a maximum of K adders in the critical path. 12.1.2 Partial product generation As the multiplier coefficients are known it is not required to use general multipliers. Instead, only the partial products corresponding to non-zero coefficient bits are generated. Here, it is shown how partial products are generated using different representations of the coefficients and using both signed and unsigned input data. An input data X with a wordlength of wd bits can be written as X= wX d −1 xi 2i (12.1) i=0 for unsigned data with an input range of 0 ≤ X ≤ 2wd − 1 and X = −xwd −1 2wd −1 + wX d −2 xi 2i (12.2) i=0 for signed (two’s complement) data with an input range of −2wd −1 ≤ X ≤ 2wd −1 − 1, where for both (12.1) and (12.2) xi ∈ {0, 1}. Note that the input data is considered to be integer instead of fractional. However, this is only to be consistent with the numbering used later on, where the bits corresponding to the smallest weight (the LSBs) have index 0. Similarly, the wc -bits coefficients, h(n), can be written as h(n) = wX c −1 hn,j 2j (12.3) j=0 and h(n) = −hn,wc −1 2wc −1 + wX c −2 j=0 hn,j 2j , (12.4) 12.1. FIR FILTER REALIZATIONS 151 1 4 8 16 (a) Binary representation. 1 −8 32 (b) Signed-digit representation. Figure 12.7 Resulting partial products when multiplying a three-bits signed data with 29. White dots correspond to negated partial products. where hn,j ∈ {0, 1}. Considering the case of unsigned data and unsigned binary coefficients first, the multiplication of the input and a filter coefficient can be written Xh(n) = wX d −1 i=0 ! w −1 wX c c −1 w d −1 X X xi 2i hn,j 2j = hn,j xi 2i+j . j=0 j=0 (12.5) i=0 Now, as some of the hn,j -bits are known to be zero, only bits corresponding to non-zero hn,j have to be added. If instead a signed-digit (SD) representation of the coefficient is used, i.e., hn,j ∈ {−1, 0, 1}, the number of non-zero hn,j can be decreased. A signed-digit representation with a minimum number of non-zero positions is called a minimum signed-digit (MSD) representation. In general there is no unique MSD representation, but introducing the constraint that no two adjacent positions should both be non-zero, i.e., hn,j hn,j+1 = 0 will result in the canonic signed-digit (CSD) representation, which is both unique and minimum. Using a SD representation now requires that partial products can be both added and subtracted. By enabling subtraction, signed input data and coefficients are also allowed. (12.5) will now contain partial products which may be both positive and negative. Note that −a = ā − 1 for a ∈ {0, 1}. Hence, negative partial products can be handled by inverting the partial product and adding a constant number. The constant numbers corresponding to all partial products can be merged to one constant binary number in the filter computation. In Fig. 12.7 the partial products resulting from multiplying a three-bits signed input with the coefficient 29 is illustrated for both binary and CSD representation of the coefficient. The corresponding constants to add are −4−16−32−64 = −116 and −4−4−8−128 = −144 for the binary and CSD representation, respectively. It should be noted that for the TF architecture, the output will be correct only after the first ⌈(N − 1)/M ⌉ samples. If correct output from the first sample is required, one solution is custom initialization of each stage register. 152 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING 12.2 Implementation complexity 12.2.1 Adder complexity As only the full adders reduce the number of partial products, the required number of full adders for the carry-save summation tree can be easily calculated as the difference between the number of generated partial products and the output wordlength. For the DF1 architecture and the TF architecture, the number of generated partial products can be written wd (N + 1)Na , where N is the filter order, and Na is the average number of non-zero digits in the filter coefficients. For the DF2 architecture, the number of coefficients is halved, whereas the input wordlength is increased by one due to the symmetry Thus the number of adders. generated partial products can be written (wd + 1) N2+1 Na . Finally, for the DF3 architecture, each symmetry adder increases its input wordlength with None, and +1 hence the total number of partial products can be written wd + wSd Na , 2 assuming that partial symmetry adder groups of S bits are used. Depending on the representation used for coefficient and input data there may also be a number of constant ones to add. In all architectures, the required number of output bits, assuming the general case with P signed full-scale data and signed coefficients, can be written wout = wd + log2 |h(k)|, where h(k) is the impulse response. Thus the required number of full adders can be written X NFA,DF1 = NFA,TF = wd ((N + 1)Na − 1) − log2 |h(k)| (12.6) for the DF1 architecture and the TF architecture, X N +1 NFA,DF2 = (wd + 1) Na − wd − log2 |h(k)| 2 for the DF2 architecture, and l w m N + 1 X d NFA,DF3 = wd + Na − wd − log2 |h(k)| S 2 (12.7) (12.8) for the DF3 architecture. Also, the complexity of the symmetry adders for the DF2 architecture is N2+1 wd -bit adders, resulting in a number of adder cells equal to N +1 (12.9) NFA,sym,DF2 = (wd − 1) 2 and NHA,sym,DF2 = N +1 . 2 (12.10) For the DF3 architecture the number of partial symmetry adders is wd /S N2+1 S-bit adders, resulting in a number of adder cells equal to l w m N + 1 d NFA,sym,DF3 = wd − (12.11) S 2 12.2. IMPLEMENTATION COMPLEXITY and NHA,sym,DF3 = lw m N + 1 d . S 2 153 (12.12) Using these equations, the total required number of adders for the DF architectures can be calculated. It should be noted, however, that the equations (12.6)–(12.8) do not take into account the half adders that are usually needed to rearrange the partial products, but nevertheless an approximate condition can be determined for when the DF2 architecture results in a structure with smaller adder complexity compared with DF1. This condition is NFA,DF1 > NFA,DF2 + NFA,sym,DF2 + NHA,sym,DF2 , (12.13) which can be approximated to wd > Na , Na − 1 (12.14) if the costs of half and full adders are considered equal. However, as the half adders of the CSA trees have been ignored, the condition should be considered as a guideline rather than as a strict rule. In the investigated application where, typically, both wd and Na are low, utilizing the coefficient symmetry often does not lead to reduced adder complexity. 12.2.2 Register complexity Regarding the register complexity, it is possible to find expressions that are asymptotically valid. However, for the considered applications these expressions convey little information, and expressions that are valid for low filter orders and short wordlengths are difficult to find. Thus, the register complexities due to algorithmic delays are calculated here, whereas those due to pipelining of the adder trees are determined experimentally. For the DF architectures, the algorithmic delays are applied at the input, and the register complexity due to these can be written NR,DF1 = wd N . (12.15) If the symmetry of the coefficients is utilized, the implementation carries an additional complexity due to pipelining of the symmetry adders. The way the pipelining is done is shown in Fig. 12.8. In addition to the registers needed to restrict the length of the critical path, registers are also placed at the outputs of the full adders just before the cuts. The reason for this is the definition of the reduction trees in Sec. 12.4.1, which does not accept inputs just before pipeline cuts. If an n-bit ripple-carry adder with a maximum of m adders in the critical path is considered, the required number of pipeline registers can be written ⌊n/m⌋ NR,RC (n, m) = X k=1 2(n − mk + 1). (12.16) 154 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING am−1 bm−1 am bm am−1 bm−1 a1 b 1 a0 b 0 HA FA s0 FA D D D D D D FA FA D s1 sj sj+1 D swd swd −1 Figure 12.8 Pipelined ripple carry adder. The register complexity of the DF2 architecture can then be written N +1 NR,DF2 = wd N + NR,RC (wd , K), 2 (12.17) where K is the maximum allowed number of adders in cascade. For the partial symmetry case, the number of registers is smaller (for S < wd ). The number of registers, assuming S = nK for an integer n, is N + 1 j wd k NR,RC (S, K) + NR,RC (wd mod S, K) . NR,DF3 = wd N + 2 S (12.18) For (12.17) and (12.18) it is assumed that no sharing of registers between the algorithmic delays and symmetry adder pipeline registers has been performed. If sharing is considered, the register complexity of the DF2 architecture can be written NR,DF2 = wd N + jw k d K (N + 1) + M −1 ⌈ X m=0 X⌉ N +1−m M −1 i=0 wd ⌊X K ⌋ (wd − Kk). (12.19) k=1+i If, for simplicity, wd = jS for an integer j is assumed, the register complexity of the DF3 architecture can be written NR,DF3 = wd N + jn(N + 1) + j M −1 ⌈ X m=0 X⌉ N +1−m M i=0 −1 n X (S − Kk). (12.20) k=1+i For the TF architecture, the algorithmic delays are merged with the pipeline registers, and all registers are in the adder tree. 12.3. PARTIAL PRODUCT REDUNDANCY REDUCTION 155 pipeline stage 0 level 0 level 1 level L−1 pipeline stage 1 T T level 0 Figure 12.9 Structure of original reduction tree. 12.3 Partial product redundancy reduction Consider the general structure of the original reduction tree as shown in Figs. 12.5 and 12.6. For clarity, the structure is expanded in Fig. 12.9. Let a subexpression denote three partial products with the same weight, on the same stage and level in the same reduction tree. The sharing is then based on identifying common subexpressions occurring in several places in the reduction trees, i.e., several occasions of the same set of three partial products. Thus the sharing identification is performed before the assignment of partial products to adders. Subexpressions can be shared in the same tree or between trees, and in the same stage or between stages. However, in order not to affect the delay of the intermediate partial products, subexpressions can not be shared between different levels. For each subexpression to be shared, the occurrences of the partial products are removed from the reduction trees, an adder for the subexpression is introduced, and partial products corresponding to the sum and carry outputs are added on the level below each subexpression occurrence. The resulting reduction tree structure is shown in Fig. 12.10. Several different reduction trees can share the same full adder. However, there can not be any full adder that have inputs from more than one adder tree. Hence, the search for possible assignments need to be done inside each adder tree only, but to determine which assignment is most beneficial one should consider all adder trees. For the first level of the first stage all partial products are from the partial 156 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING shared FA shared FA shared FA T T T Figure 12.10 Structure of reduction tree with redundancy reduction. FA FA FA FA Figure 12.11 Equivalence transforms for sign normalization of subexpressions. product generation stage. For latter levels there may still be redundancy as the same partial product result may end up in several columns. Considering the DF and TF architectures, adder sharing can not be expected to consistently yield better results with one of the architectures than the other. In the DF architecture, all partial products are added in the first level, increasing the number of possible subexpressions that can be considered for sharing. On the other hand, the delay chain at the input also increases the number of different partial products and decreasing the probability that the same partial products occur in several columns. If signed data and/or coefficients are used, some of the partial products may be inverted. However, it is possible to apply transformations to increase the possible sharing by propagating inversions from the inputs to the outputs. This is illustrated in Fig. 12.11. 12.4. ILP OPTIMIZATION BITS j (k, b + 1) 157 BITS j (k, b) INBITS j (k, b + 1) F Aj (k, b + 1) HAj (k, b + 1) carries sums BITS j (k + 1, b + 1) BITS j (k, b − 1) INBITS j (k, b) F Aj (k, b) HAj (k, b) carries sums BITS j (k + 1, b) INBITS j (k, b − 1) F Aj (k, b − 1) HAj (k, b − 1) carries sums BITS j (k + 1, b − 1) Figure 12.12 Relations between the BITS variables in the ILP problem. 12.3.1 Proposed algorithm The goal of the proposed algorithm is to identify subexpressions that occur more than once. It can be described as follows: 1. Represent the filter coefficients in a suitable representation and generate one or more adder trees to be computed. 2. For each column in each level in each reduction tree, find the possible subexpressions, normalized in terms of possible inversion. 3. Determine the maximum frequency subexpression. In case of a tie, pick one at random. 4. If the chosen subexpression has only one occurrence, finish. 5. Assign the subexpression to a full adder, remove the corresponding entries from the adder trees, and add the sum and carry outputs as partial products in the corresponding columns at the level below each subexpression occurrence. 6. Go to 2. 12.4 ILP optimization 12.4.1 ILP problem formulation Denote the stage height, i.e., the maximum number of cascaded adders, by K, as in Fig. 12.6. Denote also the number of stages by J, as in Fig. 12.5. Furthermore, denote the output wordlength by wout , and the number of input partial products in each stage j, level k and bit position b by INBITS j (k, b), k ∈ {0, 1, 2, . . . , K −1}, 158 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING BITS j (K, b + 1) BITS j (K, b) BITS j (K, b − 1) REGS j (b + 1) REGS j (b) REGS j (b − 1) BITS j+1 (0, b + 1) BITS j+1 (0, b) BITS j+1 (0, b − 1) Figure 12.13 Relations between the stages in the ILP problem. b ∈ {0, 1, 2, . . . , wout − 1}. As variables for the ILP problem, the number of full adders FAj (k, b) and half adders HAj (k, b) are used, with the same parameter bounds as INBITS . The resulting number of partial products in each level is denoted BITS j (k, b), and is defined BITS 0 (0, b) = INBITS 0 (0, b) for the first level of the first stage. As each full adder reduces three partial products to one of the same weight and one of the next higher weight, and each half adder converts two partial products to one of the same weight and one of the next higher, the resulting number of partial products in each following level can be written BITS j (k + 1, b) = BITS j (k, b) + INBITS j (k, b) − 2FAj (k, b) − HAj (k, b) + FAj (k, b − 1) + HAj (k, b − 1), (12.21) for k ∈ {0, 1, 2, . . . , K − 1}, b ∈ {0, 1, 2, . . . , wout − 1}, and with variables equal to zero for out-of-bounds arguments. The relations between the BITS variables are depicted in Fig. 12.12. Connections between the stages are defined by BITS j+1 (0, b) = BITS j (K, b). (12.22) Variables REGS j (b) denoting the number of pipeline registers for each stage are defined by REGS j (b) = BITS j (K, b). (12.23) Thus, registers are added for all signals between the stages, as shown in Fig. 12.13. The number of adders in each level is limited by the constraint 3FAj (k, b) + 2HAj (k, b) ≤ BITS j (k, b) + INBITS j (k, b), (12.24) as the number of adder inputs can not exceed the number of partial products, for each level k and bit position b. Also, in order to utilize a VMA to sum the output, the number of output bits from the last stage is limited to 2 by the condition BITS J−1 (K, b) + INBITS J−1 (K, b) ≤ 2, (12.25) for b ∈ {0, 1, 2, . . . , wout − 1}. Costs are defined for full adders, half adders and registers as CFA , CHA , and CREG , respectively, and the filter structure is optimized 12.4. ILP OPTIMIZATION 159 by minimizing the sum C = CFA J−1 XX FAj (k, b) + CHA j=0 k,b J−1 XX HAj (k, b) + CREG j=0 k,b J−1 XX j=0 REGS j (b) b (12.26) The optimization problem as specified by (12.21)–(12.26) does not consider the length of the VMA. However, it may be possible to significantly reduce the length by introducing half adders to the least significant bits. The optimization problem can be modified to achieve a shorter VMA by adding a constraint to limit the number of output partial products to one for a number m of the least significant bits. This can be done by the constraint BITS J−1 (K, b) + INBITS J−1 (K, b) ≤ 1, (12.27) for b ∈ {0, 1, 2, . . . , m − 1}. The problem complexity can be significantly reduced by the addition of additional constraints. In particular, there will never be a reason, in terms of efficient reduction of the number of partial products, not to insert a full adder where at least three partial products are available. Hence, the number of full adders in a given position can be defined based on the number of partial products available as BITS k (l, b) + INBITS k (l, b) FAk (l, b) = , (12.28) 3 which can be formulated as two linear constraints for each variable. It should be noted that the optimization problem, as formulated, is independent of the coefficient representation, i.e., binary as well as any signed-digit representation may be used. However, if signed digits are used, either if the data is signed or if the coefficient contains negative digits, a constant term must be added to the sum, as discussed in Sec. 12.1.2. As the placement of the term in the tree is arbitrary, the problem can be modified to insert the bits where they fit well. How to formulate the optimization problem to accomodate for the constant term is explained in Sec. 12.4.6. In Sec. 12.4.2 to 12.4.5, the presented architectures are formulated as initial conditions for the optimization problem. In these formulations, the coefficient digits hn,j are defined as in (12.3) or (12.4), with hn,j ∈ {0, 1} for binary coefficients and hn,j ∈ {−1, 0, 1} for signed-digit coefficients. 12.4.2 DF1 architecture For the DF1 architecture, all partial P products inserted first adder stage. Pware Pwc −1 in thei+j N d −1 The sum of the partial products is n=0 j=0 |h | 2 . Substituting n,i i=0 b = i + j and rearranging the sums allows the number of bitproducts of weight b to be written N wX d −1 X INBITS 0 (0, b) = |hn,b−j | . (12.29) n=0 j=0 160 12.4.3 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING DF2 architecture If the direct-form architecture is modified to utilize coefficient symmetry, the symmetry adders will add additional delay. Thus, the partial products involving bit 0 (the LSB) of the data are added in level 1, the partial products involving bit 1 of the data are added in level 2, and so on. Generally, the number of initial partial products in stage j and level k can be written (N +1)/2 INBITS j (k, b) = X n=0 |hn,b−Kj−k+1 | (12.30) for 1 ≤ Kj + k ≤ wd − 1, and (N +1)/2 X INBITS j (k, b) = (|hn,b−Kj−k+1 | + |hn,b−Kj−k |) n=0 (12.31) for Kj + k = wd . 12.4.4 DF3 architecture For the partial symmetry case, the contributions of the different adders are separated. Assuming that a symmetry width of S adders is used, the partial products can be split into ⌈wd /S⌉ bins, where the first contains partial products from the S least significant data bits, the next bin contains partial products from the next S least significant data bits, and so on. Denoting the contribution from bin m by Bjm (k, b), the contributions for m = 0, 1, 2, . . . , ⌈wd /S⌉ − 1 can be written (N +1)/2 Bjm (k, b) = X n=0 for 1 ≤ Kj + k ≤ wd − 1, and |hn,b−Kj−k−mS+1 | (12.32) (N +1)/2 Bjm (k, b) = X n=0 (|hn,b−Kj−k−mS+1 | + |hn,b−Kj−k−mS |) (12.33) for Kj + k = wd . For m = ⌈wd /S⌉, the contribution can be written (N +1)/2 Bjm (k, b) = X n=0 |hn,b−Kj−k−mS+1 | (12.34) for 1 ≤ Kj + k ≤ (wd mod S) − 1, and (N +1)/2 Bjm (k, b) = X n=0 (|hn,b−Kj−k−mS+1 | + |hn,b−Kj−k−mS |) (12.35) 12.5. RESULTS 161 for Kj + k = wd mod S. Finally, the combined contribution is the sum of the partial symmetry adder contributions ⌈wd /S⌉ INBITS j (k, b) = X Bjm (k, b) (12.36) m=0 12.4.5 TF architecture Denoting the polyphase factor by M , for the TF architecture the first M filter coefficient will be inserted in the last adder stage, the next M coefficients in the stage before, and so on. Thus, the number of initial partial products can be written M (j+1)−1 wd −1 INBITS J−j−1 (0, b) = X X t=0 n=M j 12.4.6 |hn,b−t | . (12.37) Constant term placement If either the coefficient or the data contains digits with a negative sign, a constant compensation term must be added to the carry-save tree. However, these bits may be placed in an arbitrary stage, and in this section it is explained how the problem may be modified to place the bits optimally in terms of hardware resources. Define the constant, in two’s complement representation, as wout −1 C = −cwout −1 2 + wout X−2 cb 2b , (12.38) b=0 where cb ∈ {0, 1}. Then define the ILP variables CBITS j (b) ∈ {0, 1} for j ∈ {0, 1, 2, . . . , J − 1}, b ∈ {0, 1, 2, . . . , wout − 1}, and add the constraint J−1 X CBITS j (b) = cb (12.39) j=0 for b ∈ {0, 1, 2, . . . , wout − 1}. Redefine (12.22) to BITS j+1 (0, b) = BITS j (K, b) + CBITS j+1 (b) (12.40) in order to add the constant bits to the carry-save tree. 12.5 Results In this section, the different architectures are compared, and the choice of coefficient representation is investigated. For the energy and area estimations, a VHDL generator has been used to generate synthesizable VHDL code. The complete software package with ILP problem and VHDL code generator is available at [117]. 162 12.5.1 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING Architecture comparison Two filters have been used to evaluate the optimization algorithm, and the relative performance of the architectures. The filters are based on 4-tap and 16-tap moving average filters, M = 4 and M = 16, respectively. Both filters consist of three cascaded filters (L = 3). In all simulations, the numbers of full adders correspond to those given by (12.6) and (12.7), and the number of registers given by (12.15) and (12.17) were added to the optimized result for the DF1 and DF2 architectures, respectively. The filters were optimized using the ILP problem solver SCIP [2] with the costs CFA = 3, CHA = 2, and CREG = 3. Even though the area of a full adder is roughly twice that of a half adder, it was chosen to increase the half adder cost slightly as the routing associated to one full adder is likely less than that of two half adders. The optimized filters have been compared with filters obtained using the Reduced Area [6] heuristic. The Reduced Area heuristic is claimed to minimize the number of registers when used in a pipelined multiplier reduction tree. However, it is interesting to note that this is in general not true for the bitproduct matrices resulting from filters implemented with carry-save adder trees. Especially for the TF architecture, the bit-level optimized adder trees may result in significantly reduced register usage, while also using fewer half adders. Figure 12.14(a) shows the impact of the pipeline factor on the first filter with short wordlengths. For the 4-tap filter, Na = 1.8, and according to (12.14) utilizing the coefficient symmetry does not lead to reduced arithmetic complexity for wd < 2.25, and the DF2 architecture has thus not been included. Also, all filters used six half adders. It can be seen that the bit-level optimized filters use significantly less registers, especially for heavily pipelined implementations. It can also be seen that the TF architecture has a lower register complexity except for implementations with large stage height and one bit input. In Fig. 12.14(b) and Fig. 12.14(c), respectively, the energy dissipation and active cell area (excluding routing) are shown. The area and energy measures are based on a 90 nm standard cell library using Synopsys Design Compiler. The energy estimations are obtained using a zero-delay model and assuming uncorrelated input data. In the considered application, using a zero-delay model is expected to yield relevant results as the amount of glitches is small due to the short critical paths. Also, as decimation filter for a sigma-delta ADC the assumption of uncorrelated input data is considered to be relevant. In Fig. 12.14(d), the total cost of the optimized filters are shown. These include additional logic and arithmetic such as the algorithmic delays for the DF1 architecture and the adders used in the ripple-carry VMA, which are not considered in the optimization. By comparing Fig. 12.14(d) with Figs. 12.14(b) and 12.14(c), it can be concluded that the used cost function is a relevant measure for optimizing both energy dissipation and cell area. Whereas energy dissipation and cell area in general do not have a strong correlation, they can be expected to correlate well when the amount of glitches is small and uncorrelated input data is used. Thus, in the rest of this paper, only complexity results and energy dissipation results will 12.5. RESULTS Register complexity 100 80 60 40 300 w=2, Heur. DF1 w=2, Opt. DF1 w=2, Heur. TF w=2, Opt. TF w=1, Heur. DF1 w=1, Opt. DF1 w=1, Heur. TF w=1, Opt. TF Energy dissipation [fJ/Sample] 120 163 20 0 1 2 3 4 5 Maximum stage height (K) (a) Register complexity comparison. Area [um2] 2500 2000 d w =1, DF1 d w =1, TF 200 d 150 100 2 3 4 5 Maximum stage height (K) 6 (b) Energy dissipation estimation. 400 wd=2, DF1 wd=2, DF1 wd=2, TF 350 wd=2, TF wd=1, DF1 300 wd=1, DF1 wd=1, TF 250 wd=1, TF Cost 3000 d w =2, TF 250 50 1 6 w =2, DF1 1500 200 150 1000 500 1 100 2 3 4 5 Maximum stage height (K) (c) Synthesized cell area. 6 50 1 2 3 4 5 Maximum stage height (K) 6 (d) Optimized cost. Figure 12.14 Optimization and synthesis results of FIR CIC filters using the TF and DF1 architectures with M = 4 and L = 3. be presented as cell area and cost are similar to energy dissipation. In Fig. 12.15, the implementation complexity of the 16-tap filter is shown, using one bit input data. It is apparent that the bit-level optimized filter for the TF architecture offers a lower register complexity, while also reducing the number of half adders. Figures 12.16(a), 12.16(b) and 12.16(c) show the full adder, half adder and register complexity, respectively, for increasing input wordlength wd for the 4-tap filter. The stage height is K = 2. For the 4-tap filter, Na = 1.8, and according to (12.14) the DF2 architecture has a lower arithmetic complexity for wd > 2. However, even for wd = 6, the reduction in arithmetic complexity is small compared to the increase in number of registers, as seen in Fig. 12.16(c). For the DF3 architecture, a symmetry adder group size of S = 2 was used, thus reducing the number of generated partial products by 25%. However, the register complexity is still close to the DF2 architecture with full symmetry, as shown in Fig. 12.16(c). Energy dissipation estimations for the optimized filters are shown in Fig. 12.16(d). 164 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING 400 Reg., Heur. DF1 Reg., Opt. DF1 Reg., Heur. TF Reg., Opt. TF HA, Heur. DF1 HA, Opt. DF1 HA, Heur. TF HA, Opt. TF 350 Reg/HA complexity 300 250 200 150 100 50 0 1 2 3 4 Maximum stage height (K) 5 6 Figure 12.15 Register/HA complexity of bit-level optimized FIR CIC filters with M = 16, L = 3, and wd = 1. The number of half adders is 12 or 13 for all simulations. 12.5.2 Coefficient representation Different coefficient representations have been investigated for two filters shown in Fig. 12.17. Using signed-digit coefficients yields a small decrease in energy dissipation. That the gain is not larger is because the additional constant vector is relatively large compared with the number of bit-products for such small filters. Also, the simulation has not taken into account the fact that adders with a constant bit input may be simplified. It can be expected that the difference between binary and MSD coefficients would increase as the data and/or coefficient wordlength increases. 12.5.3 Subexpression sharing In this section, adder sharing is evaluated for CIC decimation and interpolation filters with rate change factors of four and eight. Wordlengths between one and six bits were used for the decimation filters, and wordlengths between three and eight bits were used for the interpolation filters. The DF1 and TF structures were evaluated, and for all filters the resulting adder trees were bit-level optimized after all shared sub-expressions were identified and adders introduced. Binary arithmetic with a critical path of three adders were used in all cases. The savings in register complexity were limited, and thus only the savings in adder complexity are presented. Three filters were used in cascade, and thus the single-rate impulse responses 12.5. RESULTS 80 60 30 FA, Opt. DF3 FA, Heur. DF2 FA, Opt. DF2 FA, Heur. DF1 FA, Opt. DF1 FA, Heur. TF FA, Opt. TF Half adder complexity Full adder complexity 100 165 40 20 0 1 25 20 15 10 5 2 3 4 Input wordlength (w) 5 0 1 6 (a) Full adder complexity. Register complexity 250 200 150 700 Regs, Opt. DF3 Regs, Heur. DF2 Regs, Opt. DF2 Regs, Heur. DF1 Regs, Opt. DF1 Regs, Heur. TF Regs, Opt. TF 100 50 0 1 2 3 4 Input wordlength (w) (c) Register complexity. 5 2 3 4 Input wordlength (w) 5 6 (b) Half adder complexity. Energy dissipation [fJ/Sample] 300 HA, Opt. DF3 HA, Heur. DF2 HA, Opt. DF2 HA, Heur. DF1 HA, Opt. DF1 HA, Heur. TF HA, Opt. TF 6 600 DF2 DF1 TF 500 400 300 200 100 0 1 2 3 4 Wordlength (wd) 5 6 (d) Energy dissipation of optimized filters. Figure 12.16 Complexity of FIR CIC filters with M = 4, L = 3, and K = 2. for the example filters are h4 (k) = (1, 3, 6, 10, 12, 12, 10, 6, 3, 1) , and (12.41) h8 (k) = (1, 3, 6, 10, 15, 21, 28, 36, 42, 46, 48, 48, 46, 42, 36, 28, 21, 15, 10, 6, 3, 1) , (12.42) for the filters of factors 4 and 8, respectively. Results of adder sharing for the decimation filters are shown in Fig. 12.18(a). The bigger filter offers significantly more sharing potential, which is expected. Generally, increasing the data wordlength increases the number of partial products and therefore the sharing potential. Results of adder sharing for interpolation filters are shown in Fig. 12.18(b). Since the multiple output branches of interpolation filters cannot be merged into a common adder tree as for the decimation filters, the number of candidate subexpressions is significantly reduced and the sharing potential is smaller. This is especially true for the transposed direct form, where the structural delays in the 166 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING Energy dissipation [fJ/Sample] 1500 M=8, L=3, bin M=8, L=3, MSD M=4, L=4, bin M=4, L=4, MSD 1000 500 0 1 2 3 4 Wordlength (wd) 5 6 Figure 12.17 Energy dissipation estimations using binary and MSD coefficients for FIR CIC filters with M = 4, L = 4 and M = 8, L = 3. 20 20 DF1, dec=8 TF, dec=8 DF1, dec=4 TF, dec=4 Saved adders [%] Saved adders [%] 25 15 10 5 0 1 2 3 4 Data wordlength 5 6 15 DF1, int=8 TF, int=8 DF1, int=4 10 5 0 3 4 5 6 Data wordlength 7 8 (a) Multirate FIR CIC decimation filters (b) Multirate FIR CIC interpolation filters with factors 4 and 8. with factors 4 and 8. Figure 12.18 Subexpression sharing results for multirate FIR CIC filters. reduction tree also inhibits merging of partial products from all taps in each output branch. For the TF architecture with interpolation by four there was no sharing potential. The factor-8 decimation filters in Figs. 12.18(a) and 12.18(b) have also been synthesized to standard cells of a 90nm CMOS process using Synopsys Design Compiler. Flipflops, full adders and half adders were used as leaf components, and the designs were synthesized for a clock frequency of 1.4 GHz. Energy and area estimations are shown in Figs. 12.19(a) and 12.19(b) respectively. It is seen that 25 20 167 14 TF TF, sharing DF1 DF1, sharing 12 Area [× 1000 µm2] Energy dissipation [pJ/sample] 12.5. RESULTS 15 10 5 0 1 10 TF TF, sharing DF1 DF1, sharing 8 6 4 2 2 3 4 Data wordlength 5 (a) Energy dissipation estimation. 6 0 1 2 3 4 Data wordlength 5 6 (b) Area estimation. Figure 12.19 Energy dissipation and area estimations of FIR CIC decimation filters with factor 8. the redundancy reduction reduces both the energy dissipation and the area. 168 CHAPTER 12. HIGH-SPEED DIGITAL FILTERING 13 Conclusions and future work In this chapter, the work presented in this part of the thesis is concluded, and suggestions for future work are given. 13.1 Conclusions In part II of this thesis, contributions were made in the area of high-speed analog-todigital conversion utilizing Σ∆-ADCs. The contributions can be divided into two parts: a sensitivity analysis of ADC systems using parallel Σ∆-modulators and the efficient bit-level realization of FIR filters with high speed and low wordlength, which can typically be used as the first stage decimation filter for a high-bandwidth Σ∆-ADC. In Chapter 10, a general model of an ADC using parallel Σ∆-modulators is introduced. The model uses periodic modulation sequences for the different channels, and multirate filter bank theory is used to define a signal transfer function for the system. The proposed model comprises several of the traditional parallel systems, including time-interleaved systems, Hadamard-modulated systems, and frequencydecomposed systems. A channel model including non-linearities of the modulators, gain and offset errors were used, and the different systems were analyzed for matching requirements using the channel model. It was found that the sensitivities differ significantly between the different systems, with the time-interleaved system requir169 170 CHAPTER 13. CONCLUSIONS AND FUTURE WORK ing full matching between all channels, the Hadamard-modulated system requiring limited matching between subsets of the channels, and the frequency-decomposed system being inherently insensitive to mismatch problems. In addition, a new three-channel scheme is introduced, that is insensitive to gain errors and modulator non-idealities. In Chapter 12, the realization of efficient high-speed FIR filters was considered, with one of the main motivations being as decimation filters for wide-bandwidth Σ∆-modulator. Due to the high speed and short wordlength, the popular recursive CIC structures are inefficient, and instead FIR filters realized using polyphase decomposition, partial product generation and reduction through carry-save adder trees were considered. The focus is on the efficient realization of partial product reduction trees. An ILP-based optimization algorithm was proposed, that minimizes a cost function defined as a weighted sum of the number of full adders, half adders and registers. The optimization algorithm was compared to the Reduced Area heuristic. Several different architectures, including direct form, transposed direct form, and direct form utilizing coefficient symmetry for linear-phase filters, were compared for several example filters and data wordlengths. 13.2 Future work The following ideas are identified as possibilities for future work: • The ILP-optimized reduction algorithm could be used also for other highspeed arithmetic. One such example is a multiply-and-accumulate, where the accumulation is recursive and will be an extra consideration in the optimization. • The result of the optimization is a fixed structure with a constant filter impulse response. However, in the area of software defined radio, the ability to trade bandwidth for resolution is needed, which can naturally by done using a fixed Σ∆-ADC and varying the bandwidth of the decimation filter. Thus, a method of optimizing an adaptive filter with varying bandwidth is interesting. References [1] H. Aboushady, Y. Dumonteix, M.-M. Louerat, and H. Mehrez, “Efficient polyphase decomposition of comb decimation filters in Σ∆ analog-to-digital converters,” IEEE Trans. Circuits Syst. II, vol. 48, no. 10, pp. 898–903, Oct. 2001. [2] T. Achterberg, “Constraint Integer Programming,” Ph.D. dissertation, Technische Universität Berlin, 2007, available: http://opus.kobv.de/tuberlin/volltexte/2007/1611/. [3] J. B. Anderson, Digital Transmission Engineering. IEEE Press, 1999. [4] R. Batten, A. Eshragi, and T. S. Fiez, “Calibration of parallel ∆Σ ADCs,” IEEE Trans. Circuits Syst. II, vol. 49, no. 6, pp. 390–399, Jun. 2002. [5] A. Berkeman, V. Öwall, and M. Torkelson, “A low logic depth complex multiplier using distributed arithmetic,” IEEE J. Solid-State Circuits, vol. 35, no. 4, pp. 656–659, Apr. 2000. [6] K. Bickerstaff, M. Schulte, and E.E. Swartzlander, Jr., “Reduced area multipliers,” in Proc. Int. Conf. Application-Specific Array Processors, Nov. 1993, pp. 478–489. [7] A. Blad and O. Gustafsson, “Energy-efficient data representation in LDPC decoders,” IET Electron. Lett., vol. 42, no. 18, pp. 1051–1052, Aug. 2006. 171 172 Bibliography [8] ——, “Integer linear programming-based bit-level optimization for highspeed FIR decimation filter architectures,” Springer Circuits, Syst. Signal Process. - Special Issue on Low Power Digital Filter Design Techniques and Their Applications, vol. 29, no. 1, pp. 81–101, Feb. 2010. [9] ——, “Redundancy reduction for high-speed FIR filter architectures based on carry-save adder trees,” in Proc. Int. Symp. Circuits, Syst., May 2010. [10] ——, “FPGA implementation of rate-compatible QC-LDPC code decoder,” in Proc. European Conf. Circuit Theory Design, Aug. 2011. [11] A. Blad, O. Gustafsson, and L. Wanhammar, “An early decision decoding algorithm for LDPC codes using dynamic thresholds,” in Proc. European Conf. Circuit Theory Design, Aug. 2005, pp. 285–288. [12] ——, “A hybrid early decision-probability propagation decoding algorithm for low-density parity-check codes,” in Proc. Asilomar Conf. Signals, Syst., Comp., Oct. 2005. [13] ——, “Implementation aspects of an early decision decoder for LDPC codes,” in Proc. Nordic Event ASIC Design, Nov. 2005. [14] ——, “An LDPC decoding algorithm utilizing early decisions,” in Proc. National Conf. Radio Science, Jun. 2005. [15] A. Blad, O. Gustafsson, M. Zheng, and Z. Fei, “Integer linear programming based optimization of puncturing sequences for quasi-cyclic low-density parity-check codes,” in Proc. Int. Symp. Turbo-Codes, Related Topics, Sep. 2010. [16] A. Blad, H. Johansson, and P. Löwenborg, “Multirate formulation for mismatch sensitivity analysis of analog-to-digital converters that utilize parallel sigma-delta modulators,” Eurasip J. Advances Signal Process., vol. 2008, 2008, article ID 289184, 11 pages. [17] A. Blad, P. Löwenborg, and H. Johansson, “Design trade-offs for linear-phase FIR decimation filters and sigma-delta modulators,” in Proc. XIV European Signal Process. Conf., Sep. 2006. [18] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 lowdensity parity-check code decoder,” IEEE J. Solid-State Circuits, vol. 37, no. 3, pp. 404–412, Mar. 2002. [19] L. J. Breems, R. Rutten, and G. Wetzker, “A cascaded continuous-time Σ∆ modulator with 67-dB dynamic range in 10-MHz bandwidth,” IEEE J. SolidState Circuits, vol. 39, no. 12, pp. 2152–2160, Dec. 2004. [20] J. Candy, “Decimation for sigma delta modulation,” IEEE Trans. Commun., vol. 29, pp. 72–76, Jan. 1986. Bibliography 173 [21] J. C. Candy and G. C. Temes, Oversampling methods for A/D and D/A conversion. IEEE Press, 1992. [22] Y.-J. Chang, F. Lai, and C.-L. Yang, “Zero-aware asymmetric SRAM cell for reducing cache power in writing zero,” IEEE Trans. VLSI Syst., vol. 12, no. 8, pp. 827–836, Aug. 2004. [23] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X.-Y. Hu, “Reducedcomplexity decoding of LDPC codes,” IEEE Trans. Commun., vol. 53, no. 8, pp. 1288–1299, Aug. 2005. [24] E. Choi, S. Suh, and J. Kim, “Rate-compatible puncturing for low-density parity-check codes with dual-diagonal parity structure,” in Proc. Symp. Personal Indoor Mobile Radio Commun., Sep. 2005, pp. 2642–2646. [25] S.-Y. Chung, “On the construction of some capacity-approaching coding schemes,” Ph.D. dissertation, Massachusets Institute of Technology, Sep. 2000. [26] S.-Y. Chung, G. D. Forney, T. J. Richardson, and R. Urbanke, “On the design of low-density parity-check codes within 0.0045 dB of the Shannon limit,” IEEE Commun. Lett., vol. 5, no. 2, pp. 58–60, Feb. 2001. [27] S.-Y. Chung, T. J. Richardson, and R. Urbanke, “Analysis of sum-product decoding of low-density parity-check codes using a Gaussian approximation,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 657–670, Feb. 2001. [28] F. Colodro, A. Torralba, and M. Laguna, “Time-interleaved multirate sigmadelta modulators,” in Proc. Int. Symp. Circuits, Syst., vol. 6, May 2005, pp. 5581–5584. [29] R. F. Cormier, T. L. Sculley, and R. H. Bamberger, “Combining subband decomposition and sigma-delta modulation for wideband A/D conversion,” in Proc. Int. Symp. Circuits, Syst., vol. 5, May 1994, pp. 357–360. [30] L. Dadda, “Some schemes for parallel multipliers,” Alta Frequenza, vol. 34, pp. 349–356, May 1965. [31] D. Declercq and F. Verdier, “Optimization of LDPC finite precision belief propagation decoding with descrete density evolution,” in Proc. Int. Symp. Turbo-Codes, Related Topics, Sep. 2003, pp. 479–482. [32] G. Ding, C. Dehollain, M. Declercq, and K. Azadet, “Frequency-interleaving technique for high-speed A/D conversion,” in Proc. Int. Symp. Circuits, Syst., vol. 1, May 2003, pp. 857–860. [33] Y. Dumonteix and H. Mehrez, “A family of redundant multipliers dedicated to fast computation for signal processing,” in Proc. Int. Symp. Circuits, Syst., vol. 5, May 2000, pp. 325–328. 174 Bibliography [34] A. Eshraghi and T. S. Fiez, “A time-interleaved parallel ∆Σ A/D converter,” IEEE Trans. Circuits Syst. II, vol. 50, no. 3, pp. 118–129, Mar. 2003. [35] ——, “A comparative analysis of parallel delta-sigma ADC architectures,” IEEE Trans. Circuits Syst. I, vol. 51, no. 3, pp. 450–458, Mar. 2004. [36] T. Etzion, A. Trachtenberg, and A. Vardy, “Which codes have cycle-free Tanner graphs?” IEEE Trans. Inf. Theory, vol. 45, no. 6, pp. 2173–2181, Sep. 1999. [37] M. Fossorier, “Quasi-cyclic low-density parity-check codes from circulant permutation matrices,” IEEE Trans. Inf. Theory, vol. 50, no. 8, pp. 1788–1793, Aug. 2004. [38] M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity iterative decoding of low density parity check codes based on belief propagation,” IEEE Trans. Commun., vol. 47, pp. 673–680, May 1999. [39] R. Gallager, “Low-density parity-check codes,” Ph.D. dissertation, Massachusetts Institute of Technology, 1963. [40] I. Galton and H. T. Jensen, “Delta-sigma modulator based A/D conversion without oversampling,” IEEE Trans. Circuits Syst. II, vol. 42, no. 12, pp. 773–784, Dec. 1995. [41] ——, “Oversampling parallel delta-sigma modulator A/D conversion,” IEEE Trans. Circuits Syst. II, vol. 43, no. 12, pp. 801–810, Dec. 1996. [42] Y. Gao, L. Jia, J. Isoaho, and H. Tenhunen, “A comparison design of comb decimators for sigma-delta analog-to-digital converters,” Springer Analog Integrated Circuits, Signal Process., vol. 22, no. 1, pp. 51–60, Jan. 2000. [43] M. Good and F. R. Kschischang, “Incremental redundancy via check splitting,” in Proc. Biennial Symp. Commun., 2006, pp. 55–58. [44] F. Guillod, E. Boutillon, J. Tousch, and J.-L. Danger, “Generic description and synthesis of LDPC decoders,” IEEE Trans. Commun., vol. 55, no. 11, pp. 2084–2091, Nov. 2007. [45] O. Gustafsson and H. Ohlsson, “A low power decimation filter architecture for high-speed single-bit sigma-delta modulation,” in Proc. Int. Symp. Circuits, Syst., vol. 2, May 2005, pp. 1453–1456. [46] J. Ha, J. Kim, D. Klinc, and S. W. McLaughlin, “Rate-compatible punctured low-density parity-check codes with short block lengths,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 728–738, 2006. [47] J. Ha, J. Kim, and S. W. McLaughlin, “Rate-compatible puncturing of lowdensity parity-check codes,” IEEE Trans. Inf. Theory, vol. IT-50, no. 11, pp. 2824–2836, Nov. 2004. Bibliography 175 [48] J. Hagenauer, “Rate-compatible punctured convolutional codes (RCPC codes) and their applications,” IEEE Trans. Commun., vol. 36, no. 4, pp. 389–400, Apr. 1988. [49] O. Herrmann, R. L. Rabiner, and D. S. K. Chan, “Practical design rules for optimum finite impulse response digital filters,” Bell System Technical J., vol. 52, no. 6, pp. 769–799, 1973. [50] D. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,” in Proc. Workshop, Signal Proc. Syst., Oct. 2004, pp. 107– 112. [51] E. B. Hogenauer, “An economical class of digital filters for decimation and interpolation,” Proc. Int. Conf. Acoust. Speech, Signal Process., vol. 29, pp. 155–162, Apr. 1981. [52] C. Howland and A. Blanksby, “A 220mW 1Gb/s 1024-bit rate-1/2 low density parity check code decoder,” in Proc. Custom Integrated Circuits Conf., May 2001, pp. 293–296. [53] P. G. A. Jespers, Integrated Converters: D to A and A to D Architectures, Analysis and Simulation. New York: Oxford University Press, 2001. [54] K. Johansson, O. Gustafsson, and L. Wanhammar, “Implementation of elementary functions for logarithmic number systems,” IET J. Computers, Digital Techniques, IET, vol. 2, no. 4, pp. 295–304, Jul. 2008. [55] S. J. Johnson and S. R. Weller, “Construction of low-density parity-check codes from Kirkman triple systems,” in Proc. Global Commun. Conf., Nov. 2003, pp. 970–974. [56] R. Khoini-Poorfard and D. A. Johns, “Time-interleaved oversampling converters,” IET Electron. Lett., vol. 29, no. 19, pp. 1673–1674, Sep. 1993. [57] ——, “Mismatch effects in time-interleaved oversampling converters,” in Proc. Int. Symp. Circuits, Syst., vol. 5, May 1994, pp. 429–432. [58] R. Khoini-Poorfard, L. B. Lim, and D. A. Johns, “Time-interleaved oversampling A/D converters: Theory and practice,” IEEE Trans. Circuits Syst. II, vol. 44, no. 8, pp. 634–645, Aug. 1997. [59] K.-Y. Khoo, Z. Yu, and Alan N. Willson, Jr., “Bit-level arithmetic optimization for carry-save additions,” in Proc. Computer-Aided Design, Digest of Technical Papers, Nov. 1999, pp. 14–18. [60] J.-A. Kim, S.-R. Kim, D.-J. Shin, and S.-N. Hong, “Analysis of check-node merging decoding for punctured LDPC codes with dual-diagonal parity structure,” in Proc. Wireless Commun. Netw. Conf., 2007, pp. 572–576. 176 Bibliography [61] E. King, F. Aram, T. Fiez, and I. Galton, “Parallel delta-sigma A/D conversion,” in Proc. Custom Integrated Circuits Conf., May 1994, pp. 503–506. [62] S. K. Kong and W. H. Ku, “Frequency domain analysis of Π∆Σ ADC and its application to combining subband decomposition and Π∆Σ ADC,” in Proc. Midwest Symp. Circuits, Syst., vol. 1, Aug. 1997, pp. 226–229. [63] Y. Kou, S. Lin, and M. P. C. Fossorier, “Low-density parity-check codes based on finite geometries: A rediscovery and new results,” IEEE Trans. Inf. Theory, vol. 47, no. 7, pp. 2711–2736, Nov. 2001. [64] M. Kozak, M. Karaman, and I. Kale, “Efficient architectures for timeinterleaved oversampling delta-sigma converters,” IEEE Trans. Circuits Syst. II, vol. 47, no. 8, pp. 802–810, Aug. 2000. [65] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 498– 519, Feb. 2001. [66] T.-C. Kuo and J. A. N. Willson, “A flexible decoder IC for WiMAX QCLDPC codes,” in Proc. Custom Integrated Circuits Conf., 2008, pp. 527–530. [67] C. Kuskie, B. Zhang, and R. Schreier, “A decimation filter architecture for GHz delta-sigma modulators,” in Proc. Int. Symp. Circuits, Syst., vol. 2, May 1995, pp. 953–956. [68] Y. Li, M. Elassal, and M. Bayoumi, “Power efficient architecture for (3,6)regular low-density parity-check code decoder,” in Proc. Int. Symp. Circuits, Syst., vol. IV, May 2004, pp. 81–84. [69] C.-Y. Lin, C.-C. Wei, and M.-K. Ku, “Efficient encoding for dual-diagonal structured LDPC codes based on parity bit prediction and correction,” in Proc. Asia-Pacific Conf. Circuits, Syst., 2008, pp. 1648–1651. [70] C.-C. Lin, K.-L. Lin, H.-C. Chang, and C.-Y. Lee, “A 3.33Gb/s (1200, 720) low-density parity-check code decoder,” in Proc. European Solid-State Circuits Conf., Sep. 2005, pp. 211–214. [71] S. Lin, L. Chen, J. Xu, and I. Djurdjevic, “Near Shannon limit quasi-cyclic low-density parity-check codes,” in Proc. Global Commun. Conf., May 2003, pp. 2030–2035. [72] M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman, “Analysis of low density codes and improved designs using irregular graphs,” in Proc. Annual Symp. Theory Computing, 1998, pp. 249–258. [73] D. MacKay, “Good error-correcting codes based on very sparse matrices,” in Proc. Int. Symp. Inf. Theory, Jun. 1997, p. 113. Bibliography 177 [74] ——, “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Inf. Theory, vol. 45, no. 2, pp. 399–431, Mar. 1999. [75] D. MacKay and R. Neal, “Near Shannon limit performance of low density parity check codes,” IET Electron. Lett., vol. 33, no. 6, pp. 457–458, Mar. 1997. [76] U. Meyer-Baese, S. Rao, J. Ramı́rez, and A. Garcı́a, “Cost-effective Hogenauer cascaded integrator comb decimator filter design for custom ICs,” IET Electron. Lett., vol. 41, no. 3, pp. 158–160, Feb. 2005. [77] O. Milenkovic, I. B. Djordjevic, and B. Vasic, “Block-circulant low-density parity-check codes for optical communication systems,” IEEE J. Sel. Topics Quantum Electron., vol. 10, no. 2, pp. 294–299, Mar. 2004. [78] S. K. Mitra, Digital Signal Processing: McGraw-Hill, 2006. A Computer Based Approach. [79] F. Naessens, V. Derudder, H. Cappelle, L. Hollevoet, P. Raghavan, M. Desmet, A. AbdelHamid, I. Vos, L. Folens, S. O’Loughlin, S. Singirikonda, S. Dupont, J.-W. Weijers, A. Dejonghe, and L. Van der Perre, “A 10.37 mm2 675 mW reconfigurable LDPC and Turbo encoder and decoder for 802.11n, 802.16e and 3GPP-LTE,” in Proc. Symp. VLSI Circuits, 2010, pp. 213–214. [80] V. T. Nguyen, P. Loumeau, and J.-F. Naviner, “Analysis of time-interleaved delta-sigma analog to digital converter,” in Proc. Veh. Tech. Conf., vol. 4, Sep. 2002. [81] S. R. Norsworthy, R. Schreier, and G. C. Temes, Eds., Delta-Sigma Data Converters: Theory, Design, and Simulation. Wiley-IEEE Press, 1996, ISBN 0780310454. [82] H. Ohlsson, B. Mesgarzadeh, K. Johansson, O. Gustafsson, P. Löwenborg, H. Johansson, and A. Alvandpour, “A 16 GSPS 0.18 µm CMOS decimator for single-bit Σ∆-modulation,” in Proc. Nordic Event ASIC Design, Nov. 2004, pp. 175–178. [83] H. Y. Park, J. W. Kang, K. S. Kim, and K. C. Whang, “Efficient puncturing method for rate-compatible low-density parity-check codes,” IEEE Trans. Wireless Commun., vol. 6, no. 11, pp. 3914–3919, Nov. 2007. [84] T. Parks and J. McClellan, “Chebyshev approximation for nonrecursive digital filters with linear phase,” IEEE Trans. Circuit Theory, vol. 19, no. 2, pp. 189–194, 1972. [85] L. Rabiner and O. Herrmann, “The predictability of certain optimum finite impulse response digital filters,” IEEE Trans. Circuit Theory, vol. CT-20, no. 4, pp. 401–408, Jul. 1973. 178 Bibliography [86] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke, “Design of capacityapproaching irregular low-density parity-check codes,” IEEE Trans. Inf. Theory, vol. vol. 47, no. 2, pp. 619–637, Feb. 2001. [87] T. J. Richardson and R. L. Urbanke, “The capacity of low-density paritycheck codes under message-passing decoding,” IEEE Trans. Inf. Theory, vol. vol. 47, no. 2, pp. 599–618, Feb. 2001. [88] R. Schreier and G. C. Temes, Understanding Delta-Sigma Data Converters. John Wiley & Sons, NJ, 2005. [89] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical J., vol. 27, pp. 379–423, 623–656, 1948. [90] X.-Y. Shih, C.-Z. Zhan, and A.-Y. Wu, “A 7.39 mm2 76mW (1944, 972) LDPC decoder chip for IEEE 802.11n applications,” in Proc. Asian SolidState Circuits Conf., Nov. 2008, pp. 301–304. [91] J. E. Swartzlander, “Merged arithmetic,” IEEE Trans. Comput., vol. C-29, no. 10, pp. 946–950, Oct. 1980. [92] R. M. Tanner, “A recursive approach to low complexity codes,” IEEE Trans. Inf. Theory, vol. IT-27, no. 5, pp. 533–547, Sep. 1981. [93] K. Uyttenhove and M. Steyaert, “Speed-power-accuracy trade-off in highspeed ADCs,” IEEE Trans. Circuits Syst. II, vol. 4, pp. 247–257, Apr. 2002. [94] P. P. Vaidyanathan, Multirate Systems and Filter Banks. Eaglewood Cliffs, NJ: Prentice-Hall, 1993. [95] B. Vasic and O. Milenkovic, “Combinatorial constructions of low-density parity-check codes for iterative decoding,” IEEE Trans. Inf. Theory, vol. 50, no. 6, pp. 1156–1176, Jun. 2004. [96] B. Vasic, K. Pedagani, and M. Ivkovic, “High-rate girth-eight low-density parity-check codes on rectangular integer lattices,” IEEE Trans. Inf. Theory, vol. 52, no. 8, pp. 1248–1252, Aug. 2004. [97] C. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Comput., vol. EC-13, no. 1, pp. 14–17, Feb. 1964. [98] Y. Wang, J. Yedidia, and S. Draper, “Construction of high-girth QC-LDPC codes,” in Proc. Int. Symp. Turbo-Codes, Related Topics, Sep. 2008, pp. 180– 185. [99] Z. Wang, Z. Cui, and J. Sha, “VLSI design for low-density parity-check code decoding,” IEEE Circuits Syst. Mag., vol. 11, no. 1, pp. 52–69, 2011. [100] L. Wanhammar and H. Johansson, Digital Filters. 2002. Linköping University, Bibliography 179 [101] N. Wiberg, “Codes and decoding on general graphs,” Ph.D. dissertation, Linköping University, Linköping, Sweden, Jun. 1996. [102] S. B. Wicker, Error Control Systems. Prentice Hall, 1995. [103] N. Yaghini and D. Johns, “A 43mW CT complex ∆Σ ADC with 23MHz of signal bandwidth and 68.8dB SNDR,” in Proc. Int. Solid-State Circuits Conf., Digest of Technical Papers, Feb. 2005, pp. 502–503. [104] E. Yeo, B. Nikolic, and V. Anantharam, “Architectures and implementations of low-density parity check decoding algorithms,” in Proc. Midwest Symp. Circuits, Syst., Aug. 2002, pp. III–437 – III–440. [105] E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam, “VLSI architectures for iterative decoders in magnetic recording channels,” IEEE Trans. Magn., vol. 37, no. 2, pp. 748–755, Mar. 2001. [106] K. Zhang, X. Huang, and Z. Wang, “A high-throughput LDPC decoder architecture with rate compatibility,” IEEE Trans. Circuits Syst. I, vol. 58, no. 4, pp. 839–847, Apr. 2011. [107] T. Zhang and K. K. Parhi, “Joint code and decoder design for implementation-oriented (3, k)-regular LDPC codes,” in Proc. Asilomar Conf. Signals, Syst., Comp., Nov. 2001, pp. 1232–1236. [108] ——, “A 54 Mbps (3,6)-regular FPGA LDPC decoder,” in Proc. Workshop, Signal Proc. Syst., Oct. 2002, pp. 127–132. [109] ——, “An FPGA implementation of (3,6) regular low-density parity-check code decoder,” Eurasip J. Applied Signal Process., vol. 6, pp. 530–542, May 2003. [110] ——, “Joint (3, k)-regular LDPC code and decoder/encoder design,” IEEE Trans. Signal Process., vol. 52, no. 4, pp. 1065–1079, Apr. 2004. [111] T. Zhang, Z. Wang, and K. K. Parhi, “On finite precision implementation of low density parity check codes decoder,” in Proc. Int. Symp. Circuits, Syst., May 2001, pp. 202–205. [112] E. Zimmermann, G. Fettweis, P. Pattisapu, and P. K. Bora, “Reduced complexity LDPC decoding using forced convergence,” in Proc. Int. Symp. Wireless Personal Multimedia Commun., Sep. 2004. [113] “ETSI EN 302 307, digital video broadcasting (DVB): Second generation framing structure, channel coding and modulation systems for broadcasting, interactive services, news gathering and other broadband satellite applications (DVB-S2),” available: http://www.dvb.org/technology/standards/. 180 Bibliography [114] “ETSI EN 302 755, digital video broadcasting (DVB): Frame structure channel coding and modulation for a second generation digital terrestrial television broadcasting system (DVB-T2),” available: http://www.dvb.org/technology/standards/. [115] “ETSI EN 302 769, digital video broadcasting (DVB): Frame structure channel coding and modulation for a second generation digital transmission system for cable systems (DVB-C2),” available: http://www.dvb.org/technology/standards/. [116] “GSFC-STD-9100, low density parity check code for rate 7/8,” available: http://standards.gsfc.nasa.gov/stds.html. [117] “High-speed FIR filter generator,” available: http://www.es.isy.liu.se/software/hsfir/. [118] “IEEE 802.11n-2009: Wireless LAN medium access control (MAC) and physical layer (PHY) specifications amendment 5: Enhancements for higher throughput,” available: http://standards.ieee.org/getieee802/download/802.11n-2009.pdf. [119] “IEEE 802.15.3c-2009: Wireless medium access control (MAC) and physical layer (PHY) specifications for high rate wireless personal area networks (WPANs) amendment 2: Millimeter-wave-based alternative physical layer extension,” available: http://standards.ieee.org/getieee802/download/802.15.3c-2009.pdf. [120] “IEEE 802.16e-2005: Air interface for fixed and mobile broadband wireless access systems,” available: http://standards.ieee.org/getieee802/download/802.16e-2005.pdf. [121] “IEEE 802.3an-2006: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifications,” available: http://standards.ieee.org/getieee802/download/802.3an-2006.pdf. [122] “National standardization management committee. GB2060-2006. digital television terrestrial broadcasting transmission system frame structure, channel coding and modulation. Beijing: Standards press of China. Aug. 2006.”
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement