Low Complexity Techniques for Low Density Parity Check Code Decoders and

Low Complexity Techniques for Low Density Parity Check Code Decoders and
Linköping Studies in Science and Technology. Dissertations, No. 1385
Low Complexity Techniques for Low
Density Parity Check Code Decoders and
Parallel Sigma-Delta ADC Structures
Anton Blad
Department of Electrical Engineering
Linköping University
SE–581 83 Linköping
Sweden
Linköping 2011
Low Complexity Techniques for Low Density Parity Check Code Decoders and Parallel Sigma-Delta ADC Structures
Anton Blad
Linköping Studies in Science and Technology. Dissertations, No. 1385
c 2011 Anton Blad
Copyright ISBN 978-91-7393-104-5
ISSN 0345-7524
e-mail: [email protected]
thesis url: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-69432
Department of Electrical Engineering
Linköping University
SE–581 83 Linköping
Sweden
Printed by LiU-Tryck, Linköping, Sweden 2011
Abstract
In this thesis, contributions are made in the area of receivers for wireless communication standards. The thesis consists of two parts, focusing on implementations of
forward error correction using low-density parity-check (LDPC) codes, and highbandwidth analog-to-digital converters (ADC) using sigma-delta modulators.
LDPC codes have received wide-spread attention since 1995 as practical
capacity-approaching code candidates. It has been shown that the class of codes
can perform arbitrarily close to the channel capacity, and LDPC codes are also
used or suggested for a number of current and future communication standards,
including the 802.16e WiMAX standard, the 802.11n WLAN standard, and the
second generation of digital TV standards DVB-x2. The first part of the thesis
contains two main contributions to the problem of decoding LDPC codes, denoted
the early-decision decoding algorithm and the check-merging decoding algorithm.
The early-decision decoding algorithm is a method of terminating parts of the
decoding process early for bits that have high reliabilities, thereby reducing the
computational complexity of the decoder. The check-merging decoding algorithm
is a method of reducing the code complexity of rate-compatible LDPC codes and
increasing the efficiency of the decoding algorithm, thereby offering a significant
throughput increase. For the two algorithms, architectures are proposed and synthesized for FPGAs and the resulting performance and logic utilization are compared with the original algorithms.
Sigma-delta ADCs are the natural choice of low-to-medium bandwidth applications that require high resolution. However, suggestions have also been made to
use them for high-bandwidth communication standards which require either high
sampling rates or several ADCs operating in parallel. In this thesis, two contributions are made in the area of high-bandwidth ADCs using sigma-delta modulators.
The first is a general formulation of parallel ADCs using modulation of the input
data. The formulation allows a system’s sensitivity to analog mismatch errors in
the channels to be analyzed and it is shown that some systems can be made insensitive to certain matching errors, whereas others may require matching of limited
subsets of the channels, or full matching of all channels. Limited sensitivity to
mismatch errors reduces the complexity of the analog parts. Simulation results are
provided for a time-interleaved ADC, a Hadamard-modulated ADC, a frequencyband decomposed ADC, as well as for a new modulation scheme that is insensitive
to channel gain mismatches. The second contribution relates to the implementation of high-speed digital filters, where a typical application is decimation filters
for a high-bandwidth sigma-delta ADC. A bit-level optimization algorithm is proposed that minimizes a cost function defined as a weighted sum of the number
of full adders, half adders and registers. Simulation results show comparisons between bit-level optimized filters and structures obtained using common heuristics
for carry-save adder trees.
i
Populärvetenskaplig sammanfattning
De flesta av dagens trådlösa kommunikationssystem bygger på digital överföring
av data. Denna avhandling innehåller två delar, vilka ger bidrag till två olika delar
i moderna mottagare för digitala kommunikationssystem.
I den första delen behandlas energieffektiv avkodning av felrättande koder.
Felrättande kodning är en av de främsta fördelarna med digital kommunikation
gentemot analog kommunikation, och bygger på att man i överföringen lägger till
redundans till informationen man skickar. Under överföringen blir det ovillkorligt
fel, men med hjälp av redundansen är det möjligt att räkna ut var det gick fel, och
man kan på så sätt öka pålitligheten i överföringen. Felrättande koder med stor
förmåga att upptäcka och rätta fel är lätta att konstruera, men att bygga praktiska implementeringar av en avkodare är svårare, och avkodaren använder ofta
en stor del av strömförbrukningen i IC-kretsar för mottagare. I den här avhandlingen behandlas en specifik typ av felrättande koder (LDPC-koder) som kommer
mycket nära den teoretiska gränsen för felrättande förmåga, och två förbättringar
av avkodningsalgoritmer föreslås. I båda fallen är de föreslagna algoritmerna mer
komplexa, men minskar effektförbrukningen i en implementering.
I den andra delen av avhandlingen behandlas en specifik typ av analog-tilldigital-omvandlare (ADC), som omvandlar den mottagna signalen till digital information. Sigma-delta är en typ av ADC som lämpar sig särskilt väl för integration
med digitala system på en gemensam IC-krets. Nackdelen idag är dock att konverteringshastigheten är relativt låg. Ett sätt att öka hastigheten är att använda
flera omvandlare parallellt, som då tar hand om delar av insignalen var för sig.
Nackdelen är att sådana system ofta blir känsliga för variationer mellan de olika
omvandlarna, och i den här avhandlingen föreslås en metod att modellera flera
parallella sigma-delta-ADC:er för att analysera känslighetskraven. Det visar sig
att vissa system är känsliga för variationer, medan andra kan kräva anpassning
av begränsade delmängder av omvandlarna. Ett annat problem är att utdata från
sigma-delta-omvandlare består av en dataström med mycket hög datatakt och mycket kvantiseringsbrus. För att kunna använda dataströmmen i en applikation
måste den först decimeras. Avhandlingen innehåller också en metod att formulera
skapandet sådana decimeringsfilter som ett optimeringsproblem, för att på så vis
skapa filter med låg komplexitet.
iii
Acknowledgments
It is with mixed feelings I look back at the 5+ years as a PhD student at Electronics Systems. This thesis is the result of many hours of work in a competent and
motivating environment, an environment promoting independence and individualism while still offering abundant support and opportunities for discussion. Being a
PhD student has been hard at times, but looking back I cannot imagine conditions
better suited for a researcher at the start of his career.
There are many people who have given me inspiration and support through
these years, and I want to take the opportunity here to say thanks to the following
people:
• My supervisor Dr. Oscar Gustafsson for being a huge source of motivation by
always taking his time to discuss my work and endlessly offering new insights
and ideas.
• My very good friends Dr. Fredrik Kuivinen and M.Sc. Jakob Rosén for all
the fun with electronics projects and retro gaming sessions in the evenings at
campus.
• Prof. Christer Svensson for getting me in contact with the STMicroelectronics
research lab in Geneva.
• Dr. Andras Pozsgay at the Advanced Radio Architectures group at STMicroelectronics in Geneva, for offering me the possibility of working with “practical” research for six months during spring 2007. It has been a very valuable
experience.
• All the other people in my research group at STMicroelectronics in Geneva,
for making my stay there a pleasant experience.
• Prof. Fei Zesong at the Department of Electrical Engineering at Beijing Institute of Technology in China, for giving me the possibility of two three-month
PhD student exchanges in spring 2009 and winter 2010/2011.
• M.Sc. Wang Hao and M.Sc. Shen Zhuzhe for helping me with all the practicalities for my visits in Beijing.
• M.Sc. Zhao Hongjie for the cooperation during his year as a PhD student
exchange at Linköping University.
• All the others at the Modern Communication Lab at Beijing Institute of
Technology for their kindness and support in an environment that was very
different to what I am used to.
• Dr. Kent Palmkvist for help with FPGA- and VHDL-related issues.
• M.Sc. Sune Söderkvist for the big contribution to the generally happy and
positive atmosphere at Electronics Systems.
v
Acknowledgments
Acknowledgments
• All the other present and former colleagues at Electronics Systems.
• All the colleagues at the Electronic Components research group during my
time there from 2006 to 2008.
• All the colleagues at Communications Systems during my time there as a
research engineer in spring 2010.
• All my friends who have made my life pleasant during my time as a PhD
student.
• Last but not least, I thank my parents Maj and Bengt Blad for all their
encouragement and time that they gave me as a child, which is definitely
part of the reason that I have come this far in life. I also thank my sisters
Lisa and Tove Blad, who I don’t see very often but still feel I am very close
to when I do.
Anton Blad
Linköping, July 2011
vi
Abbreviations
ADC
AWGN
BEC
BER
BLER
BPSK
BSC
CFU
CIC
CMOS
CNU
CSA
CSD
DAC
DECT
DFT
DTTB
DVB-S2
Analog-to-Digital Converter
Additive White Gaussian Noise
Binary Erasure Channel
Bit Error Rate
Block Error Rate
Binary Phase Shift Keying
Binary Symmetric Channel
Check Function Unit
Cascaded Integrator Comb
Complementary Metal Oxide Semiconductor
Check Node processing Unit
Carry-Save Adder
Canonic Signed-Digit
Digital-to-Analog Converter
Digital Enhanced Cordless Telecommunications
Discrete Fourier Transform
Digital Terrestrial Television Broadcasting
Digital Video Broadcasting - Satellite 2nd generation
vii
Abbreviations
Eb /N0
ECC
ED
FIR
FPGA
GPS
ILP
k-SE
k-SR
LAN
LDPC
LUT
MPR
MSD
MUX
OSR
PSD
QAM
QC-LDPC
QPSK
RAM
ROM
SD
SNR
USB
VHDL
VLSI
VMA
VNU
WLAN
WPAN
Abbreviations
Bit energy to noise spectral density (normalized SNR)
Error Correction Coding
Early Decision
Finite Impulse Response
Field Programmable Gate Array
Global Positioning Satellite
Integer Linear Programming
k-step enabled
k-step recoverable
Local Area Network
Low-Density Parity Check
Look-Up Table
McClellan-Parks-Rabiner
Minimum Signed-Digit
Multiplexer
Oversampling ratio
Power Spectral Density
Quadrature Amplitude Modulation
Quasi-Cyclic Low-Density Parity-Check
Quadrature Phase Shift Keying
Random Access Memory
Read-Only Memory
Signed-Digit
Signal-to-Noise Ratio
Universal Serial Bus
VHSIC (Very High Speed Integrated Circuit) Hardware Description Language
Very Large Scale Integration
Vector Merge Adder
Variable Node processing Unit
Wireless Local Area Network
Wireless Personal Area Network
viii
Preface
Thesis outline
In this thesis, contributions are made in two different areas related to the design
of receivers for radio communications, and the contents are therefore separated
into two parts. Part I consists of Chapters 1–6 and offers contributions in the
area of low density parity check (LDPC) code decoding, whereas Part II consists
of Chapters 7–13 and offers contributions related to high-speed analog-to-digital
conversion using Σ∆-ADCs.
The outline of Part I is as follows. In Chapter 1, a short background, possible
applications and the scientific contributions are discussed. In Chapter 2, the basics
of digital communications are described and LDPC codes are introduced. Also,
two decoder architectures are described, which are used as reference implementations for the contributed work. In Chapter 3, early decision decoding is proposed
as a method of reducing the computational complexity of the decoding algorithm.
Performance issues related to the algorithm are analyzed, and solutions are suggested. Also, an implementation of the algorithm for FPGA is described, and the
resulting estimations of area and power dissipation are included. In Chapter 4,
an improved algorithm for decoding of rate-compatible LDPC codes are proposed.
The algorithm offers a significant reduction of the average number of iterations
required for decoding of punctured codes, thereby offering a significant increase in
ix
Preface
Preface
throughput. An architecture implementing the algorithm is proposed, and simulation and synthesis results are included. In Chapter 5, a minor contribution in the
data representation of a sum-product LDPC decoder is explained. It is shown how
redundancy in the data representation can be used to reduce the required memory
used for storage of messages between iterations. Finally, in Chapter 6, conclusions
are given and future work is discussed.
The outline of Part II is as follows. In Chapter 7, an introduction to highspeed data conversion is given, and the scientific contributions of the second part
of the thesis are described. In Chapter 8, a short introduction to finite impulse
response (FIR) filters, multirate theory and FIR filter architectures is given. In
Chapter 9, the basics of ADCs using Σ∆-modulators are discussed, and some highspeed structures using parallel Σ∆-ADCs are shown. In Chapter 10, a general
model for the analysis of matching requirements in parallel Σ∆-ADCs is proposed.
It is shown that some parallel systems may become alias-free with limited matching
between subsets of the channels, whereas others may require matching between all
channels. In Chapter 11, a short analysis of the relations between oversampling
factors, Σ∆-modulator orders, required signal-to-noise ratio (SNR) and decimation
filter complexity is contributed. In Chapter 12, an integer linear programming
approach to the design of high-speed decimation filters for Σ∆-ADCs is proposed.
Several architectures are discussed and their complexities compared. Finally, in
Chapter 13, conclusions are given and future work is discussed.
Publications
This thesis contains research done at Electronics Systems, department of Electrical
Engineering at Linköping University, Sweden. The work has been done between
March 2005 and June 2011, and has resulted in the following publications [7–17]:
1. A. Blad, O. Gustafsson, and L. Wanhammar, “An LDPC decoding algorithm
utilizing early decisions,” in Proc. National Conf. Radio Science, Jun. 2005.
2. A. Blad, O. Gustafsson, and L. Wanhammar, “An early decision decoding
algorithm for LDPC codes using dynamic thresholds,” in Proc. European
Conf. Circuit Theory Design, Aug. 2005, pp. 285–288.
3. A. Blad, O. Gustafsson, and L. Wanhammar, “A hybrid early decisionprobability propagation decoding algorithm for low-density parity-check
codes,” in Proc. Asilomar Conf. Signals, Syst., Comp., Oct. 2005.
4. A. Blad, O. Gustafsson, and L. Wanhammar, “Implementation aspects of an
early decision decoder for LDPC codes,” in Proc. Nordic Event ASIC Design,
Nov. 2005.
5. A. Blad and O. Gustafsson, “Energy-efficient data representation in LDPC
decoders,” IET Electron. Lett., vol. 42, no. 18, pp. 1051–1052, Aug. 2006.
x
Preface
Preface
6. A. Blad, P. Löwenborg, and H. Johansson, “Design trade-offs for linear-phase
FIR decimation filters and sigma-delta modulators,” in Proc. XIV European
Signal Process. Conf., Sep. 2006.
7. A. Blad, H. Johansson, and P. Löwenborg, “Multirate formulation for mismatch sensitivity analysis of analog-to-digital converters that utilize parallel sigma-delta modulators,” Eurasip J. Advances Signal Process., vol. 2008,
2008, article ID 289184, 11 pages.
8. A. Blad and O. Gustafsson, “Integer linear programming-based bit-level optimization for high-speed FIR decimation filter architectures,” Springer Circuits, Syst. Signal Process. - Special Issue on Low Power Digital Filter Design
Techniques and Their Applications, vol. 29, no. 1, pp. 81–101, Feb. 2010.
9. A. Blad and O. Gustafsson, “Redundancy reduction for high-speed FIR filter
architectures based on carry-save adder trees,” in Proc. Int. Symp. Circuits,
Syst., May 2010.
10. A. Blad, O. Gustafsson, M. Zheng, and Z. Fei, “Integer linear programming based optimization of puncturing sequences for quasi-cyclic low-density
parity-check codes,” in Proc. Int. Symp. Turbo-Codes, Related Topics,
Sep. 2010.
11. A. Blad and O. Gustafsson, “FPGA implementation of rate-compatible
QC-LDPC code decoder,” in Proc. European Conf. Circuit Theory Design,
Aug. 2011.
During the period, the following papers were also published, but are either
outside the scope of this thesis or overlapping with the publications above:
1. A. Blad, C. Svensson, H. Johansson, and S. Andersson, “An RF sampling
radio frontend based on sigma-delta conversion,” in Proc. Nordic Event ASIC
Design, Nov. 2006.
2. A. Blad, H. Johansson, and P. Löwenborg, “A general formulation of analogto-digital converters using parallel sigma-delta modulators and modulation
sequences,” in Proc. Asia-Pacific Conf. Circuits Syst., Dec. 2006, pp. 438–
441.
3. A. Blad and O. Gustafsson, “Bit-level optimized high-speed architectures for
decimation filter applications,” in Proc. Int. Symp. Circuits, Syst., May 2008.
4. M. Zheng, Z. Fei, X. Chen, J. Kuang, and A. Blad, “Power efficient partial
repeated cooperation scheme with regular LDPC code,” in Proc. Vehicular
Tech. Conf., May 2010.
5. O. Gustafsson, K. Amiri, D. Andersson, A. Blad, C. Bonnet, J. R. Cavallaro,
J. Declerckz, A. Dejonghe, P. Eliardsson, M. Glasse, A. Hayar, L. Hollevoet,
xi
Preface
Preface
C. Hunter, M. Joshi, F. Kaltenberger, R. Knopp, K. Le, Z. Miljanic, P. Murphy, F. Naessens, N. Nikaein, D. Nussbaum, R. Pacalet, P. Raghavan, A. Sabharwal, O. Sarode, P. Spasojevic, Y. Sun, H. M. Tullberg, T. Vander Aa,
L. Van der Perre, M. Wetterwald and M. Wu, “Architecture for cognitive radio testbeds and demonstrators - An overview,” in Proc. Int. Conf. Cognitive
Radio Oriented Wireless Networks Comm., Jun. 2010.
6. A. Blad, O. Gustafsson, M. Zheng, and Z. Fei, “Rate-compatible LDPC code
decoder using check-node merging,” in Proc. Asilomar Conf. Signals, Syst.,
Comp., Nov. 2010.
7. M. Abbas, O. Gustafsson, and A. Blad, “Low-complexity parallel evaluation
of powers exploiting bit-level redundancy,” in Proc. Asilomar Conf. Signals,
Syst., Comp., Nov. 2010.
xii
Contents
I
Decoding of low-density parity-check codes
1
1 Introduction
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Scientific contributions . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Error correction coding
2.1 Digital communications . . . . . . . .
2.1.1 Channel models . . . . . . . . .
2.1.2 Modulation methods . . . . . .
2.1.3 Uncoded communication . . . .
2.2 Coding theory . . . . . . . . . . . . . .
2.2.1 Shannon bound . . . . . . . . .
2.2.2 Block codes . . . . . . . . . . .
2.3 LDPC codes . . . . . . . . . . . . . . .
2.3.1 Tanner graphs . . . . . . . . .
2.3.2 Quasi-cyclic LDPC codes . . .
2.3.3 Randomized quasi-cyclic codes
2.4 LDPC decoding algorithms . . . . . .
2.4.1 Sum-product algorithm . . . .
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
5
5
7
8
8
10
10
12
14
14
18
18
19
21
21
21
xiv
Contents
2.4.2 Min-sum approximation . . . . . . .
Rate-compatible LDPC codes . . . . . . . .
2.5.1 SR-nodes . . . . . . . . . . . . . . .
2.5.2 Decoding of rate-compatible codes .
LDPC decoder architectures . . . . . . . . .
2.6.1 Parallel architecture . . . . . . . . .
2.6.2 Serial architecture . . . . . . . . . .
2.6.3 Partly parallel architecture . . . . .
2.6.4 Finite wordlength considerations . .
2.6.5 Scaling of Φ(x) . . . . . . . . . . . .
Sum-product reference decoder architecture
2.7.1 Architecture overview . . . . . . . .
2.7.2 Memory block . . . . . . . . . . . .
2.7.3 Variable node processing unit . . . .
2.7.4 Check node processing unit . . . . .
2.7.5 Interconnection networks . . . . . .
2.7.6 Memory address generation . . . . .
2.7.7 Φ function . . . . . . . . . . . . . . .
Check-serial min-sum decoder architecture .
2.8.1 Decoder schedule . . . . . . . . . . .
2.8.2 Architecture overview . . . . . . . .
2.8.3 Check node function unit . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
24
25
26
26
27
28
29
29
30
31
32
32
33
34
35
36
37
37
37
38
39
3 Early-decision decoding
3.1 Early-decision algorithm . . . . . . . . . . .
3.1.1 Choice of threshold . . . . . . . . . .
3.1.2 Handling of decided bits . . . . . . .
3.1.3 Bound on error correction capability
3.1.4 Enforcing check constraints . . . . .
3.1.5 Enforcing check approximations . . .
3.2 Hybrid decoding . . . . . . . . . . . . . . .
3.3 Early-decision decoder architecture . . . . .
3.3.1 Memory block . . . . . . . . . . . .
3.3.2 Node processing units . . . . . . . .
3.3.3 Early decision logic . . . . . . . . . .
3.3.4 Enforcing check constraints . . . . .
3.4 Hybrid decoder . . . . . . . . . . . . . . . .
3.5 Simulation results . . . . . . . . . . . . . .
3.5.1 Choice of threshold . . . . . . . . . .
3.5.2 Enforcing check constraints . . . . .
3.5.3 Hybrid decoding . . . . . . . . . . .
3.5.4 Fixed-point simulations . . . . . . .
3.6 Synthesis results . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
42
43
43
44
44
46
47
47
47
48
48
51
52
52
52
55
56
58
61
2.5
2.6
2.7
2.8
Contents
xv
4 Rate-compatible LDPC codes
4.1 Design of puncturing patterns . . . . . . . . . . . .
4.1.1 Preliminaries . . . . . . . . . . . . . . . . .
4.1.2 Optimization problem . . . . . . . . . . . .
4.1.3 Puncturing pattern design . . . . . . . . . .
4.2 Check-merging decoding algorithm . . . . . . . . .
4.2.1 Defining HP . . . . . . . . . . . . . . . . .
4.2.2 Algorithmic properties of decoding with HP
4.2.3 Choosing the puncturing sequence p . . . .
4.3 Rate-compatible QC-LDPC code decoder . . . . .
4.3.1 Decoder schedule . . . . . . . . . . . . . . .
4.3.2 Architecture overview . . . . . . . . . . . .
4.3.3 Cyclic shifters . . . . . . . . . . . . . . . . .
4.3.4 Check function unit . . . . . . . . . . . . .
4.3.5 Bit-sum update unit . . . . . . . . . . . . .
4.3.6 Memories . . . . . . . . . . . . . . . . . . .
4.4 Simulation results . . . . . . . . . . . . . . . . . .
4.4.1 Design of puncturing sequences . . . . . . .
4.4.2 Check-merging decoding algorithm . . . . .
4.5 Synthesis results of check-merging decoder . . . . .
4.5.1 Maximum check node degrees . . . . . . . .
4.5.2 Decoding throughput . . . . . . . . . . . .
4.5.3 FPGA synthesis . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
64
64
65
67
68
68
70
71
72
72
73
74
74
76
76
77
77
78
79
80
80
80
5 Data representations
5.1 Fixed wordlength . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
83
83
85
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Conclusions and future work
87
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
II
High-speed analog-to-digital conversion
89
7 Introduction
91
7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3 Scientific contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8 FIR filters
8.1 FIR filter basics . . . . . . .
8.1.1 FIR filter definition
8.1.2 z-transform . . . . .
8.1.3 Linear phase filters .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
95
96
96
97
xvi
Contents
8.2
8.3
8.4
FIR filter design . . . . . . . . . . . . . . .
Multirate signal processing . . . . . . . . .
8.3.1 Sampling rate conversion . . . . . .
8.3.2 Polyphase decomposition . . . . . .
8.3.3 Multirate sampling rate conversion .
FIR filter architectures . . . . . . . . . . . .
8.4.1 Conventional FIR filter architectures
8.4.2 High-speed FIR filter architecture .
9 Sigma-delta data converters
9.1 Sigma-delta data conversion . . . . . .
9.1.1 Sigma-delta ADC overview . .
9.1.2 Sigma-delta modulators . . . .
9.1.3 Quantization noise power . . .
9.1.4 SNR estimation . . . . . . . . .
9.2 Modulator structures . . . . . . . . . .
9.3 Modulated parallel sigma-delta ADCs
9.4 Data rate decimation . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Parallel sigma-delta ADCs
10.1 Linear system model . . . . . . . . . . . .
10.1.1 Signal transfer function . . . . . .
10.1.2 Alias-free system . . . . . . . . . .
10.1.3 L-decimated alias-free system . . .
10.2 Sensitivity to channel mismatches . . . . .
10.2.1 Modulator nonidealities . . . . . .
10.2.2 Modulation sequence errors . . . .
10.2.3 Modulation sequence offset errors .
10.2.4 Channel offset errors . . . . . . . .
10.3 Simulation results . . . . . . . . . . . . .
10.3.1 Time-interleaved ADC . . . . . . .
10.3.2 Hadamard-modulated ADC . . . .
10.3.3 Frequency-band decomposed ADC
10.3.4 Generation of new scheme . . . . .
10.4 Noise model of system . . . . . . . . . . .
11 Sigma-delta ADC decimation filters
11.1 Design considerations . . . . . . . .
11.1.1 FIR decimation filters . . . .
11.1.2 Decimation filter specification
11.1.3 Signal-to-noise-ratio . . . . .
11.2 Simulation results . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
98
98
99
100
100
100
101
.
.
.
.
.
.
.
.
105
. 105
. 105
. 106
. 107
. 108
. 109
. 110
. 112
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
. 116
. 118
. 120
. 121
. 123
. 123
. 124
. 125
. 126
. 126
. 126
. 129
. 130
. 131
. 133
.
.
.
.
.
135
. 135
. 136
. 137
. 138
. 139
Contents
12 High-speed digital filtering
12.1 FIR filter realizations . . . . . . . . .
12.1.1 Architectures . . . . . . . . . .
12.1.2 Partial product generation . . .
12.2 Implementation complexity . . . . . .
12.2.1 Adder complexity . . . . . . .
12.2.2 Register complexity . . . . . .
12.3 Partial product redundancy reduction
12.3.1 Proposed algorithm . . . . . .
12.4 ILP optimization . . . . . . . . . . . .
12.4.1 ILP problem formulation . . .
12.4.2 DF1 architecture . . . . . . . .
12.4.3 DF2 architecture . . . . . . . .
12.4.4 DF3 architecture . . . . . . . .
12.4.5 TF architecture . . . . . . . . .
12.4.6 Constant term placement . . .
12.5 Results . . . . . . . . . . . . . . . . . .
12.5.1 Architecture comparison . . . .
12.5.2 Coefficient representation . . .
12.5.3 Subexpression sharing . . . . .
xvii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
145
146
146
150
152
152
153
155
157
157
157
159
160
160
161
161
161
162
164
164
13 Conclusions and future work
169
13.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Part I
Decoding of low-density
parity-check codes
1
1
Introduction
1.1
Background
Digital communication is used ubiquitously for transferring data between electronic
equipment. Examples include cable and satellite TV, mobile phone voice and
data transmissions, wired and wireless LAN, GPS, computer peripheral connections
through USB and IEEE1394 and many more. The basic principles of a digital
communications system are known, and one of the main advantages of digital
communications systems over analog is the ability to use error correction coding
(ECC) for the data transmission.
ECC is used in almost all digital communications systems to improve link performance and reduce transmitter power requirements [3]. By adding redundant
data to the transmitted data stream, the system allows a limited amount of transmission errors to be corrected, resulting in a reduction of the number of errors
in the transmitted information. However, for the digital data symbols that are
received correctly, the received information is identical to that which is sent. This
can be contrasted to analog communications systems, where transmission noise
will irrevocably degrade the signal quality, and the only way to ensure a predefined
signal quality at the receiver is to use enough transmitter power. Thus, the metrics
used to measure the transmission quality are intrinsically different for digital and
analog communications, with bit error rate (BER) or block error rate (BLER) for
3
4
CHAPTER 1. INTRODUCTION
Data source
Data sink
Source
coding
Channel
coding
Modulation
Source
decoding
Channel
decoding
Demodulation
Figure 1.1 Simple communications system model
digital systems, and signal-to-noise ratio (SNR) for analog systems. Whereas analog error correction is not principally impossible, analog communications systems
are different enough on a system level to make practically feasible implementations
hard to envisage.
As the quality metrics of digital and analog communications systems are different, the performance of an analog and a digital system cannot easily be objectively
compared with each other. However, it is often the case that a digital system
with a quality subjectively comparable to that of an analog system requires significantly less power and/or bandwidth. One example is the switch from analog to
digital TV, where image coding and ECC allow four standard definition channels
of comparable quality in the same bandwidth as one channel in the analog TV.
A simple model of a digital communications system is shown in Fig. 1.1. The
modeled system encompasses wireless and wired communications, as well as data
storage, for example on optical disks and hard drives. However, the properties
of the blocks are dependent on data rate, acceptable error probability, channel
conditions, the nature of the data, and so on. In the communications system, data
is usually first source coded (or compressed) to reduce the amount of data that
needs to be transmitted, and then channel coded to add redundancy to protect
against transmission errors. The modulator then converts the digital data stream
into an analog waveform suitable for transmission. During transmission, the analog
waveform is affected by channel noise, and thus the received signal differs to the
sent. The result is that when the signal is demodulated, the digital data will contain
errors. It is the purpose of the channel decoder to correct the errors using the
redundancy introduced by the channel coder. Finally, the data stream is unpacked
by the source decoder, recreating data suitable to be used by the application.
The work in this part of the thesis considers the hardware implementation of
the channel decoder for low-density parity-check (LDPC) codes. The decoding of
LDPC codes is complex, and is often a major part of the baseband processing of
a receiver. For example, the flexible decoder in [79] supports the LDPC codes in
IEEE 802.11n and IEEE 802.16e, as well as the Turbo codes in 3GPP-LTE, but at
a maximum power dissipation of 675 mW. The need for low-power components is
obviously high in battery-driven applications like hand helds and mobile phones,
but becomes increasingly important also in stationary equipment like computers,
1.2. APPLICATIONS
5
computer peripherals and TV receivers, due to the need of removing the waste
heat produced. Thus the focus of this work is on reducing the power dissipation of
LDPC decoders, without sacrificing the error-correction performance.
LDPC codes were discovered originally in 1962 by Robert Gallager [39]. He
showed that the class of codes has excellent theoretical properties and he also
provided a decoding algorithm. However, as the hardware of the time was not
powerful enough to run the decoding algorithm efficiently, LDPC codes were not
practically usable and were forgotten. They were rediscovered in 1995 [74,101], and
have been shown to have a very good performance close to the theoretical Shannon
limit [73, 75]. Since the rediscovery, LDPC codes have been successfully used in a
number of applications, and are suggested for use in a number of important future
communications standards.
1.2
Applications
Today, LDPC codes are used or are proposed to be used in a number of applications
with widely different characteristics and requirements. In 2003, a type of LDPC
codes was accepted to be used for the DVB-S2 standard for satellite TV [113]. The
same type of code was then adopted for both the DVB-T2 [114] and DVB-C2 [115]
standards for terrestial and cable-based TV, respectively. A similar type has also
been accepted for the DTTB standard for digital TV in China [122]. The systemlevel requirements of these systems are relatively low, with relaxed latency requirements as the communication is unidirectional, and relatively small constraints on
power dissipation, as the user equipment is typically not battery-driven. Thus, the
adopted code is complex with a resulting complex decoder implementation.
Opposite requirements apply for the WLAN IEEE 802.11n [118] and WiMax
IEEE 802.16e [120] standards, for which LDPC codes have been chosen as optional
ECC schemes. In these applications, communication is typically bi-directional,
necessitating low latency. Also, the user equipment is typically battery-driven,
making low power dissipation critical. For these applications, the code length is
restricted directly by the latency requirements. However, it is preferable to reduce
the decoder complexity as much as possible to save power dissipation.
Whereas these types of applications are seen as the primary motivation for
the work in this part of the thesis, LDPC codes are also used or suggested in
several other standards and applications. Among them are the IEEE 802.3an
[121] standard for 10Gbit/s Ethernet, the IEEE 802.15.3c [119] mm-wave WPAN
standard, and the gsfc-std-9100 [116] standard for deep-space communications.
1.3
Scientific contributions
There are two main scientific contribution in the first part of the thesis. The
first is a modification to the sum-product decoding algorithm for LDPC codes,
called the early-decision algorithm, and is described in Chapter 3. The aim of
the early-decision modification is to dynamically reduce the number of possible
6
CHAPTER 1. INTRODUCTION
states of the decoder during decoding, and thereby reduce the amount of internal
communication of the hardware. However, this algorithm modification impacts
the error correction performance of the code, and it is therefore also investigated
how the modified decoding algorithm can be efficiently combined with the original
algorithm to yield a resulting hybrid decoder which retains the performance of the
original algorithm while still offering a reduction of internal communication.
The second main contribution is an improved algorithm for decoding of ratecompatible LDPC codes, and is described in Chapter 4. Using rate-compatible
LDPC codes obtained through puncturing, the higher-rate codes can trivially be
decoded by the low-rate mother code. However, by defining a specific code by
merging relevant check nodes for each of the punctured rates, the code complexity
can be reduced at the same time as the propagation speed of the extrinsic information is increased. The result is a significant reduction in the convergence time
of the decoding algorithm for the higher-rate codes.
A minor contribution is the observation of redundancy in the internal data
format in a fixed-width implementation of the decoding algorithm. It is shown that
a simple data encoding can further reduce the amount of internal communication.
The performance of the proposed algorithms have been evaluated in software.
For the early-decision algorithm, it is verified that the modifications have an insignificant impact on the error correction performance, and the change in the internal communication is estimated. For the check-merging decoding algorithm,
the modifications can be shown to even improve the error correction performance.
However, these improvements are mostly due to the reduced convergence time, allowing the algorithm to converge for codewords which the original algorithm does
not have sufficient time for.
The early-decision and check-merging algorithms have been implemented in a
Xilinx Virtex 5 FPGA and an Altera Cyclone II FPGA, respectively. As similar
implementations have not been published before, they have mainly been compared
with implementations of the original reference decoders. For the early-decision decoder, the required overhead has been determined, and the power dissipation of
both the original and proposed architecture have been simulated and compared
using regular quasi-cyclic LDPC codes with an additional scrambling layer. For
the check-merging decoder, the required overhead has been determined, and the
increased throughput obtainable with the modification has been quantized for two
different implementations geared for the IEEE 802.16e and IEEE 802.11n standards, respectively.
2
Error correction coding
In this chapter, the basics of digital communications systems and error correction
coding are explained. In Sec. 2.1, a model of a system using digital communications
is shown, and the channel model and different modulation methods are explained.
In Sec. 2.2, coding theory is introduced as a way of reducing the required transmission power with a retained bit error probability, and block codes are defined.
In Sec. 2.3, LDPC codes are defined as a special case of general block codes, and
Tanner graphs are introduced as a way of visualizing the structure of an LDPC
code. The sum-product decoding algorithm and the min-sum approximation are
discussed in Sec. 2.4, and in Sec. 2.5, rate-compatible LDPC codes are introduced
as a way of obtaining practical usable codes with a wide range of rates. Also,
the implications of rate-compatibility on the decoding algorithm are discussed. In
Sec. 2.6, several general decoder architectures with different parallelism degrees
are discussed, including a serial architecture, a parallel architecture, and a partly
parallel architecture. In Sec. 2.7, a partly parallel architecture for a specific class
of regular LDPC codes is described, and is also used as a reference for the early
decision algorithm proposed in Chapter 3. In Sec. 2.8, a partly parallel architecture using the min-sum decoding algorithm for general quasi-cyclic LDPC codes
is described, and is used as a reference for the check merging decoding algorithm
proposed in Chapter 4.
7
8
CHAPTER 2. ERROR CORRECTION CODING
n(t)
Endpoint A
xn ∈ A
Modulation
s(t)
r(t)
Endpoint B
Demodulation
x̃n ∈ B
Figure 2.1 Digital communications system model.
2.1
Digital communications
Consider a two-user digital communications system, such as the one shown in
Fig. 2.1, where an endpoint A transmits information to an endpoint B. Whereas
multi-user communications systems with multiple transmitting and receiving endpoints can be defined, only systems with one transmitter and receiver will be considered in this thesis. The system is digital, meaning that the information is represented by a sequence of symbols xn from a finite discrete alphabet A. The sequence
is mapped onto an analog signal s(t) which is transmitted to the receiver through
the air, through a cable, or using any other medium. During transmission, the
signal is distorted by noise n(t), and thus the received signal r(t) is not equal to
the transmitted signal. By the demodulator, the received signal r(t) is mapped to
symbols x̃n from an alphabet B, which may or may not be the same as alphabet A,
and may be either discrete or continuous. Typically, if the output data stream is
used directly by the receiving application, B = A. However, commonly some form
of error coding is employed, which can benefit from including symbol reliability
information in the reception alphabet B.
2.1.1
Channel models
In analyzing the performance of a digital communications system, the chain in
Fig. 2.1 is modeled as a probabilistic mapping P (X̃ = b | X = a), ∀a ∈ A, b ∈
B, from the transmission alphabet A to the reception alphabet B. The system
modeled by the probabilistic mapping is formally called a channel, and X and X̃
are stochastic variables denoting the input and output of the channel, respectively.
For the channel, the following requirement must be satisfied for discrete reception
alphabets
X
P (X̃ = b | X = a) = 1, ∀a ∈ A,
(2.1)
b∈B
or analogously for continuous reception alphabets
Z
P (X̃ = b | X = a) = 1, ∀a ∈ A.
(2.2)
b∈B
Depending on the characteristics of the modulator, demodulator, transmission
medium, and the accuracy requirement of the model, different channel models are
suitable. Some common channel models include
2.1. DIGITAL COMMUNICATIONS
9
• the binary symmetric channel (BSC), a discrete channel defined by the alphabets A = B = {0, 1}, and the mapping
P (X̃ = 0 | X = 0) = P (X̃ = 1 | X = 1) = 1 − p
P (X̃ = 1 | X = 0) = P (X̃ = 0 | X = 1) = p,
where p is the cross-over probability that the sent binary symbol will be
received in error. The BSC is an adequate channel model in many cases when
a hard-decision demodulator is used, as well as in early stages of a system
design to compute the approximate performance of a digital communications
system.
• the binary erasure channel (BEC), a discrete channel defined by the alphabets
A = {0, 1}, B = {0, 1, e}, and the mapping
P (X̃ = 0 | X = 0) = P (X̃ = 1 | X = 1) = 1 − p
P (X̃ = e | X = 0) = P (X̃ = e | X = 1) = p
P (X̃ = 1 | X = 0) = P (X̃ = 0 | X = 1) = 0,
where p is the erasure probability, i.e., the received symbols are either known
by the receiver, or known that they are unknown. The binary erasure channel
is commonly used in theoretical estimations of the performance of a digital
communications system due to its simplicity, but can also be adequately used
in low-noise system modeling.
• the additive white Gaussian noise (AWGN) channel with noise spectral density N0 , a continuous channel defined by a discrete alphabet A and a continuous alphabet B, and the mapping
P (X̃ = b | X = a) = f(a,σ) (b),
(2.3)
where f(a,σ) (b) is the probability density function for a normally distributed
p
stochastic variable with mean a and standard deviation σ = N0 /2. The size
of the input alphabet is usually determined by the modulation method used,
and is further explained in Sec. 2.1.2. The AWGN channel models real-world
noise sources well, especially for cable-based communications systems.
• the Rayleigh and Rician fading channels. The Rayleigh channel is appropriate for modeling a wireless communications system when no line-of-sight is
present between the transmitter and receiver, such as cellular phone networks
and metropolitan area networks. The Rician channel is more appropriate
when a dominating line-of-sight communications path is available, such as
for wireless LANs and personal area networks.
The work in this thesis considers the AWGN channel with a binary input alphabet only.
10
CHAPTER 2. ERROR CORRECTION CODING
mk ∈ I
Symbol
mapping
xn ∈ A
Modulation
s(t)
n(t)
m̂k ∈ I
Symbol
demapping
Demodulation
r(t)
x̃n ∈ B
Figure 2.2 Model of uncoded digital communications system.
2.1.2
Modulation methods
The size of the transmission alphabet A for the AWGN channel is commonly determined by the modulation method used. Common modulation methods include
• the binary phase-shift
keying
(BPSK) modulation, using the transmission
√
√
alphabet A = {− E, + E} and reception alphabet B = R. E denotes the
symbol energy.
• the quadrature phase-shift
keying (QPSK) modulation, using the transmisp
sion alphabet A = E/2{(−1 − i), (−1 + i), (+1 − i), (+1 + i)} with complex
symbols, and reception alphabet B = C. The binary source information is
mapped in blocks of two bits onto the symbols of the transmission alphabet.
As the alphabets are complex, the probability density function in (2.3) is the
probability density function for the two-dimensional Gaussian distribution.
• the quadrature amplitude (QAM) modulation, which is a generalization of
the QPSK modulation to higher orders, using equi-spaced symbols from the
complex plane.
In this thesis, BPSK modulation has been assumed exclusively. However, the
methods are not limited to BPSK modulation, but may straight-forwardly be applied to systems using other modulation methods as well.
2.1.3
Uncoded communication
In order to use the channel for communication of data, some way of mapping the
binary source information to the transmitted symbols is needed. In the system
using uncoded communications depicted in Fig. 2.2, this is done by the symbol
mapper, which maps the source bits mk to the transmitted symbols xn . The
transmitted symbols may be produced at a different rate than the source bits are
consumed.
2.1. DIGITAL COMMUNICATIONS
11
On the receiver side, the end application is interested in the most likely symbols that were sent, and not the received symbols. However, the transmitted and
received data are symbols from different alphabets, and thus a symbol demapper
is used to infer the most likely transmitted symbols from the received ones, before
mapping them back to the binary information stream m̂k . In the uncoded case,
this is done on a per-symbol basis.
For the BSC, the source bits are mapped directly to the transmitted symbols
such that xn = mk , where n = k, whereas the BEC is not used with uncoded
communications and is thus not discussed. For the AWGN with BPSK modulation,
the source√ bits are conventionally mapped so that the bit√0 is mapped to the
symbol + E, whereas the bit 1 is mapped to the symbol − E. For higher-order
modulation, several source bits are mapped to each symbol, and the source bits are
typically mapped using gray mapping so that symbols that are close in the complex
plane differ by one bit. The optimal decision rules for the symbol demapper can
be formulated as follows for different channels.
For the BSC,
(
x̃n
if p < 0.5
m̂k =
(2.4)
1 − x̃n if p > 0.5,
where the case p > 0.5 is rather unlikely. For the AWGN channel using BPSK
modulation,
(
0 if x̃n > 0
m̂k =
(2.5)
1 if x̃n < 0.
Finally, if QPSK modulation with gray mapping of source bits to transmitted
symbols is used,

00



01
{m̂k , m̂k+1 } =

11



10
if
if
if
if
Re x̃n
Re x̃n
Re x̃n
Re x̃n
> 0, Im x̃n
< 0, Im x̃n
< 0, Im x̃n
> 0, Im x̃n
>0
>0
<0
< 0.
(2.6)
In analyzing the performance of a communications system, the probability of erroneous transmissions is interesting. For BPSK communications with equal symbol
probabilities, the bit error probability can be defined as
PB,BP SK = P (m̂k 6= mk ) =
= P (x̃n > 0 | xn = 1)P (xn = 1) + P (x̃n < 0 | x0 = 0)P (xn = 0) =
!
√ !
r
E
2E
=Q
=Q
,
(2.7)
σ
N0
where Q(x) is the cumulative density function for normally distributed stochastic
variables. However, it turns out that significantly lower error probabilities can
be achieved by adding redundancy to the transmitted information, while keeping
12
CHAPTER 2. ERROR CORRECTION CODING
(m0 , . . . , mK−1 )
(x0 , . . . , xN −1 )
Channel
Modulation
coding
s(t)
n(t)
(m̂0 , . . . , m̂K−1 )
Channel
decoding
Demodulation
r(t)
(x̃0 , . . . , x̃N −1 )
Figure 2.3 Error correction system overview
the total transmitter power unchanged. Thus, the individual symbol energies are
reduced, and the saved energy is used to transmit redundant symbols computed
from the information symbols according to some well-defined code.
2.2
Coding theory
Consider the error correction system in Fig. 2.3. As the codes in this thesis are
block codes, the properties of the system are formulated assuming that a block
code is used. Also, it is assumed that the symbols used for the messages are binary
symbols. A message m with K bits is to be communicated over a noisy channel.
The message is encoded to the codeword x with N bits, where N > K. The
codeword is then modulated to the analog signal s(t) using BPSK modulation with
an energy of E per bit. During transmission over the AWGN channel, the noise
signal n(t) with a one-sided spectral density of N0 is added to the signal to produce
the received signal r(t). The received signal is demodulated to produce the received
vector x̃, which may contain either bits or scalars. The channel decoder is then
used to find the most likely sent codeword x̂, given the received vector x̃. From x̂,
the message bits m̂ are then extracted.
For the system, a number of properties can be defined:
• The information transmitted is K bits.
• The block size of the code is N bits. Generally, in order to achieve better
error correction performance, N must be increased. However, a larger block
size requires a more complex encoder/decoder and increases the latency of
the system, and there is therefore a trade-off between these factors in the
design of the coding system.
• The code rate is R = K/N . Obviously, increasing the code rate increases
the amount of information transmitted for a fixed block size N . However,
it is also the case that a reduced code rate allows more information to be
2.2. CODING THEORY
13
1
0.9
0.8
Capacity
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−2
−1
0
1
Eb/N0
2
3
4
Figure 2.4 Capacity for binary-input AWGN channel using different SNR.
transmitted for a constant transmitter power level (see Sec. 2.2.1), and the
code rate is therefore also a trade-off between error correction performance
and encoder/decoder complexity.
• The normalized SNR at the receiver is Eb /N0 = ER/N0 and is used instead
of the actual SNR E/N0 in order to allow a fair comparison between codes
of different rates. The normalized SNR is denoted SNR in the rest of this
thesis.
• The bit error rate (BER) is the fraction of differing bits in m and m̂,
averaged over several blocks.
• The block error rate (BLER) is the fraction of blocks where m and m̂
differs.
Coding systems are analyzed in depth in any introductionary book on coding
theory, e.g. [3, 102].
14
CHAPTER 2. ERROR CORRECTION CODING
2.2.1
Shannon bound
In 1948, Claude E. Shannon proved the noisy channel coding theorem [89], that
can be phrased in the following way.
For each channel, as defined in Sec. 2.1.1, there is associated a quantity called
the channel capacity. The channel capacity is the maximum amount of information,
as measured by the shannon unit, that can be transferred per channel use, guaranteeing error-free transmission. Moreover, error-free transmission at information
rates above the channel capacity is not possible.
Thus, transmitting information at a rate below the channel capacity allows an
arbitrarily low error rate, i.e., there are arbitrarily good error-correcting codes.
Additionally, the noisy channel coding theorem states that above the channel capacity, data transmission can not be done without errors, regardless of the code
used.
The capacity for the AWGN channel using BPSK modulation and assuming
equi-probable inputs is given here without derivation, but calculations are found
e.g. in [3]. It is
CBIAW GN =
Z
∞
−∞
f√2E/N (y) log2
0
2f√2E/N (y)
0
f√2E/N (y) + f−√2E/N (y)
0
0
!
dy,
(2.8)
where f±√2E/N (y) are the probability density functions for Gaussian stochastic
0
p
√
variables with means ± E and standard deviation N0 /2. In Fig. 2.4 the capacity
of a binary-input AWGN channel is plotted as a function of the normalized SNR
Eb /N0 = ER/N0 , and it can be seen that reducing the code rate allows error-free
communications using less energy even if more bits are sent for each information
bit.
Shannon’s theorem can be rephrased in the following way: for each information
rate (or code rate) there is a limit on the channel conditions, above which communication can achieve an arbitrarily low error rate, and below which communication
must introduce errors. This limit is commonly referred to as the Shannon limit,
and is commonly plotted in code performance plots to show how far the code is
from the theoretical limit. The Shannon limit can be found numericallyp
for the
binary input AWGN channel by iteratively solving (2.8) for the argument E/N0
that yields the desired information rate.
2.2.2
Block codes
There are two standard ways of defining block codes: through a generator matrix
G or through a parity-check matrix H. For a message length of K bits and block
length of N bits, G has dimensions of K × N , and H has dimensions of M × N ,
where M = N − K. Denoting the set of codewords by C, C can be defined in the
2.2. CODING THEORY
15
following two ways:
C = x = mG | m ∈ {0, 1}K
C = x ∈ {0, 1}N | HxT = 0
(2.9)
(2.10)
The most important property of a code regarding performance is the minimum
Hamming distance d, which is the minimum number of bits that two codewords
may differ in. Moreover, as the set of codewords C is linear, it is also the weight
of the lowest-weight codeword which is not the all-zero codeword. The minimum
distance is important because all transmission errors with a weight strictly less
than d/2 can be corrected. However, for practical codes d is often not known
exactly, as it is often difficult to calculate theoretically, and exhaustive searches
are not realistic with block sizes of thousands of bits. Also, depending on the type
of decoder used, the actual error-correcting ability may be both above and below
d/2. Thus the performance of modern codes is usually determined experimentally
by simulations over a noisy channel and by measuring the actual bit- or block-error
rate at the output of the decoder.
A simple example of a block code is the (N, K, d) = (7, 4, 3) Hamming code
defined by the parity-check matrix


1 1 1 1 0 0 0
(2.11)
H = 1 1 0 0 1 1 0 .
1 0 1 0 1 0 1
The code has a block length of N = 7 bits, and a message length of K = 4 bits.
Thus the code rate is R = K/N = 4/7. It can easily be shown that the minimumweight codeword has a weight of d = 3, which is therefore the minimum distance of
the code. The error correcting performance of this code over the AWGN channel
is shown in Fig. 2.5. As can be seen, the code performance is just somewhat
better than uncoded transmission. There exists a Hamming code with parameters
(N, K, d) = (2m − 1, 2m − m − 1, 3) for every integer m ≥ 2, and their paritycheck matrices are constructed by concatenating every nonzero m-bit vector. The
advantage of these codes is that decoding is very simple, and they are used e.g. in
memory chips.
To decode a received block using Hamming coding, consider for example the
(7, 4, 3) Hamming code and a received vector x̃. Then the syndrome of the received
vector is Hx̃T , which is a three-bit vector. If the syndrome is zero, the received
vector is a valid codeword, and decoding is finished. If the syndrome is non-zero, the
received vector could become a codeword if the bit corresponding to the column in
H that matched the syndrome is flipped. It should thus be noted that the columns
of H contain every non-zero three-bit vector, and thus every received vector x̃ will
be at a distance of at most one from a valid codeword. Thus decoding will consist
of changing at most one bit, determined by the syndrome if it is non-zero.
To increase the error correcting performance, the code needs to be able to
correct more than single bit errors, and then the above decoding technique does
not work. While the method could be generalized to determine the bits to flip by
16
CHAPTER 2. ERROR CORRECTION CODING
0
10
Uncoded
Hamming(7,4,3)
Hamming(255,247,3)
Reed−Solomon(31,25,7)
LDPC(144,72,10)
LDPC(288,144,14)
−2
Bit error rate
10
−4
10
−6
10
−8
10
0
2
4
6
E /N [dB]
b
8
10
12
0
Figure 2.5 Error correcting performance of short codes. The Hamming
and Reed-Solomon curves are estimations for hard-decision decoding, whereas the LDPC curves are obtained using simulations with soft-decision decoding.
finding the minimum set of columns whose sum is the syndrome, this is usually
not efficient. Thus the syndrome is usually computed only to determine if a given
vector is a codeword or not.
The performance of other short codes are also shown in Fig. 2.5. The Hamming
and Reed-Solomon curves are estimations for hard-decision decoding obtained using
the MATLABTM function bercoding. The LDPC codes are randomly constructed
(3, 6)-regular codes (as defined in Sec. 2.3). Ensembles of 100 codes were generated, and their minimum distances computed using integer linear programming
optimization. Among the codes with the largest minimum distances, the codes
with the best performance under the sum-product algorithm were selected.
The performance of some long codes are shown in Fig. 2.6. The performance
of the N = 107 LDPC code is from [26], whereas the performance of the N = 106
codes are from [86]. It is seen that at a block length of 106 bits, the LDPC code
performs better than the Turbo code. The N = 107 code is a highly optimized
irregular LDPC code with variable node degrees up to 200, and performs within
0.04 dB of the Shannon limit at a bit error rate of 10−6 . At shorter block lengths
2.2. CODING THEORY
17
−2
10
Shannon limit
7
LDPC, N=10
6
LDPC, N=10
6
Turbo, N=10
LDPC(9216,3,6)
−3
Bit error rate
10
−4
10
−5
10
−6
10
0
0.5
1
1.5
E /N [dB]
b
2
2.5
3
0
Figure 2.6 Error correcting performance of long codes.
of 1000-10000 bits, the performance of Turbo codes and LDPC codes are generally
comparable. The (9216, 3, 6) code is a randomly constructed regular code, also
used in the simulations in Sec. 3.5.
For block codes, there are three general ways in which a decoding attempt may
terminate:
• Decoder successful: The decoder has found a valid codeword, and the
corresponding message m̂ equals m.
• Decoder error: The decoder has found a valid codeword, and the corresponding message m̂ differs from m.
• Decoder failure: The decoder was unable to find a valid codeword using
the resources specified.
For both the error and the failure result, the decoder has been unable to find the
correct sent message m. However, the key difference is that decoder failures are
detectable, whereas decoder errors are not. Thus, if, for example, several decoder
algorithms are available, the decoding could be retried with another algorithm
when a decoder failure occurs.
18
CHAPTER 2. ERROR CORRECTION CODING
v0
v1
v2
c0
v3
c1
v4
v5
v6
c2
Figure 2.7 Example of Tanner graph for the (7, 4, 3) Hamming code.
Check nodes
c0
c1
c2
v0
1
1
1
v1
1
1
0
Variable nodes
v2 v3 v4 v5
1
1
0
0
0
0
1
1
1
0
1
0
v6
0
0
1
Figure 2.8 Parity-check matrix H for (7, 4, 3) Hamming code.
2.3
LDPC codes
A low-density parity-check (LDPC) code is a code defined by a parity-check matrix
with low density, i.e., the parity-check matrix H has a low number of 1s. It has
been shown [39] that there exists classes of such codes that asymptotically reach the
Shannon bound with a density tending to zero as the block length tends to infinity.
Moreover, the theorem also states that such codes are generated with a probability
approaching one if the parity-check matrix H is just constructed randomly. However, the design of practical decoders is greatly simplified if some structure can
be imposed upon the parity-check matrix. This seems to often negatively impact
the error-correcting performance of the codes, leading to a trade-off between the
performance of the code and the complexity of the encoder and decoder.
2.3.1
Tanner graphs
LDPC codes are commonly visualized using Tanner graphs [92]. Moreover, the iterative decoding algorithms are defined directly on the graph (see Sec. 2.4.1). The
Tanner graph consists of nodes representing the columns and rows of the paritycheck matrix, with an edge between two nodes if the element in the intersection
of the corresponding row and column in the parity-check matrix is 1. Nodes corresponding to columns are called variable nodes, and nodes corresponding to rows
are called check nodes. As there are no intersections between columns and between
rows, the resulting graph is bipartite with all the edges between variable nodes and
check nodes. An example of a Tanner graph is shown in Fig. 2.7, and its corresponding parity-check matrix is shown in Fig. 2.8. Comparing to (2.11), it is seen
that the matrix is that of the (7, 4, 3) Hamming code.
2.3. LDPC CODES
19
Having defined the Tanner graph, there are some properties which are interesting for the decoding algorithms for LDPC codes.
• A check node regular code is a code for which all check nodes have the
same degree.
• A variable node regular code is a code for which all variable nodes have
the same degree.
• A (j, k)-regular code is a code which is variable node regular with variable
node degree j and check node regular with check node degree k.
• The girth of a code is the length of the shortest cycle in its Tanner graph.
• The diameter of a code is the largest distance between two nodes in its
Tanner graph.
Using a regular code can simplify the decoder architecture. However, it has also
been conjectured [39] that regular codes can not be capacity-approaching under
message-passing decoding. The conjecture will be proved if it can be showed that
cycles in the code can not enhance the performance of the decoder on average.
Furthermore, it has also been showed [25, 27, 72, 87] that codes need to have a wide
range of node degree distributions in order to be capacity-approaching. Therefore,
assuming that the conjecture is true, there is a trade-off between code performance
and decoder complexity regarding the regularity of the code.
The sum-product decoding algorithm for LDPC codes computes exact marginal
bit probabilities when the code’s Tanner graph is free of cycles [65]. However, it
can also be shown that the graph must contain cycles for the code to have more
than minimal error correcting performance [36]. Specifically, it is shown that for
a cycle-free code C with parameters (N, K, d) and rate R = K/N , the following
conditions apply. If R ≥ 0.5, then d ≤ 2, and if R < 0.5, then C is obtained
from a code with R ≥ 0.5 and d ≤ 2 by repetition of certain symbols. Thus, as
cycles are needed for the code to have good theoretical properties, but also inhibit
the performance of the practical decoder, the concept of girth is important. Using
a code with large girth and small diameter is generally expected to improve the
performance, and codes are therefore usually designed so that the girth is at least
six.
2.3.2
Quasi-cyclic LDPC codes
One common way of imposing structure on an LDPC code is to construct the
parity-check matrix from equally sized sub-matrices which are either all zeros or
cyclically shifted identity matrices. These types of LDPC codes are denoted quasicyclic (QC-LDPC) codes. Typically, QC-LDPC codes are defined from a base
20
CHAPTER 2. ERROR CORRECTION CODING
I1,1
I1,2
I1,k
I2,1
I2,2
Ik,1
P1,1
P2,1
I2,k
Ik,2
Ik,k
Pk,1
P1,2
P2,2
Pk,2
P1,k
R1,1 R2,1
Rk,1 R1,2 R2,2
Rk,2
P2,k
Pk,k
R1,k R2,k
Rk,k
Figure 2.9 Parity-check matrix structure of randomized quasi-cyclic codes
through joint code and decoder architecture design.
matrix Hb of size Mb × Nb with integer elements:

Hb (0, 0)
Hb (1, 0)
..
.


Hb = 

Hb (Mb − 1, 0)
Hb (0, 1)
Hb (1, 1)
..
.
···
···
..
.
Hb (0, Nb − 1)
Hb (1, Nb − 1)
..
.
Hb (Mb − 1, 1)
···
Hb (Mb − 1, Nb − 1)



.

(2.12)
For an expansion factor of z, a parity-check matrix H of size Mb z × Nb z is constructed from Hb by replacing each element with a square sub-matrix of size z × z.
The sub-matrix is the all zero matrix if Hb (m, n) = −1, otherwise it is an identity
matrix circularly right-shifted by φ(Hb (m, n), z). φ(k, z) is commonly a scaling
function, modulo function, or the identity function.
Methods of constructing QC-LDPC codes include algebraic methods [55, 77,
95], geometric methods [63, 71, 95], and random or optimization approaches [37,
98]. QC-LDPC codes tend to have decent performance while also allowing the
implementation to be efficiently parallelized. The block size may easily be adapted
by changing the exansion factor z. Also, certain construction methods can ensure
that the girth of the code is at least 8 [96]. In all of the standards using LDPC
codes that are referenced in this thesis, the codes are of QC-LDPC structure.
2.4. LDPC DECODING ALGORITHMS
2.3.3
21
Randomized quasi-cyclic codes
The performance of regular quasi-cyclic codes can be increased relatively easily by
the addition of a randomizing layer in the hardware architecture. This type of codes
resulted from an effort of joint code and decoding architecture design [107, 110].
The codes are (3, k)-regular, with the general structure shown in Fig. 2.9, and have
a girth of at least six. In the figure, I represents L × L identity matrices, where L is
a scaling constant, and P represents cyclically shifted L × L identity matrices. The
column weight is 3, and the row weight is k. Thus there are k 2 each of the I- and
P -type matrices. The bottom part is a partly randomized matrix, also with row
weight k. The submatrix is obtained from a quasi-cyclic matrix by moving some
of the ones within their columns according to certain constraints. The constraints
are best described directly by the decoder implementation, described in Sec. 2.7.
2.4
LDPC decoding algorithms
Normally, LDPC codes are decoded using a belief propagation algorithm. In this
section, the sum-product algorithm and the common min-sum approximation are
explained.
2.4.1
Sum-product algorithm
The sum-product decoding algorithm is defined directly on the Tanner graph of
the code [39, 65, 74, 101]. It is an iterative algorithm, consecutively propagating
bit probabilities and parity-check constraint satisfiability likelihoods until the algorithm converges to a valid codeword, or a predefined maximum number of iterations
is reached. A number of variables are defined:
• The prior probabilities p0n and p1n denote the probabilities that bit n is zero
and one, respectively, considering only the received channel information and
not the code structure.
0
1
• The variable-to-check messages qnm
and qnm
are defined for each edge between a variable node n and a check node m. They denote the probabilities
that bit n is zero and one, respectively, considering the prior variable probabilities and the likelihood that parity-check relations other than m involving
bit n are satisfied.
0
1
• The check-to-variable messages rmn
and rmn
are defined for each edge between a check node m and a variable node n. They denote the likelihoods
that parity-check relation m is satisfied considering variable probabilities for
the other involved bits given by their variable-to-check messages, and given
that bit n is zero and one, respectively.
• The pseudo-posterior probabilities qn0 and qn1 are updated in each iteration
and denote the probabilities that bit n is zero and one, respectively, considering the information propagated so far during the decoding.
22
CHAPTER 2. ERROR CORRECTION CODING
pn
pn ′
vn
vn ′
pn′′
vn′′
cm
c m′
Figure 2.10 Sum-product decoding: Initialization phase
qn
qn′
vn
vn′′
vn ′
qnm
cm
qn′′
qn′ m
qn′′ m′
c m′
Figure 2.11 Sum-product decoding: Variable node update phase
• The hard-decision vector x̂n denotes the most likely bit values, considering bit
n and its surrounding. The number of surrounding bits considered increases
with each iteration.
Decoding a received vector consists of three phases: initialization phase, variable node update phase, and check node update phase. In the initialization phase,
shown in Fig. 2.10, the messages are cleared and the prior probabilities are initialized to the individual bit probabilities based on received channel information.
In the variable node update phase, shown in Fig. 2.11, the variable-to-check messages are computed for each variable node from the prior probabilities and the
check-to-variable messages along the adjoining edges. Also, the pseudo-posterior
2.4. LDPC DECODING ALGORITHMS
vn ′
vn
23
vn′′
rm′ n′′
rmn′
rmn
cm
c m′
Figure 2.12 Sum-product decoding: Check node update phase
probabilities are calculated, and the hard-decision bits are set to the most likely
bit values based on the pseudo-posterior probabilities. In the check node update
phase, shown in Fig. 2.12, the check-to-variable messages are computed based on
the variable-to-check messages, and all check node relations are evaluated based on
the hard-decision vector. If all check node constraints are satisfied, decoding stops,
and the current hard-decision vector is output.
Decoding continues until either a valid codeword is found, or a preset maximum
number of iterations is reached. In the latter case, a decoding failure occurs,
whereas the former case results in either a decoder success or a decoder error.
However, for well-defined codes with block lengths of at least 1000 bits, decoder
errors are extremely rare. Therefore, when a decoding attempt is unsuccessful, it
will almost always be known.
Decoding is usually performed in the log-likelihood ratio domain using the vari0
1
0
1
ables γn = log(p0n /p1n ), αnm = log(qnm
/qnm
), βmn = log(rmn
/rmn
) and λn =
0
1
log(qn /qn ). In this domain, the variable update equations can be written [65]
X
αnm = γn +
βmn
(2.13)
m′ ∈M(n)\m

βmn = 
Y
n′ ∈N (m)\n
λn = γn +
X


sign αnm  · Φ 
βmn ,
X
n′ ∈N (m)\n

Φ (|αnm |)
(2.14)
(2.15)
m′ ∈M(n)
where M(n) denotes the neighbors to variable node n, N (m) denotes the neighbors
to check node m, and Φ(x) = − log tanh(x/2).
The sum-product algorithm is used in the implementation of the early-decision
algorithm in Chapter 3.
24
2.4.2
CHAPTER 2. ERROR CORRECTION CODING
Min-sum approximation
Whereas (2.13) and (2.15) consist of sums and are simple to implement in hardware,
(2.14) is a bit more complex. One way of simplifying the hardware implementation is the use of the min-sum approximation [38] which replaces the check node
operation by the minimum of the arguments. The min-sum approximation results
in an overestimation of the reliabilities of messages, as only the probability for one
message is used in the operation. This can be partly compensated for by adding
an offset to variable-to-check messages [23], and results in the following equations:
X
αnm = γn +
βmn
(2.16)
m′ ∈M(n)\m

βmn = 
Y
n′ ∈N (m)\n
λn = γn +
X

sign αnm  · max
min
n′ ∈N (b)\n
|αnm − δ| , 0
βmn ,
(2.17)
(2.18)
m′ ∈M(n)
where δ is a constant determined by simulations. An additional result of the
approximation is that the number of different messages from a specific check node
is reduced to at most two. This enables a significant reduction in the memory
requirements for storage of the check-to-variable messages in some architectures,
especially when a layered decoding schedule is used.
The offset min-sum algorithm is used in the implementation of the rate-compatible
decoder in Chapter 4.
2.5
Rate-compatible LDPC codes
A class of rate-compatible codes is defined as a set of codes with the same number of codewords but different rates, and where codewords of higher rates can be
obtained from codewords of lower rates by removing bits at fixed positions [48].
Thus the information content is the same in the codes, but the amount of parity
information differs. The benefits of rate-compatibility include better adaptation
to channel environments and more efficient implementations of encoders and decoders. Better adaptation to channel environments is achieved through the large
number of possible rates to choose from, whereas more efficient implementations
are achieved through the reuse of hardware between the encoders and decoders of
the different rates. An additional advantage is the possibility to use smart ARQ
schemes where the retransmission consists of a small amount of extra parity bits
rather than a completely recoded packet.
There are two main methods of defining such classes of codes: puncturing and
extension. Using puncturing, a low-rate mother code is designed and the higherrate codes are then defined by removing bits at fixed positions in the blocks. Using
extension, lower-rate codes are defined from a high-rate mother code by adding
additional parity bits.
2.5. RATE-COMPATIBLE LDPC CODES
25
2-SR node
1-SR node
1-SR node
Figure 2.13 Recovery tree of a 2-SR node. The circles are variable nodes
and the squares are check nodes. The filled circles are punctured nodes.
Disadvantages of rate-compatible codes include reduced performance of the code
and decoder. Since it is difficult to optimize a class of rate-compatible codes for
a range of different rates, there will generally be a performance difference between
a rate-compatible and a dedicated code of a specific rate. However, better adaptation to channel conditions may still allow a decrease in the average number of
transmitted bits.
2.5.1
SR-nodes
For LDPC codes, a straight-forward way of decoding rate-compatible codes obtained through puncturing is to use the parity-check matrix of the low-rate mother
code and initialize the prior LLRs of the punctured nodes to zero. However, such
nodes will delay the probability propagation of its check node neighbors until it
receives a non-zero message from one of its neighbors. The concept of k-step recoverable (k-SR) nodes was introduced in [47] based on the assumption that the
performance of an LDPC code using a particular puncturing pattern is mainly
determined by the recovery time of the punctured nodes. The recovery time of a
punctured variable node is defined as the minimum number of iterations required
before a node can start to produce non-zero messages. A non-punctured node can
thus be denoted a 0-SR node. A punctured node having at least one check node
neighbor for which it is the only punctured node may receive a non-zero message
from that node in the first iteration and is thus a 1-SR node. Generally, a k-SR
node is reinitialized by its neighbors after k iterations. Figure 2.13 shows the recovery tree of a 2-SR punctured node. The 2-SR node has no other check node
neighbors for which it is the only punctured node.
26
CHAPTER 2. ERROR CORRECTION CODING
In [47], a greedy algorithm was proposed that successively chooses variable
nodes that receive likelihood data from their neighbours as early as possible. Since
then, several other greedy algorithms have been proposed. In [24], an algorithm
that takes into account the reliability of the information used for reinitializing
punctured nodes was suggested. However, it can only be used for parity-check matrices with dual-diagonal structures. In [46] and [83], algorithms maximizing the
reinitialization reliability for general parity-check matrix structures were proposed.
The algorithm in [46] is essentially a refinement of the one proposed in [47], sequentially puncturing k-SR nodes for increasing values of k while trying to choose
nodes with high recovering reliability. In [83], the aims are similar, but instead of
minimizing the number of iterations for recovering, the algorithm tries to choose
nodes based on the number of unpunctured nodes used in the computation of the
recovery information.
2.5.2
Decoding of rate-compatible codes
Rate-compatible LDPC codes can be straight-forwardly decoded by any LDPC
decoder simply by initializing the a-priori LLR values of punctured nodes to zero.
In that respect, any LDPC decoder is also a rate-compatible decoder. However, by
reducing the number of different codes that must be supported by the decoder, the
usage of rate-compatible codes may allow simplifications to the architecture. In
[106], a rate-compatible decoder is presented for the WiMAX rate-1/2 code, where
hard-wiring the information exchange between the nodes allows a low-complexity
implementation.
In contrast, in Chapter 4 an alternative decoding algorithm is proposed, which
removes the punctured nodes altogether from the code. The result is a significant
increase of the propagation speed of the messages in the decoder, which together
with a reduced code complexity allows a significant throughput increase.
2.6
LDPC decoder architectures
To achieve good theoretical properties, the code is typically required to have a
certain degree of randomness or irregularity. However, this makes efficient hardware
implementations difficult [99,104]. For example, a direct instantiation of the Tanner
graph of a 1024-bit code in a 0.16 µm CMOS process resulted in a chip with more
than 26000 wires with an average length of 3 mm, and a routing overhead of 50%
[18,52]. It is also the case that the required numerical accuracy of the computations
is low. Thus, the sum-product algorithm can be said to be communication-bound
rather than computation-bound.
The architectures for the sum-product decoding algorithm can be divided into
three main types [44]: the parallel, the serial, and the partly parallel architecture.
These are briefly described here.
2.6. LDPC DECODER ARCHITECTURES
27
Data I/O
VNU
VNU
VNU
VNU
VNU
reg
reg
reg
reg
reg
Interconnection
CNU
CNU
CNU
CNU
Figure 2.14 Parallel architecture for sum-product decoding algorithm
rec. data
cntrl
RAM
dec. data
CNU
VNU
Figure 2.15 Serial architecture for sum-product decoding algorithm.
2.6.1
Parallel architecture
Directly instantiating the Tanner graph of the code yields the parallel architecture,
as shown in Fig. 2.14. As the check node computations, as well as the variable
node computations, are intraindependent (i.e. the check node computations depend
only on the result of variable node computations, and vice versa), the algorithm
is inherently parallelizable. All check node computations can be done in parallel,
followed by computations of all the variable nodes.
An example implementation is the above mentioned 1024-bit code decoder,
achieving a throughput of 1 Gb/s while performing 64 iterations. The chip has an
active area of 52.5 mm2 and a power dissipation of 690 mW, and is manufactured
in a 0.16 µm CMOS process [18]. However, due to the graph irregularity required
for good codes, the parallel architecture is hardly scalable to larger codes. Also,
the irregularity of purely random codes makes it difficult to time-multiplex the
28
CHAPTER 2. ERROR CORRECTION CODING
Data I/O
MEM
MEM
BANK VNU BANK VNU
0
1
MEM
BANK VNU
k−1
cntrl
Interconnection
CNU
CNU
CNU
Figure 2.16 Partly parallel architecture for sum-product decoding algorithm
computations efficiently.
2.6.2
Serial architecture
Another obvious architecture is the serial architecture, shown in Fig. 2.15. In the
serial architecture, the messages are stored in a memory between generation and
consumption. Control logic is used to schedule the variable node and check node
computations, and the code structure is realized through the memory addressing.
However, in a code with good theoretical properties, the sets of check-to-variable
node messages that a set of variable nodes depend on are largely disjunctive (e.g. in
a code with girth six, at most one check-to-variable message is shared between the
dependencies of any two variable nodes), which makes an efficient code schedule
difficult and requires that the memory contains most of the messages. Moreover,
in a general code, increasing the throughput by partitioning the memory is made
difficult by the irregular dependencies of node computations, although certain code
constructions methods (e.g. QC-LDPC codes) can ensure that such a partitioning
can be done. Still, the performance of the serial architecture is likely to be severely
limited by memory accesses.
In [105], iteration-level loop unrolling was used to achieve 1 Gb/s throughput
of a serial-like decoder for an (N, j, k) = (4608, 4, 36) code, but with memory
requirements of 73728 words per iteration.
2.6. LDPC DECODER ARCHITECTURES
2.6.3
29
Partly parallel architecture
A third possible architecture is the serial/parallel, or partly parallel architecture,
shown in Fig. 2.16, which can be seen as either a time-multiplexed parallel decoder,
or an interleaved serial decoder. The partly parallel architecture retains the speed
achievable with parallel processing, while also allowing longer codes without resulting in excessive routing. However, neither the parallel nor the serial architecture
can usually be efficiently transformed using a general random code. Thus, the use
of a partly parallel architecture usually requires using a joint code and decoder design flow. Generally, the QC-LDPC codes (see Sec. 2.3.2) obtained through various
techniques are suitable to be used with a partly parallel architecture.
In the partly parallel architecture, parallelism can be achieved in a number of
different ways, with different advantages and drawbacks. Considering a QC-LDPC
code, the main parallelism choices are either inside or across the sub-matrices.
Examples of the partly parallel architecture include a 3.33 Gb/s (1200, 720)
code decoder with a power dissipation of 644 mW, manufactured in a 0.18µm
technology [70], and a 250 Mb/s (1944, 972) code decoder dissipating 76 mW,
manufactured in a 0.13µm technology [90]. The implementations in this thesis are
of the partly parallel type architecture. The implementation in Chapter 3 is based
on an FPGA implementation achieving a throughput of 54 Mbps for a (9216, 4608)
code using a clock frequency of 56 MHz and performing 18 iterations [108]. For
the implementation in Chapter 4, two degrees of parallelism are investigated: 24
and 81, respectively. The examined codes are those of the IEEE 802.16e WiMAX
standard and the IEEE 802.11n WLAN standard. The achieved throughputs are
rate-dependent, but throughputs starting from 33 Mbps and 100 Mbps are achieved
in an FPGA implementation with a clock frequency of 100 MHz and performing
10 iterations. It is based on [66].
2.6.4
Finite wordlength considerations
Earlier investigations have shown that the sum-product algorithm for LDPC decoding has relatively low requirements on data wordlength [111]. Even using as
few as 4–6 bits yields only a fraction of a dB as SNR penalty for the decoder
performance. Considering the equations (2.13)–(2.15), the variable node computations, (2.13) and (2.15), are naturally done using two’s complement arithmetic,
whereas the check node computations (2.14) are more efficiently carried out using
signed-magnitude arithmetic. This is the solution chosen for the implementation in
Sec. 3.3, and data representation converters are therefore used between the variable
node and check node processing elements. It should also be noted that due to the
inverting characteristic of the domain transfer function Φ(x), shown in Fig. 2.17,
there is a big difference between positive and negative zero in the signed-magnitude
representation, as the sum in (2.14) is given directly as an argument to Φ(x). This
makes the signed-magnitude representation particularly suited for the check-node
computations.
Considering Φ(x), the fixed-point implementation is not trivial. The functions
30
CHAPTER 2. ERROR CORRECTION CODING
4
Continuous
Discrete (2,2)
Discrete (2,3)
3.5
3
Φ(x)
2.5
2
1.5
1
0.5
0
0
0.5
1
1.5
2
x
2.5
3
3.5
4
Figure 2.17 Figure showing Φ(x) and discrete functions obtained through
rounding. For the discrete functions, the notation is (wi , wf ),
where wi is the number of integer bits and wf is the number
of fractional bits.
obtained through rounding of the function values are shown for some different data
representations in Fig. 2.17. Whereas other quantization rules such as truncation,
rounding towards infinity, or arbitrary rules obtained through simulations can be
considered, rounding to nearest has been used in this thesis, and the problem is not
further considered. However, it can be noted that because of the highly non-linear
nature of Φ(x), many numbers do not occur as function values for the fixed-point
implementations. This fact can be exploited, and in Sec. 5.2 a compression scheme
is introduced.
2.6.5
Scaling of Φ(x)
Considering the Φ(x) function in (2.14), using the natural base for the logarithm
is not necessary. However, when the natural base is used, the inverse function
Φ−1 (x) = 2arctanh exp(x) is identical to Φ(x), and thus in (2.14) the inverse
transformation can be done using Φ(x). However, the arithmetic transformation
of the check node update rule to a sum of magnitudes work equally well with any
other logarithm base. The resulting difference between the forward transformation
2.7. SUM-PRODUCT REFERENCE DECODER ARCHITECTURE
31
Block1,1
Block2,1
Block3,1
Blockk,k
VNU
VNU
VNU
VNU
RAM1,1
RAM2,1
RAM3,1
RAMk,k
π1 :
Fixed
interconnection
CNU1,1
CNU1,k
π2 :
Fixed
interconnection
CNU2,1
CNU2,k
π3 :
Configurable
interconnection
CNU3,1
CNU3,k
Parity check
Figure 2.18 Partially parallel
overview.
(3, k)-regular
sum-product
decoder
function Φ(x) and the reverse transformation function Φ−1 (x) may or may not be
a problem in the implementation. In a fixed-point implementation, changing the
logarithm base can be seen as a scaling of the inputs and outputs to the CNU, which
can often be done without any overhead if separate implementations are already
used for the forward and reverse transformation functions. In [31], it is shown
that such a scaling can improve the performance of the sum-product decoder using
fixed-point data. In Sec. 5.2, it is shown how the benefits of internal data coding
depend on the choice of logarithm base.
2.7
Sum-product reference decoder architecture
In this section, the sum-product reference decoder for the work in Chapter 3 is
presented. The section includes an architecture overview, description of memory
blocks and implementations of the variable node processing unit (VNU) and check
node processing unit (CNU), as well as the interconnection networks.
32
CHAPTER 2. ERROR CORRECTION CODING
INT RAM
w
1
VNU
w
w+1
EXT RAM1
w+1
w
To π1
w
w+1
EXT RAM2
w+1
w
To π2
DEC RAM
w
w+1
EXT RAM3
w+1
w
To π3
Figure 2.19 Memory block of (3, k)-regular sum-product decoder, containing memories used during decoding, and the VNUs.
2.7.1
Architecture overview
An overview of the architecture is shown in Fig. 2.18. The architecture contains
k 2 memory blocks with the memory banks used during decoding and the VNUs.
The memory blocks are connected to 3k CNUs through two regular and fixed
interconnection networks (π1 and π2 ), and one randomized and configurable interconnection network (π3 ). The purpose of the VNUs is to perform the computations
of the variable-to-check node messages αnm , as in (2.13), and the pseudo-posterior
probabilities λn , as in (2.15). Similarly, the purpose of the CNUs is to perform
the computation of the check-to-variable messages βmn , as in (2.14), and to compute the check parities using the hard-decision variables x̂n obtained from the
pseudo-posterior probabilities. However, in the implementation, the computation
of Φ(αnm ) is moved from the CNUs to the VNUs, and thus the implementation
deviates slightly from the formal formulation of the algorithm.
2.7.2
Memory block
The structure of the memory blocks is shown in Fig. 2.19, where w denotes the
wordlength of the prior probabilities γn . One memory block contains one INT RAM
of width w for storing γn , one DEC RAM of width 1 to store the hard-decision
values x̂n , and three EXT RAMi of width w + 1 to store the extrinsic information αnm and βmn during decoding. Thus, there are a total of 5k2 memories. All
memories are of size L. The EXT RAMi are dual-port memories, read and written every clock cycle in both the variable node update phase and the check node
update phase, whereas the INT RAM and DEC RAM are single-port memories
used only in the variable node update phase. The messages are stored in two’s
complement representation in INT RAM and in signed magnitude representation
2.7. SUM-PRODUCT REFERENCE DECODER ARCHITECTURE
33
INT RAM EXT RAM1 EXT RAM2 EXT RAM3
w
w
SM/2C
w+2
Sign
w
SM/2C
w+2
2C/SM
SM/2C
w+2
2C/SM
w
Φ
w
w+2
2C/SM
w
Φ
w
Φ
DEC RAM
w+1
w+1
w+1
EXT RAM1 EXT RAM2 EXT RAM3
Figure 2.20 Implementation of VNU for the (3, k)-regular sum-product
decoder.
in EXT RAMi . At the output of the EXT RAMi , there are two registers. Thus
switching of signals in the CNU can be disabled in the VNU phase and vice versa.
2.7.3
Variable node processing unit
The implementation of the VNU is relatively straight-forward, as shown in Fig. 2.20.
As the extrinsic information is stored in signed magnitude format, the data are first
converted to two’s complement. An adder network performs the computations of
the variable-to-check node messages and the pseudo-posterior probabilities, and
the variable-to-check node messages are then converted back to signed magnitude
format and truncated to w bits. Finally, the Φ(x) function (see Fig. 2.17) of the
magnitude is computed, and the results are joined with the hard-decision bit to
form the w + 1-bit hybrid data.
34
CHAPTER 2. ERROR CORRECTION CODING
πi
πi
w+1
πi
w+1
πi
w+1
πi
w+1
=1
w
w
S(x, y)
S(x, y)
w
=1
w
w
=1
S(x, y)
S(x, y)
S(x, y)
w+1
=1
S(x, y)
S(x, y)
parity
S(x, y)
S(x, y)
S(x, y)
S(x, y)
w+3
w+3
w+3
w+3
w+3
w+3
w+2
w+2
w+2
w+2
w+2
w+2
w−1
w−1
w−1
w−1
w−1
w−1
Φ
Φ
w
πi
w+1
=1
w
S(x, y)
πi
Φ
w
πi
Φ
w
πi
Φ
w
πi
Φ
w
πi
w
πi
Figure 2.21 Implementation of CNU for the (3, k)-regular sum-product
decoder with k = 6.
2.7.4
Check node processing unit
The implementation of the CNU for k = 6 is shown in Fig. 2.21. First, the harddecision bits are extracted and xored to compute the parity of the check constraint.
Then the check-to-variable messages are computed using the function S(x, y) defined as
S(x, y) = sign (x) · sign (y) · (|x| + |y|) .
(2.19)
Finally, the results are truncated to w bits, and the Φ(x) function of the magnitude
is computed.
2.7. SUM-PRODUCT REFERENCE DECODER ARCHITECTURE
CNU1,1 CNU1,2
35
CNU1,k
a1,1
a2,1
ak,1
a1,2
a2,2
ak,2
a1,k
a2,k
ak,k
Figure 2.22 Function of π1 interconnection network, with am,n denoting
the message from/to memory block (m, n).
CNU2,1
a1,1
a2,1
ak,1
CNU2,2
a1,2
a2,2
ak,2
CNU2,k
a1,k
a2,k
ak,k
Figure 2.23 Function of π2 interconnection network, with am,n denoting
the message from/to memory block (m, n).
2.7.5
Interconnection networks
The functions of the fixed interconnection networks are shown in Figs. 2.22 and
2.23. π1 connects the messages from Blockx,y with the same x coordinate to the
same CNU, whereas π2 connects the messages with the same y coordinate to the
same CNU.
The function of the configurable interconnection network π3 is shown in Fig. 2.24
for the forward path. am,n denotes the message from Blockm,n . am,n are permuted
to bm,n by the permutation functions Ψl,n , where l is the value of AG1 , defined
in Sec. 2.7.6. Ψl,n is either the identity permutation or the fixed permutation Rn
depending on l and n. Thus, formally
(
am,n
if ψl,n = 0
bm,n =
,
(2.20)
aRn (m),n if ψl,n = 1
where ψl,n are values stored in a ROM. Similarly, bm,n are permuted to cm,n
through the permutation functions Ωl,m , which are either the identity permutations or the fixed permutations Cm depending on l and m. The permutation can
be formalized
(
bm,n
if ωl,m = 0
cm,n =
,
(2.21)
bm,Cm (n) if ωl,m = 1
36
CHAPTER 2. ERROR CORRECTION CODING
a1,1
a2,1
ak,1
Ψl,1
b1,1
b2,1
bk,1
a1,2
a2,2
ak,2
Ψl,2
b1,2
b2,2
bk,2
a1,k
a2,k
ak,k
Ψl,k
b1,k
b2,k
bk,k
c1,1
c2,1
ck,1
c1,2
c2,2
ck,2
c1,k
c2,k
ck,k
b1,1
b2,1
bk,1
b1,2
b2,2
bk,2
b1,k
b2,k
bk,k
Ωl,1
Ωl,2
Ωl,k
CNU3,1
c1,1
c2,1
ck,1
CNU3,2
c1,2
c2,2
ck,2
CNU3,k
c1,k
c2,k
ck,k
Figure 2.24 Forward path of π3 interconnection network, with am,n denoting the message from memory block (m, n). Ψl,n and Ωl,m
denote permutation functions that are either the identity permutation or fixed permutations Rn and Cm , respectively.
where ωl,m are values stored in a ROM. Finally, the values cm,n are connected to
CNU3,n .
2.7.6
Memory address generation
The addresses to the EXT RAMi memories are given by mod-L binary counters
AGix,y , i = 1, 2, 3, where AG1 is also used for the INT RAM and DEC RAM memories. The counters are initialized to 0 at the beginning of variable node processing,
i
and to the values Cx,y
at the beginning of check node processing, which are chosen
according to the constraints
1
Cx,y
=0
Cx31 ,y
−
2
Cx,y
3
Cx,y
1
Cx32 ,y
(2.22)
= ((x − 1) · y) mod L
(2.23)
6= ((x − 1) · y) mod L, ∀x1 , x2 ∈ {1, . . . , k}, x1 6= x2 .
(2.25)
6=
3
Cx,y
, ∀y1 , y2
2
∈ {1, . . . , k}, y1 6= y2
(2.24)
The purpose of the constraints are to ensure that the code results in a code with
girth of at least six, and are further motivated in [108].
2.8. CHECK-SERIAL MIN-SUM DECODER ARCHITECTURE
x
0
1
2
3
Φ[x]
15
8
6
4
x
4
5
6
7
Φ[x]
3
2
2
1
x
8
9
10
11
Φ[x]
1
1
1
1
x
12
13
14
15
37
Φ[x]
0
0
0
0
Table 2.1 Definition of Φ[x] function for (wi , wf ) = (3, 2) fixed point data
representation.
The architecture as described results in the code shown in Fig. 2.9, where Ix,y
denotes an L × L identity matrix, Px,y denotes an L × L identity matrix circularly
2
right-shifted by Cx,y
, and Rx,y denotes an Lk × L matrix with column weight one
3
and row weight at most one, subject to further constraints imposed by Cx,y
and π3 .
The extrinsic data corresponding to Ix,y is stored in EXT RAM1 in Blockx,y , the
data corresponding to Px,y in EXT RAM2 in Blockx,y , and the data corresponding
to Rx,y in EXT RAM3 in Blockx,y .
2.7.7
Φ function
The definition of the Φ[x] function used in the sum-product reference decoder
architecture is shown in Table 2.1. The same function is used for the early decision
decoder without enforcing check constraints in Sec. 3.3.
2.8
Check-serial min-sum decoder architecture
In this section, the reference decoder [66] for the work in Chapter 4 is presented.
The decoder handles all the LDPC codes of the WiMAX standard. It uses the
min-sum algorithm ((2.16)–(2.18)) in a layered schedule, and uses serial check node
operations. Using a layered schedule, bit-sums are updated directly after the completion of each layer, effectively replacing the variable-node update equation (2.16)
with updates directly at the end of each check node computation. The result is
a significant increase in the propagation rate of the messages which reduces the
convergence time of the algorithm [50].
This section includes a description of the schedule, an architecture overview,
and details on the check node function unit (CFU).
2.8.1
Decoder schedule
The schedule of the decoder and relevant definitions are shown in Fig. 2.25:
• A layer is a group of check nodes where all the involved variable nodes are
independent. It may be bigger than the expansion factor of the base paritycheck matrix.
38
CHAPTER 2. ERROR CORRECTION CODING
Bit-sum block
Check node group
Check node processing schedule
Layer
clk k
clk k+1
clk k+4
clk k+5
clk k+2 clk k+3
clk k+6 clk k+7
Figure 2.25 Schedule of check-serial min-sum decoder.
• A check node group is a number of check nodes that are processed in
parallel. The variable nodes in the check node groups are processed serially,
with the bit-sums for each variable node updated after each operation.
• A bit-sum block is the bit-sums of the variable nodes involved in the checknode groups. The bit-sums in a bit-sum block are accessed in parallel and are
stored in different memories. One bit-sum block is read per clock cycle, and
then re-written one per clock cycle after the processing delay of the routing
networks and CFUs.
2.8.2
Architecture overview
An overview of the architecture is shown in Fig. 2.26. The data-path consists of a
number of bit-sum memories, a cyclic shifter network, CFUs and a decision memory.
The bit-sum memories store the pseudo-posterior bit probabilities as computed by
(2.18). In each clock cycle, a bit-sum block is read from the bit-sum memories,
routed to the correct CFU by the cyclic shifter, and the CFU computes updated
bit-sums that are rewritten to the bit-sum memories. Also, hard decisions are
done on the updated bit-sums and written to the decision memory. The CFUs also
compute the parities of the signs of the input bit-sums, and decoding stops when
all parities are satisfied in an iteration, or when a pre-defined maximum number of
iterations is reached.
The following parameters are defined for the architecture:
• Q is the parallelization factor and the size of the bit-sum blocks
• wq is the bit-sum wordlength
• wr is the wordlength of the min-sum magnitudes
2.8. CHECK-SERIAL MIN-SUM DECODER ARCHITECTURE
39
Channel LLR
Addr
Gen
Bit-sum memory
Shift
Ctrl
QxQ wq -bit cyclic shift
Qxwq
in idx
Counter
rst(0)
Qxwq
we
row input start
Q
CFU
Ctrl
CFU
out idx
Counter
rst(0)
Qxwq
we
row output start
Q
Out
Ctrl
Decision memory
Decision
Figure 2.26 Overview of check-serial min-sum decoder architecture.
old wq
bitsum
v2c
mag
S mem
we
MinSum
update
parity
idx
we
min wr
min2 wr
BitSum
update
wq
new
bitsum
wr
Reg
1+we+2wr
FIFO
sign in idx
Reg
BitMsg
gen
wq
MinSum
mem
parity
Figure 2.27 CFU of check-serial min-sum decoder.
• we is the edge index wordlength
2.8.3
Check node function unit
The implementation of the CFU is shown in Fig. 2.27. It contains a min-sum
memory and a sign memory storing the check-to-variable message magnitudes and
signs, respectively. The entries in the min-sum memory stores the check node
parity, the magnitudes of the two smallest variable-to-check messages and the index
of the variable node with the smallest variable-to-check message. The wordlength
of the min-sum memory is thus 1 + 2wr + we .
At the input of the CFU, the bit message generator subtracts the old check-
40
CHAPTER 2. ERROR CORRECTION CODING
to-variable message from the input bit-sum, forming the variable-to-check message
according to (2.16). This is followed by the min-sum update that does a sequential
search for the two smallest magnitudes of the variable-to-check messages, and the
index of the smallest one. The result is a packed description of all the check-tovariable messages according to (2.17) and is stored in the min-sum memory. It is
also used by the bit-sum update unit to compute the new bit-sum (2.18) from the
old variable-to-check message and the new check-to-variable message.
For details of the CFU implementation, the reader is referred to [66].
3
Early-decision decoding
In this chapter, a modification to the sum-product algorithm that the authors call
the early-decision decoding algorithm is introduced. Basically, the idea is to reduce
internal communication of the decoder by early decision of bits that already have
a high enough probability of being either zero or one. This is done in several steps
in Sec. 3.1: by defining a measure of bit reliabilities, by defining how bits are
decided based on their reliabilities, and by defining the processing of decided bits
in the decoding process. In Sec. 3.2, a hybrid algorithm is proposed, removing the
performance penalty caused by the early-decision decoding algorithm.
To estimate the power dissipation savings obtainable with the methods in this
thesis, the algorithms have been implemented in VHDL and synthesized to a Xilinx
Virtex 5 FPGA. As a basis for the architecture, the FPGA implementation in
Sec. 2.7 (also in [108]) has been used. The codes that can be decoded by the
architecture are similar to QC-LDPC codes, but a configurable interconnection
networks allows some degree of randomness, resulting in the type of codes described
in Sec. 2.3.3. The modifications done to implement the early decision algorithm is
described in Sec. 3.3, and in Sec. 3.4, it is shown how the early decision decoder
can implement the hybrid decoding algorithm.
In Sec. 3.5, floating-point and fixed-point simulation results are provided. The
floating-point simulations show the performance of the early decision and hybrid
algorithms for three different codes, whereas the performance of the hybrid algorithm under fixed wordlength conditions is shown by the fixed-point simulations.
41
CHAPTER 3. EARLY-DECISION DECODING
10
Bit reliability
Bit reliability
42
0
−10
0
10
20
30
Iteration
40
20
0
−20
50
0
20
40
60
Iteration
80
Bit reliability
Bit reliability
(a) (N, j, k) = (144, 3, 6), successful decod- (b) (N, j, k) = (144, 3, 6), unsuccessful deing.
coding.
20
0
−20
0
2
4
Iteration
6
20
0
−20
8
0
20
40
60
Iteration
80
(c) (N, j, k) = (1152, 3, 6), successful decod- (d) (N, j, k) = (1152, 3, 6), unsuccessful deing.
coding.
Figure 3.1 Typical reliabilities cn during decoding of different codes. The
gray lines show the individual bit reliabilities, and the thick
black lines are the magnitude averages.
In Sec. 3.6, slice utilization and power dissipation estimations are provided from
synthesis results for a Xilinx Virtex 5 FPGA.
Secs. 3.1.1, 3.1.2, 3.1.3 and 3.2 have been published in [14], [11] and [12], whereas
Secs. 3.1.4 and 3.1.5 are previously unpublished. The implementations of algorithms, Sec. 3.3 and Sec. 3.4, have been published in [13].
3.1
Early-decision algorithm
During the first iterations of the decoding of a block, when the messages are still in
dependent, the probability that the hard-decision variable is wrong is min qn0 , qn1 .
Assuming that this value is small, which is the case
in the circumstances considered, it can be approximated as min qn0 /qn1 , qn1 /qn0 , which can be rewritten as
exp(−|λn |). Thus, a measure of the reliability of a bit during decoding can be
defined as
cn = |λn | ,
(3.1)
with the interpretation that the hard-decision bit is correct with a probability of
1 − exp(−cn ). Typical values of cn during decoding of a block are shown for two
different codes in Fig. 3.1. The slowly increasing average reliabilities in Figs. 3.1(a)
and 3.1(c) are typical for successfully decoded blocks. The slope is generally steeper
for longer codes, as the reliabilities escalate quickly in local parts of the code graph
where the decoder has converged. Similarly, the oscillating behavior in Figs. 3.1(b)
and 3.1(d) is common for unsuccessfully decoded blocks. The key point to recognize
3.1. EARLY-DECISION ALGORITHM
43
in these figures is that high reliabilities are unlikely to change sign, i.e., their
corresponding bits are unlikely to change value.
Early decision decoding is introduced through the definition of a threshold t
denoting the minimum required reliability of a bit to consider it to be sufficiently
well determined. The threshold is in the simplest case just a constant, but may also
be a function of the iteration, the node degree, or other values. Different choices
of the threshold are discussed in Sec. 3.1.1.
When a bit is decided, incoming messages to the corresponding node are ignored, and a fixed value is chosen for the outgoing messages. The value will be an
approximation of the actual bit probability and will affect the probability computations of bits in subsequent iterations. Different choices of values for decided bits
are discussed in Sec. 3.1.2.
Early decision adds an alternative condition for finishing the decoding of a
block, which is when a decision has been made for all bits. However, when the
decoder makes an erroneous decision, this tends to lock the algorithm by rendering
its adjacent bits undecidable as the bit values in the graph become inconsistent. In
Sec. 3.1.4, detecting the introduced inconsistencies in an implementation-friendly
way is discussed.
3.1.1
Choice of threshold
The simplest choice of a threshold is a constant t = t0 . In a cycle-free graph this
is a logical choice, as the pseudo-posterior probabilities qn are valid. However, in
a graph with cycles, the messages will be correlated after g/4 iterations, where g
is the girth of the code. As the girth is often at most 6 for codes used in practice,
this will be already after the first iteration. Detailed analysis of the impact of
the correlation of messages on the probabilities is difficult to do. However, as the
pseudo-posterior probabilities of a cycle-less graph increase with the size of the
graph, it can be assumed that the presence of cycles causes an escalation of the
pseudo-posterior probabilities. In [11], thresholds defined as t = t0 + td i, where i
is the current iteration, are investigated. However, in fixed-point implementations
with course quantizations and low saturation limits significant gains are hard to
achieve and therefore only constant thresholds are considered in this thesis.
3.1.2
Handling of decided bits
When computations of probabilities stop for a bit, the outgoing messages from
the node will not represent the correct probabilities, and subsequent computations
for nearby nodes will therefore be done using incorrect data. The most straightforward way is to stop computing the variable-to-check messages, and use the values
from the iteration in which the bit was decided in subsequent iterations. However,
messages must still be communicated along the edges, so the potential gain is small
in this case.
There are two other natural choices of variable-to-check messages for early decided bits. One is to set the message αnm to t or −t, depending on the hard-decision
44
CHAPTER 3. EARLY-DECISION DECODING
value. The other is to set αnm to ∞ or −∞, thereby essentially regarding the bit
as known during the subsequent iterations. The first choice can be expected to introduce the least amount of noise in the decoding, and thereby yield more accurate
decoding results, whereas the second choice can be expected to propagate increased
reliabilities and thus decide bits faster. Similar ideas are presented in [112]. In this
thesis, results are only provided for the second case, as simulations have shown the
difference in performance to be insignificant. Results are provided in Sec. 3.5.
3.1.3
Bound on error correction capability
Considering early decision applied only on the pseudo-posterior probabilities before the first iteration, a lower bound on the error correction capability over a
white Gaussian channel with standard deviation σ can be calculated. Before any
messages have been sent, the pseudo-posterior probabilities are equal to the prior
probabilities, qn = pn . An incorrect decision will be made if the reliability of a bit
is above the threshold, but the hard decision bit is not equal to the sent bit. The
probability of this occurring can be written
Bbit = P (cn > t, x̂n 6= xn ) = P (γn > t, x = −1) + P (γn < −t, x = 1) .
(3.2)
Assuming that the bits 0 and 1 are equally probable, the error probability can be
rewritten
Bbit = P (γn > t|x = −1) P (x = −1) + P (γn < −t|x = 1) P (x = 1) =
tσ 2
= P (γn < −t|x = 1) = P rn < −
|x = 1 =
2
2
tσ
=Q −
.
(3.3)
2
If no corrections of erroneous decisions are done, an incorrect bit decision will
result in the whole block being incorrectly decoded. The probability of this happening for a block with N bits is
N
Bblock = 1 − (1 − Bbit )
(3.4)
It can thus be seen that the early decision algorithm imposes a lower bound on
the block error probability, which grows with increasing SNR (as Bbit grows with
increasing SNR). This is important, as LDPC decoding is sometimes performed
outside the SNR region which is practical to characterize with simulations, and the
presence of an error floor will then reduce the expected decoder performance.
3.1.4
Enforcing check constraints
When an erroneous decision is made, usually the decoder will neither converge on
a codeword nor manage to decide every bit. Thus the decoder will continue to
3.1. EARLY-DECISION ALGORITHM
v0
v1
v2
v3
v4
v5
45
v6
v7
c0
v8
v9
v10
c1
(a) Variable node update phase. White variable nodes have been decided to
0 and black variable nodes have been decided to 1. v5 is still undecided.
v0
v1
v2
v3
v4
c0
(b) Check node update phase.
variable messages.
v5
v6
v7
v8
v9
v10
c1
Thick arrows indicate enforcing check-to-
Figure 3.2 Principle of enforcing check constraints.
iterate the maximum number of iterations, and if erroneous decisions are frequent,
the average number of iterations can increase significantly and severely reduce the
throughput of the decoder. However, if the graph inconsistency can be discovered, the decoder can be aborted early and the block can be retried using a less
approximative algorithm.
The principle of how graph inconsistencies can be discovered is shown in Fig. 3.2.
In the example, bits v0 , . . . , v4 have been decided. Thus, the incoming messages
α0,0 , . . . , α4,0 to check node c0 will not change, and the message β0,5 from check
node c0 to variable node v5 will also be constant. As the decided conditions are
passed along with the messages α0,0 , . . . , α4,0 , this is known to the check node,
and an enforcing condition can therefore be flagged for β0,5 , meaning that the
receiving variable node must take the value of the message for the graph to remain
consistent. Similarly, variable nodes v6 , . . . , v10 are also decided, resulting in that
message β1,5 is also enforcing. However, to satisfy check node c0 , v5 must take
the value 0, and to satisfy c1 , v5 must take the value 1, and as the variable nodes
v0 , . . . , v4 , v6 , . . . , v10 can not change, an incorrect decision has been discovered.
However, it is not known exactly which bit has been incorrectly decided, and
46
CHAPTER 3. EARLY-DECISION DECODING
thus the error can not be simply corrected. In this thesis, the event is handled by
aborting the decoding process, but other options are also possible, e.g., to remove
all decisions and continue with the current extrinsic information, or to use the
current graph state as a starting point for the general sum-product algorithm.
As an enforcing check constraint requires that the receiving variable node must
take the value of the associated check-to-variable message for the graph to remain consistent, an early decision can be made for the receiving variable node. In
floating-point simulations with large dynamic range this is not necessary as the
large magnitude of the check-to-variable message ensures that a decision will be
made based on the pseudo-posterior probability. However, in fixed-point implementations, a decision may have to be explicitly made as the magnitude of the
check-to-variable message is limited by the data representation.
3.1.5
Enforcing check approximations
In a practical implementation, using an extra bit to denote enforcing checks is
relatively costly, and thus approximations using the absolute value of the checkto-variable messages are suggested. Consider the check node c0 with k inputs
v0 , . . . , vk−1 , where v0 , . . . , vk−2 have been decided. Assume that the values ±z
are used for the messages from the decided variables. The absolute value of the
check-to-variable message β0,k−1 from c0 to vk−1 will then be
k−2
!
!
k−2
Y
X
|β0,k−1 | = sign αn,0 · Φ
Φ (|αn,0 |) n=0
n=0
!
k−2
z X
= 2arctanh exp
log tanh
2
n=0
z k−1 .
(3.5)
= 2arctanh tanh
2
Using this approximation to determine enforcing checks, the check-to-variable message βmn from a k-input check node is enforcing if
z k−1 |βmn | ≥ 2arctanh tanh
.
(3.6)
2
In a fixed-point implementation, the condition can be written
|βmn | ≥ Φ [(k − 1) Φ [z]] ,
(3.7)
where Φ[x] denotes the discrete function obtained through rounding of the argument and function values of Φ(x), as in Fig. 2.17. Typically, z is the maximum representable value in the data representation, and thus Φ [z] = 0, and Φ [(k − 1) Φ [z]]
is also the maximum representable value. However, due to the limited dynamic
range of the data, z will be a common value also for variables that are not decided.
Thus check-to-variable messages will often be enforcing when the involved variables are not decided, which inhibits performance. An alternative implementation
is described in Sec. 3.3.4.
3.2. HYBRID DECODING
3.2
47
Hybrid decoding
For the sum-product algorithm, undetected errors are extremely rare, and this
property is retained with the early-decision modification. Thus, it is almost always
detectable that a wrong decision has been made, but not for which bit. However,
the unsuccessfully decoded block can be retried using the regular sum-product
algorithm. As the block is then decoded twice, the resources required for that
block will be significantly increased, and there is therefore a trade-off adjusted
by the threshold level. Lowering the threshold will increase the number of bit
decisions, but also increase the probability that a block is unsuccessfully decoded
and will require an additional decoding pass. Determining the optimal threshold
level analytically is difficult, and in this thesis simulations are used for a number
of codes to determine its impact.
Using an adequate threshold, redecoding a block is relatively rare, and thus
the average decoding latency is only slightly increased. However, the maximum
latency is doubled, which might make the hybrid decoding algorithm unsuited for
applications sensitive to the communication latency.
3.3
Early-decision decoder architecture
In this section, the changes made to the sum-product reference decoder to realize
the early decision decoder are explained. In the early decision decoder architecture,
additional registers are introduced which are not shown in the reference architecture. However, as the implementation is pipelined in several stages, the registers
are present also in the reference architecture, and do not therefore constitute additional hardware resources for the early decision architecture.
3.3.1
Memory block
In the early-decision decoder architecture, additional memories DIS RAM of size
L/8 × 8 bits are added to the memory blocks to store the early decisions, as seen in
Fig. 3.3. The reason that the memories are implemented with a wordlength of eight
is that decisions must be read for all CNUs in the check node update phase, which
would thus require three-port memories if a wordlength of one is used. However,
as data are always read sequentially due to the quasi-cyclic structure of the codes,
parallelizing the memory accesses by increasing the wordlength is easy. In addition,
the DEC RAMs have been changed to the same size to be able to utilize the same
addressing mechanism. The parallelized accesses to the hard decision and early
decision memories are handled by the ED LOGIC block which uses an additional
output signal from the VNU to determine early decisions.
In the VNU update phase, the EXT RAMs are addressed with the same addresses and the three-bit disabled signal from the ED LOGIC block is identical for
all memories. Thus, when a bit is disabled, the registers on the memory outputs
are disabled, and the VNU will be idle. In the CNU update phase, the wordlength
of the signals from the EXT RAMs to the CNUs have been increased by one to
48
CHAPTER 3. EARLY-DECISION DECODING
INT RAM
DEC RAM
8
w
DIS RAM
8
3
hard decision
VNU
threshold
early decision
w
w
EXT RAM1
w
w
ED LOGIC
w
w
w
To π1
To π2
3
w
EXT RAM2
hard decision
disabled
w
w
EXT RAM3
w
w
To π3
Figure 3.3 Memory block of (3, k)-regular early-decision decoder, containing memories used during decoding, and the variable node processing unit.
include the disabled bits. As data are read from different memory positions in the
three memories, the disabled bits are not necessarily identical. When a bit is disabled, the registers on the memory outputs are disabled, and only the hard decision
bit is used in the CNU computations.
3.3.2
Node processing units
In the variable node processing unit, shown in Fig. 3.4, the magnitude of the
pseudo-posterior likelihood is compared to the threshold value to determine if the
bit will be disabled in further iterations.
In the check node processing unit, shown in Fig. 3.5, MUXes at the input
determine if the variable-to-check messages or the hard decision bits are used for
the computations. At the CNU outputs, registers are used to disable driving of the
check-to-variable messages for bits that are disabled.
If enforcing check constraints, as defined in Sec. 3.1.4, are implemented, additional changes are made to the VNU and CNU. These are described in Sec. 3.3.4.
3.3.3
Early decision logic
The implementation of the ED LOGIC block is shown in Fig. 3.6, and consists
mostly of a set of shift registers. During the VNU phase, hard decision and
early decision bits are read from DEC RAM and DIS RAM and stored in one
shift register each. Newly made hard decisions and early decisions arrive from the
VNU. However, for the bits that are disabled, the VNU computations are turned
3.3. EARLY-DECISION DECODER ARCHITECTURE
49
INT RAM EXT RAM1 EXT RAM2 EXT RAM3
w
w
SM/2C
w+2
SM/2C
w+2
2C/SM
Sign
w
SM/2C
w+2
2C/SM
w+2
2C/SM
Magn
w
threshold
w
>
Φ
w
Φ
w
Φ
early decision
ED LOGIC
hard decision
EXT RAM1 EXT RAM2 EXT RAM3
Figure 3.4 Implementation of variable node processing unit for the (3, k)regular early-decision decoder.
off, and thus the stored values from previous iterations are recycled. During the
CNU phase, the memories are only read. However, the three address generators
produce different addresses, and thus three shift registers with different contents
are used. Because of this implementation, an additional constraint is imposed on
the code structure. As DEC RAM and DIS RAM are only single-port memories, simultaneous reading from several memory locations is not allowed. The implication
of this is the following three constraints:
1
2
Cx,y
6= Cx,y
(mod 8)
6=
(mod 8)
2
Cx,y
1
Cx,y
6=
3
Cx,y
3
Cx,y
(mod 8)
(3.8)
However, if needed, the architecture can be modified to remove the constraints
either by using dual-port memories, or by introducing an additional pipeline stage.
50
CHAPTER 3. EARLY-DECISION DECODING
πi
πi
w+2
πi
w+2
πi
w+2
hard decision
πi
w+2
w+2
hard decision
=1
disabled
disabled
disabled
disabled
w
w
w
w
w
w−1
0
w−1
1
0
w−1
1
w
0
0
S(x, y)
w
0
1
w
w
S(x, y)
S(x, y)
S(x, y)
w − 1 parity
1
S(x, y)
S(x, y)
0
w−1
1
w
S(x, y)
S(x, y)
0
=1
0
w−1
1
w
w
0
0
=1
=1
disabled
0
w+2
hard decision
=1
0
πi
S(x, y)
S(x, y)
S(x, y)
S(x, y)
w+3
w+3
w+3
w+3
w+3
w+3
w+2
w+2
w+2
w+2
w+2
w+2
w−1
w−1
w−1
w−1
w−1
w−1
Φ
Φ
w
Φ
w
Φ
w
Φ
w
Φ
w
w
Reg
Reg
Reg
Reg
Reg
Reg
πi
πi
πi
πi
πi
πi
Figure 3.5 Implementation of check node processing unit for the (3, k)regular early-decision decoder.
3.3. EARLY-DECISION DECODER ARCHITECTURE
51
DEC RAM DIS RAM
1
hard
decision
0
ShiftReg
load
load
1
early
decision
AG1
0
ShiftReg
load
ShiftReg
ShiftReg
ShiftReg
load
ShiftReg
load
load
3
3
ShiftReg
hard
decision
disable
ShiftReg
x mod 8 = 7
AG2
x mod 8 = 7
AG3
x mod 8 = 7
Figure 3.6 Implementation of early decision logic for the (3, k)-regular
early-decision decoder.
3.3.4
Enforcing check constraints
The principle of enforcing check constraints is described in Sec. 3.1.4, and an approximation in Sec. 3.1.5. The enforcing check constraint approximation has been
simulated on a bit-true level, with some additional changes to improve the performance. The modifications, and their associated hardware requirements are explained in detail in this section.
First, with a requirement of 3k − 4 and gates, the enforcing conditions of the
check-to-variable messages are computed explicitly in the CNU from the disabled
bits of the incoming variable-to-check messages. Then, using k w-bit 2-to-1 MUXes,
the magnitude of the check-to-variable message is set to the highest possible value
in the representation for enforcing messages. Also, in order to make this value
unique, the function value of Φ[x] for x = 0 is reduced by one. As the slope
of the continuous Φ(x) function is steep for x close to zero, this change can be
expected to introduce a small amount of errors, as the “true” value of the check-tovariable message is anywhere between Φ(2−(wf +1) ) and infinity. Using this change,
enforcing check constraints can be detected by the receiving VNU, as the magnitude
of the incoming check-to-variable message is maximized only for check-to-variable
messages that are enforcing.
Second, the VNU will use the knowledge of enforcing check constraints both
52
CHAPTER 3. EARLY-DECISION DECODING
to detect contradicting enforcing check constraints, and to make additional early
decisions for variables that receive enforcing check-to-variable messages. The contradiction detection can be straight-forwardly implemented for a three-input VNU
using 3(w − 1) + 6 and gates and 3 xor gates, where w is the wordlength of the
check-to-variable messages. The logic for the additional early decisions can be
implemented using e.g. three one-bit two-to-one MUXes.
Third, the CNU detects the case that all incoming variable-to-check messages
are decided, but the parity-check constraint is not satisfied. This can be done using
one additional and gate to determine if all incoming messages are decided, and one
and gate to determine the inconsistency.
3.4
Hybrid decoder
With the early decision decoder as defined in this chapter, the sum-product algorithm can be implemented using the same hardware simply by adjusting the
threshold. Thus, the hybrid decoder can be implemented with some controlling
logic to clear the extrinsic message memories and change the threshold level when
a decoder failure occurs for the early decision decoder.
As seen in Sec. 3.5.3, the hybrid decoder offers gains only in the high SNR
region, as most blocks will fail to be decoded by both decoders in the low SNR
region. Thus, if channel information is available, the SNR estimation can be used
to determine if the receiver will use the early decision algorithm or not, and could
also be used to set a suitable threshold. However, this idea has not been further
investigated in this thesis.
3.5
Simulation results
3.5.1
Choice of threshold
Figure 3.7 shows early decision decoding results for a (3,6)-regular rate-1/2 code
with a block length of 1152 bits. In all decoding instances, the maximum number
of iterations was set to 100, and the early decision algorithm with thresholds of 4,5
and 6 were simulated. While 100 iterations is generally not practically feasible, the
number was chosen to highlight the behaviour of the sum-product algorithm, and
the complications resulting from the early decision modification.
It is obvious that early decision impacts the block error rate, shown in Fig. 3.7(a),
significantly more than the bit error rate, shown in Fig. 3.7(b). This is to be expected, as the decoding algorithm may correct most of the bits even if some bit
is erroneously decided. The behaviour is consistent through all performed simulations, and thus only the block error rate is shown in most cases.
If the LDPC code is used as an outer code in an error correction system, the
small amount of bit errors introduced by the early decision decoder can be corrected
by the decoder for the inner code. In other cases, hybrid decoding, as explained in
Sec. 3.2, may be efficiently utilized.
3.5. SIMULATION RESULTS
53
0
0
10
10
−1
10
−1
−2
10
−2
10
Bit error rate
Block error rate
10
ED, t=4
ED, t=5
ED, t=6
PP
−3
10
−4
10
−3
10
−4
10
−5
10
ED, t=4
ED, t=5
ED, t=6
PP
−6
10
−5
10
−7
10
−6
10
−8
1
1.5
2
Eb/N0 [dB]
2.5
10
3
(a) Block error rate.
1
1.5
2
Eb/N0 [dB]
2.5
3
(b) Bit error rate.
5
x 10
80
PP
ED, t=6
ED, t=5
ED, t=4
5
4
Average number of iterations
Average number of messages
6
3
2
1
0
1
1.5
2
Eb/N0 [dB]
2.5
3
(c) Average number of internal messages.
ED, t=4
ED, t=5
ED, t=6
PP
60
40
20
0
1
1.5
2
Eb/N0 [dB]
2.5
3
(d) Average number of iterations.
Figure 3.7 Early decision decoding results of rate-1/2 regular code with
(N, j, k) = (1152, 3, 6)
The average number of internal messages sent per block is shown in Fig. 3.7(c),
and the average number of iterations is shown in Fig. 3.7(d). It is evident that,
whereas the internal communication is reduced with reducing thresholds, the decoding time is increased as the decoder may get stuck when an incorrect decision is
made. In the following figures, usually a measure of the relative communication reduction is shown instead of the absolute number of internal messages. The measure
is defined as comm.red. = 1 − Cmod /Cref , where Cmod is the number of internal
messages using the modified algorithm, and Cref is the number of internal messages using the reference algorithm. Results of reducing the number of iterations
using enforcing check constraints is shown in Sec. 3.1.4.
The performance of the early decision algorithm for a rate-3/4 (N, j, k) =
(1152, 3, 12) code is shown in Fig. 3.8(a), and for a rate-1/2 (N, j, k) = (9216, 3, 6)
code in Fig. 3.8(b). The performance for the higher rate code is significantly better
than for the rate-1/2 code of the same block size. Similarly, the performance of
the longer code is worse than for the shorter. It is intuitive that codes with higher
CHAPTER 3. EARLY-DECISION DECODING
Block error rate
BLER, t=4
BLER, t=5
BLER, t=6
BLER, PP
Comm., t=4
Comm., t=5
Comm., t=6
−2
10
−3
−4
2.2
2.4
2.6
0.8
10
0.6
10
10
10
2.8
3
3.2
Eb/N0 [dB]
3.4
3.6
0.6
−1
0.4
BLER, t=6
BLER, t=8
BLER, t=10
BLER, PP
Comm., t=6
Comm., t=8
Comm., t=10
−2
10
−3
0.4
10
0.2
10
0.2
0
−4
(a) Rate-3/4 code, (N, j, k) = (1152, 3, 12).
Comm. reduction
−1
10
0
1
Block error rate
0
10
Comm. reduction
54
1
1.1
1.2
1.3
1.4
Eb/N0 [dB]
1.5
1.6
−0.2
1.7
(b) Rate-1/2 code, (N, j, k) = (9216, 3, 6).
Figure 3.8 Performance of early decision decoding algorithm on codes with
high rates and codes with long block lengths.
0
0
10
10
Block error rate
Block error rate
−1
−1
10
−2
10
1
t=3, iters=2
t=3, iters=1
t=3, iters=0
PP
Limit
1.5
10
−2
10
−3
10
−4
2
Eb/N0 [dB]
(a) t = 3.
2.5
3
10
1
t=4, iters=2
t=4, iters=1
t=4, iters=0
PP
Limit
1.5
2
2.5
3
Eb/N0 [dB]
3.5
4
(b) t = 4.
Figure 3.9 Error floor resulting from early decision decoding and making
decisions only in the initial iterations.
error correction capabilities (such as lower rate and larger block size) allow noisier
signals, which increases the chance of making wrong decisions and thereby reducing
performance.
As described in Sec. 3.1.3, early decision decoding results in an error floor as
defined by (3.4). The error floor has been computed for two different thresholds for
the (N, j, k) = (1152, 3, 6) code, and is shown for t = 3 in Fig. 3.9(a) and for t = 4
in Fig. 3.9(b). As seen, the theoretical limit is tight when decisions are performed
only on the initial data (iters = 0). However, if decisions are performed in later
iterations, the block error rate increases rapidly. The reasons for this are two. First,
the messages no longer have a Gaussian distribution after the first iteration, and
the decision error probability is therefore no longer valid. Second, the dependence
3.5. SIMULATION RESULTS
55
0
30
t=4
t=5
t=6
PP
−1
10
Average number of iterations
Block error rate
10
−2
10
−3
10
1
1.5
2
Eb/N0 [dB]
2.5
20
15
10
5
0
1
3
(a) Block error rate.
PP
t=6
t=5
t=4
25
1.5
2
Eb/N0 [dB]
2.5
3
(b) Average number of iterations.
Figure 3.10 Early decision decoding of (N, j, k) = (1152, 3, 6) code using
enforcing checks.
0
t=4
t=5
t=6
PP
−1
10
−2
10
−3
10
2.5
3
Eb/N0 [dB]
(a) Block error rate.
3.5
30
Average number of iterations
Block error rate
10
PP
t=6
t=5
t=4
25
20
15
10
5
0
2.5
3
Eb/N0 [dB]
3.5
(b) Average number of iterations.
Figure 3.11 Early decision decoding of (N, j, k) = (1152, 3, 12) code using
enforcing checks.
of the messages caused by cycles in the graph causes the reliabilities of the bits to
escalate.
3.5.2
Enforcing check constraints
Enforcing check constraints were introduced in Sec. 3.1.4 as a way of reducing
the number of iterations required by the early decision algorithm when decoding
of a block fails. The results are shown for the (N, j, k) = (1152, 3, 6) code in
Fig. 3.10 and for the (N, j, k) = (1152, 3, 12) code in Fig. 3.11. By comparing
with Fig. 3.7(a) and Fig. 3.8(a) it can be seen that the error correction capability
of the decoder is not additionally worsened by the enforcing check constraints
Block error rate
30
BLER, EC
BLER, ECA
BLER, PP
Iters, PP
Iters, EC
20
Iters, ECA
−1
10
−2
10
10
−3
10
1
1.5
2
Eb/N0 [dB]
2.5
(a) (N, j, k) = (1152, 3, 6)
0
3
0
10
Block error rate
0
10
30
BLER, EC
BLER, ECA
BLER, PP
Iters, PP
Iters, EC
20
Iters, ECA
−1
10
−2
10
10
−3
10
2.2
2.4
2.6
2.8
3
3.2
Eb/N0 [dB]
3.4
3.6
Average number of iterations
CHAPTER 3. EARLY-DECISION DECODING
Average number of iterations
56
0
(b) (N, j, k) = (1152, 3, 12)
Figure 3.12 Early decision decoding of (N, j, k) = (1152, 3, 6) and
(N, j, k) = (1152, 3, 12) codes using enforcing check approximation. t = 4.
modification. However, the average number of iterations is significantly reduced,
and for low SNRs the early decision algorithm requires an even lower number of
iterations than the standard algorithm as graph inconsistencies are discovered for
blocks that the standard algorithm will not be able to successfully decode.
The enforcing check approximation is investigated in Figs. 3.12(a) and 3.12(b)
for the (N, j, k) = (1152, 3, 6) and (N, j, k) = (1152, 3, 12) codes, respectively. It
is apparent that the approximation does not reduce the decoder performance, and
thus that additional state information to attribute check-to-variable messages as
enforcing is not needed.
3.5.3
Hybrid decoding
In this section, results using the hybrid algorithm explained in Sec. 3.2 is presented.
The maximum number of iterations were set to 100 for both the early decision pass
and the standard algorithm. For the early decision algorithm, enforcing check
constraints have been used to reduce the number of iterations. Simulations have
been done on the three example codes with (N, j, k) = (1152, 3, 6), (1152, 3, 12),
and (9216, 3, 6). For each code, first the SNR has been kept constant while the
threshold has been varied. It can be seen that setting the threshold too low increases
the communication (reduces the communication reduction) as the early decision
algorithm will fail on most blocks. Similarly, setting the threshold too high also
increases the communication as bits with high reliabilities are not decided. Thus,
for a fixed SNR there will be an optimal choice of the threshold with respect to the
internal communication. For each code, SNRs corresponding to block error rates
of around 10−2 and 10−4 with the standard algorithm have been used.
Using the optimal threshold for a block error rate of 10−4 , simulations with
varying SNR have been done. It can be seen that the block error rate performance
3.5. SIMULATION RESULTS
Block error rate
10
−1
10
0.7
0.6
0.5
0.4
0.3
−2
10
0.2
0.1
−3
10
0
−0.1
−4
10
2
3
4
Threshold
Average number of iterations
0
40
0.8
Comm. reduction
BLER, SNR=1.75, Hybrid
BLER, SNR=1.75, PP
BLER, SNR=2.25, Hybrid
BLER, SNR=2.25, PP
Comm., SNR=2.25, Hybrid
Comm., SNR=1.75, Hybrid
57
30
20
10
0
2
−0.2
6
5
(a) Block error rate, communication reduction.
SNR=1.75, Hybrid
SNR=1.75, PP
SNR=2.25, Hybrid
SNR=2.25, PP
3
4
Threshold
5
6
(b) Average number of iterations.
Figure 3.13 Decoding of (N, j, k) = (1152, 3, 6) code using hybrid algorithm showing performance as function of threshold.
Block error rate
10
−2
10
0.6
−3
10
0.5
0.4
−4
10
0.3
30
0.2
−5
10
0.1
−6
10
1
1.5
2
Eb/N0 [dB]
2.5
0
3
(a) Block error rate, communication reduction.
Average number of iterations
1
BLER, t=4
BLER, t=5 0.9
BLER, PP 0.8
Comm., t=4
Comm., t=5 0.7
−1
Comm. reduction
0
10
t=4
t=5
PP
25
20
15
10
5
0
1
1.5
2
Eb/N0 [dB]
2.5
3
(b) Average number of iterations.
Figure 3.14 Decoding of (N, j, k) = (1152, 3, 6) code using hybrid algorithm showing performance as function of SNR for threshold
4 (optimal) and 5.
of the hybrid algorithm is indistinguishable from that of the standard algorithm
(in the SNR region simulated), whereas the internal communication of the decoder
is much lower. In fact, for the rate-1/2 codes, no decoder errors have occurred, and
therefore all errors introduced by wrong decisions of the early decision algorithm
were detected. However, for the rate-3/4 code, undetected errors did occur, and
thus it is possible that the performance of the hybrid algorithm is inferior to that
of the standard algorithm.
Simulations for the (N, j, k) = (1152, 3, 6) code are presented in Figs. 3.13
and 3.14, for the (N, j, k) = (1152, 3, 12) code in Figs. 3.15 and 3.16, and for
the (N, j, k) = (9216, 3, 6) code in Figs. 3.17 and 3.18. In the plots, the average
58
CHAPTER 3. EARLY-DECISION DECODING
Block error rate
10
−1
10
Comm. reduction
0
0.4
−2
10
0.2
−3
10
0
−4
10
2
3
4
Threshold
Average number of iterations
30
BLER, SNR=2.8, Hybrid
BLER, SNR=2.8, PP
1
BLER, SNR=3.4, Hybrid
BLER, SNR=3.4, PP
Comm., SNR=3.4, Hybrid 0.8
Comm., SNR=2.8, Hybrid
0.6
20
15
10
5
0
2
−0.2
6
5
SNR=2.8, Hybrid
SNR=2.8, PP
SNR=3.4, Hybrid
SNR=3.4, PP
25
(a) Block error rate, communication reduction.
3
4
Threshold
5
6
(b) Average number of iterations.
Figure 3.15 Decoding of (N, j, k) = (1152, 3, 12) code using hybrid algorithm showing performance as function of threshold.
−1
Block error rate
10
0.6
−2
10
0.5
0.4
0.3
−3
10
30
0.2
0.1
−4
10
2.2
2.4
2.6
2.8
3
3.2
Eb/N0 [dB]
3.4
3.6
0
(a) Block error rate, communication reduction.
Average number of iterations
1
BLER, t=3.75
BLER, t=4.5 0.9
BLER, PP
0.8
Comm., t=3.75
Comm., t=4.5 0.7
Comm. reduction
0
10
t=3.75
t=4.5
PP
25
20
15
10
5
0
2.5
3
Eb/N0 [dB]
3.5
(b) Average number of iterations.
Figure 3.16 Decoding of (N, j, k) = (1152, 3, 12) code using hybrid algorithm showing performance as function of SNR for threshold
3.75 (optimal) and 4.5.
number of iterations denote the sum of the number of iterations required for the
early decision pass and the standard algorithm pass.
3.5.4
Fixed-point simulations
The performance of a randomized QC-LDPC code with parameters (N, j, k) =
(1152, 3, 6) has been simulated using a fixed-point implementation. Among an
ensemble of 300 codes constructed using random parameters, the code with the
best block error correcting performance at an SNR of 2.6 was selected. The random
2
3
parameters were defined in Sec. 2.7, and are the initial values Cx,y
and Cx,y
of the
59
0
10
−1
10
0.1
−2
10
0
100
Comm. reduction
0.5
BLER, SNR=1.3, Hybrid
BLER, SNR=1.3, PP
0.4
BLER, SNR=1.5, Hybrid
BLER, SNR=1.5, PP
Comm., SNR=1.5, Hybrid 0.3
Comm., SNR=1.3, Hybrid
0.2
−3
10
−0.1
−4
10
3
4
5
6
Threshold
Average number of iterations
Block error rate
3.5. SIMULATION RESULTS
80
60
40
20
0
3
−0.2
8
7
(a) Block error rate, communication reduction.
SNR=1.3, Hybrid
SNR=1.3, PP
SNR=1.5, Hybrid
SNR=1.5, PP
4
5
6
Threshold
7
8
(b) Average number of iterations.
Figure 3.17 Decoding of (N, j, k) = (9216, 3, 6) code using hybrid algorithm showing performance as function of threshold.
−1
Block error rate
10
0.2
−2
10
0.1
0
−0.1
−3
10
100
−0.2
−0.3
−4
10
1
1.1
1.2
1.3
1.4
Eb/N0 [dB]
1.5
1.6
−0.4
1.7
(a) Block error rate, communication reduction.
Average number of iterations
0.6
BLER, t=5
BLER, t=7 0.5
BLER, PP 0.4
Comm., t=5
Comm., t=7 0.3
Comm. reduction
0
10
t=5
t=7
PP
80
60
40
20
0
1
1.2
1.4
Eb/N0 [dB]
1.6
(b) Average number of iterations.
Figure 3.18 Decoding of (N, j, k) = (9216, 3, 6) code using hybrid algorithm showing performance as function of SNR for threshold
5 (optimal) and 7.
address generators AG2x,y and AG3x,y , the permutation functions Ψl,n and Ωl,m ,
and the contents of the row and column permutation ROMs for π3 , ψl,n and ωl,m .
Furthermore, the initial values of the address generators were chosen to satisfy
(2.25) to ensure high girth, as well as (3.8) to ensure that an early decision decoder
implementation is possible.
In Fig. 3.19, the performance of the randomized QC-LDPC code is compared
with the random LDPC code used in the earlier sections. It is seen that the
performances are comparable down to a block error rate of 10−6 . In the same
figure, the performances using a fixed-point implementation with wordlengths of 5
and 6 are shown. For both fixed-point simulations, two integer bits were used, as
60
CHAPTER 3. EARLY-DECISION DECODING
0
10
−2
10
Average number of iterations
−1
10
Block error rate
100
RQC code, (5,2)
RQC code, (6,3)
RQC code
Random code
−3
10
−4
10
−5
10
−6
10
1
1.5
2
Eb/N0 [dB]
(a) Block error rate.
2.5
3
RQC code, (5,2)
RQC code, (6,3)
RQC code
Random code
80
60
40
20
0
1
1.5
2
Eb/N0 [dB]
2.5
3
(b) Average number of iterations.
Figure 3.19 Decoding of (N, j, k) = (1152, 3, 6) randomized QC-LDPC
code, using the floating-point sum-product algorithm and
fixed-point precision. (w, wf ) denotes quantization to w bits
where wf bits are fraction bits.
increasing the number of integer bits did not increase the performance. Down to a
block error rate of 10−4 , both fixed-point representations show a performance hit
of less than 0.1 dB. The performance difference between wordlengths of 5 and 6
bits is rather small, which is consistent with previous results [111], and thus w = 5
has been used for the following simulations. It is apparent that the fixed-point
implementations show an increased error floor, which is likely due to clipping of
the message magnitudes.
In Fig. 3.20, the performance of the hybrid early decision-sum product decoder
is simulated for SNRs of 2.0 and 2.5, while the threshold is varied. Similarly to
the floating point case, fox a fixed SNR there is an optimal choice of the threshold
that decreases the internal communications of the decoder. The limited increase
of the number of iterations for low thresholds is due to the use of enforced check
constraints. The fixed-point implementation uses a different scaling of the input
due to the limited range of the data representation, and thus the thresholds used
in the fixed-point simulations are not directly comparable to the ones used in the
floating-point simulations.
Keeping the threshold constant and varying the SNR yields the results in
Fig. 3.21, where it can be seen that the use of early decision gives a somewhat
lower block error probability than using only the sum-product algorithm. This can
be explained by the fact that the increased probabilities of the early decided bits
increases the likelihood of the surrounding bits to take consistent values. Thereby
in some cases, the early decision decoder converges to the correct codeword, where
the sum-product decoder would fail. Regarding the reduction of internal communication, results close to those obtained using floating-point data representation are
achieved.
3.6. SYNTHESIS RESULTS
61
−1
Block error rate
10
−2
10
0.7
0.6
0.5
0.4
0.3
0.2
0.1
−3
10
30
0.8
0
−0.1
−4
10
3
3.5
4
4.5
Threshold
5
5.5
Average number of iterations
BLER, SNR=2.0, Hybrid
BLER, SNR=2.0, PP
BLER, SNR=2.5, Hybrid
BLER, SNR=2.5, PP
Comm., SNR=2.5, Hybrid
Comm., SNR=2.0, Hybrid
Comm. reduction
0
10
20
15
10
5
0
3
−0.2
6
(a) Block error rate, communication reduction.
SNR=2.0, Hybrid
SNR=2.0, PP
SNR=2.5, Hybrid
SNR=2.5, PP
25
3.5
4
4.5
5
Threshold
5.5
6
(b) Average number of iterations.
Figure 3.20 Decoding of (N, j, k) = (1152, 3, 6) randomized QC-LDPC
code using (w, wf ) = (5, 2) precision and the hybrid algorithm.
−1
Block error rate
10
0.4
−2
10
0.3
0.2
0.1
−3
10
30
0
−0.1
−4
10
1
1.5
2
Eb/N0 [dB]
2.5
−0.2
3
(a) Block error rate, communication reduction.
Average number of iterations
0.8
BLER, PP
BLER, t=4 0.7
BLER, t=5 0.6
Comm., t=4
Comm., t=5 0.5
Comm. reduction
0
10
t=4
t=5
PP
25
20
15
10
5
0
1
1.5
2
Eb/N0 [dB]
2.5
3
(b) Average number of iterations.
Figure 3.21 Decoding of (N, j, k) = (1152, 3, 6) randomized QC-LDPC
code using (w, wf ) = (5, 2) precision and the hybrid algorithm.
3.6
Synthesis results
The reference sum-product decoder and the early decision decoder as described
in Sec. 2.7 and Sec. 3.3, respectively, have been implemented in a Xilinx Virtex 5
FPGA for the randomized QC-LDPC code in the previous section. A threshold
of t = 8 was used at an SNR of 3 dB. The utilization of the FPGA is shown
in Table 3.1, along with energy estimations obtained with Xilinx’ power analyzer
XPower. As seen, the early-decision algorithm reduces the logic energy dissipation
drastically. Unfortunately, the control overhead is relatively large, resulting in a
62
CHAPTER 3. EARLY-DECISION DECODING
Slice utilization [%]
Number of block RAMs
Maximum frequency [MHz]
Logic energy [pJ/iteration/bit]
Clock energy [pJ/iteration/bit]
Reference decoder
14.5
54
302
499
141
Early-decision decoder
21.0
54
220
291
190
Table 3.1 Synthesis results for the reference sum-product decoder and the
early-decision decoder in a Xilinx Virtex 5 FPGA.
slice utilization overhead of 45% and a net energy reduction of only 16% when
increased clock distribution energy is considered. However, some points are worth
mentioning:
• In order not to restrain the architecture to short codes, the hard decision and
early decision memories and accompanying logic in each memory block were
designed in a general and scalable way. However, with the code size used,
this resulted in two four-byte memories with eight eight-bit shift registers
used to schedule the memory accesses (c.f. Fig. 3.6). As the controlling logic
is independent of the code size, it can be expected that the early decision
overhead would be reduced with larger codes. Alternatively, for short codes,
implementing the hard decision and early decision memories using individual
registers might also significantly reduce the overhead.
• The addresses to the hard decision and early decision memories were generated individually from the address generators, along with control signals to
the shift registers. However, a more efficient implementation utilizing sharing
between different memory blocks is expected to be possible.
4
Rate-compatible LDPC codes
In this chapter, new results related to rate-compatible LDPC codes are presented.
In Sec. 4.1, an ILP optimization approach to the design of puncturing patterns
for QC-LDPC codes is proposed. The algorithm generates puncturing sequences
that are optimal in terms of recovery time and eliminates the element of randomness that is inherent in some of the greedy algorithms [46, 47]. In Sec. 4.2, an
improved algorithm for the decoding of rate-compatible LDPC codes is presented.
The algorithm systematically creates a new parity-check matrix used by the decoding algorithm. In the new matrix, the punctured nodes are not present, and
thereby the convergence speed is significantly improved. In Sec. 4.3, a check-serial
architecture for the decoding of rate-compatible LDPC codes using the proposed
check-merging algorithm is presented. The proposed architecture is an extension
of the work done in [66]. Finally, Sec. 4.4 and Sec. 4.5 contain simulation results of the check-merging decoding algorithm and synthesis results of the decoder
architecture, respectively.
Previous work in the area is presented in [43] and [60]. In [43], a method of designing rate-less codes through check node splitting is proposed. Using the method,
high-degree check nodes are split in two with equal weight to add additional parity bits to the code. In [60], a decoding algorithm using check node merging for
dual-diagonal LDPC codes is suggested and the algorithm is analyzed using density
evolution.
Section 4.1 has been published in [15], and Sec. 4.3 has been published in [10].
63
64
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
2-SR node
2-SE node
1-SR node
1-SR node
1-SE node 1-SE node
Figure 4.1 Recovery tree of a 2-SR node. The filled nodes denote punctured bits in the codeword. Bits that are recovered after k
iterations are denoted k-SR. Recovering check nodes are denoted k-SE.
4.1
Design of puncturing patterns
The concept of k-SR nodes was explained in Sec. 2.5.1. Here, an ILP-based optimization algorithm for the design of puncturing sequences is proposed. The algorithm uses the number of k-SR nodes as a cost measure. The primary goal of the
algorithm is to optimize the puncturing pattern for fast recovery of the punctured
nodes. Given a parity check matrix, it maximizes the number of 1-SR nodes. Following that, it maximizes the number of punctured nodes for increasing values of
K, given the puncturing patterns obtained in earlier iterations and the constraint
that only k-SR nodes with k ≤ K are allowed.
In Sec. 4.1.1, needed notations are introduced. In Sec. 4.1.2, the optimization
problem with variable definitions, goal function and constraints is described. Also,
design cuts to decrease the optimization time are discussed. Then, the proposed algorithm for designing the puncturing patterns using the ILP optimization problem
is presented in Sec. 4.1.3.
4.1.1
Preliminaries
Using a message passing algorithm to decode a punctured block, the check nodes
adjacent to a punctured bit node will propagate null information until the punctured variable node has received a non-zero message from another check node.
Thus, the concept of recoverable nodes (k-SR nodes) was introduced in Sec. 2.5.1.
Here, the concept is extended for check nodes in order to enable the optimization
problem formulation.
As shown in Fig. 4.1, a check node is denoted 1-step enabled (1-SE) if it involves
a singular punctured node and is used to recover that node in the first iteration.
Furthermore, a punctured variable node is denoted 1-SR if it is connected to a
4.1. DESIGN OF PUNCTURING PATTERNS
65
1-SE check node. Generally, a check node is denoted k-SE if it begins to communicate non-null information in the kth iteration and has been chosen to recover a
punctured node. A punctured variable node is denoted k-SR if it is connected to
a k-SE check node but to no lower k-SE check node.
It is possible to create puncturing patterns that contain nodes that will never be
recovered. Such nodes thus never change value and severely impact the decoding
performance. In [46], it was shown that decreasing the recovery time increases the
reliability of the recovery information, and it was then conjectured that increasing
the number of nodes with fast recovery enhances the performance of the puncturing
pattern.
4.1.2
Optimization problem
The optimization problem is formulated to maximize the number of k-SR variable
nodes, where k < K for a given value of K. As secondary optimization criteria,
the algorithm chooses k-SR variable nodes and k-SE check nodes such that first
the sum of the degrees of the k-SR variable nodes is minimized, and then the sum
of the degrees of the k-SE check nodes is minimized, as low-degree nodes were
reported to give good results in [83].
The following constants are defined:
Hb : the base matrix of the code, as defined in Sec. 2.3.2
M:
the number of rows in Hb
N:
the number of columns in Hb
Ha : the z = 1 expansion of Hb (each -1 element replaced by 0 and each nonnegative element replaced by 1)
Cp : the cost for each non-punctured node
Cr : the recoverability cost
Cv : the variable node degree cost
Cc : the check node degree cost
K:
the maximum allowed recovery delay (maximum k-SR node)
F:
set of nodes for which puncturing is forced
Furthermore, the set Qm of involved variable nodes in check node m and the
set Rn of involved check nodes in variable node n are defined as
Qm = {n : Ha (m, n) = 1}
Rn = {m : Ha (m, n) = 1}
The
pn :
sk,n :
ck,m :
rk,n :
(4.1)
(4.2)
following binary variables are defined:
1 if variable node n is punctured, 0 otherwise
1 if variable node n is a k-SR node, 0 otherwise, k = 1, . . . , K
1 if check node m is a k-SE node, 0 otherwise, k = 1, . . . , K
1 if variable node n is still unrecovered at level k (after k iterations), 0
otherwise, k = 0, . . . , K
66
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
The goal function G is to be minimized and is written
G = Cp Gp + Cr Gr + Cv Gv + Cc Gc ,
(4.3)
where
Gp =
N
−1
X
n=0
Gr =
(1 − pn )
K N
−1
X
X
(4.4)
ksk,n
(4.5)
k=1 n=0
Gv =
N
−1
X
n=0
Gc =
|Rn |pn
K M
−1
X
X
k=1 m=0
(4.6)
|Qm |ck,m
(4.7)
Gp maximizes the number of punctured nodes, whereas Gr ensures as fast recovery
as possible. Secondary objectives are achieved by Gv and Gc that minimize the
degrees of the punctured variable nodes and the check nodes used for recovery,
respectively. By setting Cp ≫ Cr ≫ Cv ≫ Cc it can be ensured that the partial
objectives are achieved with falling priorities.
The goal function is to be minimized subject to the following constraints:
Initialize unrecovered nodes: For n = 0, . . . , N − 1, let
r0,n = pn .
(4.8)
At level 0, only the non-punctured nodes (0-SR nodes) are recovered.
Recover nodes: For k = 0, . . . , K − 1 and n = 0, . . . , N − 1,
rk+1,n ≥ rk,n − sk,n .
(4.9)
This constraint makes sure that a punctured node will be unrecovered until the
level where it is marked as a k-SR node.
Require recovered neighbours: For m = 0, . . . , M − 1, k = 1, . . . , K and
n = 0, . . . , N − 1,
X
rk,n ≥ −|Qm |(2 − sk,n − ck,m ) +
rk−1,n′ .
(4.10)
n′ ∈Qm \n
Thus, if check node m is used to recover variable node n at level k, the first term
of the right hand side will be zero and the variable node is allowed to be recovered
(rk,n = 0) only if the other check node’s neighbours are already recovered (the sum
is zero). If check node m is not k-SE or variable node n is not k-SR, the condition
imposes no additional constraints.
4.1. DESIGN OF PUNCTURING PATTERNS
Require k-SE neighbour: For k = 1, . . . , K and n = 0, . . . , N − 1,
X
sk,n ≤
ck,m .
67
(4.11)
m∈Rn
The constraint states that, in order for a variable node n to be k-SR, one of its
check node neighbours must be k-SE.
Require recovering of all nodes: For n = 0, . . . , N − 1,
rK,n = 0,
(4.12)
which states that all variable nodes must be recovered at level K.
Forced punctures: For n ∈ F , pn = 1. Thus, for the nodes in the set F ,
puncturing is forced. However, the recovery level is not constrained.
In addition to the optimization constraints, two cuts can be used to decrease
the optimization time without affecting optimality of the results.
Cut 1: For m = 0, . . . , M − 1, k = 0, . . . , K − 1 and n ∈ Qm ,
X
ck+1,m ≥ rk,n −
rk,n′ .
(4.13)
n′ ∈Qm \n
The cut implies that, whenever a variable node is the only unrecovered node in a
check m at level k, check node m must be k + 1-SE.
Cut 2: For m = 0, . . . , M − 1, k = 0, . . . , K − 1 and n ∈ Qm ,
X
sk+1,n ≥ rk,n −
rk,n′ .
(4.14)
n′ ∈Qm \n
The cut implies that, whenever a variable n is the only unrecovered node in a check
at level k, variable node n must be k + 1-SR.
4.1.3
Puncturing pattern design
The following algorithm is proposed for the design of puncturing sequences for a
class of codes based on a base matrix Hb with different expansion factors z.
1. Initialize the puncturing sequence p = () and set the maximum recovery delay
K = 1. Set the costs Cp ≫ Cr ≫ Cv ≫ Cc .
2. Set the forced puncturing pattern F = p and solve the optimization problem
on Hb .
3. Let q denote the puncturing pattern solution. Order q \ p first by increasing
recovery delay and then by increasing node degree, and append the sequence
to p.
4. If a pre-defined maximum recovery delay K has been reached, quit. Otherwise, increase K and go to Step 2.
68
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
The result of the above procedure is an ordered sequence p defining a puncturing
order for the blocks of the base matrix Hb . For the parity-check matrix H with
expansion factor z, the puncturing sequence pz is defined by replacing each element
a in p with the sequence za, . . . , z(a + 1) − 1.
Simulation results of the proposed puncturing pattern design algorithm are in
Sec. 4.4.1.
4.2
Check-merging decoding algorithm
Assume that a code defined by a parity-check matrix H of size M × N is used for a
communications system. Also assume that signal conditions allow for a puncturing
pattern p = (p0 , . . . , pL−1 ) to be used. The sum-product decoding algorithm is
defined on H and initializes the prior log-likelihood ratios of the punctured nodes
γp0 , . . . , γpL−1 to zero. However, by merging the check nodes involving each punctured bit and purging the rows and columns associated with them, a modified
parity-check matrix HP of size (M − L) × (N − L) can be defined. Decoding using HP instead of H can in many circumstances reduce the convergence time for
the decoding algorithm significantly, while the computational complexity of each
iteration is reduced.
The main motivation for the proposed algorithm is systems with varying signal
conditions, i.e., most wireless systems. In such systems, rate-compatible codes can
be used to adapt the protection better to the current channel. The result is that
most of the transmitted blocks use some amount of puncturing, and then increasing
the throughput of the decoder for punctured blocks is motivated.
4.2.1
Defining HP
In the following, let all matrix operations be defined using modulo-2 arithmetic.
HP can be defined if and only if the columns of H corresponding to the punctured
bits p are linearly independent. The implications of the requirement and how to
construct puncturing patterns that satisfy it are discussed in Sec. 4.2.3.
Consider a sub-matrix B of size M × L formed by taking the columns of H
corresponding to the punctured bits p. By assumption, the columns of B are
linearly independent. Thus, B has a set of L rows that are linearly independent.
Let c = (c0 , . . . , cL−1 ) denote such a set. c is thus the set of check nodes to purge
along with the punctured variable nodes.
Given a set of punctured variable nodes p = (p0 , . . . , pL−1 ) and purged check
nodes c = (c0 , . . . , cL−1 ), perform column and row reordering of H such that the
punctured variable nodes p and the purged check nodes c end up at the right and
bottom, respectively. Denote the resulting matrix H̃. H̃ thus has the following
structure:
HA BA
H̃ =
,
(4.15)
HB BB
4.2. CHECK-MERGING DECODING ALGORITHM
69
where BA and BB have sizes of (M − L) × L and L × L, respectively. Further, BB
is non-singular as its rows are linearly independent.
Define the matrix A of size M × M as
I BA
A=
,
(4.16)
0 BB
where I is the identity matrix of size (M −L)×(M −L). Since BB is a non-singular
matrix, A is also non-singular and its inverse is
I BA B−1
−1
B
A =
.
(4.17)
0
B−1
B
In order to define HP , first define the matrix H̃R of size M × N as
HB 0
HA + BA B−1
−1
B
H̃R = A H̃ =
.
B−1
I
B HB
(4.18)
As H̃R is obtained by reversible row operations on H, the matrices have the
same null space and thus define the same code. However, considering the behavior
of sum-product decoding with punctured bits over H̃R , two observations can be
made. First, consider the check-to-variable message βcl n0 from the purged check
node cl to one of its non-punctured neighbors n0 ∈ N (cl ) \ pl , as depicted in
Fig. 4.2(a).




Y
X
βcl n0 = 
sign αn′ cl  · Φ 
Φ (|αn′ cl |) =
n′ ∈N (cl )\n0

=
Y
n′ ∈N (c
n′ ∈N (cl )\n0


sign αn′ cl  · Φ Φ (|αpl cl |) +
l )\n0
X
n′ ∈N (c

Φ (|αn′ cl |) = 0,
l )\{n0 ,pl }
(4.19)
since Φ (|αpl cl |) = Φ (γpl ) = Φ(0) = ∞. Thus, the purged check nodes c affect
neither the computations of the variable-to-check messages nor the pseudo-posterior
likelihoods at their neighboring variable nodes.
Second, consider the variable nodes involved in check node cl , as depicted in
Fig. 4.2(b). The hard-decision of pl can be written
Y
Y
sign λpl = sign βcl pl =
sign αncl =
sign λn .
(4.20)
n∈N (cl )\pl
n∈N (cl )\pl
Thus, in a converged state, where all non-purged check nodes are fulfilled and the
hard-decision values of the variable nodes do not change, cl will eventually also
be fulfilled. It follows from these two observations that the purged check nodes
do not affect the behavior of the decoding algorithm, apart from possibly delaying
convergence. HP is thus defined as the matrix H̃R with the purged check nodes
and punctured variable nodes removed:
HP = HA + BA B−1
B HB .
(4.21)
70
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
n0
n1
n2
pl
βcl n0
n0
n1
n2
β c l pl
αn1 cl
αn0 cl
cl
pl
αn2 cl
cl
(a) Message from purged check node cl (b) Message from purged check node cl
to punctured node pl .
to non-punctured node n0 .
Figure 4.2 Check-to-variable messages from purged check nodes.
punctured variable node
n0 n1 n2 n3 n4 n5 n6
H
m0
m1
purged
check node
H̃R
punctured variable node
H
purged
check node
H̃R
n0 n1 n2
n4 n5 n6
HP
HP
m0
(a) Degree-2 variable node.
(b) Degree-3 variable node.
Figure 4.3 Merging of check nodes for punctured variable nodes.
4.2.2
Algorithmic properties of decoding with HP
In the case when all punctured variable nodes are of degree 2, decoding over HP has
some interesting properties. Consider first the case when the punctured variable
nodes do not share any check node and H is free of length-4 cycles. Given these assumptions, BB is a permuted identity matrix and BA is a permuted identity matrix
with possible additional zero-filled rows. In this case, HP = HA + BA BTB HB , and
each purged check node cl is merged with the other neighbor of its corresponding
punctured variable pl , as shown in Fig. 4.3(a).
Denoting the variable and check node operations of node k by Vk and Ck ,
i
respectively, consider the messages βm
in iteration i in the top and bottom
0 n0
4.2. CHECK-MERGING DECODING ALGORITHM
71
graphs in Fig. 4.3(a). In the top graph using H,
i
βm
= C0 (αni−1
, αni−1
, αni−1
)=
0 n0
1 m0
2 m0
3 m0
i−1
= C0 (αni−1
, αni−1
, V3 (βm
)) =
1 m0
2 m0
1 n3
i−1
= C0 (αni−1
, αni−1
, βm
)=
1 m0
2 m0
1 n3
= C0 (αni−1
, αni−1
, C1 (αni−2
, αni−2
, αni−2
)) =
1 m0
2 m0
4 m1
5 m1
6 m1
= C0 (αni−1
, αni−1
, αni−2
, αni−2
, αni−2
),
1 m0
2 m0
4 m1
5 m1
6 m1
(4.22)
whereas in the bottom graph using HP ,
i
βm
= C0 (αni−1
, αni−1
, αni−1
, αni−1
, αni−1
).
0 n0
1 m0
2 m0
4 m0
5 m0
6 m0
(4.23)
The operations performed when decoding using HP are thus the same as when
decoding using H, and differ only in the speed of the propagation of the messages. It
can thus be expected that decoding using HP achieves convergence in less iterations
than using H. As seen in Sec. 4.4, this is also the case.
When the punctured variable nodes share check nodes, (4.23) can be shown to
still hold by repeated application of (4.21) for independent sets of nodes. However,
as (4.21) may introduce length-4 cycles, (4.22) may possibly contain duplicate
messages which are not present in (4.23). Although the presence of duplicate
messages results in a computational change in the algorithm, the defined code is
the same.
The necessary conditions of punctured degree-2 variable nodes are relatively
strict, but are often met in practice due to the simplified encoder designs for these
types of codes [69]. Puncturing of higher-degree nodes is possible and an example
of a degree-3 node is shown in Fig. 4.3(b). However, for higher-degree nodes (4.18)
normally introduces length-4 cycles which reduce the performance of the decoding
algorithm.
4.2.3
Choosing the puncturing sequence p
The requirement that the columns in H corresponding to the punctured nodes
must be linearly independent is related to the concepts of k-SR variable nodes
(defined in Sec. 2.5.1) and k-SE check nodes (defined in Sec. 4.1.1). However,
the requirement of linear independence is less strict. First, it is shown that the
columns in H corresponding to a set of k-SR nodes are linearly independent. Then
the implications of linear independence in terms of recoverability is discussed.
Assume that the parity-check matrix H is given, and let p denote a set of
recoverable punctured nodes. Further, let c denote the set of check nodes used
for the recovery of the punctured nodes. Denote the k-SR nodes by pk and the
corresponding k-SE check nodes by ck . Now, define H̃ as in Sec. 4.2.1 and consider
the BB sub-matrix. Obviously, the rows c1 and columns p1 corresponding to the
1-SR nodes form the identity matrix. Generally, the checks ck can only contain lSR nodes where l < k, in addition to the k-SR node it is recovering. It follows that
BB is upper triangular with 1s along the diagonal, and is therefore non-singular.
72
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
p0
p1
p2
Figure 4.4 Example of non-recoverable punctured nodes.
Thus, as a set of recoverable nodes implies that HP can be defined, any of the
available algorithms to determine k-SR puncturing sequences can be used also to
determine sets of punctured nodes used for check-merged decoding.
The converse, however, is not true. Consider for example the punctured variable
nodes shown in Fig. 4.4. None of the nodes are recoverable, as all their check
node neighbors require at least one other of the punctured variable nodes to be
recovered in order to produce non-zero messages. However, the parity-check matrix
corresponding to the graph is


··· 1 1 0
 · · · 1 1 1


 · · · 0 1 1


H =  · · · 0 0 0 ,
(4.24)


 ..

.. .. ..
 .
. . .
···
0
0
0
where the three rightmost columns corresponding to the punctured variable nodes
are linearly independent.
4.3
Rate-compatible QC-LDPC code decoder
In this section, a check-merging decoder architecture for QC-LDPC codes is presented. The architecture is based on the reference decoder in Sec. 2.8 (also in [66]),
and includes modifications to enable check-merging. The modifications require
some logic overhead, but the benefit is a significantly increased throughput of the
decoder on punctured blocks. The increased throughput comes from three sources:
• faster propagation of messages due to merged check nodes, leading to a decrease in the average number of iterations,
• a reduction in the code complexity due to removed nodes, reducing the number of clock cycles for each iteration, and
• the elimination of idle waiting cycles between layers.
4.3.1
Decoder schedule
The schedule of the decoder and relevant definitions are described in Sec. 2.8.1.
4.3. RATE-COMPATIBLE QC-LDPC CODE DECODER
clk k
73
clk k+2
clk k+1 clk k+3
Checkmerged
layer
Figure 4.5 Schedule of decoder on check node merged parity-check matrix.
Figure 4.5 shows the resulting parity-check matrix when the bits in the rightmost sub-matrix in Fig. 2.25 have been punctured and the last two layers merged.
As can be seen, the merged layer consists of the sums of the corresponding rows
in the original parity-check matrix. The result is a parity-check matrix that may
contain sub-matrices that have column weights larger than one. The bit-sum blocks
in a check-node group may then overlap, which the architecture has to be aware
of. Whereas the original architecture computes updated bit-sums directly, the
proposed architecture computes bit-sum differences which are then added to the old
bit-sums to produce the updated ones. The proposed architecture thus efficiently
handles the case where a variable node is involved simultaneously in several check
nodes.
As the check nodes in a check node group are processed in parallel, check
node merging can only be used if all the check nodes in the check node group
are purged. This is not a major limitation, however, as puncturing sequences are
normally defined on the base matrix of the code and then expanded to the used
block length. Thus, at most one check node group needs to have both purged and
non-purged check nodes, and this check node group is then processed by initializing
the punctured variable nodes to zero.
4.3.2
Architecture overview
An overview of the check node merging decoder architecture is shown in Fig. 4.6.
At the start of each block, the contents of the bit-sum memories are initialized
with the log-likelihood ratios received from the channel. Then the contents of the
bit-sum memories are updated during a number of iterations. In each iteration,
the hard-decision values are also written to the decision memory, from where they
can be read when decoding of the block is finished.
In each iteration, bit-sums are read from the bit-sum memories, routed by a
cyclic shifter to the correct check function unit (CFU), and then routed back.
The output of the CFUs are bit-sum differences which are accumulated for each
read bit-sum value in the bit-sum update block. The updated bit-sums are then
74
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
Channel LLR
Decision memory
Q
Qxwq
Bit-sum memory
Decision
Q
Bit-sum update
Qxwq
Qx(wr+2)
QxQ wq -bit cyclic shift
QxQ wr+2-bit cyclic shift
Qxwq Q
Qx(wr+2)
CFU
Counter
rst(0)
row input start
in idx
Counter
we
row output start
rst(0)
out idx
we
Figure 4.6 Architecture of decoder utilizing check merging.
rewritten to the bit-sum memories. The two counters count the current input and
output indices of the bit-sum blocks of the currently processed check node group.
The following parameters are defined for the architecture:
• Z is the maximum expansion factor
• Q is the parallelization factor
• C is the maximum check node degree
• wq is the bit-sum wordlength
• wr is the wordlength of the min-sum magnitudes
• we is the edge index wordlength
4.3.3
Cyclic shifters
As in the original architecture [66], the shifters are implemented as barrel shifters
with programmable wrap-around. However, unlike the original architecture the
return path needs a cyclic shift as a bit-sum may need updates produced in several
different CFUs.
4.3.4
Check function unit
In the CFU, shown in Fig. 4.7, the old check-to-variable messages are stored for
each of the check nodes’ neighbors. As the min-sum algorithm is used, only two
values are needed. These are stored as wr -bit magnitudes along with a we -bit
index of the node with the minimum message and one bit containing the check
node parity. These bits are grouped together and stored in the min-sum memory.
4.3. RATE-COMPATIBLE QC-LDPC CODE DECODER
we+1+2wr
in idx
ChkMsg
gen
Reg
c2v old
out idx
wr+1
Reg
out idx
ms old
S mem
MinSum
mem
bitsum
wq
BitMsg
gen
CMP
qidx
MinSum
update
A==B
qmin
qmin2
1
we
wr
wr
sign
wr+1
c2v new
c2v diff
ms new
we+1+2wr
wr+1
bitsum
wr+1
wq
1
wr+1
2ctosm
c2v
magn
wr+2
BitMsg generator
c2v
parity
qidx
qmin
qmin2
smto2c
0
1
wr+1
Reg
r
ChkMsg generator
ms msg
parity =1
sign
idx
sign
v2c
magnw
ChkMsg
gen
ChkMsg
gen
FIFO
75
wr
v2c sign
v2c magn
MinSum update
=1
sign
offset
wr
O parity
O parity O idx
ms
en rst(0)
O min
O min2
1
row input start
min2 sel
≥1
msb0
Reg
&
1
CMP
magn
min update
≥1
x<O min
wr+1
wr
msg
CMP
≥1
min2 update
x<O min2
wr
lsb
in idx
Reg
en rst(0)
O idx
msb0
0
we
min update
row input start
O min
0
1
Reg
enrst(2wr −1)
min update
min2 sel
0
1
Reg
enrst(2wr −1)
min2 update
O min
wr
row input start
O min2
wr
row input start
Figure 4.7 Check function unit.
Along with the signs of each check-to-variable message stored in the S memory,
the check message generator generates the old check-to-variable message in two’s
complement wr + 1-bit representation for each variable node neighbor. The old
check-to-variable messages are subtracted from the bit-sum input to generate the
76
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
new bitsum
1
wq
0
sel (C-2)
1
0
sel (0)
Reg
sel (C-1)
0
Reg
0
1
Reg
Reg
1
wq+1
wq
wr+2
old bitsum c2v diff
Figure 4.8 Bit-sum update unit.
variable-to-check messages in the bit message generator. These are used by the
min-sum update unit to compute the new min-sum values, which are subsequently
stored in the min-sum memory. They are also sent to the second check message
generator, which generates the new check-to-variable messages. The difference
between the old and new check-to-variable message is computed, and forms the
output of the CFU.
The check message generator consists of a mux to choose the correct check-tovariable message and a converter from signed magnitude to two’s complement. The
bit message generator computes the variable-to-check message by adding the inverted old check-to-variable message to the bit-sum and then converting the result
back to signed magnitude representation. The min-sum update unit is unchanged
from the original in [66], and performs a basic search for the two minimum magnitudes in its input. A programmable offset value is used to correct for the overly
optimistic estimates of the min-sum algorithm. Also, the parity of the check node
is computed.
4.3.5
Bit-sum update unit
In Fig. 4.8, the bit-sum update unit is shown. When processing of a check node
group starts, the bit-sums are loaded in the shift registers with a delay equal to
the current node degree, and thus equal to the delay of the CFUs. Then, when
the bit-sum differences from the CFUs start to arrive, they are added to the old
bit-sums. The sum is truncated and sent to the output to be written back to the
bit-sum memories. When several CFUs have updates to the same bit-sum, the
updated bit-sum is reloaded in the shift register to replace the obsolete value.
4.3.6
Memories
The bit-sums are organized into Q memory banks with data wordlengths of wq
bits. Each memory l contains the bit-sums of the variable nodes vk where k =
l mod Q. The min-sums are organized into Q memory banks with data wordlengths
of we + 1 + 2wr bits, which holds a we -bit edge index, 1-bit parity, and two wr -bit
4.4. SIMULATION RESULTS
77
0
Block error rate
10
−1
10
−2
10
−3
10
1
1.5
2
2.5
3
Eb/N0 [dB]
3.5
4
Figure 4.9 Comparison of puncturing sequences for the WiMax rate-1/2
code punctured to rate 2/3 and 3/4. The straight lines are the
optimized sequences and the dotted lines are obtained using
the heuristic in [46].
magnitudes. Each memory l contains the min-sums of the check nodes ck where
k = l mod Q. 1
4.4
4.4.1
Simulation results
Design of puncturing sequences
The optimization algorithm in Sec. 4.1 has been applied to the WiMax rate-1/2
base matrix ((N, M ) = (24, 12)) to create a base puncturing pattern of length 11,
with which code rates up to 0.92 can be obtained. The base puncturing pattern
was expanded with factors of 24, 36 and 48 to create puncturing patterns for block
lengths of 576, 864 and 1152. To solve the optimization problem, the SCIP solver [2]
has been used. Solving the problem for the WiMax rate-1/2 base matrix was done
in seconds on a desktop computer, and resulted in the puncturing sequence
p = (13, 16, 19, 22, 14, 17, 20, 23, 6, 15, 8) .
(4.25)
The performance of the codes was evaluated using the sum-product decoding algorithm with a maximum of 100 iterations. For all of the measurements a minimum
of 100 failed decoding attempts were observed.
In Fig. 4.9, puncturing sequences for the (N, K) = (576, 288) code punctured
to rate 2/3 and 3/4 have been compared. The figure shows the ILP-optimized
1 For standard codes and parallelization degrees, these memory arrangements may result in
sparingly used memories. To circumvent this, the same techniques as in [66] may be used.
78
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
1-SR
Number
4
130
5
372
6
385
7
113
Table 4.1 Number of 1-SR nodes for 1000 runs of the algorithm [46] applied
to the WiMax rate-1/2 base matrix. The proposed algorithm
achieves 8 1-SR nodes.
Rate
[46]
Proposed
0.75
2
1
0.8
2–3
2
0.86
2–4
2
0.92
6
4
Table 4.2 Recovery delay (maximum k-SR node) for different punctured
rates of the WiMax rate-1/2 base matrix.
sequence together with an ensemble of ten random sequences obtained using the
greedy algorithm in [46]. In order to provide a fair comparison with scalable block
sizes, the greedy algorithm was applied on the base matrix and the obtained base
puncturing patterns were expanded with the matrix expansion factor. For rate 2/3,
the performance of the optimized sequence is comparable to those obtained with
the heuristics, whereas for rate 3/4, the performance of the optimized sequence is
superior to those obtained with the heuristics. There is a small trade-off between
performance at different rates, which partly explains why the optimized sequence is
not superior for all rates. However, it is also the case that the algorithm optimizes
recovery time, which may not necessarily result in optimal performance.
In Table 4.1, the number of 1-SR nodes obtained with the algorithm in [46] is
shown to vary between 4 and 7. In contrast, the optimization achieves 8 1-SR nodes.
In addition, as seen in Table 4.2 the recovery delay is commonly higher for the
greedy algorithm, leading to shorter decoding times for the optimized puncturing
sequences.
In Fig. 4.10(a), the rate-1/2 length-576 WiMax code punctured to rates 2/3,
3/4, and 5/6 have been compared to the dedicated codes of the standard. It is seen
that the performance difference is small for the lower rates.
In Fig. 4.10(b), the performance loss of puncturing the rate-1/2 WiMax code to
rate 3/4 is shown for block sizes of 576, 864 and 1152. The puncturing patterns are
obtained from the same base puncturing pattern, showing the possibility of using
the proposed scheme with scalable block sizes.
4.4.2
Check-merging decoding algorithm
The proposed check-merging decoding algorithm in Sec. 4.2 has been applied to
the WiMAX rate-1/2 code punctured to rates 3/5 and 3/4. The resulting codes
have been simulated over an AWGN channel and decoded with the sum-product
algorithm on both the rate-1/2 graph and the check-merged graph. The block error
4.5. SYNTHESIS RESULTS OF CHECK-MERGING DECODER
0
0
10
10
10
−2
10
−3
10
0
10
−2
10
−3
10
−4
10
N=576, punct
N=576, ref
N=864, punct
N=864, ref
N=1152, punct
N=1152, ref
−1
5/6, punct
5/6, ref
3/4, punct
3/4, ref
2/3, punct
2/3, ref
1/2
Block error rate
−1
Block error rate
79
−4
1
2
3
4
Eb/N0 [dB]
5
10
6
2
2.5
3
3.5
4
Eb/N0 [dB]
4.5
5
(a) Comparison of the punctured rate-1/2 (b) Performance of the rate-1/2 WiMax
length-576 WiMax code with dedicated rates code punctured to rate 3/4 for block sizes
of the standard.
of 576, 864 and 1152.
Figure 4.10 Performance of ILP-based puncturing scheme.
0
20
Average number of iterations
10
−1
Block error rate
10
−2
10
−3
10
−4
10
R=3/4, SP
R=3/4, prop.
R=3/5, SP
R=3/5, prop.
R=1/2
−5
10
−6
10
0
1
2
3
4
Eb/N0 [dB]
5
6
R=3/4, SP
R=3/4, prop.
R=3/5, SP
R=3/5, prop.
R=1/2
15
10
5
0
0
1
2
3
4
Eb/N0 [dB]
5
6
(a) Block error rate of WiMAX rate-1/2 (b) Average number of iterations for
code punctured to rates 3/5 and 3/4.
WiMAX rate-1/2 code punctured to rates
3/5 and 3/4.
Figure 4.11 Simulation results of check-merging decoding algorithm.
rate and average number of iterations have been computed. Simulations using a
maximum of 20 iterations are shown in Figs. 4.11(a) and 4.11(b), respectively.
The main benefit is the significant reduction of the average number of iterations,
allowing an increase in the throughput.
4.5
Synthesis results of check-merging decoder
The proposed check-merging architecture in Sec. 4.3 has been synthesized for the
LDPC codes used in IEEE 802.16e and IEEE 802.11n in order to determine the
logic overhead required to support check merged decoding. For IEEE 802.16e, a
parallelization factor of Q = 24 has been used. For IEEE 802.11n, the maximum
80
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
60
802.16e 1/2
802.16e 2/3a
802.16e 3/4a
Imp. limit
40
Maximum check node degree
Maximum check node degree
50
30
20
10
0
0.5
0.6
0.7
Punctured rate
0.8
0.9
50
40
802.11n 1/2
802.11n 2/3
802.11n 3/4
Imp. limit
30
20
10
0
0.5
0.6
0.7
Punctured rate
0.8
0.9
Figure 4.12 Maximum check node degrees of check node merged matrices
for IEEE 802.16e codes and IEEE 802.11n codes.
length codes of N = 1944 have been assumed, and in order to more accurately
conform to the data rate requirements a larger parallelization factor of Q = 81 has
been used. For all implementations, wq = 8, wr = 5 and we = 5.
4.5.1
Maximum check node degrees
The maximum check node degree supported by the decoder is determined by the
length of the register chain in the bit-sum update unit. Therefore a limit must be set
based on the degree distributions of the check node merged matrices. Figure 4.12
shows the maximum check node degrees as a function of the punctured rate for the
considered codes of the IEEE 802.16e and IEEE 802.11n standards. In both cases,
C = 28 has been considered a suitable trade-off between decoder complexity and
achievable rates.
4.5.2
Decoding throughput
As the complexity of the code is reduced when checks are merged, the attainable
throughput increases as the punctured rate increases. This gain is in addition to
the faster convergence rate of the check merging decoder. Assuming a fixed number
of 10 iterations, the minimum achievable throughputs for IEEE 802.16e and IEEE
802.11n are shown in Fig. 4.13.
4.5.3
FPGA synthesis
The reference decoder (Sec. 2.8) and the proposed check merging decoder have been
synthesized for an Altera Cyclone II EP2C70 FPGA with speedgrade -6. Mentor
Precision was used for the synthesis, and the results are shown in Tables 4.3 and 4.4
for IEEE 802.16e and IEEE 802.11n, respectively. The C values for the reference
decoders were set to the maximum check node degrees of the fixed rate codes in the
4.5. SYNTHESIS RESULTS OF CHECK-MERGING DECODER
55
200
802.16e 3/4a
802.16e 2/3a
802.16e 1/2
Inf. bit throughput [Mb/s]
Inf. bit throughput [Mb/s]
60
50
45
40
35
30
0.5
0.6
0.7
Punctured rate
0.8
0.9
180
81
802.11n 3/4
802.11n 2/3
802.11n 1/2
160
140
120
100
0.5
0.6
0.7
Punctured rate
0.8
0.9
Figure 4.13 Information bit throughput of decoder on IEEE 802.16e codes
and IEEE 802.11n codes.
Table 4.3 Synthesis results of decoders for IEEE 802.16e codes.
Q C LUTs Regs Mem. bits
Fixed rate (Sec. 2.8
24 20
4750 6303
20640
Rate-compatible (Sec. 4.3) 24 28
5988 8015
20480
Table 4.4 Synthesis results of decoders for IEEE 802.11n codes.
Q C LUTs
Regs Mem. bits
Fixed rate (Sec. 2.8)
81 23 16396 21865
66560
Rate-compatible (Sec. 4.3) 81 28 22884 28635
66336
respective standards. For 802.16e, the overhead of the check merging decoder is
26% and 27% for the LUTs and registers, respectively. For 802.11n, the overhead
is bigger at 40% and 31% for the LUTs and registers, respectively. A significant
number of the LUTs is consumed in the barrel shifters.
82
CHAPTER 4. RATE-COMPATIBLE LDPC CODES
5
Data representations
In this chapter, the representation of the data in a fixed-point decoder implementation is discussed. It is shown that the usual data representation is redundant,
and that in many cases coding of the data can be applied to reduce the width of
the data buses without sacrificing the error-correcting performance of the decoder.
This chapter has been published in [7].
5.1
Fixed wordlength
The sum-product algorithm, as formulated in the log-likelihood domain in (2.13)–
(2.15), lends itself well towards fixed wordlength implementations. Usually, decent
performance is achieved using wordlengths as short as 4–6 bits [111]. Thus, the
additions can be efficiently implemented using ripple-carry adders, and the domain
transfer function Φ(x) can be implemented using direct gate-level synthesis.
5.2
Data compression
The function Φ(x) is shown in Fig. 5.1. In a hardware implementation, usually the
output of this function is stored in memories between the phases, and it is therefore
of interest to represent the information efficiently. However, as the function is non83
84
CHAPTER 5. DATA REPRESENTATIONS
4
3.5
3
Φ(x)
2.5
2
1.5
1
0.5
0
0
1
2
x
3
4
Figure 5.1 The domain transfer function Φ(x).
VNU
αmn
CNU
2
βmn
LUT
1
LUT
(a) Conventional data-path.
VNU
CNU
αmn
Enc
Dec
βmn
Dec
Enc
(b) Data-path utilizing compression.
Figure 5.2 Data-paths in a sum-product decoder.
linear, all possible words will not be represented at the output, and there is therefore
sometimes an opportunity to code the data.
A general model of the data flow in a sum-product decoder is shown in Fig. 5.2(a).
The VNU block performs the computation of the αnm messages in (2.13), whereas
the CNU block performs the additions for the βmn messages in (2.14). Φ(x) is
implemented by the LUT blocks, and can be part of either the VNU block or the
CNU block. The dashed lines 1 and 2 show the common partitions used in decoder
implementations, e.g., in [109] and [68], respectively. Instead of these partitions,
the implementation in Fig. 5.2(b) is suggested, where an encoder is used to convert
the messages to a more compact representation which is used to communicate the
5.3. RESULTS
85
2.2
1.6
1.4
1.2
LUT 1, w =2
f
LUT 2, w =2
1.6
Redundant bits
1.8
Redundant bits
1.8
1 integer bit
2 integer bits
3 integer bits
2
f
LUT 1, w =3
f
1.4
LUT 2, w =3
f
1.2
1
1
0.8
0.8
0.6
2
4
6
Fractional bits
8
10
0.6
1.5
2
2.5
3
Logarithm base
3.5
4
(a) Redundant bits B(wi , wf ) of LUT value (b) Redundant bits of LUT value entries for
entries for different data representations.
different logarithm bases using data representations with wi = 2, wf = 2, and wi = 2,
wf = 3.
Figure 5.3 Redundant bits of discretized domain transfer function Φ(x).
messages between the processing units. At the destination, a decoder is used to
convert the coded data back to the original message.
Denoting the number of integer bits in the uncoded data representation by wi ,
and the number of fractional bits by wf , the number of redundant bits in the
output of the LUT blocks can be written
B(wi , wf ) = log2 2wi +wf − No ,
(5.1)
where No is the number of unique function values of the discretized function of
Φ(x).
5.3
Results
B(wi , wf ) is plotted for different parameters in Fig. 5.3(a). Obviously B(wi , wf )
also depends on the base of the logarithm in Φ(x) described in Sec. 2.6.5. However, as the domain transfer functions differ when the logarithm is not the natural
logarithm, B(wi , wf ) will be different for the two LUTs. The representation with
wi = 2 and wf = 2 was shown in [111] to be a good trade-off between the performance and complexity of the decoder, and the number of redundant bits as a
function of the logarithm base is shown in Fig. 5.3(b). As shown, compression can
be combined with logarithm scaling for bases from 2.4 to 3.2 for wf = 2, for the considered representation, without precision loss. Outside this interval, approximation
of the domain transfer function is needed in one direction.
Assuming that the natural logarithm base and a data representation with
wi = 2 and wf = 2 is used, messages consist of 6 bits (including sign bit and
hard decision/parity-check bit), and compression thus results in a wordlength reduction of 16.7%. The area overhead associated by the separation of the look-up
86
CHAPTER 5. DATA REPRESENTATIONS
90
50
80
40
Frequency [%]
Frequency [%]
70
60
50
40
30
20
30
20
10
10
0
0
0.5
1
1.5
2
2.5
3
3.5
(a) Distribution of input values to CNU.
0
0
0.5
1
1.5
2
2.5
3
3.5
(b) Distribution of input values to VNU.
Figure 5.4 Distributions of arguments to discretized domain transfer function Φ(x).
table has been estimated by synthesis to standard cells in a 0.35µm CMOS process.
The synthesis was performed using Synopsys design compiler. The synthesis of the
original look-up table required 12 cells utilizing a total area of 726µm2 . Realized as
separate blocks using a straightforward mapping of data to the intermediate compressed format, the encoder and decoder parts utilized 8 cells occupying 528µm2
and 6 cells occupying 309µm2 , respectively. As comparison, a 6-input CNU and
a 3-input VNU occupy roughly 39000µm2 and 31000µm2 respectively, when synthesized using the same methods. Thus the area overhead of 111µm2 per look-up
table amounts to a 1–1.5% area increase for the process elements. In contrast,
problems with large routing overheads are commonly reported in implementations,
with overheads as large as 100% in the fully parallel (N, K) = (1024, 512) code
decoder in [18].
Additionally to reducing the wordlength of messages, the separation of the lookup table also allows a possibility to choose a representation suitable for energyefficient communication, and, as shown above, the separation is essentially free.
Consider the distribution of look-up table values for CNU input (Fig. 5.4(a)) and
VNU input (Fig. 5.4(b)) obtained using an (N, K) = (1152, 576) code. It is obvious
that the energy dissipation for communication of messages will depend on the
encoding chosen. For example, in a parallel architecture the most common values
can be assigned representations with a small mutual distance, whereas in a partly
parallel or serial architecture an asymmetric memory [22] might be efficiently used
if low-weight representations are chosen for the most common values.
6
Conclusions and future work
In this chapter, the work presented in this part of the thesis is concluded, and
suggestions for future work are given.
6.1
Conclusions
In part I of this thesis, two improved algorithms for the decoding of LDPC codes
were proposed. Chapter 3 concerns the early decision algorithm, which reduces the
computational complexity by deciding certain parts of the codeword early during
decoding. Furthermore, the early decision modification was combined with the
sum-product algorithm to form a hybrid decoding algorithm which does not visibly
reduce the performance of the decoder. For a regular (N, j, k) = (1152, 3, 6) code,
an internal communication reduction of 40% was achieved at a block error rate of
10−6 . In general, larger reductions are obtainable for codes with higher rates and
smaller block sizes.
In Chapter 4, the check merging decoding algorithm for rate-compatible LDPC
codes was suggested. The algorithm proposes a technique of merging check nodes
of punctured LDPC codes, thereby reducing the decoding complexity. For dualdiagonal LDPC codes, the algorithm offers a reduced decoding complexity due to
two factors: the complexity of each iteration is reduced due to the removal of nodes
in the code’s Tanner graph, and the number of iterations required for convergence
87
88
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
is reduced due to faster propagation of messages in the graph. For the punctured
rate-1/2 length-576 IEEE 802.16e code, punctured to rate-3/4, the check merging
algorithm offers a more than 60% reduction in the average number of iterations at
a block error rate of 10−3 over an AWGN channel.
The early decision and check merging algorithm have both been implemented
on FPGAs. In both cases, the proposed algorithms come with a logic overhead
compared with their respective reference implementations. The early decision algorithm was implemented for a (3, 6)-regular code class, whereas the check merging
algorithm was implemented for the QC-LDPC codes with dual-diagonal structure
used in the IEEE 802.16e and IEEE 802.11n standards. The logic overhead of the
early decision algorithm was relatively big at 45%, whereas the logic overhead of
the check merging algorithm was 26% and 40% for the IEEE 802.16e and IEEE
802.11n implementations, respectively.
6.2
Future work
The following ideas are identified as possibilities for future work:
• Removing early decisions when graph inconsistencies are encountered. This
is likely necessary for the early decision algorithm to work well with codes
longer than 1000–2000 bits.
• The behavior of the early decision algorithm using irregular codes and efficient
choices of thresholds should be analyzed.
• The check merging algorithm works well when the punctured nodes are of
degree two, but investing the possibilities of adapting the algorithm to work
with punctured nodes of higher degrees is of interest.
• For both of the proposed algorithms, their performance over fading channels
is of interest.
Part II
High-speed analog-to-digital
conversion
89
7
Introduction
7.1
Background
The analog-to-digital converter (ADC) is the interface between the analog signal
domain and the digital processing domain, and is used in all digital communications
systems. Normally, an ADC consists of a sample-and-hold stage and a quantization
stage, where the sample-and-hold performs discretization in time and the quantization stage performs discretization of the signal level. There are many different
types of ADCs, and some of the more common ones are:
• the flash ADC, which consists of a number of comparators with different
reference signal levels, followed by a thermometer-to-binary decoder. The
flash ADC is fast, but achieves low resolution as the number of comparators
grows exponentially with the required number of bits.
• the successive approximation ADC, which uses an internal digital-to-analog
converter (DAC) fed on a register containing the current approximation of
the analog input. In each clock cycle, a comparator compares the input to
the current approximation, deciding one bit of the output. The successive
approximation ADC is slower than the flash ADC, but can achieve medium
resolutions.
• the pipeline ADC, which consists of several cascaded flash ADCs. In each
91
92
CHAPTER 7. INTRODUCTION
stage, the signal is discretized by a course flash ADC, and the approximation
is then subtracted from the signal and the difference is passed on to the
following stages. The pipeline ADC thus combines the speed of the flash
ADC with the resolution of the successive approximation ADC.
• the sigma-delta ADC (Σ∆-ADC), which is essentially a low-resolution flash
ADC with a negative feed-back loop for the quantization error. The feedback
allows the quantization noise to be frequency shaped, which is combined with
oversampling to place the majority of the quantization noise in a frequency
band unoccupied by the signal. This requires the ADC to be followed by a
filter to remove the quantization noise and a downsampling stage to reduce
the sampling frequency to a useful rate. Due to the oversampling, the achievable speed of a Σ∆-ADC is limited, but the architecture is normally resistant
to analog errors, allowing high resolutions to be obtained.
The resistance to analog errors of the Σ∆-ADC is attractive from a system-onchip point of view, as digital circuits are almost exclusively made in cost-efficient
CMOS processes with small feature sizes, in which linear analog devices are difficult
to design. Σ∆-ADCs allow a relaxation on the requirements of the analog devices,
and instead compensating for errors with digital signal processing for which the
CMOS processes are well suited. Therefore, attemps have recently been made to
build high-speed Σ∆-ADCs [19, 82, 103] for general communications standards.
The work in this part of the thesis considers the design of high-speed ADCs
through parallel Σ∆-modulators, and the design of high-speed FIR filters with
short wordlengths which are usable as decimation filters for high-speed ADCs.
The proposed parallel Σ∆-modulator designs use modulation of the input signal
to reduce matching requirements of the analog devices, and a general method to
analyze the matching requirements of such systems is presented. The targeted data
converters have bandwidths of tens of MHz and sampling frequencies in the GHz
range, making both the linearity of analog devices and the implementation of digital
decimation filters difficult. The proposed design of high-speed FIR filters uses bitlevel optimization on the placements of full and half adders in the implementation.
7.2
Applications
There are many applications that could benefit from system-on-chip integration
of transceiver frontends, but the main target is area- and power-constrained devices such as handhelds and laptops. Such devices support an increasing number
of communication standards such as GSM and UMTS for voice and data communications, wireless LAN for high-speed internet access, bluetooth for peripheral
connections and GPS for positioning services. These standards have wildly differing requirements on bandwidth and resolution, making the Σ∆-ADC attractive
as these measures can be traded by adjusting the width of the digital decimation
filter.
7.3. SCIENTIFIC CONTRIBUTIONS
7.3
93
Scientific contributions
There are two main contributions in this part of the thesis. In Chapter 10, a method
of analyzing general parallel ADCs for matching requirements is presented. This
is done by a multi-rate formulation of the parallel system, and the stability of
subsets of the channels can then be determined by observing the transfer function
matrix for the sub-system. Time-interleaved, Hadamard-modulated and frequencyband decomposed parallel ADCs are then described as special cases of the general
system, and their resistances to analog imperfections are analyzed. The multi-rate
formulation is then used as a base to find insensitive systems.
The second main contribution is in Chapter 12 and considers the design of highspeed FIR filters that may be used as decimation filters for high-speed Σ∆-ADCs.
The problem of designing an efficient filter is decomposed in two parts, where the
first part considers generation of a set of weighted bit-products and the second part
considers the construction of a pipelined summation tree. In the first part, the filter
is first formulated as a multi-rate system. Then, the branch-filters are implemented
using one of several structures, including direct form FIR and transposed direct
form FIR, and bit-products are generated for each of the coefficients. Finally, bitproducts with the same bit weight are merged. In the second part, the summation
of the generated bit-products is considered. Traditional methods including Wallace
trees and Dadda trees are used as references, and a bit-level optimization algorithm
is proposed that minimizes a cost function based on the number of full adders, half
adders and pipeline registers. Finally, the suitability of the different structures
are evaluated for decimation and interpolation filters of varying wordlengths, rate
change factors and pipeline depths.
94
CHAPTER 7. INTRODUCTION
8
FIR filters
In this chapter, the basics of finite impulse response (FIR) filters are discussed. The
definition of a filter from an impulse response is in Sec. 8.1, and the z-transform
is introduced as a tool to analyze the behavior of a filter. In Sec. 8.2, some design
methods for FIR filters are touched briefly. In Sec. 8.3, the basics of sampling
rate conversion and multirate signal processing are discussed, and in Sec. 8.4, the
main architectures for the realization of FIR filters are introduced. Also, a bitlevel realization of an FIR filter for decimation is shown, and the realization uses
multirate theory to reduce the arithmetic complexity. More in-depth introductions
to digital filters are available in [78], and multirate theory is extensively discussed
in [94].
8.1
FIR filter basics
A digital filter is a discrete linear time-invariant system that is typically used to
amplify or suppress certain frequencies of a signal. Digital filters can be partitioned
into two main classes: finite impulse response (FIR) filters and infinite impulse response (IIR) filters. FIR filters often result in more complex implementations than
IIR filters, for a given filter specification. However, FIR filters have other advantages such as better phase characteristics, better stability, fewer finite-wordlength
considerations and better pipeline abilities. The work in this thesis considers FIR
95
96
CHAPTER 8. FIR FILTERS
filters only.
8.1.1
FIR filter definition
An FIR filter [78] can be characterized by its impulse response h(k), a scalar discrete
function of k. For a causal FIR filter of order N , h(k) = 0 when k < 0 or k > N , and
the impulse response thus has at most N + 1 nonzero values. The impulse response
defines factors with which different delays of the input should be multiplied, and
the output is the sum of these. Thus, denoting the input and output by x(n) and
y(n), respectively, the behavior of an N th order filter with impulse response h(k)
is defined by
y(n) =
N
X
k=0
x(n − k)h(k).
(8.1)
It can be seen that when the input is the Kronecker delta function (x(n) = δ(n)),
the output is the impulse response (y(n) = h(n)).
8.1.2
z-transform
The impulse response describes the filter’s behavior in the time domain, but does
not say much about a filter’s behavior in the frequency domain. Thus, in order
to analyze a filter’s frequency characteristics, the z-transform H(z) of the impulse
response h(k) is defined as
H(z) =
∞
X
h(k)z −k ,
(8.2)
k=−∞
and the input and output are related by the equation
Y (z) = H(z)X(z).
(8.3)
jω
The frequency response of the
to visualize
filter
is then H(e ) which is common
jω as the magnitude response H(e ) and the phase response arg H(ejω ). It is also
possible to visualize the filter’s frequency characteristics by a pool-zero diagram.
This is done by rewriting H(z) as
H(z) =
N
X
k=0
h(k)z −k =
PN
k=0
h(k)z N −k
,
zN
(8.4)
where the zeros and poles are the zeros of the numerator and the denominator,
respectively. Since the denominator is a single power of z, all the poles are in the
origin, making the filter unconditionally stable.
8.1. FIR FILTER BASICS
97
0.3
10
0.25
0
|H(ejω)| [dB]
0.2
h(k)
0.15
0.1
0.05
−20
−30
−40
0
−0.05
0
10
k
20
−50
0
30
0
0.5
−200
arg H(ejω) [deg]
1
0
−0.5
−1
−1
0.1
0.2
0.3
0.4
Normalized frequency
0.5
(b) Magnitude response H(ejω ).
(a) Impulse response h(k).
Imaginary axis
−10
−400
−600
−0.5
0
0.5
Real axis
1
(c) Pole-zero diagram.
1.5
−800
0
0.1
0.2
0.3
0.4
Normalized frequency
0.5
(d) Phase response arg H(ejω ).
Figure 8.1 28th order linear phase FIR filter with impulse response, frequency response and pole-zero diagram.
8.1.3
Linear phase filters
One of the advantages of FIR filters is the ability to design filters with linear phase.
Linear phase filters ensure that the delay is the same for all frequencies, which is
often desirable in data communications applications. A linear phase filter results
when the impulse response coefficients are either symmetric or anti-symmetric
around the middle, and are commonly designed using the MPR algorithm (see
Sec. 8.2).
Consider the example design of a 28th order linear phase lowpass FIR filter
shown in
Fig. 8.1.
The figure shows the impulse response h(k), the magnitude response H(ejω ), the pole-zero diagram of H(z) and the phase response arg H(ejω ).
The linear phase property can be seen in both the coefficient symmetry in the impulse response in Fig. 8.1(a) and in the phase response in Fig. 8.1(d).
98
CHAPTER 8. FIR FILTERS
x(m)
L
y(n)
(a) L-fold upsampler.
x(n)
L
y(m)
(b) L-fold downsampler.
Figure 8.2 Rate change elements.
8.2
FIR filter design
There are many ways to design an FIR filter. In this section, some of the more
commonly used when optimizing for a desired magnitude are given. A desired real
non-negative magnitude response function Hd (ejω ) is given along with a weighting
function W (ejω ). The optimization process then tries to design the filter H(ejω ),
as defined by (8.2), to match Hd (ejω ) as closely as possible, with allowed relative
errors for different frequencies defined by the weighting function W (ejω ). Using a
mini-max goal function, the maximum weighted error over the frequency band is
minimized for a given filter order, i.e.,
minimize max W (ejω ) H(ejω ) − Hd (ejω ) .
(8.5)
ω∈[0,2π]
In certain cases, it is more beneficial to minimize the error energy rather than
the maximum error. This leads to a least-squares optimization goal, which can be
defined as
Z 2π
2
minimize
W (ejω ) H(ejω ) − Hd (ejω ) .
(8.6)
0
In this thesis, the McClellan-Parks-Rabiner (MPR) algorithm [84, 85] has been
used for the design of optimal mini-max FIR filters.
8.3
8.3.1
Multirate signal processing
Sampling rate conversion
In a system utilizing multiple sampling rates, rate changes are modelled by the
upsampler and downsampler elements, shown in Fig. 8.2. The L-fold upsampler
inserts zeros in the input stream, such that
(
n
for n = 0, ±L, ±2L, . . .
x L
y(n) =
(8.7)
0
otherwise.
In the z-domain, (8.7) becomes
Y (z) = X(z L ),
and the spectrum of y(n) thus consists of L copies of the spectrum of x(n).
(8.8)
8.3. MULTIRATE SIGNAL PROCESSING
x(m)
L
H(z)
y(n)
x(n)
(a) L-fold interpolation.
99
H(z)
L
y(m)
(b) L-fold decimation.
Figure 8.3 Conventional interpolation and decimation of a signal.
The L-fold downsampler retains every Lth sample of its input and discards the
others, such that
y(m) = x(mL).
(8.9)
In the z-domain, (8.9) becomes
Y (z) =
L−1
1 X −j 2πl 1 X e L zL ,
L
(8.10)
l=0
and the output spectrum thus consists of L stretched and shifted versions of the
input spectrum.
It can be seen that both the upsampler and downsampler affects the spectrum
of the input: the output of the upsampler contains replicas of the input spectrum,
whereas the output of the downsampler contains aliasing. Thus, when the upsampler and downsampler are used to interpolate or decimate a signal, the signal must
also be filtered to remove the replicas (for the upsampler) or to remove the aliasing
(for the downsampler). Typically, the filter is a lowpass filter, but can also be a
highpass filter or any bandpass filter with a bandwidth of at most 2π/L. For the
interpolator, filtering is performed on the output as shown in Fig. 8.3(a), whereas
it is performed on the input for the decimator as shown in Fig. 8.3(b).
For both the interpolator and the decimator, it can be seen that the filtering
is performed at the higher sampling rate. It can also be noted that the filtering is
unnecessarily complex in both cases. For the interpolation filter it is known that
L − 1 of L input samples are zero and do not contribute to the output. For the
decimation filter, L − 1 of L output samples are computed, but then discarded by
the downsampler. In Sec. 8.3.3, it is shown how this knowledge can be used to
reduce the complexities of the filters.
8.3.2
Polyphase decomposition
For a filter h(k) with z-transform H(z), (8.2) can be rewritten
H(z) =
∞
X
k=−∞
h(k)z −k =
L−1
X
l=0
z −l
∞
X
k=−∞
h(Lk + l)z −Lk =
L−1
X
z −l Hl (z L ),
(8.11)
l=0
where Hl (z) is the z-transform of the lth polyphase component
hl (k) = h(Lk + l),
l = 0, 1, . . . , L − 1.
(8.12)
100
CHAPTER 8. FIR FILTERS
x(n)
H(z L )
x(m)
L
L
H(z L )
y(m) ⇐⇒
y(n)
x(n)
⇐⇒ x(m)
L
H(z)
H(z)
y(m)
L
y(n)
Figure 8.4 Noble identities.
This decomposition is usually called the Type-1 polyphase decomposition [94],
and is often used to parallelize the implementations of FIR filters [78]. It is also
normally used in the implementations of FIR rate change filters, as it allows the
implementation to retain the complexity, but run at a lower sampling rate.
8.3.3
Multirate sampling rate conversion
The upsampler and downsampler are time-varying operations [94], and therefore
operations can in general not be moved through them. However, the Noble identities [94] allow filters and rate change operations to exchange places in certain
conditions. The Noble identities are shown in Fig. 8.4.
Using polyphase decomposition and the Noble identities, the complexities of
the interpolator and decimator in Fig. 8.3 can be reduced significantly. The transformation of the interpolator is shown in Fig. 8.5, where the interpolation filter
H(z) is first polyphase decomposed with a factor of L. The resulting subfilters
Hl (z) have the same total complexity as the original filter H(z). Then the upsampler is moved to after each subfilter, reducing the sampling frequencies of the
filters by a factor of L. Finally, analyzing the structure at the output it is realized
that only one branch is non-zero at a given time and can therefore be realized as a
commutator between the branches. The transformation of the decimator is shown
in Fig. 8.6 and is analogous.
8.4
8.4.1
FIR filter architectures
Conventional FIR filter architectures
The most straight-forward implementation of an FIR filter is the direct form architecture, which is a direct realization of the filter definition (8.1). The architecture is
shown in Fig. 8.7, and has a delay line for the input signal x(n), forming a number
of delayed input signals x(n−k). The delay line is tapped at a number of positions,
where the delayed input is scaled by the appropriate impulse response coefficient,
and all products are then summed together and form the output y(n).
The transposed direct form structure can be obtained from the direct form
structure using the transpose operation [78], which for systems with a single input and output results in an arithmetically equivalent system. The transposed
direct form architecture is shown in Fig. 8.8. The transposed direct form may lend
itself better to high-speed implementations, as the critical path of the adders at
8.4. FIR FILTER ARCHITECTURES
x(m)
101
L
H(z)
y(n)
(a) Conventional interpolation.
x(m)
H0 (z L )
L
y(n)
H1 (z L )
z −1
HL−1 (z L )
z −(L−1)
(b) Polyphase decomposition of filter.
x(m)
H0 (z)
L
y(n)
H1 (z)
L
z −1
HL−1 (z)
L
z −(L−1)
(c) Moving upsampler to after subfilters.
x(m)
H0 (z)
H1 (z)
y(n)
HL−1 (z)
(d) Replacing output structure by commutator.
Figure 8.5 Interpolator realized using polyphase decomposition.
the output of the direct form structure may become significant and require extra
pipelining registers.
8.4.2
High-speed FIR filter architecture
In the case of implementations with high speed and short wordlengths, it may
be more beneficial to consider realizations on the bit level rather than the word
level. One such architecture is described here, and is also used as a base for the
work in Chapter 12. First, the detailed derivation of a decimation filter on direct
form is described, and then the resulting architecture for a transposed direct form
realization is shown. Finally, the realizations of single rate and interpolation filters
are discussed.
Consider the structure in Fig. 8.9(a), corresponding to the arithmetic part of the
multirate decimation filter in Fig. 8.6. The system has L inputs x0 (n), . . . , xL−1 (n),
L independent subfilters H0 (z), . . . , HL−1 (z), and one output y(n). In Fig. 8.9(b),
102
CHAPTER 8. FIR FILTERS
x(n)
H(z)
L
y(m)
(a) Conventional decimation.
H0 (z L )
x(n)
L
z −1
H1 (z L )
z −(L−1)
HL−1 (z L )
y(m)
(b) Polyphase decomposition of filter.
x(n)
L
H0 (z)
z −1
L
H1 (z)
z −(L−1)
L
HL−1 (z)
y(m)
(c) Moving downsampler to before subfilters.
y(m)
H0 (z)
H1 (z)
x(n)
HL−1 (z)
(d) Replacing input structure by commutator.
Figure 8.6 Decimator realized using polyphase decomposition.
z −1
x(n)
h(0)
z −1
h(1)
z −1
z −1
h(2)
h(N − 1)
h(3)
y(n)
Figure 8.7 Direct form FIR filter architecture.
x(n)
h(N − 1)
z −1
h(3)
z −1
h(2)
z −1
h(1)
h(0)
z −1
Figure 8.8 Transposed direct form FIR filter architecture.
y(n)
8.4. FIR FILTER ARCHITECTURES
x0 (n)
H0 (z)
x1 (n)
H1 (z)
xL−1 (n)
103
y(n)
HL−1 (z)
(a) Multirate filter structure.
x0 (n)
z −1
z −1
z −1
H0 (z)
x1 (n)
z −1
z −1
z −1
H1 (z)
y(n)
xL−1 (n)
z −1
z −1
HL−1 (z)
z −1
(b) Direct form realization of subfilters.
x0 (n)
x1 (n)
xL−1 (n)
wd
wd
z −1
z −1
z −1
z −1
z −1
z −1
z −1
z −1
z −1
wd
Partial product generation
Nw−1
N2
N1
N0
Carry-save adder tree
wi
VMA
y(n)
w
(c) Realization using partial product generation and carrysave adder tree.
Figure 8.9 High-speed realization of FIR decimation filter on direct form
using partial product generation and carry-save adders.
104
CHAPTER 8. FIR FILTERS
x0 (n)
x1 (n)
wd
Partial product generation
wd
VMA
CSA
z −1
CSA
z −1
CSA
xL−1 (n)
wd
y(n)
w
Figure 8.10 High-speed architecture of FIR decimation filter on transposed direct form.
the subfilters have been realized on direct form. Now, instead of directly computing
all products of the inputs and coefficients, partial products corresponding to the
non-zero bits of the coefficient are generated. The detailed generation of the partial
products depends on the representation of the data and the coefficients, and is
described in Sec. 12.1.2. However, the result is a number of partial products with
different bit weights, represented by the partial product vectors in Fig. 8.9(c) where
wd is the input data wordlength, Nm is the number of partial products of bit weight
m and w is the wordlength of y(n). Typically, Nm = 0 for some m. The partial
products are merged in a carry-save adder tree, which results in a still-redundant
output of two vectors of length wi ≤ w. Finally, the vectors are merged by a vector
merge adder to a non-redundant output y(n).
A similar architecture on transposed direct form is shown in Fig. 8.10. In the
transposed direct form architecture, the delay elements are at the output, and the
partial product generation thus consists of several partitions of partial products
corresponding to different delays. Each partition is reduced by a carry-save adder
tree to a redundant form of two vectors in order to reduce the number of registers
required.
The described architectures can be used also for single rate filters, simply by
letting L = 1. It is also possible to realize an interpolation filter (Fig. 8.5) as L
independent branches. However, for a direct form architecture, the delay elements
of the input can typically be shared between the branches.
9
Sigma-delta data converters
In this chapter, a brief introduction to Σ∆-ADCs is given. The basics of Σ∆modulators are given in Sec. 9.1. Signal and noise transfer functions of the modulator are defined, and expressions for the resulting quantization noise power and modulator SNR are obtained. In Sec. 9.2, some common structures of Σ∆-modulators
are shown. In Sec. 9.3, ADC systems using several Σ∆-modulators in parallel
are introduced. Finally, Sec. 9.4 gives some notes on implementations of decimation filters for Σ∆-ADCs. More in-depth analyses of Σ∆-ADCs are given in,
e.g., [21, 81, 88].
9.1
9.1.1
Sigma-delta data conversion
Sigma-delta ADC overview
The Σ∆-ADC is today often the preferred architecture for realizing low- to mediumspeed analog-to-digital converters (ADCs) with effective resolution above 12 bits.
Higher resolution than this is difficult to achieve for non-oversampled ADCs without laser trimming or digital error correction, since device matching-errors of semiconductor processes limit the accuracy of critical analog components [88]. The
Σ∆-ADC can overcome this problem by combining the speed advantage of analog
circuits with the robustness and accuracy of digital circuits. Through oversampling
105
106
CHAPTER 9. SIGMA-DELTA DATA CONVERTERS
x(t)
S/H
x(n)
Σ∆-modulator
R
ADC
y(n)
H(z)
u(n)
DAC
Figure 9.1 Model of first order Σ∆-ADC
e(n)
x(n)
z −1
y(n)
Figure 9.2 Linear model of the first order Σ∆-modulator in Fig. 9.1
and noise shaping, the Σ∆-modulator converts precise signal waveforms to an oversampled digital sequence where the information is localized in a narrow frequency
band in which the quantization noise is heavily suppressed.
One of the prices to pay for these advantages is the required digital decimator
operating on high sampling rate data. For Nyquist-rate CMOS ADCs, the power
consumption increases approximately by a factor of four when increasing the resolution by one bit [93]. Hence, the power consumption of accurate Nyquist-rate
ADCs tends to become very high. On the other hand, the analog circuitry of
oversampled Σ∆-modulators does not need to be accurate and power savings can
therefore be made, in particular for continuous-time Σ∆-modulators [53].
A model of a first order Σ∆-ADC is shown in Fig. 9.1. The differentiator
at the input subtracts the value of the previous conversion from the input, and
the difference is integrated. Following the integration, the signal is sampled and
quantized by a local ADC. The result is that the error of the previous conversion is
subtracted from the next conversion, thus causing a suppression of the quantization
error at low frequencies. Instead, the error appears at the higher frequencies. In
the digital domain, a decimation filter can efficiently remove the noise, resulting
in a signal that can achieve high resolution. However, in order to achieve high
resolution, the sampling frequency has to be significantly larger than the signal
bandwidth. This often causes the power dissipation of the digital decimation filter
to constitute a major part of the power dissipation of the ADC.
9.1.2
Sigma-delta modulators
When analyzing the performance of a Σ∆-modulator, the input signal and the
quantization noise are assumed to be uncorrelated. With this assumption, the
Σ∆-modulator in Fig. 9.1 can be modelled by the structure in Fig. 9.2. Using
9.1. SIGMA-DELTA DATA CONVERSION
107
this model, a signal transfer function F (z) and a noise transfer function G(z) are
defined such that the output y(n) can be written
Y (z) = F (z)X(z) + G(z)E(z),
(9.1)
where X(z), E(z) and Y (z) are the z-transforms of x(n), e(n) and y(n), respectively. In the linear model of the first order Σ∆-modulator, the signal and noise
transfer functions are given by
F (z) = z −1
(9.2)
G(z) = 1 − z
−1
.
(9.3)
Thus, the signal is passed unaffected by the modulator, whereas the quantization
noise is high-pass shaped. It is also possible to construct higher order modulators,
where the noise transfer function has a stronger high pass shape. Examples of the
structures used in this thesis are included in Sec. 9.2. A straight-forward design of
a kth order modulator yields signal and noise transfer functions given by
F (z) = z −k
G(z) = 1 − z
(9.4)
−1 k
.
(9.5)
The noise transfer function G(z) is shown in Fig. 9.3 for Σ∆-modulators of
different orders. It is seen that, whereas the higher order modulators have a higher
suppression of the noise at lower frequencies, the advantage comes at a cost of
increased noise amplification at the higher frequencies. This increases the requirements of the digital decimation filter. Higher order modulators also have other
considerations, including stability problems that reduce the allowed swing of the
input signal, and analog matching problems. However, these are not further discussed in this thesis.
9.1.3
Quantization noise power
It is common to regard the quantization noise e(n) as a white stochastic process
with a uniform distribution in the interval [−Q/2, Q/2], where Q is the quantization
step. Assuming uniform quantization in the range [−1, 1],
Q=
2
,
2B − 1
(9.6)
where B is the number of bits used in the quantization. Let Ree (ejωT ) denote the
power spectral density (PSD) of the quantization noise. Then,
Ree (ejωT ) =
Q2
1
=
,
B
12
3(2 − 1)2
and the noise power at the output of the modulator can be written
4K
ωT
2K
jωT
jωT jωT 2
Rye ye (e
) = Ree (e
) G(e
) =
sin
3(2B − 1)2
2
(9.7)
(9.8)
108
CHAPTER 9. SIGMA-DELTA DATA CONVERTERS
2
10
1
10
0
|G(jω)|
10
−1
10
−2
10
4th order
3rd order
2nd order
1st order
0th order
−3
10
−4
10
0
0.2pi
0.4pi
0.6pi
0.8pi
pi
jω
Figure 9.3 Frequency response of noise transfer function G(z) for Σ∆modulators of different orders
The assumption of e(n) to be independent of x(n) is in reality not a well-motivated
assumption. In contrast, e(n) is highly dependent of x(n) which causes the appearance of idle tones in the output spectrum. In addition, the assumption that the
quantization noise is white with a uniform distribution is also not well motivated.
In reality, the input is slow-changing due to being band-limited, which causes the
quantization noise to be highly correlated in time. Both of these effects are well
pronounced for single-bit Σ∆-modulators, which are among the most commonly
used in practice. In Chapter 11, an alternative analysis method is presented, with
which the actual quantization noise for a specific input signal is used to evaluate
the performance of different Σ∆-modulators.
9.1.4
SNR estimation
Assume that the Σ∆-modulator uses an oversampling ratio (OSR)
of L, and thus
π π
that the input signal is bandlimited to the interval − L
, L . The allowed input
signal power depends on several things, including the modulator order, number of
quantization bits, and the overloading behavior of the modulator, but assume for
simplicity that a sine wave of unit amplitude is used. Then the input signal power
9.2. MODULATOR STRUCTURES
109
200
SNR
150
4th order
3rd order
2nd order
1st order
0th order
100
50
0
1
2
4
8
16
Oversampling ratio
32
64
128
Figure 9.4 SNR as function of OSR for single-bit Σ∆-modulators of different orders.
is σx2 = 1/2. Since the signal transfer function is a pure delay, the output signal
2
power is σyx
= σx2 . Now, the signal-to-noise ratio (SNR) at the output due to
quantization noise can be written
SNR =
2
σyx
1
2π
π
L
π
−L
R
Rye ye (ejωT )dω
.
(9.9)
Figure 9.4 shows (9.9) evaluated for single-bit Σ∆-modulators of different orders. The oversampling gain is 3 + 6K dB/octave for a Σ∆-modulator of order
K.
9.2
Modulator structures
In this thesis, three different Σ∆-modulator structures are considered. These are
presented in this section, and are the first order, single feedback Σ∆-modulator
shown in Fig. 9.5, the second order, double feedback Σ∆-modulator shown in
Fig. 9.6, and the fourth order multi-stage Σ∆-modulator (MASH) shown in Fig. 9.7.
The signal and noise transfer functions of the first and second order Σ∆-modulators
110
CHAPTER 9. SIGMA-DELTA DATA CONVERTERS
x(n)
T
Q
y(n)
D/A
Figure 9.5 A first-order, single-feedback Σ∆-modulator.
x(n)
T
Q
y(n)
T
D/A
Figure 9.6 A second-order, double-feedback Σ∆-modulator.
x(n)
T
Q
T
y(n)
T
D/A
0.5
T
T
2
Q
T
T
D/A
2
Figure 9.7 A multistage noise-shaping (MASH) Σ∆-modulator.
are given by (9.4) and (9.5), respectively. For the MASH Σ∆-modulator, the signal
and noise transfer functions are given by
FMASH (z) = z −2
GMASH (z) = 2(1 − z
9.3
(9.10)
−1 4
) .
(9.11)
Modulated parallel sigma-delta ADCs
Traditionally, ADCs and DACs based on Σ∆-modulation have been used primarily for low bandwidth and high-resolution applications such as audio. The
requirements make the architecture perfectly suited for this purpose. However,
9.3. MODULATED PARALLEL SIGMA-DELTA ADCS
x(t)
S/H
Σ∆
H0 (z)
Σ∆
H1 (z)
Σ∆
HN −1 (z)
111
y(n)
Figure 9.8 Architecture of a parallel Σ∆-ADC using time interleaving of
the input signal.
1, 1, 1, ...
1, 1, 1, ...
Σ∆
H0 (z)
Σ∆
H1 (z)
Σ∆
HN −1 (z)
1, -1, 1, ...
x(t)
S/H
1, -1, 1, ...
1, -1, -1, ...
y(n)
1, -1, -1, ...
Figure 9.9 Architecture of a parallel Σ∆-ADC using Hadamard modulation of the input signal.
x(t)
S/H
lowpass Σ∆
H0 (z)
bandpass Σ∆
H1 (z)
highpass Σ∆
HN −1 (z)
y(n)
Figure 9.10 Architecture of a parallel Σ∆-ADC using frequency-band decomposition of the input signal.
in later years, advancements in VLSI technology have allowed greatly increased
clock frequencies, and Σ∆-ADCs with bandwidths of tens of MHz have been reported [19, 103]. This makes it possible to use Σ∆-ADCs in a wider context, for
example in wireless communications. In order to reach such bandwidths with feasible resolutions, sampling frequencies in the GHz range are used. However, higher
operating frequencies than this are hard to obtain with current technology. Even
with reduced feature sizes, the increased performance with digital circuits comes
from increased parallelism rather than increased operating frequencies.
One way to further increase the bandwidth of a Σ∆-ADC is to use several
modulators in parallel, where a part of the input signal is converted in each channel.
112
CHAPTER 9. SIGMA-DELTA DATA CONVERTERS
Several flavors of such Σ∆-ADCs have been proposed, and these can essentially
be divided into four categories: time-interleaved modulators [34, 35], Hadamard
modulators [35, 40, 41, 61, 62], frequency-band decomposed modulators [29, 32, 35]
and multirate modulators based on block-digital filtering [28, 56, 58, 64].
In the time-interleaved modulator, shown in Fig. 9.8, samples are interleaved in
time between the channels. Each modulator is running at the input sampling rate,
with its input grounded between consecutive samples. This is a simple scheme, but
as interleaving causes aliasing of the spectrum, the channels have to be carefully
matched in order to cancel aliasing in the deinterleaving at the output.
In a Hadamard modulator, shown in Fig. 9.9, the signal is modulated by a
sequence constructed from the rows of a Hadamard matrix. One advantage over the
time-interleaved modulator is an inherent coding gain, which increases the dynamic
range of the ADC [35], whereas a disadvantage is that the number of channels is
restricted to a number for which there exists a known Hadamard matrix. Another
advantage, as is shown in Chapter 10, is the reduced sensitivity to mismatches in
the analog circuitry.
The third category of parallel modulators is the frequency-band decomposed
modulators, shown in Fig. 9.10, in which the signal is decomposed in frequency
rather than time. This scheme is insensitive to analog mismatches, but has increased hardware complexity because it requires the use of bandpass modulators.
The idea of the multirate modulators is different, based on a polyphase decomposition of the integrator in one channel, and is not further considered.
9.4
Data rate decimation
A significant part of the complexity, and power dissipation, of a Σ∆-ADC is the
digital decimation filter. This is especially true for the high-speed ADCs considered
in this thesis, and for operation in the GHz range designing the decimation filter is
a significant challenge. Normally, the decimation filter is designed in several stages
to relax the requirements on each stage and reduce the overall complexity [94,100].
For the first stage, a common choice is the cascaded integrator comb (CIC) filter
(also known as moving average filter). The impulse response of an N -tap (order
N − 1) CIC filter is
N −1
1 X −i
1 1 − z −N
H(z) =
z =
(9.12)
N i=0
N 1 − z −1
If the number of taps, N , is selected as N = M , where M is the decimation rate,
the CIC filter has zeros on the unit circle at 2iπ/M rad for i = 1, 2, . . . , M − 1, i.e.,
the angles that are folded to 0 during the decimation. To increase the attenuation,
several CIC filters are cascaded.
The main reason of the popularity of the CIC filter is the direct implementation
of (9.12) as a cascade of an accumulator and a differentiator [20, 51], depicted in
Fig. 9.11. This implementation has low arithmetic complexity, as for a cascade
of L CIC filters only 2L adders are required. However, the wordlength of the
9.4. DATA RATE DECIMATION
113
M
x(n)
1/M
y(m)
z −1
z −1
Figure 9.11 Direct implementation of (9.12) as cascaded accumulator and
differentiator.
x0 (m)
x1 (m)
x(n)
xN −1 (m)
Partial
product
gen.
CSA
VMA
y(m)
Figure 9.12 General architecture of a decimation filter implemented using
polyphase decomposition and partial product reduction with
a carry-save adder tree.
accumulators in a CIC filter is significantly longer than the input wordlength, which
may lead to problems obtaining high throughput as the accumulators operate at
the input sampling rate. This can be alleviated by the use of redundant arithmetic
in the accumulators [76] or parallelizing the accumulators to operate at a lower
sampling rate. However, this increases the complexity of the implementation which
makes it less attractive for operations at high speed.
For decimation filters with short wordlength and high throughput requirements,
it may be significantly more efficient to compute the impulse response of the cascaded filters and realize the resulting linear-phase FIR filter using polyphase decomposition [1, 42, 45, 67]. The realization using a direct form structure is described
in Sec. 8.3, and a general architecture of a decimation filter implemented using
polyphase decomposition is shown in Fig. 9.12. Another advantage of the architecture is the ability to use general FIR filters to allow an arbitrary set of filter
coefficients, optimized for a suitable cost function. This is especially useful when
using higher order Σ∆-modulators with arbitrary zero positioning for the noise
transfer function.
114
CHAPTER 9. SIGMA-DELTA DATA CONVERTERS
10
Parallel sigma-delta ADCs
In this chapter, a multi-rate formulation of a system of parallel Σ∆-ADCs is presented. The channels use modulation of the input sequence to achieve parallelism,
and also allow the Σ∆-modulators to differ between the channels. The formulation
is used to analyze a system’s sensitivity to mismatches between the channels, including channel gain, general Σ∆-modulator mismatches and modulation sequence
errors. It is shown that certain systems can become noise-free with limited calibration of a subset of the channels, whereas others may require full calibration of all
channels. The multi-rate formulation is also used as a tool to devise a three-channel
parallel ADC that is insensitive to gain mismatches in the channels.
In Sec. 10.1, the proposed multi-rate formulation is presented, and the resulting signal transfer function of the parallel system is derived. In Sec. 10.2, the
model of one channel of the parallel system is described. The model includes analog imperfections such as modulator non-idealities, channel gain, offset errors and
modulation sequence mismatches. In Sec. 10.3, the formulation is used to analyze the sensitivities of a time-interleaved ADC, Hadamard-modulated ADC and a
frequency-band decomposed ADC, and the new insensitive modulation scheme is
presented. Finally, the quantization noise characteristics of a parallel modulated
system is discussed briefly in Sec. 10.4.
This chapter has been published in [16].
115
116
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
a0 (n)
b0 (n)
y0 (n)
Σ∆0
G0 (z)
a1 (n)
b1 (n)
y1 (n)
x(n)
Σ∆1
G1 (z)
aN −1 (n)
y(n)
bN −1 (n)
yN −1 (n)
Σ∆N −1
GN −1 (z)
Figure 10.1 ADC system using parallel Σ∆-modulators and modulation
sequences.
10.1
Linear system model
The system in Fig. 10.1 is considered. In this scheme, the input signal x(n) is first
divided into N channels. In each channel k, k = 0, 1, . . . , N − 1, the signal is first
modulated by the M -periodic sequence ak (n) = ak (n+M ). The resulting sequence
is then fed into a Σ∆-modulator Σ∆k , followed by a digital filter Gk (z). The
output of the filter is modulated by the M -periodic sequence bk (n) = bk (n + M )
which produces the channel output sequence yk (n). Finally, the overall output
sequence y(n) is obtained by summing all channel output sequences. The Σ∆modulator in each channel works in the same way as an ordinary Σ∆-modulator.
By increasing the channel oversampling, and reducing the passband width of the
channel filter accordingly, more of the shaped noise is removed, and the resolution
is increased. By using several channels in parallel, wider signal bands can be
handled without increasing the input sampling rate to unreasonable values. In
other words, instead of using one single Σ∆-ADC with a very high input sampling
rate, a number of Σ∆-ADCs in parallel provide essentially the same resolution
but with a reasonable input sampling rate. Traditional parallel systems that are
covered by the proposed model include the time-interleaved, Hadamard modulated
and frequency-band decomposed ADCs discussed in Sec. 9.3.
The overall output y(n) is determined by the input x(n), the signal transfer
function of the system, and the quantization noise generated in the Σ∆-modulators.
Using a linear model for analysis, the signal input-to-output relation and noise
input-to-output relation can be analyzed separately. The signal transfer function
from x(n) to y(n) should equal (or at least approximate) a delay in the frequency
band of interest. The main problem in practice is that the overall scheme is subject
10.1. LINEAR SYSTEM MODEL
117
ak (n)
x(n)
bk (n)
Hk (z)
yk (n)
(a) Model of channel.
x0 (m)
x(n)
yk,0 (m)
M
M
M
ak,0
z
M
bk,0
−1
z
x1 (m)
M
z
M
yk,1 (m)
M
ak,1
z −1
M
bk,1
Hk (z)
z −1
z
xM −1 (m)
M
z
M
yk,M −1 (m)
M
ak,M −1
z −1
M
yk (n)
bk,M −1
(b) Polyphase decomposition of multipliers.
x0 (m)
x(n)
yk,0 (m)
M
z −1
M
x1 (m)
M
z −1
Pk (z)
xM −1 (m)
M
z −1
yk,1 (m)
M
yk,M −1 (m)
M
z −1
yk (n)
(c) Multirate formulation of a channel.
Figure 10.2 Equivalent signal transfer models of a channel of the parallel
system in Fig. 10.1
118
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
xq (m)
M
zq
z M −1
Hk (z)
z −r
M
ak,q
yk,r (m)
bk,r
Figure 10.3 Path from xq (m) to yk,r (m) in channel k as depicted in
Fig. 10.2(b).
to channel gain, offset, and modulation sequence level mismatches [4, 35, 57]. This
is where the new general formulation becomes useful as it gives a relation between
the input and output from which one can easily deduce a particular scheme’s sensitivity to mismatch errors. The noise contribution, on the other hand, is essentially
unaffected by channel mismatches. Therefore, the noise analysis can be handled in
the traditional way, as in Sec. 10.4.
10.1.1
Signal transfer function
From the signal input-to-output point of view, the system depicted in Fig. 10.2(a) is
obtained for channel k. Here, each Hk (z) represents a cascade of the corresponding
signal transfer function of the Σ∆-modulator and the digital filter Gk (z). To derive
a useful input-output relation in the z-domain, multirate filter bank theory [94]
is used. As ak (n) and bk (n) are M -periodic sequences, each multiplication can
be modelled as M branches with constant multiplications and with the samples
interleaved between the branches. This is shown in the structure in Fig. 10.2(b),
where
(
ak (0)
for n = 0
ak,n =
(10.1)
ak (M − n) for n = 1, 2, . . . , M − 1
bk,n = bk (M − 1 − n)
for n = 0, 1, . . . , M − 1.
(10.2)
Now, consider the system shown in Fig. 10.3, representing the path from xq (m)
to yk,r (m) in Fig. 10.2(b). Denoting
H̃k (z) = z M −1 Hk (z),
(10.3)
the transfer function from xq (m) to yk,r (m) is given by the first polyphase component in the polyphase decomposition of z q H̃k (z)z −r , scaled by ak,q bk,r . For
p = q − r = 0, 1, . . . , M − 1, the polyphase decomposition of z p H̃k (z) can be
written
M
−1
X
z p H̃k (z) =
z p−i H̃k,i (z M ),
(10.4)
i=0
and the first polyphase component is H̃k,p (z), i.e., the pth polyphase component
of H̃k (z) as specified by the Type 1 polyphase representation in Sec. 8.3. For
10.1. LINEAR SYSTEM MODEL
x(n)
119
M
M
z −1
z −1
M
M
P(z)
z −1
z −1
M
M
y(n)
Figure 10.4 Equivalent representation of the system in Fig. 10.1 based on
the equivalences in Fig. 10.2. P(z) is given by (10.8).
p = −M + 1, . . . , −1,
z p H̃k (z) =
M
−1
X
z p−i+M z −M H̃k,i (z M )
(10.5)
i=0
and the first polyphase component is z −1 H̃k,p+M (z). Returning to the system
in Fig. 10.2(b), the transfer functions Pkr,q (z) from xq (m) to yk,r (m) can now be
written
(
bk,r H̃k,q−r (z)ak,q
for q ≥ r
r,q
Pk (z) =
(10.6)
bk,r z −1 H̃k,q−r+M (z)ak,q for q < r.
The relations can be written in matrix form as
Pk (z) =

ak,0 bk,0 H̃k,0 (z)
ak,0 bk,1 z −1 H̃k,M −1 (z)

ak,0 bk,2 z −1 H̃k,M −2 (z)


..

.
ak,0 bk,M −1 z −1 H̃k,1 (z)
ak,1 bk,0 H̃k,1 (z)
ak,1 bk,1 H̃k,0 (z)
ak,1 bk,2 z −1 H̃k,M −1 (z)
..
.
···
···
···
..
.
ak,1 bk,M −1 z −1 H̃k,2 (z)
···

ak,M −1 bk,0 H̃k,M −1 (z)
ak,M −1 bk,1 H̃k,M −2 (z)

ak,M −1 bk,2 H̃k,M −3 (z)
,

..

.
ak,M −1 bk,M −1 H̃k,0 (z)
(10.7)
and it is thus obvious that one channel of the system can be represented by the
structure in Fig. 10.2(c). In the whole system (Fig. 10.1) a number of such channels
are summed at the output, and the parallel system of N channels can be represented
120
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
by the structure in Fig. 10.4, where the matrix P(z) is given by
P(z) =
N
−1
X
Pk (z).
(10.8)
Pk (z) = Sk · H̃k (z)
(10.9)
k=0
For convenience, we write (10.7) as
where the multiplication is element-wise, and where H̃k (z) and Sk are given by
H̃k,0 (z)
z −1 H̃k,M −1 (z)

 −1
H̃k (z) = z H̃k,M −2 (z)

..

.

z −1 H̃k,1 (z)
H̃k,1 (z)
H̃k,0 (z)
z −1 H̃k,M −1 (z)
..
.
···
···
···
..
.
z −1 H̃k,2 (z)
···

H̃k,M −1 (z)
H̃k,M −2 (z)

H̃k,M −3 (z)


..

.
(10.10)
H̃k,0 (z)
and




Sk = 


ak,0 bk,0
ak,0 bk,1
ak,0 bk,2
..
.
ak,1 bk,0
ak,1 bk,1
ak,1 bk,2
..
.
···
···
···
..
.
ak,M −1 bk,0
ak,M −1 bk,1
ak,M −1 bk,2
..
.
ak,0 bk,M −1
ak,1 bk,M −1
···
ak,M −1 bk,M −1
respectively. (10.11) can equivalently be written as
Sk = bTk ak




,


(10.11)
(10.12)
where
ak = ak,0 ak,1 · · · ak,M −1 ,
bk = bk,0 bk,1 · · · bk,M −1 ,
(10.13)
and T stands for transpose. Examples of the Sk -matrices and the ak - and bk vectors are provided for a time-interleaved system, a Hadamard-modulated system and a frequency-band decomposed system in Sec. 10.3.1, Sec. 10.3.2, and
Sec. 10.3.3, respectively.
10.1.2
Alias-free system
With the system represented as above, it is known that it is alias-free, and thus
time-invariant, if and only if the matrix P(z) is pseudocirculant [94]. Under this
condition, the output z-transform becomes
Y (z) = HA (z)X(z),
(10.14)
10.1. LINEAR SYSTEM MODEL
121
where
HA (z) = z −M +1
N
−1 M
−1
X
X
k=0 i=0
−i
M
s0,i
k z H̃k,i (z ) =
N
−1 M
−1
X
X
−i
M
s0,i
k z Hk,i (z )
(10.15)
k=0 i=0
with s0,i
k denoting the elements on the first row of Sk . This case corresponds to
a Nyquist sampled ADC of which two special cases are the time-interleaved ADC
[34, 58] and Hadamard-modulated ADC [40]. These systems are also described in
the context of the multirate formulation in Sec. 10.3.1 and Sec. 10.3.2, respectively.
Regarding H̃k (z), it is seen in (10.10) that it is pseudocirculant for an arbitrary
H̃k (z). It would thus be sufficient to make Sk circulant for each channel k in order
to make each Pk (z) pseudocirculant and end up with a pseudocirculant P(z).
Unfortunately, the set of circulant real-valued Sk achievable by the construction
in (10.12) is limited, because the rank of Sk is one. However, for purposes of
error cancellation between channels it is beneficial to group the channels in sets
where the matrices within each set sum to a circular matrix. The channel set
{0,
P 1, . . . , N − 1} is thus partitioned into the sets C0 , . . . , CI−1 , where each sum
k∈Ci Sk is a circulant matrix. It is assumed that the modulators and filters are
identical for channels belonging to the same partition, Hk (z) = Hl (z) whenever
k, l ∈ Ci , and thus H̃k (z) = H̃l (z). The matrix for partition i is denoted H̃0,i (z).
Sensitivity to channel mismatches are discussed further in Sec. 10.2.
10.1.3
L-decimated alias-free system
A system is denoted an L-decimated alias-free system if it is alias-free before decimation by a factor of L. A channel of such a system is shown in Fig. 10.5(a).
Obviously, the decimation can be performed before the modulation, as shown in
Fig. 10.5(b), if the index of the modulation sequence is scaled by a factor of L.
Considering the equivalent system in Fig. 10.5(c), it is apparent that the downsampling by L can be moved to after the scalings by bk,l if the delay elements z −1
are replaced by L-fold delay elements z −L . The system may then be described as in
Fig. 10.5(d), where Pk (z) is defined by (10.6). However, the outputs are taken from
every Lth row of Pk (z), such that the first output yk,L−1 mod M (m) is taken from
row L, the second output yk,2L−1 mod M (m) is taken from row (2L − 1 mod M ) + 1,
and so on. It is thus apparent that only rows gcd(L, M ) · i, i = 0, 1, 2, . . . are used.
The L-decimated system corresponds to an oversampled ADC. The main observation that should be made is that the L-decimated system may be described in
the same way as the critically sampled (non-oversampled) system, but that relaxations may be allowed on the requirements of the modulation sequences. As only a
subset of the rows of P(z) are used, the matrix needs to be pseudo-circulant only
on these rows. As in the critically sampled case, the channel
set {0, 1, . . . , N − 1}
P
is partitioned into sets C0 , . . . , CI−1 where the matrix k∈Ci Sk is circulant on the
rows gcd(L, M ) · i, i = 0, 1, 2, . . . , and H̃k (z) = H̃l (z) = H̃0,i (z) when k, l ∈ Ci .
The oversampled Hadamard-modulated system in [41] belongs to this category of
the formulation, and another example of a decimated system is given in Sec. 10.3.4.
122
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
ak (n)
bk (n)
x(n)
Hk (z)
L
yk (l)
(a) Decimation at output.
ak (n)
bk (Ll)
x(n)
Hk (z)
L
yk (l)
(b) Internal decimation.
x(n)
M
z
M
x0 (m)
−1
ak,0
z
z
x1 (m)
Hk (z)
M
M
M
bk,L−1 mod M
M
M
bk,2L−1 mod M
M
ak,1
z −1
L
z −1 xM −1 (m)
M
z −1
z
z
M
M
ak,M −1
bk,M L−1 mod M
M
= bk,M −1
yk (l)
(c) Polyphase decomposition of input and output.
x(n)
M
z −1
M
x0 (m)
yk,L−1 mod M (m)
x1 (m)
yk,2L−1 mod M (m)
M
z −1
L
xM −1 (m)
M
Pk (z)
L
M
z −1
yk,M −1 (m)
L
z −1
M
yk (l)
(d) Multirate formulation of a channel. yk,L (m) denotes the output pertaining
to the Lth row of Pk (z).
Figure 10.5 Channel model of L-decimated system.
10.2. SENSITIVITY TO CHANNEL MISMATCHES
123
ak (n)
εk (n)
γk
δk
x(n)
∆Hk (z)
bk (n)
Hk (z)
yk (n)
Figure 10.6 Channel model with nonideal analog circuits.
10.2
Sensitivity to channel mismatches
In this section, the channel model used for the sensitivity analysis is explained.
Figure 10.6 shows one of the channels of a parallel Σ∆-ADC, where several nonidealities resulting from imperfect analog circuits have been included. Difficulties
in realizing the exact values of the analog modulation sequence are modelled by an
additive error term εk (n). The error is assumed to be static, i.e., it depends only on
the value of ak (n), and is therefore a periodic sequence with the same periodicity as
ak (n). The time-varying error εk (n) may be a major concern when the modulation
sequences contain non-trivial elements, i.e., elements that are not −1, 0, or 1. The
trivial elements may be realized without a multiplier by exchanging, grounding, or
passing through the inputs to the modulator, and are for this reason particularly
attractive on the analog side.
A channel-specific gain γk is included in the sensitivity analysis, and analog
imperfections in the modulator are modelled as the transfer function ∆Hk (z). The
modulator nonidealities including channel gain and modulation sequence errors are
analyzed separately in the context of the multirate formulation. In practice, there
is also a channel offset δk which is not suitable for analysis in this context, as it is
signal independent. Channel offsets are commented in Sec. 10.2.4.
10.2.1
Modulator nonidealities
P
Assume that the ideal system is alias-free, i.e., the matrix P(z) =
Pk (z) is
pseudocirculant. Due to analog circuit errors the transfer function of channel k
deviates from the ideal Hk (z) to γk (Hk (z) + ∆Hk (z)), and H̃k (z) is replaced by
Ĥk (z) = γk (Hk (z)+∆Hk (z))z M −1 . The transfer matrix for channel k thus becomes
P̂k (z) with elements
(
bk,j Ĥk,i−j (z)ak,i
for i ≥ j
j,i
P̂k (z) =
(10.16)
bk,j z −1 Ĥk,i−j+M (z)ak,i for i < j
where Ĥk,p (z) are the polyphase components of Ĥk (z). It is apparent that P̂k (z)
is pseudocirculant whenever Pk (z) is. Thus a system where all the Sk matrices are
circulant is completely insensitive to modulator mismatches.
124
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
P
In the general case, unfortunately, all Sk are not circulant and
P̂k (z) =
Sk · Ĥk (z) does not sum up to a pseudocirculant matrix as the matrices Ĥk (z)
are different between the channels. Partitioning the channel set into the sets Ci ,
as described in Sec. 10.1.1, and matching the modulators of channels belonging to
the same partition Ci , i.e., defining γk = γl and ∆Hk (z) = ∆Hl (z) when k, l ∈ Ci ,
allows P̂(z) to be written
P
P̂(z) =
N
−1
X
k=0
Sk · Ĥk (z) =
I−1
X
i=0
Ĥ0,i (z) ·
X
Sk
(10.17)
k∈Ci
and it is apparent that each term in the outer sum is pseudocirculant, and thus
that P(z) is as well. Thus the system is alias-free and non-linear distortion is
eliminated.
10.2.2
Modulation sequence errors
P
It is assumed that the ideal system is alias-free, i.e., that P(z) =
Pk (z) is
pseudocirculant. Due to difficulties in realizing the analog modulation sequence,
the signal is modulated in channel k by the sequence âk = ak + εk rather than the
ideal sequence ak . We consider here different choices of the modulation sequences.
Bi-level sequence for an insensitive channel
Assume that an analog modulation sequence with two levels is used for an insensitive channel, i.e., Sk = bTk ak is a circular matrix. Examples of this type of channel
include the first two channels of a Hadamard modulated system. Assuming that the
sequence errors εk depend only on ak , i.e., εk (n1 ) = εk(n2 ) when ak (n1 ) = ak (n2 ),
the modulation vector can be written âk = αk ak + βk βk · · · βk for some
values of the scaling factor αk and offset term βk . The channel matrix P̂k (z) for
the channel modulated with the sequence âk then becomes
P̂k (z) = bTk (αk ak + βk βk · · · βk ) · H̃k (z) =
= αk Sk · H̃k (z) + βk Bk H̃k (z),
(10.18)
where Bk is a diagonal matrix consisting of the elements of bk . The first term is
pseudocirculant, and thus the system is insensitive to modulation sequence scaling
factors in channel k. The impact of the offset term βk , i.e., the second term, is
explained in Sec. 10.2.3.
Bi-level sequence for sensitive channels
Consider one of the subsets Ci in the partition of the channel
set. The sum of
P
the Sk -matrices corresponding to the channels in the set, k∈Ci Sk , is a circulant
matrix, whereas the constituent matrices are not. Examples of this type of channels
are the time-interleaved systems and the Hadamard modulated systems with more
10.2. SENSITIVITY TO CHANNEL MISMATCHES
125
than two channels.
As in the insensitive
case, the modulation vectors are written
âk = αk ak + βk βk · · · βk , and the sum of the channel matrices for the
channel subset becomes
X
X
P̂k (z) =
bTk (αk ak + βk βk · · · βk ) · H̃k (z) =
k∈Ci
k∈Ci
=
H̃0,i (z) ·
X
αk Sk
k∈Ci
!
+
X
βk Bk H̃k (z),
(10.19)
k∈Ci
where Bk is a diagonal matrix consisting of the elements of bk . The first sum
is generally not a pseudocirculant matrix, and the channels are thus sensitive to
sequence gain errors. If the gains are matched, denote α0,i = αk = αl when
k, l ∈ Ci , the channel matrix sum may be written
!
X
X
X
P̂k (z) = α0,i H̃0,i (z) ·
Sk +
βk Bk H̃k (z),
(10.20)
k∈Ci
k∈Ci
k∈Ci
and it is seen that the first term is a pseudocirculant matrix, and the channel set
is alias-free. Again, the impact of the offset term βk is explained in Sec. 10.2.3.
Multi-level sequences
If an insensitive channel is modulated with a multi-level sequence âk = ak + εk the
channel matrix becomes
P̂k (z) = bTk (ak + εk ) · H̃k (z) = Sk · H̃k (z) + bTk εk · H̃k (z),
(10.21)
which is pseudocirculant only if bTk εk is a circulant matrix. Systems with multilevel analog modulation sequences are thus sensitive to level errors.
10.2.3
Modulation sequence offset errors
Consider the modulation sequence offset errors introduced in Sec. 10.2.2. The
channel matrix for a channel with a modulation sequence containing an offset error
can be written as (10.18). Thus the error pertaining to the sequence offset is
additive, and can be modelled as in Fig. 10.7. The signal is thus first filtered
through Hk (z) and then aliased by the system Bk , as Bk is not pseudocirculant
unless the elements in the digital modulation sequence bk are identical. However,
as the signal is first filtered, only signal components in the passband of Hk (z) will
cause aliasing. If the signal contains no information in this band, aliasing will
be completely suppressed. Typically the signal has a guard band either at the
low-frequency or high-frequency region to allow transition bands of the filters, and
the modulator can then be suitably chosen as either a low-pass type or high-pass
type [80], respectively. Errors pertaining to sequence offsets are demonstrated in
Sec. 10.3.1.
126
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
βk
x(n)
M
M
z −1
M
z
z −1
yk (n)
z
Bk
Hk (z)
M
M
M
M
M
Figure 10.7 Model of errors in a parallel system pertaining to sequence
offsets.
10.2.4
Channel offset errors
Channel offsets must be removed for each channel in order not to overload the Σ∆modulator. Offsets affect the system in a nonlinear way and may not be analyzed
using the multirate formulation. However, the problem has been well investigated
and numerous solutions exist [4, 58, 81].
10.3
Simulation results
In this section, the proposed multi-rate formulation is used to analyze the sensitivity to channel mismatch errors for some different systems. Results are provided
for a time-interleaved ADC, a Hadamard modulated ADC and a frequency-band
decomposed ADC, as described in Sec. 9.3. Also, an example is provided of how the
formulation can be used to derive a new architecture that is insensitive to channel
matching errors.
10.3.1
Time-interleaved ADC
Consider a time-interleaved ADC with four channels. The samples are interleaved
between the channels, each encompassing identical second-order low-pass modulators and decimation filters. Ideally, their z-domain transforms may be written
(
z −1
Hk (z) = H(z) =
0
−π/4 ≤ ωT ≤ π/4
otherwise.
(10.22)
10.3. SIMULATION RESULTS
127
All modulators are running at the input sampling rate, with their inputs grounded
between consecutive samples. Thus the modulation sequences are
a0 (n) = b0 (n) = 1, 0, 0, 0, . . . ,
a1 (n) = b1 (n) = 0, 1, 0, 0, . . . ,
(10.23)
a2 (n) = b2 (n) = 0, 0, 1, 0, . . . ,
a3 (n) = b3 (n) = 0, 0, 0, 1, . . . ,
all periodic with period M = 4. The vectors ak and bk are as defined by (10.13)
a0 = b 3 = 1 0 0 0
a1 = b 0 = 0 0 0 1
(10.24)
a2 = b 1 = 0 0 1 0
a3 = b 2 = 0 1 0 0
The matrices Sk , defined by

0

0
S0 = bT0 a0 = 
0
1

0

0
S2 = bT2 a2 = 
0
0
(10.12), then become


0 0 0
0

0 0 0
0

S1 = bT1 a1 = 
0
0 0 0
0 0 0
0


0 0 0
0

0 1 0
0

S3 = bT3 a3 = 
0
0 0 0
0 0 0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0

0
0

1
0

0
0

0
0
Because the sum of all Sk -matrices is a circulant matrix, the system is alias-free
and the transfer function for the system is given by (10.15) as
4
−1
HA (z) = z −1 s0,1
3 H3,1 (z ) = z
(10.25)
where H3,1 (z) = 1 is the second polyphase component in the polyphase decomposition of H(z). The transfer function is thus a simple delay, and the system will
digitize the complete spectrum.
As none of the Sk -matrices are circulant, and a circulant matrix can be formed
only by summing all the matrices, the time-interleaved ADC requires matching of
all channels in order to eliminate aliasing. Thus we define C0 = {0, 1, 2, 3}, according to Sec. 10.1.2. The system has been simulated with modulator nonidealities
and errors of bi-level sequences for sensitive channels, as described in Sec. 10.2.
Figure 10.8(a) shows the output spectrum for the ideal case with no mismatches
between channels (γk = 1 for all k). Applying 2% gain mismatch for one of the
channels (γ0 = 0.98, γ1 = γ2 = γ3 = 1), the spectrum in Fig. 10.8(b) results, where
the aliasing components can be clearly seen. In Fig. 10.8(c), the channel gains are
set to one, and a 1% offset error has been added to the first modulation sequence
(β0 = 0.01, β1 = β2 = β3 = 0), which results in aliasing. In Fig. 10.8(d), high-pass
modulators have been used instead, and the distortions disappear, as predicted by
the analysis in Sec. 10.2.3.
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
Amplitude [dB]
128
0
−50
−100
−150
0
0.2π
0.4π
ωT
0.6π
0.8π
π
0.8π
π
Amplitude [dB]
(a) Simulation using ideal system.
0
−50
−100
−150
0
0.2π
0.4π
ωT
0.6π
Amplitude [dB]
(b) Simulation with 2% gain mismatch in one channel.
0
−50
−100
−150
0
0.2π
0.4π
ωT
0.6π
0.8π
π
Amplitude [dB]
(c) Simulation with 1% offset error in one modulation sequence.
0
−50
−100
−150
0
0.2π
0.4π
ωT
0.6π
0.8π
π
(d) Simulation with 1% offset error in one modulation sequence
using high-pass modulators instead of low-pass modulators.
Figure 10.8 Example 1: Sensitivity of time-interleaved ADC.
10.3. SIMULATION RESULTS
10.3.2
129
Hadamard-modulated ADC
Consider a non-oversampling Hadamard-modulated ADC with eight channels. In
this case, every channel filter is an 8th-band filter (Hk (z) = H(z), k = 0, . . . , 7)
and the modulation sequences ak (n) and bk (n) are
a0 (n) = b0 (n) = 1, 1, 1, 1, 1, 1, 1, 1, . . . ,
a1 (n) = b1 (n) = 1, −1, 1, −1, 1, −1, 1, −1, . . . ,
a2 (n) = b2 (n) = 1, 1, −1, −1, 1, 1, −1, −1, . . . ,
a3 (n) = b3 (n) = 1, −1, −1, 1, 1, −1, −1, 1, . . . ,
a4 (n) = b4 (n) = 1, 1, 1, 1, −1, −1, −1, −1, . . . ,
a5 (n) = b5 (n) = 1, −1, 1, −1, −1, 1, −1, 1, . . . ,
(10.26)
a6 (n) = b6 (n) = 1, 1, −1, −1, −1, −1, 1, 1, . . . ,
a7 (n) = b7 (n) = 1, −1, −1, 1, −1, 1, 1, −1, . . . .
The vectors ak and bk become
1
a0 = b 0 = 1
a1 = −b1 = 1 −1
a2 = b3 = 1 −1
1
a3 = −b2 = 1
a4 = 1 −1
b4 = −1 −1
1
a5 = 1
b5 = 1 −1
1
a6 = 1
1
b6 = 1
a7 = 1 −1
1
b7 = −1
1
1
1 −1
−1
1
1
1
1 −1
1
−1
1
−1
1
1
−1
1
1
1
−1
1 .
1
1
−1 −1
1
−1 −1
1
−1
−1 −1
−1
1
1
−1
1
−1 −1
1
1
−1
−1 −1
1 −1
−1 −1
1
1
1
−1
1
−1
1
1
1 −1
−1 −1 −1
−1 −1
−1
1
1 −1
1 −1 −1
With Sk = bTk ak , the following matrices can be computed:
S0 = 1

−1
1

−1

1
S1 = 
−1

1

−1
1
1
−1
1
−1
1
−1
1
−1
−1
1
−1
1
−1
1
−1
1
1
−1
1
−1
1
−1
1
−1
1
−1
1
−1
1
−1
1
−1
1
1
−1
1
−1
1
−1
1
−1
−1
1
−1
1
−1
1
−1
1

1
−1

1

−1

1

−1

1
−1
(10.27)
130
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS

0
−2

0

2
S2 + S3 = 
0

−2

0
2

0
0

0

−4
S4 + S5 + S6 + S7 = 
0

0

0
4
2
0 −2
0
2
0
−2 0
2
0 −2 0
2
0 −2
0
2
0
−2 0
2
0 −2 0
4
0
0
0
4
0
0
0
4
0
0
0
−4 0
0
0 −4 0
0
0 −4
0
0
0
0
2
−2 0
0 −2
2
0
0
2
−2 0
0 −2
2
0
0 −4
0
0
0
0
4
0
0
4
0
0
0
0
−4 0

0 −2
2
0

0
2

−2 0 

0 −2

2
0

0
2
−2 0

0
0
−4 0 

0 −4

0
0

0
0

4
0

0
4
0
0
It is seen that S0 and S1 are circulant matrices. Also, S2 + S3 is circulant.
Further, the remaining matrices sum to a circulant matrix S4 + S5 + S6 + S7 ,
whereas no smaller subset does. Thus, in order to eliminate aliasing, the channels
are partitioned into the sets C0 = {0}, C1 = {1}, C2 = {2, 3} and C3 = {4, 5, 6, 7}.
The Hadamard-modulated ADC thus contains both insensitive channels 0 and 1,
and sensitive channels 2, . . . , 7.
Using the model of the ideal system, the spectrum of the output signal is as
shown in Fig. 10.9(a). Figure 10.9(b) shows the output spectrum for the system
with 1% random gain mismatch (γk ∈ [0.99, 1.01]), where the aliasing distortions
are readily seen. Matching the gains of the C2 -channels to each other (setting
γ2 = γ3 ) and the gains of the C3 -channels to each other (setting γ4 = γ5 = γ6 = γ7 ),
the spectrum in Fig. 10.9(c) results, and the distortions disappear. Although the
Hadamard-modulated ADC is less sensitive than the time-interleaved ADC, the
matching requirements for eight-channel systems and above are still severe.
10.3.3
Frequency-band decomposed ADC
For the frequency-band decomposed ADC, the input signal is applied unmodulated
to N modulators converting different frequency bands. Consider as an example a
four-channel system consisting of a lowpass channel, a highpass channel, and two
bandpass channels centered at 3π/8 and 5π/8.
As the signal is not modulated,
ak = b k = 1 1 1 1
(10.28)
for all k, and

1
1
Sk = 
1
1
1
1
1
1
1
1
1
1

1
1

1
1
Amplitude [dB]
10.3. SIMULATION RESULTS
131
0
−50
−100
−150
0
0.2π
0.4π
ωT
0.6π
0.8π
π
0.8π
π
Amplitude [dB]
(a) Simulation using ideal model.
0
−50
−100
−150
0
0.2π
0.4π
ωT
0.6π
Amplitude [dB]
(b) Simulation using 1% channel gain mismatch.
0
−50
−100
−150
0
0.2π
0.4π
ωT
0.6π
0.8π
π
(c) Simulation using gain matching of sensitive channels.
Figure 10.9 Example 2: Sensitivity of Hadamard-modulated ADC.
for all k. As each Sk -matrix is circulant, the system is insensitive to channel
mismatches. Further, modulation sequence errors are irrelevant in this case, as
the signal is not modulated. The frequency-band decomposed ADC is thus highly
resistant to mismatches. Its obvious drawback, however, is the need to use bandpass
modulators which are more expensive in hardware.
10.3.4
Generation of new scheme
This example demonstrates that the formulation can also be used to devise new
schemes, although a general method is not presented. A three-channel parallel
system using lowpass modulators is designed. The signal is assumed to be in the
frequency band −π/4 < ωT < π/4, and the ADC is thus an oversampled system
and is described according to Sec. 10.1.3 with L = 4 and M = 8.
Using complex modulation sequences, three bands of width π/4 centered at
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
Amplitude [dB]
132
50
0
−50
−100
−150
−200
0
0.2π
0.4π
ωT
0.6π
0.8π
π
Figure 10.10 Example 4: Sensitivity of new scheme. Simulation using
10% channel gain mismatch.
−π/4, 0, and π/4 can be translated to baseband and converted with a lowpass
ADC. These modulation sequences are a0 (n) = 1, a1 (n) = exp(jπn/4), a2 (n) =
exp(−jπn/4) and bk (n) = ak∗ (n). Summing the resultant Sk -matrices yields

√
√
√
√
2
2
2
0
− 2 −2
− 2
0
√
√
√
√

 0
2
2
− 2 −2
 √
√2
√0
√ − 2 

− 2
0
2
2
− 2 −2


√2
√0
√
√


X
2
0
2
2
2
0
−
2
−

 −2
√
√
√
Sk = 1 +  √
 (10.29)

− 2 −2
−
2
0
2
2
2
0
√
√
√
√


 0
0
2
2 
− 2 −2
√ − 2
√
√2

√
 2

− 2 −2
0
2
√ − 2
√
√2
√0
2
0
− 2 −2 − 2
0
2
2
Unfortunately, using complex modulation sequences is not practical. However, as
the modulators and filters are identical for all channels (Hk (z) = H(z) for all k),
any other choice of modulation sequences resulting in the same matrix will perform
the same function. Moreover, for a decimated system, relaxations may be allowed
on the new modulation sequences. In this case, with decimation by four, it is
sufficient to find replacing modulation
sequences a′k and b′k such that the sum of
P
′
the resulting Sk -matrices equal
Sk on rows 4 and 8, as gcd(L, M ) = 4. One such
choice of modulation sequences is
a′0 = 1 1 1 1 1 1 1 1
a′1 = 1 1 0 −1 −1 −1 0 1
a′2 = 1 0 0 0 −1 0 0 0
b′0 = 0 0 0 1 0 0 0 1
√
√ b′1 = 0 0 0 − 2 0 0 0
2
√
√ ′
(10.30)
b2 = 0 0 0 ( 2 − 2) 0 0 0 (2 − 2) .
The analog modulation sequences a′k can easily be implemented by switching or
grounding the inputs to the modulators, whereas the nontrivial multiplications in
10.4. NOISE MODEL OF SYSTEM
133
b0 (n)
q0 (n)
NTF0 (z)
G0 (z)
b1 (n)
q1 (n)
NTF1 (z)
G1 (z)
y(n)
bN −1 (n)
qN −1 (n)
NTFN −1 (z)
GN −1 (z)
Figure 10.11 Noise model of parallel system.
b′k can be implemented with high precision digitally. Note that


0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 


0
0
0
0
0
0
0
0 


√
√
√
√
X
−2 − 2 0
2
2
2 0 − 2
′T ′
,

(10.31)
b k ak = 1 + 
0
0
0
0
0
0
0 

0
0
0
0
0
0
0
0
0 



0
0
0
0
0
0
0
√
√
√
√0
2
2 0 − 2 −2 − 2 0
2
P
which is equal to
Sk in (10.29) on rows 4 and 8. Note also that the S′k -matrices,
given on rows 4 and 8 by
′ b0,3 ′
1 1 1 1 1 1 1 1
a =
b′0,7 0
1 1 1 1 1 1 1 1
√
√
√
√
√ ′ √
b1,3 ′
−√ 2 −√ 2 0
2
2
2 0 −√ 2
√
√
√
a =
b′1,7 1
2
2 0 − 2 − 2 − 2 0
2
√
′ √
b2,3 ′
( 2−
√2) 0 0 0 (2√− 2) 0 0 0 ,
a =
b′2,7 2
(2 − 2) 0 0 0 ( 2 − 2) 0 0 0
are circulant on these rows, and thus the system is insensitive to channel mismatches. This is demonstrated in Fig. 10.10, where the channel gain mismatch is
10% and no aliasing results. However, as three levels are used in the analog modulation sequences a′1 and a′2 , the system is sensitive to mismatches in the modulation
sequences of these channels, as described in Sec. 10.2.
10.4
Noise model of system
The primary purpose of this work is to investigate the signal transfer characteristics
of a parallel Σ∆-system. However, a system’s noise properties are also affected
134
CHAPTER 10. PARALLEL SIGMA-DELTA ADCS
yk,0 (m)
qk (n)
NTFk (z)
Gk (z)
M
M
yk (n)
bk,0
z −1
yk,1 (m)
M
z
M
bk,1
z −1
yk,M −1 (m)
M
z
M
bk,M −1
Figure 10.12 Noise model of channel k.
by the choice of modulation sequences, and therefore a simple noise analysis is
included.
A noise model of the parallel Σ∆-system can be depicted as in Fig. 10.11. The
quantization noise qk (n) of channel k is filtered through the noise transfer function
NTFk (z) and filter Gk (z). The filtered noise is then modulated by the sequence
bk (n). The channels are summed to form the output y(n).
In order to determine the statistical properties of the output y(n), channel k is
modeled as in Fig. 10.12. Denoting the spectral density of the quantization noise
of channel k by RQk (ejω ), the spectral densities of the polyphase components yk,m
of the channel output can be written
Ryk,m (ejω ) = b2k,m
M
−1
X
l=0
Gk,l (ejω )2 RQ (ejω )
k
(10.32)
where Gk,l (z) are the polyphase components of the cascaded system NTFk (z)Gk (z).
It is seen that the noise power is scaled by the factor b2k,m , and it is thus of interest
to keep the amplitudes of the modulation sequences low on the digital side. For
example, in the generation of the new modulation scheme in Sec. 10.3.4, alternative
choices of a1 and b2 would have been
a1 = 0 1 0 −1 0 −1 0 1
b2 = 0 0 0 −2 0 0 0 2 .
However, in this case the noise power is larger. This shows that the smaller magnitudes of the digital modulation sequences, as in (10.30), are preferable from a
noise perspective.
11
Sigma-delta ADC decimation
filters
In this chapter, the design complexity requirements of decimation filters for Σ∆ADCs are investigated. For fast Σ∆-ADCs with large oversampling factors, the
power dissipation of the decimation filter may become a significant part of the
whole ADC, and is often overlooked. The complexity of linear phase FIR filterbased decimators are considered for a wide range of oversampling ratios (OSR) and
three different modulator structures. Simulation results show the SNR degradation
of the ADC as a function of the transition bandwidth and the filter order.
In Sec. 11.1, needed notations are introduced and the filter design methodology
is discussed. In Sec. 11.2, filter design results are provided.
This chapter has been published in [17].
11.1
Design considerations
Consider the system shown in Fig. 11.1. An analog signal x(t) is sampled and
quantized by a Σ∆-modulator to a low resolution signal u(n) with high sampling
x(t)
Σ∆
u(n)
H(z)
L
y(m)
Figure 11.1 Σ∆-ADC with digital decimation filter
135
136
CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS
0.06
Magnitude
Magnitude
1
0.5
0
0
0.2π 0.4π 0.6π 0.8π
π
ωT [rad]
0.04
0.02
0
0
0.2π 0.4π 0.6π 0.8π
π
ωT [rad]
(a) Ideal (solid line) and practical (dashed (b) Filtered quantization noise using ideal
line) decimation filter, and shaped quanti- (solid line) and practical (dashed line) deczation noise before filtering (dotted line). imation filter.
Figure 11.2 Output quantization noise spectral density function for a 1-bit
first-order sigma-delta.
rate. The signal is then filtered and downsampled, resulting in a signal y(m) with
high resolution and a low sampling rate.
11.1.1
FIR decimation filters
The filters in this chapter are FIR filters designed in the minimax sense (see
Sec. 8.2). It may be argued that minimizing the energy through a least-squares
design is more appropriate in order to maximize the SNR. However, in communications systems it is common that the suppression of blockers imposes minimax
constraints on the filter, thus making minimax design interesting. The filters are
also designed as single-stage filters, whereas they are in practice normally designed
in several stages to relax the overall requirements. In particular, recursive comb
decimators provide efficient filtering with few arithmetic operations. The results in
this chapter do not contradict the fact that such structures are efficient, but rather
concentrates on the relation between the stringency of the filter requirements and
ADC parameters. As a measure of the filter requirements stringency, the filter
order is used.
Ideally, a decimation filter would be used that passes through all the frequencies
in the signal bandwidth, while completely suppressing all frequencies outside the
signal bandwidth. Such a filter is shown by the solid line in Fig. 11.2(a). In this
case, the signal is degraded only by the quantization noise in the same band as the
signal. However, in practice the filter has a transition band, as exemplified by the
dashed line in Fig. 11.2(a). Then, the signal quality is reduced also by the noise
within the transition band. Moreover, as seen in the output power spectral density
of the filtered quantization noise in Fig. 11.2(b), the noise in the transition band
can consitute a considerable fraction of the total noise power.
11.1. DESIGN CONSIDERATIONS
137
1 + δc
1 − δc
δs
π
OSR
π
OSR (1
+ ∆)
Figure 11.3 Decimation filter specification.
For FIR filter-based decimators, the filter complexity is to the largest extent
determined by the transition bandwidth [49]. Further, a significant part of the
quantization noise energy of an oversampled Σ∆-modulator is located in this transition band, in particular when higher-order noise shaping is applied. Therefore,
there exists a pronounced trade-off between decimation filter complexity and Σ∆modulator signal-to-noise-ratio (SNR). The work in this chapter presents investigations on how the SNR is degraded as a function of the filter transition bandwidth
and filter order for various commonly used Σ∆-modulator architectures and oversampling ratio (OSR). It is also demonstrated that, for a given filter order, there
exists an optimum choice of the stopband ripple and stopband edge for equi-ripple
filter solutions which minimizes the SNR degradation.
11.1.2
Decimation filter specification
The considered filters are N th-order symmetric linear-phase FIR filters that satisfy
the specification shown in Fig. 11.3 where δc , δs , ωc T = π/OSR, and ωs T =
π(1 + ∆)/OSR, denote the passband ripple, stopband ripple, passband edge and
stopband edge, respectively. The filters are designed in the minimax sense using
the McClellan-Parks-Rabiner’s (MPR) algorithm. The passband ripple is specified
to be 0.01, but after each design it may be slightly smaller as there generally is a
design margin. This is because the MPR algorithm only handles the ratio between
the passband and stopband ripples.
The stopband edge is related to the passband edge through the relative transition bandwidth ∆ that is defined as
∆=
and shown in Fig. 11.3.
ωs T − ωc T
,
ωc T
(11.1)
138
CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS
e(n)
x(n)
F (z)
Q
y(n)
D/A
e(n)
F (z)
u(n)
-1
Figure 11.4 Principle of the extraction of quantization noise from a Σ∆modulator and the filtering thereof through a time-domain
realization of the Σ∆-modulator noise transfer function.
11.1.3
Signal-to-noise-ratio
Consider a Σ∆-modulator with an input signal x(n), output signal y(n), and quantization error e(n), as defined in Sec. 9.2. The output signal and noise power of
the Σ∆-modulator after filtering are denoted Pyx and Pye , respectively. Assuming
that the input signal x(n) is bandlimited to ωc T , and the passband ripple δc of the
filter H(z) is small enough so its effect on the signal power can be neglected, Pyx
equals the input signal power, i.e., Pyx = Px . The contribution to the output noise
power emanates from the passband, transition band, and stopband regions. The
(pb)
(tb)
(sb)
corresponding noise powers are denoted Pye , Pye , and Pye , respectively. The
(pb)
(tb)
(sb)
total noise power is thus Pye = Pye + Pye + Pye , and the SNR at the output is
given by
Pyx
(11.2)
SNR = 10 log10
Pye
Further, the SNR degradation ∆SN R is defined as the degradation in SNR caused
by using a practical filter instead of an ideal lowpass filter. Thus,
∆SNR = 10 log10
Pyx
(pb)
Pye
− 10 log10
Pyx
Pye
= 10 log10 (pb)
Pye
Pye
(11.3)
Using the linear model discussed in Sec. 9.1.3, the noise power is computed as
Z π 2
2 2
1
Q Pe =
G(ejωT ) H(ejωT ) d(ωT ),
(11.4)
2π −π 12
where Q is the quantization step, G(z) is the noise transfer function of the Σ∆modulator, and H(z) is the z-transform of the decimation filter. The noise power in
11.2. SIMULATION RESULTS
139
the different bands can be computed analogously by using the appropriate integration limits. However, the linear model tends to be less appropriate for coarse quantization steps. In particular, problems arise for one-bit quantization. Therefore, in
this work, the noise power has been evaluated using the model in Fig. 11.4, where
the extracted quantization error e(n) has been filtered through a time-domain realization of the Σ∆-modulator noise transfer function G(z). The obtained sequence
u(n) is then filtered through the decimation filter in order to get the final output
noise. The noise in the different bands is then obtained from the DFT of the output
noise signal. In Fig. 11.5, the noise powers of the two different cases are compared.
11.2
Simulation results
For the three Σ∆-modulator topologies shown in Figs. 9.5–9.7, two different investigations have been done. In the first, the transition bandwidth of the decimation
filter has been varied for different OSR and filter orders and the corresponding
SNR degradation has been calculated. The results of the investigation are plotted
in Fig. 11.6. For a given transition bandwidth and filter order the stopband ripple
is fixed. From the figures, it can be seen that the SNR degradation has a minimum
for a certain choice of ∆.
In the second investigation, the optimal choice of the relative transition bandwidth ∆ has been found for filter orders between 100 and 1000 for different OSR.
The optimal ∆’s and the corresponding SNR degradations for the three Σ∆modulator topologies are shown in Figs. 11.7–11.9. Decreasing the transition
bandwidth below the optimum causes the SNR to worsen considerably, because of
rapidly decreasing stopband attenuation. Also, for large enough transition bands,
increasing the filter order will yield no significant SNR improvements because essentially all the noise power is from the transition band region. It should be noted
that optimal SNR may occur for a transition bandwidth several times wider than
the passband width.
140
CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS
10
Noise power
10
10
10
10
10
−4
Σ∆ order 1, linear model
Σ∆ order 1, extracted noise
Σ∆ order 2, linear model
Σ∆ order 2, extracted noise
−5
−6
−7
−8
−9
1
2
3
4
Quantization bits
(a) Σ∆-modulators of order one and two.
−5
10
Linear model, pass band
Extracted noise, pass band
Linear model, transition band
Extracted noise, transition band
Linear model, stop band
Extracted noise, stop band
−6
Noise power
10
−7
10
−8
10
−9
10
−10
10
1
2
3
4
Quantization bits
(b) Power in different bands of decimation filter, using second-order Σ∆modulator.
Figure 11.5 Quantization noise power of Σ∆-modulators after decimation
filter, according to the linear noise model and the computed
quantization noise. The OSR is 32.
11.2. SIMULATION RESULTS
141
16
14
∆ SNR
12
OSR=512, N=500
OSR=512, N=1000
OSR=128, N=500
OSR=128, N=1000
OSR=32, N=200
OSR=32, N=500
OSR=32, N=1000
10
8
6
4
2
0
0
1
2
3
4
5
6
7
Relative transition bandwidth (∆)
8
9
10
(a) First-order Σ∆-modulator.
30
25
∆ SNR
20
OSR=512, N=500
OSR=512, N=1000
OSR=128, N=500
OSR=128, N=1000
OSR=32, N=200
OSR=32, N=500
OSR=32, N=1000
15
10
5
0
0
1
2
3
4
5
6
7
Relative transition bandwidth (∆)
8
9
10
(b) Second-order Σ∆-modulator.
30
25
∆ SNR
20
OSR=512, N=500
OSR=512, N=1000
OSR=128, N=500
OSR=128, N=1000
OSR=32, N=200
OSR=32, N=500
OSR=32, N=1000
15
10
5
0
0
1
2
3
4
5
6
7
Relative transition bandwidth (∆)
8
9
10
(c) MASH Σ∆-modulator.
Figure 11.6 SNR degradation as a function of relative transition bandwidth ∆ for the different modulator structures.
142
CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS
5
OSR = 256
OSR = 128
OSR = 64
OSR = 32
OSR = 16
OSR = 8
4
∆
3
2
1
0
100
200
300
400
500 600 700
FIR filter order
800
900
1000
(a) SNR-optimal choice of ∆.
10
OSR = 256
OSR = 128
OSR = 64
OSR = 32
OSR = 16
OSR = 8
9
8
∆ SNR
7
6
5
4
3
2
1
0
100
200
300
400
500 600 700
FIR filter order
800
900
1000
(b) Minimal SNR degradation.
Figure 11.7 SNR-optimal choice of ∆ and minimal SNR degradation as
functions of the filter order for the first-order Σ∆-modulator.
11.2. SIMULATION RESULTS
143
5
OSR = 256
OSR = 128
OSR = 64
OSR = 32
OSR = 16
OSR = 8
4
∆
3
2
1
0
100
200
300
400
500 600 700
FIR filter order
800
900
1000
(a) SNR-optimal choice of ∆.
20
OSR = 256
OSR = 128
OSR = 64
OSR = 32
OSR = 16
OSR = 8
18
16
∆ SNR
14
12
10
8
6
4
2
0
100
200
300
400
500 600 700
FIR filter order
800
900
1000
(b) Minimal SNR degradation.
Figure 11.8 SNR-optimal choice of ∆ and minimal SNR degradation
as functions of the filter order for the second-order Σ∆modulator.
144
CHAPTER 11. SIGMA-DELTA ADC DECIMATION FILTERS
5
OSR = 256
OSR = 128
OSR = 64
OSR = 32
OSR = 16
OSR = 8
4
∆
3
2
1
0
100
200
300
400
500 600 700
FIR filter order
800
900
1000
(a) SNR-optimal choice of ∆.
20
OSR = 256
OSR = 128
OSR = 64
OSR = 32
OSR = 16
OSR = 8
18
16
∆ SNR
14
12
10
8
6
4
2
0
100
200
300
400
500 600 700
FIR filter order
800
900
1000
(b) Minimal SNR degradation.
Figure 11.9 SNR-optimal choice of ∆ and minimal SNR degradation as
functions of the filter order for the MASH Σ∆-modulator.
12
High-speed digital filtering
In this chapter, the bit-level design of high-speed FIR filters is considered. For
a given filter impulse response, the design process is divided into four steps: the
choice of a filter architecture, the generation of partial products from the input,
the identification of shared subexpressions, and the construction of an efficient
pipelined reduction tree for the partial products. The focus is on the two last
parts. For the identification of common subexpressions, a heuristic algorithm is
proposed that identifies possible shared subexpressions before their allocation to
adders in the reduction tree. In the construction of the reduction tree, a proposed
bit-level optimization algorithm minimizes a weighted sum of the number of full
adders, half adders and registers.
In Sec. 12.1, the investigated architectures are presented, including the directform structure, the transposed direct-form structure, and a new direct-form structure utilizing partial symmetry adders. In Sec. 12.2, the complexities of the architectures are estimated for different wordlengths and pipeline levels. In Sec. 12.3,
the proposed algorithm for partial product redundance reduction is explained, and
in Sec. 12.4.1, the proposed optimization algorithm is discussed. Results comparing
the different architectures are finally in Sec. 12.5.
Secs. 12.1, 12.2 and 12.4.1 have been published in [8], whereas Sec. 12.3 has
been published in [9].
145
146
12.1
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
FIR filter realizations
The two main architectures for FIR filters are the direct-form (DF) architecture
and the transposed direct-form (TF) architecture. These were presented in Sec. 8.4,
and multi-rate structures using partial product generation and reduction trees using carry-save adders were shown. The suitability of different architectures for
high-speed FIR filters is investigated. For each of the considered architecture, the
implementation can be partitioned into three parts: a partial product generation
stage that realizes the filter coefficient multiplications, a carry-save adder (CSA)
reduction tree to efficiently reduce the number of partial products, and a vector
merge adder (VMA) to merge the CSA output to a non-redundant binary form.
In [1,42,45,67], different methods of generating partial products for given filters
were investigated. However, here the focus is rather on the generation of an efficient
pipelined reduction tree. This is done through a formulation as an integer linear
programming (ILP) problem, with which a bit-level optimized reduction tree can
be obtained. As cost function for the optimization algorithm, a weighted sum of
the number of full adders, half adders, and registers is used. The model is similar
to that presented in [59], but was not formulated as an ILP problem there.
Compared with the traditional heuristic methods: Wallace trees [97], Dadda
trees [30], and Reduced Area [6] trees, the bit-level optimization yields better results
for a number of reasons. The aggressive use of half adders in Wallace trees leads
to fast reductions, but generally a more efficient use of half adders is possible.
The Dadda structure uses half adders more restrictively only to try to maximize
the opportunities to use full adders. However, only placing full adders as late as
possible makes the structure unsuitable for pipelining. It is also the case that the
heuristics work well for reduction trees for general multipliers but less so for other
reduction trees. For example, the Reduced Area heuristic is claimed to be optimal
in terms of hardware resources for general multipliers, but simulations provided
in this paper show that this is not necessarily the case for general partial product
trees. Moreover, the heuristics do not consider that partial products might be
added at different levels in the reduction tree, which is the case for several of the
architectures considered in this paper.
It should be noted that the proposed realizations can be applied to reduction trees in other applications as well. One possible application is Merged arithmetic [91], where the results of several multiplications are summed in a carry-save
adder tree. Other examples are high-speed complex multipliers implemented using distributed arithmetic [5], and implementation of general functions as sums of
weighted bit-products [54].
12.1.1
Architectures
The direct-form (DF1) architecture is depicted in Fig. 12.1. In this architecture, a
delay chain at the input provides the algorithmic delays, and an adder tree sums
the partial products generated from the taps. An interesting characteristic of the
direct-form architecture is its ability to utilize the symmetry of linear-phase FIR
12.1. FIR FILTER REALIZATIONS
x0
x1
xM −2
xM −1
w
w
w
w
147
T
T
T
T
T
T
T
T
T
T
T
T
Partial product generation
CSA
T
CSA
T
VMA
wout
Figure 12.1 Direct-form architecture (DF1).
x0
x1
xM −2
xM −1
w
w
T
T
T
T
T
T
T
T
T
T
T
T
w
w
w+1
w+1
w+1
w+1
Partial product generation
CSA
T
CSA
T
VMA
wout
Figure 12.2 Direct-form architecture utilizing coefficient symmetry (DF2).
filter coefficients. This architecture is depicted in Fig. 12.2, and is denoted the
DF2 architecture. The benefit of utilizing the symmetry is that the number of
multiplications is halved. However, in the application of high-speed decimation
filters, this feature is not necessarily efficient because of the need for short critical
paths. The inability to implement the symmetry adders using carry save arithmetic
leads to excessive use of registers in pipelined ripple-carry adders. This can be
readily observed in the experimental results in Sec. 12.5. A solution to this problem
was suggested in [33], but demonstrated in [45] to have limited efficiency. As a
possible alternative solution, dividing the symmetry adders into smaller blocks of
adders as depicted in Fig. 12.3 is considered. This solution is denoted partial
symmetry, and will reduce the register complexity as the carry propagation chain
is broken. However, partial symmetry results in slightly more partial products than
148
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
8
lsb
msb
x1
x2
HA
FA
FA
FA
HA
FA
s1,0
FA
s1,1
FA
s1,2
D
D
D
D
s1,4
s1,3
s0,4
s0,3
s0,0
s0,1
s0,2
Figure 12.3 Partial symmetry adder
w
x0
w
x1
w
xM −2
w
xM −1
Partial
products
Partial
products
CSA
CSA
T
Partial
products
CSA
T
CSA
VMA
wout
Figure 12.4 Transposed direct-form architecture (TF).
full symmetry does. The partial symmetry architecture is referred to as the DF3
architecture.
The TF architecture is depicted in Fig. 12.4. For high-speed decimation filters,
the TF architecture may provide a more register-efficient realization, as the algorithmic delay elements are also used as pipelining registers in the summation tree.
Because of the architecture’s limited ability to utilize filter coefficient symmetry,
however, the TF architecture may require more adders than the direct-form architecture. The reasons that symmetry cannot be utilized are two. First, the symmetric multiplier may be connected to a different input because of the polyphase
decomposition. Second, instead of computing each multiplication explicitly, all
generated partial products are merged in one carry-save tree. As the number of
cascaded adders in each stage is restricted, there may be additional CSA stages to
reduce the number of partial products before the VMA. Hence, the output of each
CSA stage is not necessarily represented using at most two bits for each bit weight.
For each of the depicted architectures in Fig. 12.1, Fig. 12.2 and Fig. 12.4, the
structure performs internal decimation by M . M may be 1 in order to describe a
Partial products
12.1. FIR FILTER REALIZATIONS
149
CSA stage 1
T
CSA stage j
T
CSA J − 1
T
CSA J
T
VMA
wout
Figure 12.5 Carry-save adder tree pipelined in J stages.
single-rate filter. For an interpolating filter, M = 1 and the filter consists of several
independent reduction trees for the outputs. Normally, the delays at the inputs
are shared between the branches for the DF architectures. A structure may also
contain both several inputs and several outputs in order to describe general multirate filters. However, the results in this thesis are constrained to simple decimation
and interpolation filters.
For both the DF architectures and the TF architecture, the result after partial
product generation is a number of partial products with different bit weights and
different delays. The general structure for the summation tree is shown in Fig. 12.5,
where the carry-save adder is divided into J stages. The stages are separated by
pipeline registers, and inputs are accepted in all stages. Each stage has the structure
shown in Fig. 12.6, allowing a maximum adder depth of K levels. Again, partial
products may be added in every level. Considering the investigated architectures,
for the DF1 architecture all partial products are added in the first level of the
first stage. For the TF architecture partial products are added in the first level of
several stages, as the pipeline registers are also used as algorithmic delays. Finally,
partial products are added in several levels of the first stages for the DF2 and DF3
architectures, as the inputs are delayed by the symmetry adders.
150
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
Partial products
T
T
T
T
Full/half adders (level 1)
Full/half adders (level 2)
Full/half adders (level K)
T
T
T
T
Figure 12.6 One stage of a carry-save adder tree, with a maximum of K
adders in the critical path.
12.1.2
Partial product generation
As the multiplier coefficients are known it is not required to use general multipliers.
Instead, only the partial products corresponding to non-zero coefficient bits are
generated. Here, it is shown how partial products are generated using different
representations of the coefficients and using both signed and unsigned input data.
An input data X with a wordlength of wd bits can be written as
X=
wX
d −1
xi 2i
(12.1)
i=0
for unsigned data with an input range of 0 ≤ X ≤ 2wd − 1 and
X = −xwd −1 2wd −1 +
wX
d −2
xi 2i
(12.2)
i=0
for signed (two’s complement) data with an input range of −2wd −1 ≤ X ≤ 2wd −1 −
1, where for both (12.1) and (12.2) xi ∈ {0, 1}. Note that the input data is
considered to be integer instead of fractional. However, this is only to be consistent
with the numbering used later on, where the bits corresponding to the smallest
weight (the LSBs) have index 0.
Similarly, the wc -bits coefficients, h(n), can be written as
h(n) =
wX
c −1
hn,j 2j
(12.3)
j=0
and
h(n) = −hn,wc −1 2wc −1 +
wX
c −2
j=0
hn,j 2j ,
(12.4)
12.1. FIR FILTER REALIZATIONS
151
1
4
8
16
(a) Binary representation.
1
−8
32
(b) Signed-digit representation.
Figure 12.7 Resulting partial products when multiplying a three-bits
signed data with 29. White dots correspond to negated partial products.
where hn,j ∈ {0, 1}.
Considering the case of unsigned data and unsigned binary coefficients first, the
multiplication of the input and a filter coefficient can be written
Xh(n) =
wX
d −1
i=0

! w −1
wX
c
c −1 w
d −1
X
X
xi 2i 
hn,j 2j  =
hn,j xi 2i+j .
j=0
j=0
(12.5)
i=0
Now, as some of the hn,j -bits are known to be zero, only bits corresponding to
non-zero hn,j have to be added.
If instead a signed-digit (SD) representation of the coefficient is used, i.e.,
hn,j ∈ {−1, 0, 1}, the number of non-zero hn,j can be decreased. A signed-digit
representation with a minimum number of non-zero positions is called a minimum
signed-digit (MSD) representation. In general there is no unique MSD representation, but introducing the constraint that no two adjacent positions should both be
non-zero, i.e., hn,j hn,j+1 = 0 will result in the canonic signed-digit (CSD) representation, which is both unique and minimum.
Using a SD representation now requires that partial products can be both added
and subtracted. By enabling subtraction, signed input data and coefficients are also
allowed. (12.5) will now contain partial products which may be both positive and
negative. Note that −a = ā − 1 for a ∈ {0, 1}. Hence, negative partial products
can be handled by inverting the partial product and adding a constant number.
The constant numbers corresponding to all partial products can be merged to
one constant binary number in the filter computation. In Fig. 12.7 the partial
products resulting from multiplying a three-bits signed input with the coefficient
29 is illustrated for both binary and CSD representation of the coefficient. The
corresponding constants to add are −4−16−32−64 = −116 and −4−4−8−128 =
−144 for the binary and CSD representation, respectively. It should be noted that
for the TF architecture, the output will be correct only after the first ⌈(N − 1)/M ⌉
samples. If correct output from the first sample is required, one solution is custom
initialization of each stage register.
152
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
12.2
Implementation complexity
12.2.1
Adder complexity
As only the full adders reduce the number of partial products, the required number of full adders for the carry-save summation tree can be easily calculated as
the difference between the number of generated partial products and the output
wordlength. For the DF1 architecture and the TF architecture, the number of
generated partial products can be written wd (N + 1)Na , where N is the filter
order, and Na is the average number of non-zero digits in the filter coefficients.
For the DF2 architecture, the number of coefficients is halved, whereas the input
wordlength is increased by one due to the symmetry
Thus the number of
adders.
generated partial products can be written (wd + 1) N2+1 Na . Finally, for the DF3
architecture, each symmetry adder increases its input wordlength
with
None,
and
+1
hence the total number of partial products can be written wd + wSd
Na ,
2
assuming that partial symmetry adder groups of S bits are used. Depending on
the representation used for coefficient and input data there may also be a number
of constant ones to add.
In all architectures, the required number of output bits, assuming the general
case with P
signed full-scale data and signed coefficients, can be written wout =
wd + log2 |h(k)|, where h(k) is the impulse response. Thus the required number
of full adders can be written
X
NFA,DF1 = NFA,TF = wd ((N + 1)Na − 1) − log2
|h(k)|
(12.6)
for the DF1 architecture and the TF architecture,
X
N +1
NFA,DF2 = (wd + 1)
Na − wd − log2
|h(k)|
2
for the DF2 architecture, and
l w m N + 1 X
d
NFA,DF3 = wd +
Na − wd − log2
|h(k)|
S
2
(12.7)
(12.8)
for the DF3 architecture.
Also, the complexity of the symmetry adders for the DF2
architecture is N2+1 wd -bit adders, resulting in a number of adder cells equal to
N +1
(12.9)
NFA,sym,DF2 = (wd − 1)
2
and
NHA,sym,DF2 =
N +1
.
2
(12.10)
For the DF3 architecture the number of partial symmetry adders is wd /S N2+1
S-bit adders, resulting in a number of adder cells equal to
l w m N + 1 d
NFA,sym,DF3 = wd −
(12.11)
S
2
12.2. IMPLEMENTATION COMPLEXITY
and
NHA,sym,DF3 =
lw m N + 1
d
.
S
2
153
(12.12)
Using these equations, the total required number of adders for the DF architectures can be calculated. It should be noted, however, that the equations
(12.6)–(12.8) do not take into account the half adders that are usually needed to
rearrange the partial products, but nevertheless an approximate condition can be
determined for when the DF2 architecture results in a structure with smaller adder
complexity compared with DF1. This condition is
NFA,DF1 > NFA,DF2 + NFA,sym,DF2 + NHA,sym,DF2 ,
(12.13)
which can be approximated to
wd >
Na
,
Na − 1
(12.14)
if the costs of half and full adders are considered equal. However, as the half
adders of the CSA trees have been ignored, the condition should be considered
as a guideline rather than as a strict rule. In the investigated application where,
typically, both wd and Na are low, utilizing the coefficient symmetry often does
not lead to reduced adder complexity.
12.2.2
Register complexity
Regarding the register complexity, it is possible to find expressions that are asymptotically valid. However, for the considered applications these expressions convey
little information, and expressions that are valid for low filter orders and short
wordlengths are difficult to find. Thus, the register complexities due to algorithmic
delays are calculated here, whereas those due to pipelining of the adder trees are
determined experimentally.
For the DF architectures, the algorithmic delays are applied at the input, and
the register complexity due to these can be written
NR,DF1 = wd N .
(12.15)
If the symmetry of the coefficients is utilized, the implementation carries an additional complexity due to pipelining of the symmetry adders. The way the pipelining
is done is shown in Fig. 12.8. In addition to the registers needed to restrict the
length of the critical path, registers are also placed at the outputs of the full adders
just before the cuts. The reason for this is the definition of the reduction trees in
Sec. 12.4.1, which does not accept inputs just before pipeline cuts. If an n-bit
ripple-carry adder with a maximum of m adders in the critical path is considered,
the required number of pipeline registers can be written
⌊n/m⌋
NR,RC (n, m) =
X
k=1
2(n − mk + 1).
(12.16)
154
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
am−1 bm−1
am bm am−1 bm−1
a1 b 1
a0 b 0
HA
FA
s0
FA
D D
D D D D
FA
FA
D
s1
sj
sj+1
D
swd swd −1
Figure 12.8 Pipelined ripple carry adder.
The register complexity of the DF2 architecture can then be written
N +1
NR,DF2 = wd N +
NR,RC (wd , K),
2
(12.17)
where K is the maximum allowed number of adders in cascade. For the partial
symmetry case, the number of registers is smaller (for S < wd ). The number of
registers, assuming S = nK for an integer n, is
N + 1 j wd k
NR,RC (S, K) + NR,RC (wd mod S, K) .
NR,DF3 = wd N +
2
S
(12.18)
For (12.17) and (12.18) it is assumed that no sharing of registers between the
algorithmic delays and symmetry adder pipeline registers has been performed. If
sharing is considered, the register complexity of the DF2 architecture can be written
NR,DF2 = wd N +
jw k
d
K
(N + 1) +
M
−1 ⌈
X
m=0
X⌉
N +1−m
M
−1
i=0
wd
⌊X
K ⌋
(wd − Kk).
(12.19)
k=1+i
If, for simplicity, wd = jS for an integer j is assumed, the register complexity of
the DF3 architecture can be written
NR,DF3 = wd N + jn(N + 1) + j
M
−1 ⌈
X
m=0
X⌉
N +1−m
M
i=0
−1
n
X
(S − Kk).
(12.20)
k=1+i
For the TF architecture, the algorithmic delays are merged with the pipeline
registers, and all registers are in the adder tree.
12.3. PARTIAL PRODUCT REDUNDANCY REDUCTION
155
pipeline stage 0
level 0
level 1
level L−1
pipeline stage 1
T
T
level 0
Figure 12.9 Structure of original reduction tree.
12.3
Partial product redundancy reduction
Consider the general structure of the original reduction tree as shown in Figs. 12.5
and 12.6. For clarity, the structure is expanded in Fig. 12.9. Let a subexpression denote three partial products with the same weight, on the same stage and
level in the same reduction tree. The sharing is then based on identifying common
subexpressions occurring in several places in the reduction trees, i.e., several occasions of the same set of three partial products. Thus the sharing identification
is performed before the assignment of partial products to adders. Subexpressions
can be shared in the same tree or between trees, and in the same stage or between stages. However, in order not to affect the delay of the intermediate partial
products, subexpressions can not be shared between different levels. For each
subexpression to be shared, the occurrences of the partial products are removed
from the reduction trees, an adder for the subexpression is introduced, and partial
products corresponding to the sum and carry outputs are added on the level below
each subexpression occurrence. The resulting reduction tree structure is shown in
Fig. 12.10.
Several different reduction trees can share the same full adder. However, there
can not be any full adder that have inputs from more than one adder tree. Hence,
the search for possible assignments need to be done inside each adder tree only,
but to determine which assignment is most beneficial one should consider all adder
trees. For the first level of the first stage all partial products are from the partial
156
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
shared FA
shared FA
shared FA
T
T
T
Figure 12.10 Structure of reduction tree with redundancy reduction.
FA
FA
FA
FA
Figure 12.11 Equivalence transforms for sign normalization of subexpressions.
product generation stage. For latter levels there may still be redundancy as the
same partial product result may end up in several columns. Considering the DF
and TF architectures, adder sharing can not be expected to consistently yield better
results with one of the architectures than the other. In the DF architecture, all
partial products are added in the first level, increasing the number of possible subexpressions that can be considered for sharing. On the other hand, the delay chain
at the input also increases the number of different partial products and decreasing
the probability that the same partial products occur in several columns.
If signed data and/or coefficients are used, some of the partial products may be
inverted. However, it is possible to apply transformations to increase the possible
sharing by propagating inversions from the inputs to the outputs. This is illustrated
in Fig. 12.11.
12.4. ILP OPTIMIZATION
BITS j (k, b + 1)
157
BITS j (k, b)
INBITS j (k, b + 1)
F Aj (k, b + 1)
HAj (k, b + 1)
carries sums
BITS j (k + 1, b + 1)
BITS j (k, b − 1)
INBITS j (k, b)
F Aj (k, b)
HAj (k, b)
carries sums
BITS j (k + 1, b)
INBITS j (k, b − 1)
F Aj (k, b − 1)
HAj (k, b − 1)
carries sums
BITS j (k + 1, b − 1)
Figure 12.12 Relations between the BITS variables in the ILP problem.
12.3.1
Proposed algorithm
The goal of the proposed algorithm is to identify subexpressions that occur more
than once. It can be described as follows:
1. Represent the filter coefficients in a suitable representation and generate one
or more adder trees to be computed.
2. For each column in each level in each reduction tree, find the possible subexpressions, normalized in terms of possible inversion.
3. Determine the maximum frequency subexpression. In case of a tie, pick one
at random.
4. If the chosen subexpression has only one occurrence, finish.
5. Assign the subexpression to a full adder, remove the corresponding entries
from the adder trees, and add the sum and carry outputs as partial products
in the corresponding columns at the level below each subexpression occurrence.
6. Go to 2.
12.4
ILP optimization
12.4.1
ILP problem formulation
Denote the stage height, i.e., the maximum number of cascaded adders, by K, as
in Fig. 12.6. Denote also the number of stages by J, as in Fig. 12.5. Furthermore,
denote the output wordlength by wout , and the number of input partial products
in each stage j, level k and bit position b by INBITS j (k, b), k ∈ {0, 1, 2, . . . , K −1},
158
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
BITS j (K, b + 1)
BITS j (K, b)
BITS j (K, b − 1)
REGS j (b + 1)
REGS j (b)
REGS j (b − 1)
BITS j+1 (0, b + 1)
BITS j+1 (0, b)
BITS j+1 (0, b − 1)
Figure 12.13 Relations between the stages in the ILP problem.
b ∈ {0, 1, 2, . . . , wout − 1}. As variables for the ILP problem, the number of full
adders FAj (k, b) and half adders HAj (k, b) are used, with the same parameter
bounds as INBITS . The resulting number of partial products in each level is
denoted BITS j (k, b), and is defined BITS 0 (0, b) = INBITS 0 (0, b) for the first level
of the first stage. As each full adder reduces three partial products to one of the
same weight and one of the next higher weight, and each half adder converts two
partial products to one of the same weight and one of the next higher, the resulting
number of partial products in each following level can be written
BITS j (k + 1, b) = BITS j (k, b) + INBITS j (k, b) − 2FAj (k, b) − HAj (k, b)
+ FAj (k, b − 1) + HAj (k, b − 1),
(12.21)
for k ∈ {0, 1, 2, . . . , K − 1}, b ∈ {0, 1, 2, . . . , wout − 1}, and with variables equal to
zero for out-of-bounds arguments. The relations between the BITS variables are
depicted in Fig. 12.12. Connections between the stages are defined by
BITS j+1 (0, b) = BITS j (K, b).
(12.22)
Variables REGS j (b) denoting the number of pipeline registers for each stage are
defined by
REGS j (b) = BITS j (K, b).
(12.23)
Thus, registers are added for all signals between the stages, as shown in Fig. 12.13.
The number of adders in each level is limited by the constraint
3FAj (k, b) + 2HAj (k, b) ≤ BITS j (k, b) + INBITS j (k, b),
(12.24)
as the number of adder inputs can not exceed the number of partial products, for
each level k and bit position b. Also, in order to utilize a VMA to sum the output,
the number of output bits from the last stage is limited to 2 by the condition
BITS J−1 (K, b) + INBITS J−1 (K, b) ≤ 2,
(12.25)
for b ∈ {0, 1, 2, . . . , wout − 1}. Costs are defined for full adders, half adders and
registers as CFA , CHA , and CREG , respectively, and the filter structure is optimized
12.4. ILP OPTIMIZATION
159
by minimizing the sum
C = CFA
J−1
XX
FAj (k, b) + CHA
j=0 k,b
J−1
XX
HAj (k, b) + CREG
j=0 k,b
J−1
XX
j=0
REGS j (b)
b
(12.26)
The optimization problem as specified by (12.21)–(12.26) does not consider the
length of the VMA. However, it may be possible to significantly reduce the length
by introducing half adders to the least significant bits. The optimization problem
can be modified to achieve a shorter VMA by adding a constraint to limit the
number of output partial products to one for a number m of the least significant
bits. This can be done by the constraint
BITS J−1 (K, b) + INBITS J−1 (K, b) ≤ 1,
(12.27)
for b ∈ {0, 1, 2, . . . , m − 1}.
The problem complexity can be significantly reduced by the addition of additional constraints. In particular, there will never be a reason, in terms of efficient
reduction of the number of partial products, not to insert a full adder where at
least three partial products are available. Hence, the number of full adders in a
given position can be defined based on the number of partial products available as
BITS k (l, b) + INBITS k (l, b)
FAk (l, b) =
,
(12.28)
3
which can be formulated as two linear constraints for each variable.
It should be noted that the optimization problem, as formulated, is independent
of the coefficient representation, i.e., binary as well as any signed-digit representation may be used. However, if signed digits are used, either if the data is signed
or if the coefficient contains negative digits, a constant term must be added to
the sum, as discussed in Sec. 12.1.2. As the placement of the term in the tree
is arbitrary, the problem can be modified to insert the bits where they fit well.
How to formulate the optimization problem to accomodate for the constant term
is explained in Sec. 12.4.6. In Sec. 12.4.2 to 12.4.5, the presented architectures are
formulated as initial conditions for the optimization problem. In these formulations, the coefficient digits hn,j are defined as in (12.3) or (12.4), with hn,j ∈ {0, 1}
for binary coefficients and hn,j ∈ {−1, 0, 1} for signed-digit coefficients.
12.4.2
DF1 architecture
For the DF1 architecture, all partial P
products
inserted
first adder stage.
Pware
Pwc −1 in thei+j
N
d −1
The sum of the partial products is n=0 j=0
|h
|
2
. Substituting
n,i
i=0
b = i + j and rearranging the sums allows the number of bitproducts of weight b
to be written
N wX
d −1
X
INBITS 0 (0, b) =
|hn,b−j | .
(12.29)
n=0 j=0
160
12.4.3
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
DF2 architecture
If the direct-form architecture is modified to utilize coefficient symmetry, the symmetry adders will add additional delay. Thus, the partial products involving bit 0
(the LSB) of the data are added in level 1, the partial products involving bit 1 of
the data are added in level 2, and so on. Generally, the number of initial partial
products in stage j and level k can be written
(N +1)/2
INBITS j (k, b) =
X
n=0
|hn,b−Kj−k+1 |
(12.30)
for 1 ≤ Kj + k ≤ wd − 1, and
(N +1)/2
X
INBITS j (k, b) =
(|hn,b−Kj−k+1 | + |hn,b−Kj−k |)
n=0
(12.31)
for Kj + k = wd .
12.4.4
DF3 architecture
For the partial symmetry case, the contributions of the different adders are separated. Assuming that a symmetry width of S adders is used, the partial products
can be split into ⌈wd /S⌉ bins, where the first contains partial products from the
S least significant data bits, the next bin contains partial products from the next
S least significant data bits, and so on. Denoting the contribution from bin m by
Bjm (k, b), the contributions for m = 0, 1, 2, . . . , ⌈wd /S⌉ − 1 can be written
(N +1)/2
Bjm (k, b)
=
X
n=0
for 1 ≤ Kj + k ≤ wd − 1, and
|hn,b−Kj−k−mS+1 |
(12.32)
(N +1)/2
Bjm (k, b)
=
X
n=0
(|hn,b−Kj−k−mS+1 | + |hn,b−Kj−k−mS |)
(12.33)
for Kj + k = wd . For m = ⌈wd /S⌉, the contribution can be written
(N +1)/2
Bjm (k, b)
=
X
n=0
|hn,b−Kj−k−mS+1 |
(12.34)
for 1 ≤ Kj + k ≤ (wd mod S) − 1, and
(N +1)/2
Bjm (k, b)
=
X
n=0
(|hn,b−Kj−k−mS+1 | + |hn,b−Kj−k−mS |)
(12.35)
12.5. RESULTS
161
for Kj + k = wd mod S. Finally, the combined contribution is the sum of the
partial symmetry adder contributions
⌈wd /S⌉
INBITS j (k, b) =
X
Bjm (k, b)
(12.36)
m=0
12.4.5
TF architecture
Denoting the polyphase factor by M , for the TF architecture the first M filter
coefficient will be inserted in the last adder stage, the next M coefficients in the
stage before, and so on. Thus, the number of initial partial products can be written
M (j+1)−1 wd −1
INBITS J−j−1 (0, b) =
X
X
t=0
n=M j
12.4.6
|hn,b−t | .
(12.37)
Constant term placement
If either the coefficient or the data contains digits with a negative sign, a constant
compensation term must be added to the carry-save tree. However, these bits may
be placed in an arbitrary stage, and in this section it is explained how the problem
may be modified to place the bits optimally in terms of hardware resources. Define
the constant, in two’s complement representation, as
wout −1
C = −cwout −1 2
+
wout
X−2
cb 2b ,
(12.38)
b=0
where cb ∈ {0, 1}. Then define the ILP variables CBITS j (b) ∈ {0, 1} for j ∈
{0, 1, 2, . . . , J − 1}, b ∈ {0, 1, 2, . . . , wout − 1}, and add the constraint
J−1
X
CBITS j (b) = cb
(12.39)
j=0
for b ∈ {0, 1, 2, . . . , wout − 1}. Redefine (12.22) to
BITS j+1 (0, b) = BITS j (K, b) + CBITS j+1 (b)
(12.40)
in order to add the constant bits to the carry-save tree.
12.5
Results
In this section, the different architectures are compared, and the choice of coefficient representation is investigated. For the energy and area estimations, a VHDL
generator has been used to generate synthesizable VHDL code. The complete
software package with ILP problem and VHDL code generator is available at [117].
162
12.5.1
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
Architecture comparison
Two filters have been used to evaluate the optimization algorithm, and the relative
performance of the architectures. The filters are based on 4-tap and 16-tap moving
average filters, M = 4 and M = 16, respectively. Both filters consist of three
cascaded filters (L = 3). In all simulations, the numbers of full adders correspond
to those given by (12.6) and (12.7), and the number of registers given by (12.15)
and (12.17) were added to the optimized result for the DF1 and DF2 architectures,
respectively. The filters were optimized using the ILP problem solver SCIP [2] with
the costs CFA = 3, CHA = 2, and CREG = 3. Even though the area of a full adder
is roughly twice that of a half adder, it was chosen to increase the half adder cost
slightly as the routing associated to one full adder is likely less than that of two
half adders.
The optimized filters have been compared with filters obtained using the Reduced Area [6] heuristic. The Reduced Area heuristic is claimed to minimize the
number of registers when used in a pipelined multiplier reduction tree. However,
it is interesting to note that this is in general not true for the bitproduct matrices resulting from filters implemented with carry-save adder trees. Especially for
the TF architecture, the bit-level optimized adder trees may result in significantly
reduced register usage, while also using fewer half adders.
Figure 12.14(a) shows the impact of the pipeline factor on the first filter with
short wordlengths. For the 4-tap filter, Na = 1.8, and according to (12.14) utilizing
the coefficient symmetry does not lead to reduced arithmetic complexity for wd <
2.25, and the DF2 architecture has thus not been included. Also, all filters used six
half adders. It can be seen that the bit-level optimized filters use significantly less
registers, especially for heavily pipelined implementations. It can also be seen that
the TF architecture has a lower register complexity except for implementations
with large stage height and one bit input.
In Fig. 12.14(b) and Fig. 12.14(c), respectively, the energy dissipation and active
cell area (excluding routing) are shown. The area and energy measures are based
on a 90 nm standard cell library using Synopsys Design Compiler. The energy
estimations are obtained using a zero-delay model and assuming uncorrelated input
data. In the considered application, using a zero-delay model is expected to yield
relevant results as the amount of glitches is small due to the short critical paths.
Also, as decimation filter for a sigma-delta ADC the assumption of uncorrelated
input data is considered to be relevant.
In Fig. 12.14(d), the total cost of the optimized filters are shown. These include
additional logic and arithmetic such as the algorithmic delays for the DF1 architecture and the adders used in the ripple-carry VMA, which are not considered in
the optimization. By comparing Fig. 12.14(d) with Figs. 12.14(b) and 12.14(c), it
can be concluded that the used cost function is a relevant measure for optimizing
both energy dissipation and cell area. Whereas energy dissipation and cell area in
general do not have a strong correlation, they can be expected to correlate well
when the amount of glitches is small and uncorrelated input data is used. Thus, in
the rest of this paper, only complexity results and energy dissipation results will
12.5. RESULTS
Register complexity
100
80
60
40
300
w=2, Heur. DF1
w=2, Opt. DF1
w=2, Heur. TF
w=2, Opt. TF
w=1, Heur. DF1
w=1, Opt. DF1
w=1, Heur. TF
w=1, Opt. TF
Energy dissipation [fJ/Sample]
120
163
20
0
1
2
3
4
5
Maximum stage height (K)
(a) Register complexity comparison.
Area [um2]
2500
2000
d
w =1, DF1
d
w =1, TF
200
d
150
100
2
3
4
5
Maximum stage height (K)
6
(b) Energy dissipation estimation.
400
wd=2, DF1
wd=2, DF1
wd=2, TF
350
wd=2, TF
wd=1, DF1
300
wd=1, DF1
wd=1, TF
250
wd=1, TF
Cost
3000
d
w =2, TF
250
50
1
6
w =2, DF1
1500
200
150
1000
500
1
100
2
3
4
5
Maximum stage height (K)
(c) Synthesized cell area.
6
50
1
2
3
4
5
Maximum stage height (K)
6
(d) Optimized cost.
Figure 12.14 Optimization and synthesis results of FIR CIC filters using
the TF and DF1 architectures with M = 4 and L = 3.
be presented as cell area and cost are similar to energy dissipation.
In Fig. 12.15, the implementation complexity of the 16-tap filter is shown, using
one bit input data. It is apparent that the bit-level optimized filter for the TF
architecture offers a lower register complexity, while also reducing the number of
half adders.
Figures 12.16(a), 12.16(b) and 12.16(c) show the full adder, half adder and
register complexity, respectively, for increasing input wordlength wd for the 4-tap
filter. The stage height is K = 2. For the 4-tap filter, Na = 1.8, and according
to (12.14) the DF2 architecture has a lower arithmetic complexity for wd > 2.
However, even for wd = 6, the reduction in arithmetic complexity is small compared
to the increase in number of registers, as seen in Fig. 12.16(c). For the DF3
architecture, a symmetry adder group size of S = 2 was used, thus reducing the
number of generated partial products by 25%. However, the register complexity is
still close to the DF2 architecture with full symmetry, as shown in Fig. 12.16(c).
Energy dissipation estimations for the optimized filters are shown in Fig. 12.16(d).
164
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
400
Reg., Heur. DF1
Reg., Opt. DF1
Reg., Heur. TF
Reg., Opt. TF
HA, Heur. DF1
HA, Opt. DF1
HA, Heur. TF
HA, Opt. TF
350
Reg/HA complexity
300
250
200
150
100
50
0
1
2
3
4
Maximum stage height (K)
5
6
Figure 12.15 Register/HA complexity of bit-level optimized FIR CIC filters with M = 16, L = 3, and wd = 1. The number of half
adders is 12 or 13 for all simulations.
12.5.2
Coefficient representation
Different coefficient representations have been investigated for two filters shown
in Fig. 12.17. Using signed-digit coefficients yields a small decrease in energy
dissipation. That the gain is not larger is because the additional constant vector
is relatively large compared with the number of bit-products for such small filters.
Also, the simulation has not taken into account the fact that adders with a constant
bit input may be simplified. It can be expected that the difference between binary
and MSD coefficients would increase as the data and/or coefficient wordlength
increases.
12.5.3
Subexpression sharing
In this section, adder sharing is evaluated for CIC decimation and interpolation
filters with rate change factors of four and eight. Wordlengths between one and
six bits were used for the decimation filters, and wordlengths between three and
eight bits were used for the interpolation filters. The DF1 and TF structures
were evaluated, and for all filters the resulting adder trees were bit-level optimized
after all shared sub-expressions were identified and adders introduced. Binary
arithmetic with a critical path of three adders were used in all cases. The savings
in register complexity were limited, and thus only the savings in adder complexity
are presented.
Three filters were used in cascade, and thus the single-rate impulse responses
12.5. RESULTS
80
60
30
FA, Opt. DF3
FA, Heur. DF2
FA, Opt. DF2
FA, Heur. DF1
FA, Opt. DF1
FA, Heur. TF
FA, Opt. TF
Half adder complexity
Full adder complexity
100
165
40
20
0
1
25
20
15
10
5
2
3
4
Input wordlength (w)
5
0
1
6
(a) Full adder complexity.
Register complexity
250
200
150
700
Regs, Opt. DF3
Regs, Heur. DF2
Regs, Opt. DF2
Regs, Heur. DF1
Regs, Opt. DF1
Regs, Heur. TF
Regs, Opt. TF
100
50
0
1
2
3
4
Input wordlength (w)
(c) Register complexity.
5
2
3
4
Input wordlength (w)
5
6
(b) Half adder complexity.
Energy dissipation [fJ/Sample]
300
HA, Opt. DF3
HA, Heur. DF2
HA, Opt. DF2
HA, Heur. DF1
HA, Opt. DF1
HA, Heur. TF
HA, Opt. TF
6
600
DF2
DF1
TF
500
400
300
200
100
0
1
2
3
4
Wordlength (wd)
5
6
(d) Energy dissipation of optimized filters.
Figure 12.16 Complexity of FIR CIC filters with M = 4, L = 3, and
K = 2.
for the example filters are
h4 (k) = (1, 3, 6, 10, 12, 12, 10, 6, 3, 1) , and
(12.41)
h8 (k) = (1, 3, 6, 10, 15, 21, 28, 36, 42, 46, 48, 48, 46, 42, 36, 28, 21, 15, 10, 6, 3, 1) ,
(12.42)
for the filters of factors 4 and 8, respectively.
Results of adder sharing for the decimation filters are shown in Fig. 12.18(a).
The bigger filter offers significantly more sharing potential, which is expected.
Generally, increasing the data wordlength increases the number of partial products
and therefore the sharing potential.
Results of adder sharing for interpolation filters are shown in Fig. 12.18(b).
Since the multiple output branches of interpolation filters cannot be merged into
a common adder tree as for the decimation filters, the number of candidate subexpressions is significantly reduced and the sharing potential is smaller. This is
especially true for the transposed direct form, where the structural delays in the
166
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
Energy dissipation [fJ/Sample]
1500
M=8, L=3, bin
M=8, L=3, MSD
M=4, L=4, bin
M=4, L=4, MSD
1000
500
0
1
2
3
4
Wordlength (wd)
5
6
Figure 12.17 Energy dissipation estimations using binary and MSD coefficients for FIR CIC filters with M = 4, L = 4 and
M = 8, L = 3.
20
20
DF1, dec=8
TF, dec=8
DF1, dec=4
TF, dec=4
Saved adders [%]
Saved adders [%]
25
15
10
5
0
1
2
3
4
Data wordlength
5
6
15
DF1, int=8
TF, int=8
DF1, int=4
10
5
0
3
4
5
6
Data wordlength
7
8
(a) Multirate FIR CIC decimation filters (b) Multirate FIR CIC interpolation filters
with factors 4 and 8.
with factors 4 and 8.
Figure 12.18 Subexpression sharing results for multirate FIR CIC filters.
reduction tree also inhibits merging of partial products from all taps in each output
branch. For the TF architecture with interpolation by four there was no sharing
potential.
The factor-8 decimation filters in Figs. 12.18(a) and 12.18(b) have also been
synthesized to standard cells of a 90nm CMOS process using Synopsys Design
Compiler. Flipflops, full adders and half adders were used as leaf components, and
the designs were synthesized for a clock frequency of 1.4 GHz. Energy and area
estimations are shown in Figs. 12.19(a) and 12.19(b) respectively. It is seen that
25
20
167
14
TF
TF, sharing
DF1
DF1, sharing
12
Area [× 1000 µm2]
Energy dissipation [pJ/sample]
12.5. RESULTS
15
10
5
0
1
10
TF
TF, sharing
DF1
DF1, sharing
8
6
4
2
2
3
4
Data wordlength
5
(a) Energy dissipation estimation.
6
0
1
2
3
4
Data wordlength
5
6
(b) Area estimation.
Figure 12.19 Energy dissipation and area estimations of FIR CIC decimation filters with factor 8.
the redundancy reduction reduces both the energy dissipation and the area.
168
CHAPTER 12. HIGH-SPEED DIGITAL FILTERING
13
Conclusions and future work
In this chapter, the work presented in this part of the thesis is concluded, and
suggestions for future work are given.
13.1
Conclusions
In part II of this thesis, contributions were made in the area of high-speed analog-todigital conversion utilizing Σ∆-ADCs. The contributions can be divided into two
parts: a sensitivity analysis of ADC systems using parallel Σ∆-modulators and
the efficient bit-level realization of FIR filters with high speed and low wordlength,
which can typically be used as the first stage decimation filter for a high-bandwidth
Σ∆-ADC.
In Chapter 10, a general model of an ADC using parallel Σ∆-modulators is introduced. The model uses periodic modulation sequences for the different channels,
and multirate filter bank theory is used to define a signal transfer function for the
system. The proposed model comprises several of the traditional parallel systems,
including time-interleaved systems, Hadamard-modulated systems, and frequencydecomposed systems. A channel model including non-linearities of the modulators,
gain and offset errors were used, and the different systems were analyzed for matching requirements using the channel model. It was found that the sensitivities differ
significantly between the different systems, with the time-interleaved system requir169
170
CHAPTER 13. CONCLUSIONS AND FUTURE WORK
ing full matching between all channels, the Hadamard-modulated system requiring
limited matching between subsets of the channels, and the frequency-decomposed
system being inherently insensitive to mismatch problems. In addition, a new
three-channel scheme is introduced, that is insensitive to gain errors and modulator non-idealities.
In Chapter 12, the realization of efficient high-speed FIR filters was considered,
with one of the main motivations being as decimation filters for wide-bandwidth
Σ∆-modulator. Due to the high speed and short wordlength, the popular recursive CIC structures are inefficient, and instead FIR filters realized using polyphase
decomposition, partial product generation and reduction through carry-save adder
trees were considered. The focus is on the efficient realization of partial product
reduction trees. An ILP-based optimization algorithm was proposed, that minimizes a cost function defined as a weighted sum of the number of full adders, half
adders and registers. The optimization algorithm was compared to the Reduced
Area heuristic. Several different architectures, including direct form, transposed
direct form, and direct form utilizing coefficient symmetry for linear-phase filters,
were compared for several example filters and data wordlengths.
13.2
Future work
The following ideas are identified as possibilities for future work:
• The ILP-optimized reduction algorithm could be used also for other highspeed arithmetic. One such example is a multiply-and-accumulate, where
the accumulation is recursive and will be an extra consideration in the optimization.
• The result of the optimization is a fixed structure with a constant filter impulse response. However, in the area of software defined radio, the ability
to trade bandwidth for resolution is needed, which can naturally by done
using a fixed Σ∆-ADC and varying the bandwidth of the decimation filter.
Thus, a method of optimizing an adaptive filter with varying bandwidth is
interesting.
References
[1] H. Aboushady, Y. Dumonteix, M.-M. Louerat, and H. Mehrez, “Efficient
polyphase decomposition of comb decimation filters in Σ∆ analog-to-digital
converters,” IEEE Trans. Circuits Syst. II, vol. 48, no. 10, pp. 898–903, Oct.
2001.
[2] T.
Achterberg,
“Constraint
Integer
Programming,”
Ph.D.
dissertation,
Technische
Universität
Berlin,
2007,
available:
http://opus.kobv.de/tuberlin/volltexte/2007/1611/.
[3] J. B. Anderson, Digital Transmission Engineering.
IEEE Press, 1999.
[4] R. Batten, A. Eshragi, and T. S. Fiez, “Calibration of parallel ∆Σ ADCs,”
IEEE Trans. Circuits Syst. II, vol. 49, no. 6, pp. 390–399, Jun. 2002.
[5] A. Berkeman, V. Öwall, and M. Torkelson, “A low logic depth complex multiplier using distributed arithmetic,” IEEE J. Solid-State Circuits, vol. 35,
no. 4, pp. 656–659, Apr. 2000.
[6] K. Bickerstaff, M. Schulte, and E.E. Swartzlander, Jr., “Reduced area multipliers,” in Proc. Int. Conf. Application-Specific Array Processors, Nov. 1993,
pp. 478–489.
[7] A. Blad and O. Gustafsson, “Energy-efficient data representation in LDPC
decoders,” IET Electron. Lett., vol. 42, no. 18, pp. 1051–1052, Aug. 2006.
171
172
Bibliography
[8] ——, “Integer linear programming-based bit-level optimization for highspeed FIR decimation filter architectures,” Springer Circuits, Syst. Signal
Process. - Special Issue on Low Power Digital Filter Design Techniques and
Their Applications, vol. 29, no. 1, pp. 81–101, Feb. 2010.
[9] ——, “Redundancy reduction for high-speed FIR filter architectures based
on carry-save adder trees,” in Proc. Int. Symp. Circuits, Syst., May 2010.
[10] ——, “FPGA implementation of rate-compatible QC-LDPC code decoder,”
in Proc. European Conf. Circuit Theory Design, Aug. 2011.
[11] A. Blad, O. Gustafsson, and L. Wanhammar, “An early decision decoding
algorithm for LDPC codes using dynamic thresholds,” in Proc. European
Conf. Circuit Theory Design, Aug. 2005, pp. 285–288.
[12] ——, “A hybrid early decision-probability propagation decoding algorithm
for low-density parity-check codes,” in Proc. Asilomar Conf. Signals, Syst.,
Comp., Oct. 2005.
[13] ——, “Implementation aspects of an early decision decoder for LDPC codes,”
in Proc. Nordic Event ASIC Design, Nov. 2005.
[14] ——, “An LDPC decoding algorithm utilizing early decisions,” in Proc. National Conf. Radio Science, Jun. 2005.
[15] A. Blad, O. Gustafsson, M. Zheng, and Z. Fei, “Integer linear programming based optimization of puncturing sequences for quasi-cyclic low-density
parity-check codes,” in Proc. Int. Symp. Turbo-Codes, Related Topics, Sep.
2010.
[16] A. Blad, H. Johansson, and P. Löwenborg, “Multirate formulation for mismatch sensitivity analysis of analog-to-digital converters that utilize parallel sigma-delta modulators,” Eurasip J. Advances Signal Process., vol. 2008,
2008, article ID 289184, 11 pages.
[17] A. Blad, P. Löwenborg, and H. Johansson, “Design trade-offs for linear-phase
FIR decimation filters and sigma-delta modulators,” in Proc. XIV European
Signal Process. Conf., Sep. 2006.
[18] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 lowdensity parity-check code decoder,” IEEE J. Solid-State Circuits, vol. 37,
no. 3, pp. 404–412, Mar. 2002.
[19] L. J. Breems, R. Rutten, and G. Wetzker, “A cascaded continuous-time Σ∆
modulator with 67-dB dynamic range in 10-MHz bandwidth,” IEEE J. SolidState Circuits, vol. 39, no. 12, pp. 2152–2160, Dec. 2004.
[20] J. Candy, “Decimation for sigma delta modulation,” IEEE Trans. Commun.,
vol. 29, pp. 72–76, Jan. 1986.
Bibliography
173
[21] J. C. Candy and G. C. Temes, Oversampling methods for A/D and D/A
conversion. IEEE Press, 1992.
[22] Y.-J. Chang, F. Lai, and C.-L. Yang, “Zero-aware asymmetric SRAM cell
for reducing cache power in writing zero,” IEEE Trans. VLSI Syst., vol. 12,
no. 8, pp. 827–836, Aug. 2004.
[23] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X.-Y. Hu, “Reducedcomplexity decoding of LDPC codes,” IEEE Trans. Commun., vol. 53, no. 8,
pp. 1288–1299, Aug. 2005.
[24] E. Choi, S. Suh, and J. Kim, “Rate-compatible puncturing for low-density
parity-check codes with dual-diagonal parity structure,” in Proc. Symp. Personal Indoor Mobile Radio Commun., Sep. 2005, pp. 2642–2646.
[25] S.-Y. Chung, “On the construction of some capacity-approaching coding
schemes,” Ph.D. dissertation, Massachusets Institute of Technology, Sep.
2000.
[26] S.-Y. Chung, G. D. Forney, T. J. Richardson, and R. Urbanke, “On the
design of low-density parity-check codes within 0.0045 dB of the Shannon
limit,” IEEE Commun. Lett., vol. 5, no. 2, pp. 58–60, Feb. 2001.
[27] S.-Y. Chung, T. J. Richardson, and R. Urbanke, “Analysis of sum-product
decoding of low-density parity-check codes using a Gaussian approximation,”
IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 657–670, Feb. 2001.
[28] F. Colodro, A. Torralba, and M. Laguna, “Time-interleaved multirate sigmadelta modulators,” in Proc. Int. Symp. Circuits, Syst., vol. 6, May 2005, pp.
5581–5584.
[29] R. F. Cormier, T. L. Sculley, and R. H. Bamberger, “Combining subband
decomposition and sigma-delta modulation for wideband A/D conversion,”
in Proc. Int. Symp. Circuits, Syst., vol. 5, May 1994, pp. 357–360.
[30] L. Dadda, “Some schemes for parallel multipliers,” Alta Frequenza, vol. 34,
pp. 349–356, May 1965.
[31] D. Declercq and F. Verdier, “Optimization of LDPC finite precision belief
propagation decoding with descrete density evolution,” in Proc. Int. Symp.
Turbo-Codes, Related Topics, Sep. 2003, pp. 479–482.
[32] G. Ding, C. Dehollain, M. Declercq, and K. Azadet, “Frequency-interleaving
technique for high-speed A/D conversion,” in Proc. Int. Symp. Circuits, Syst.,
vol. 1, May 2003, pp. 857–860.
[33] Y. Dumonteix and H. Mehrez, “A family of redundant multipliers dedicated
to fast computation for signal processing,” in Proc. Int. Symp. Circuits, Syst.,
vol. 5, May 2000, pp. 325–328.
174
Bibliography
[34] A. Eshraghi and T. S. Fiez, “A time-interleaved parallel ∆Σ A/D converter,”
IEEE Trans. Circuits Syst. II, vol. 50, no. 3, pp. 118–129, Mar. 2003.
[35] ——, “A comparative analysis of parallel delta-sigma ADC architectures,”
IEEE Trans. Circuits Syst. I, vol. 51, no. 3, pp. 450–458, Mar. 2004.
[36] T. Etzion, A. Trachtenberg, and A. Vardy, “Which codes have cycle-free
Tanner graphs?” IEEE Trans. Inf. Theory, vol. 45, no. 6, pp. 2173–2181,
Sep. 1999.
[37] M. Fossorier, “Quasi-cyclic low-density parity-check codes from circulant permutation matrices,” IEEE Trans. Inf. Theory, vol. 50, no. 8, pp. 1788–1793,
Aug. 2004.
[38] M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity iterative
decoding of low density parity check codes based on belief propagation,”
IEEE Trans. Commun., vol. 47, pp. 673–680, May 1999.
[39] R. Gallager, “Low-density parity-check codes,” Ph.D. dissertation, Massachusetts Institute of Technology, 1963.
[40] I. Galton and H. T. Jensen, “Delta-sigma modulator based A/D conversion
without oversampling,” IEEE Trans. Circuits Syst. II, vol. 42, no. 12, pp.
773–784, Dec. 1995.
[41] ——, “Oversampling parallel delta-sigma modulator A/D conversion,” IEEE
Trans. Circuits Syst. II, vol. 43, no. 12, pp. 801–810, Dec. 1996.
[42] Y. Gao, L. Jia, J. Isoaho, and H. Tenhunen, “A comparison design of comb
decimators for sigma-delta analog-to-digital converters,” Springer Analog Integrated Circuits, Signal Process., vol. 22, no. 1, pp. 51–60, Jan. 2000.
[43] M. Good and F. R. Kschischang, “Incremental redundancy via check splitting,” in Proc. Biennial Symp. Commun., 2006, pp. 55–58.
[44] F. Guillod, E. Boutillon, J. Tousch, and J.-L. Danger, “Generic description
and synthesis of LDPC decoders,” IEEE Trans. Commun., vol. 55, no. 11,
pp. 2084–2091, Nov. 2007.
[45] O. Gustafsson and H. Ohlsson, “A low power decimation filter architecture for
high-speed single-bit sigma-delta modulation,” in Proc. Int. Symp. Circuits,
Syst., vol. 2, May 2005, pp. 1453–1456.
[46] J. Ha, J. Kim, D. Klinc, and S. W. McLaughlin, “Rate-compatible punctured
low-density parity-check codes with short block lengths,” IEEE Trans. Inf.
Theory, vol. 52, no. 2, pp. 728–738, 2006.
[47] J. Ha, J. Kim, and S. W. McLaughlin, “Rate-compatible puncturing of lowdensity parity-check codes,” IEEE Trans. Inf. Theory, vol. IT-50, no. 11, pp.
2824–2836, Nov. 2004.
Bibliography
175
[48] J. Hagenauer, “Rate-compatible punctured convolutional codes (RCPC
codes) and their applications,” IEEE Trans. Commun., vol. 36, no. 4, pp.
389–400, Apr. 1988.
[49] O. Herrmann, R. L. Rabiner, and D. S. K. Chan, “Practical design rules for
optimum finite impulse response digital filters,” Bell System Technical J.,
vol. 52, no. 6, pp. 769–799, 1973.
[50] D. Hocevar, “A reduced complexity decoder architecture via layered decoding
of LDPC codes,” in Proc. Workshop, Signal Proc. Syst., Oct. 2004, pp. 107–
112.
[51] E. B. Hogenauer, “An economical class of digital filters for decimation and
interpolation,” Proc. Int. Conf. Acoust. Speech, Signal Process., vol. 29, pp.
155–162, Apr. 1981.
[52] C. Howland and A. Blanksby, “A 220mW 1Gb/s 1024-bit rate-1/2 low density
parity check code decoder,” in Proc. Custom Integrated Circuits Conf., May
2001, pp. 293–296.
[53] P. G. A. Jespers, Integrated Converters: D to A and A to D Architectures,
Analysis and Simulation. New York: Oxford University Press, 2001.
[54] K. Johansson, O. Gustafsson, and L. Wanhammar, “Implementation of elementary functions for logarithmic number systems,” IET J. Computers,
Digital Techniques, IET, vol. 2, no. 4, pp. 295–304, Jul. 2008.
[55] S. J. Johnson and S. R. Weller, “Construction of low-density parity-check
codes from Kirkman triple systems,” in Proc. Global Commun. Conf., Nov.
2003, pp. 970–974.
[56] R. Khoini-Poorfard and D. A. Johns, “Time-interleaved oversampling converters,” IET Electron. Lett., vol. 29, no. 19, pp. 1673–1674, Sep. 1993.
[57] ——, “Mismatch effects in time-interleaved oversampling converters,” in
Proc. Int. Symp. Circuits, Syst., vol. 5, May 1994, pp. 429–432.
[58] R. Khoini-Poorfard, L. B. Lim, and D. A. Johns, “Time-interleaved oversampling A/D converters: Theory and practice,” IEEE Trans. Circuits Syst. II,
vol. 44, no. 8, pp. 634–645, Aug. 1997.
[59] K.-Y. Khoo, Z. Yu, and Alan N. Willson, Jr., “Bit-level arithmetic optimization for carry-save additions,” in Proc. Computer-Aided Design, Digest of
Technical Papers, Nov. 1999, pp. 14–18.
[60] J.-A. Kim, S.-R. Kim, D.-J. Shin, and S.-N. Hong, “Analysis of check-node
merging decoding for punctured LDPC codes with dual-diagonal parity structure,” in Proc. Wireless Commun. Netw. Conf., 2007, pp. 572–576.
176
Bibliography
[61] E. King, F. Aram, T. Fiez, and I. Galton, “Parallel delta-sigma A/D conversion,” in Proc. Custom Integrated Circuits Conf., May 1994, pp. 503–506.
[62] S. K. Kong and W. H. Ku, “Frequency domain analysis of Π∆Σ ADC and its
application to combining subband decomposition and Π∆Σ ADC,” in Proc.
Midwest Symp. Circuits, Syst., vol. 1, Aug. 1997, pp. 226–229.
[63] Y. Kou, S. Lin, and M. P. C. Fossorier, “Low-density parity-check codes
based on finite geometries: A rediscovery and new results,” IEEE Trans. Inf.
Theory, vol. 47, no. 7, pp. 2711–2736, Nov. 2001.
[64] M. Kozak, M. Karaman, and I. Kale, “Efficient architectures for timeinterleaved oversampling delta-sigma converters,” IEEE Trans. Circuits Syst.
II, vol. 47, no. 8, pp. 802–810, Aug. 2000.
[65] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the
sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 498–
519, Feb. 2001.
[66] T.-C. Kuo and J. A. N. Willson, “A flexible decoder IC for WiMAX QCLDPC codes,” in Proc. Custom Integrated Circuits Conf., 2008, pp. 527–530.
[67] C. Kuskie, B. Zhang, and R. Schreier, “A decimation filter architecture for
GHz delta-sigma modulators,” in Proc. Int. Symp. Circuits, Syst., vol. 2,
May 1995, pp. 953–956.
[68] Y. Li, M. Elassal, and M. Bayoumi, “Power efficient architecture for (3,6)regular low-density parity-check code decoder,” in Proc. Int. Symp. Circuits,
Syst., vol. IV, May 2004, pp. 81–84.
[69] C.-Y. Lin, C.-C. Wei, and M.-K. Ku, “Efficient encoding for dual-diagonal
structured LDPC codes based on parity bit prediction and correction,” in
Proc. Asia-Pacific Conf. Circuits, Syst., 2008, pp. 1648–1651.
[70] C.-C. Lin, K.-L. Lin, H.-C. Chang, and C.-Y. Lee, “A 3.33Gb/s (1200, 720)
low-density parity-check code decoder,” in Proc. European Solid-State Circuits Conf., Sep. 2005, pp. 211–214.
[71] S. Lin, L. Chen, J. Xu, and I. Djurdjevic, “Near Shannon limit quasi-cyclic
low-density parity-check codes,” in Proc. Global Commun. Conf., May 2003,
pp. 2030–2035.
[72] M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman, “Analysis of
low density codes and improved designs using irregular graphs,” in Proc.
Annual Symp. Theory Computing, 1998, pp. 249–258.
[73] D. MacKay, “Good error-correcting codes based on very sparse matrices,” in
Proc. Int. Symp. Inf. Theory, Jun. 1997, p. 113.
Bibliography
177
[74] ——, “Good error-correcting codes based on very sparse matrices,” IEEE
Trans. Inf. Theory, vol. 45, no. 2, pp. 399–431, Mar. 1999.
[75] D. MacKay and R. Neal, “Near Shannon limit performance of low density
parity check codes,” IET Electron. Lett., vol. 33, no. 6, pp. 457–458, Mar.
1997.
[76] U. Meyer-Baese, S. Rao, J. Ramı́rez, and A. Garcı́a, “Cost-effective
Hogenauer cascaded integrator comb decimator filter design for custom ICs,”
IET Electron. Lett., vol. 41, no. 3, pp. 158–160, Feb. 2005.
[77] O. Milenkovic, I. B. Djordjevic, and B. Vasic, “Block-circulant low-density
parity-check codes for optical communication systems,” IEEE J. Sel. Topics
Quantum Electron., vol. 10, no. 2, pp. 294–299, Mar. 2004.
[78] S. K. Mitra, Digital Signal Processing:
McGraw-Hill, 2006.
A Computer Based Approach.
[79] F. Naessens, V. Derudder, H. Cappelle, L. Hollevoet, P. Raghavan,
M. Desmet, A. AbdelHamid, I. Vos, L. Folens, S. O’Loughlin, S. Singirikonda,
S. Dupont, J.-W. Weijers, A. Dejonghe, and L. Van der Perre, “A 10.37 mm2
675 mW reconfigurable LDPC and Turbo encoder and decoder for 802.11n,
802.16e and 3GPP-LTE,” in Proc. Symp. VLSI Circuits, 2010, pp. 213–214.
[80] V. T. Nguyen, P. Loumeau, and J.-F. Naviner, “Analysis of time-interleaved
delta-sigma analog to digital converter,” in Proc. Veh. Tech. Conf., vol. 4,
Sep. 2002.
[81] S. R. Norsworthy, R. Schreier, and G. C. Temes, Eds., Delta-Sigma Data
Converters: Theory, Design, and Simulation. Wiley-IEEE Press, 1996,
ISBN 0780310454.
[82] H. Ohlsson, B. Mesgarzadeh, K. Johansson, O. Gustafsson, P. Löwenborg,
H. Johansson, and A. Alvandpour, “A 16 GSPS 0.18 µm CMOS decimator
for single-bit Σ∆-modulation,” in Proc. Nordic Event ASIC Design, Nov.
2004, pp. 175–178.
[83] H. Y. Park, J. W. Kang, K. S. Kim, and K. C. Whang, “Efficient puncturing
method for rate-compatible low-density parity-check codes,” IEEE Trans.
Wireless Commun., vol. 6, no. 11, pp. 3914–3919, Nov. 2007.
[84] T. Parks and J. McClellan, “Chebyshev approximation for nonrecursive digital filters with linear phase,” IEEE Trans. Circuit Theory, vol. 19, no. 2, pp.
189–194, 1972.
[85] L. Rabiner and O. Herrmann, “The predictability of certain optimum finite
impulse response digital filters,” IEEE Trans. Circuit Theory, vol. CT-20,
no. 4, pp. 401–408, Jul. 1973.
178
Bibliography
[86] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke, “Design of capacityapproaching irregular low-density parity-check codes,” IEEE Trans. Inf. Theory, vol. vol. 47, no. 2, pp. 619–637, Feb. 2001.
[87] T. J. Richardson and R. L. Urbanke, “The capacity of low-density paritycheck codes under message-passing decoding,” IEEE Trans. Inf. Theory, vol.
vol. 47, no. 2, pp. 599–618, Feb. 2001.
[88] R. Schreier and G. C. Temes, Understanding Delta-Sigma Data Converters.
John Wiley & Sons, NJ, 2005.
[89] C. E. Shannon, “A mathematical theory of communication,” Bell System
Technical J., vol. 27, pp. 379–423, 623–656, 1948.
[90] X.-Y. Shih, C.-Z. Zhan, and A.-Y. Wu, “A 7.39 mm2 76mW (1944, 972)
LDPC decoder chip for IEEE 802.11n applications,” in Proc. Asian SolidState Circuits Conf., Nov. 2008, pp. 301–304.
[91] J. E. Swartzlander, “Merged arithmetic,” IEEE Trans. Comput., vol. C-29,
no. 10, pp. 946–950, Oct. 1980.
[92] R. M. Tanner, “A recursive approach to low complexity codes,” IEEE Trans.
Inf. Theory, vol. IT-27, no. 5, pp. 533–547, Sep. 1981.
[93] K. Uyttenhove and M. Steyaert, “Speed-power-accuracy trade-off in highspeed ADCs,” IEEE Trans. Circuits Syst. II, vol. 4, pp. 247–257, Apr. 2002.
[94] P. P. Vaidyanathan, Multirate Systems and Filter Banks. Eaglewood Cliffs,
NJ: Prentice-Hall, 1993.
[95] B. Vasic and O. Milenkovic, “Combinatorial constructions of low-density
parity-check codes for iterative decoding,” IEEE Trans. Inf. Theory, vol. 50,
no. 6, pp. 1156–1176, Jun. 2004.
[96] B. Vasic, K. Pedagani, and M. Ivkovic, “High-rate girth-eight low-density
parity-check codes on rectangular integer lattices,” IEEE Trans. Inf. Theory,
vol. 52, no. 8, pp. 1248–1252, Aug. 2004.
[97] C. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Comput., vol. EC-13, no. 1, pp. 14–17, Feb. 1964.
[98] Y. Wang, J. Yedidia, and S. Draper, “Construction of high-girth QC-LDPC
codes,” in Proc. Int. Symp. Turbo-Codes, Related Topics, Sep. 2008, pp. 180–
185.
[99] Z. Wang, Z. Cui, and J. Sha, “VLSI design for low-density parity-check code
decoding,” IEEE Circuits Syst. Mag., vol. 11, no. 1, pp. 52–69, 2011.
[100] L. Wanhammar and H. Johansson, Digital Filters.
2002.
Linköping University,
Bibliography
179
[101] N. Wiberg, “Codes and decoding on general graphs,” Ph.D. dissertation,
Linköping University, Linköping, Sweden, Jun. 1996.
[102] S. B. Wicker, Error Control Systems.
Prentice Hall, 1995.
[103] N. Yaghini and D. Johns, “A 43mW CT complex ∆Σ ADC with 23MHz
of signal bandwidth and 68.8dB SNDR,” in Proc. Int. Solid-State Circuits
Conf., Digest of Technical Papers, Feb. 2005, pp. 502–503.
[104] E. Yeo, B. Nikolic, and V. Anantharam, “Architectures and implementations
of low-density parity check decoding algorithms,” in Proc. Midwest Symp.
Circuits, Syst., Aug. 2002, pp. III–437 – III–440.
[105] E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam, “VLSI architectures
for iterative decoders in magnetic recording channels,” IEEE Trans. Magn.,
vol. 37, no. 2, pp. 748–755, Mar. 2001.
[106] K. Zhang, X. Huang, and Z. Wang, “A high-throughput LDPC decoder architecture with rate compatibility,” IEEE Trans. Circuits Syst. I, vol. 58,
no. 4, pp. 839–847, Apr. 2011.
[107] T. Zhang and K. K. Parhi, “Joint code and decoder design for
implementation-oriented (3, k)-regular LDPC codes,” in Proc. Asilomar
Conf. Signals, Syst., Comp., Nov. 2001, pp. 1232–1236.
[108] ——, “A 54 Mbps (3,6)-regular FPGA LDPC decoder,” in Proc. Workshop,
Signal Proc. Syst., Oct. 2002, pp. 127–132.
[109] ——, “An FPGA implementation of (3,6) regular low-density parity-check
code decoder,” Eurasip J. Applied Signal Process., vol. 6, pp. 530–542, May
2003.
[110] ——, “Joint (3, k)-regular LDPC code and decoder/encoder design,” IEEE
Trans. Signal Process., vol. 52, no. 4, pp. 1065–1079, Apr. 2004.
[111] T. Zhang, Z. Wang, and K. K. Parhi, “On finite precision implementation of
low density parity check codes decoder,” in Proc. Int. Symp. Circuits, Syst.,
May 2001, pp. 202–205.
[112] E. Zimmermann, G. Fettweis, P. Pattisapu, and P. K. Bora, “Reduced complexity LDPC decoding using forced convergence,” in Proc. Int. Symp. Wireless Personal Multimedia Commun., Sep. 2004.
[113] “ETSI EN 302 307, digital video broadcasting (DVB): Second generation
framing structure, channel coding and modulation systems for broadcasting,
interactive services, news gathering and other broadband satellite applications (DVB-S2),” available:
http://www.dvb.org/technology/standards/.
180
Bibliography
[114] “ETSI EN 302 755, digital video broadcasting (DVB): Frame structure channel coding and modulation for a second generation digital terrestrial television
broadcasting system (DVB-T2),” available:
http://www.dvb.org/technology/standards/.
[115] “ETSI EN 302 769, digital video broadcasting (DVB): Frame structure channel coding and modulation for a second generation digital transmission system for cable systems (DVB-C2),” available:
http://www.dvb.org/technology/standards/.
[116] “GSFC-STD-9100, low density parity check code for rate 7/8,” available:
http://standards.gsfc.nasa.gov/stds.html.
[117] “High-speed FIR filter generator,” available:
http://www.es.isy.liu.se/software/hsfir/.
[118] “IEEE 802.11n-2009: Wireless LAN medium access control (MAC) and
physical layer (PHY) specifications amendment 5: Enhancements for higher
throughput,” available:
http://standards.ieee.org/getieee802/download/802.11n-2009.pdf.
[119] “IEEE 802.15.3c-2009: Wireless medium access control (MAC) and physical layer (PHY) specifications for high rate wireless personal area networks
(WPANs) amendment 2: Millimeter-wave-based alternative physical layer
extension,” available:
http://standards.ieee.org/getieee802/download/802.15.3c-2009.pdf.
[120] “IEEE 802.16e-2005: Air interface for fixed and mobile broadband wireless
access systems,” available:
http://standards.ieee.org/getieee802/download/802.16e-2005.pdf.
[121] “IEEE 802.3an-2006: Carrier sense multiple access with collision detection
(CSMA/CD) access method and physical layer specifications,” available:
http://standards.ieee.org/getieee802/download/802.3an-2006.pdf.
[122] “National standardization management committee. GB2060-2006. digital
television terrestrial broadcasting transmission system frame structure, channel coding and modulation. Beijing: Standards press of China. Aug. 2006.”
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement