Low-Complexity Multiplierless Constant Rotators Based on Combined Coefficient Selection and Shift-and-Add Implementation

Low-Complexity Multiplierless Constant Rotators Based on Combined Coefficient Selection and Shift-and-Add Implementation
Low-Complexity Multiplierless Constant
Rotators Based on Combined Coefficient
Selection and Shift-and-Add Implementation
(CCSSI)
Mario Garrido Gálvez, Fahad Qureshi and Oscar Gustafsson
Linköping University Post Print
N.B.: When citing this work, cite the original article.
©2014 IEEE. Personal use of this material is permitted. However, permission to
reprint/republish this material for advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or lists, or to reuse any copyrighted
component of this work in other works must be obtained from the IEEE.
Mario Garrido Gálvez, Fahad Qureshi and Oscar Gustafsson, Low-Complexity Multiplierless
Constant Rotators Based on Combined Coefficient Selection and Shift-and-Add
Implementation (CCSSI), 2014, IEEE Transactions on Circuits and Systems Part 1: Regular
Papers, (61), 7, 2002-2012.
http://dx.doi.org/10.1109/TCSI.2014.2304664
Postprint available at: Linköping University Electronic Press
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-109385
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
1
Low-Complexity Multiplierless Constant Rotators
Based on Combined Coefficient Selection and
Shift-and-Add Implementation (CCSSI)
Mario Garrido, Member, IEEE, Fahad Qureshi, Member, IEEE, and Oscar Gustafsson, Senior Member, IEEE
Abstract—This paper presents a new approach to design
multiplierless constant rotators. The approach is based on a
combined coefficient selection and shift-and-add implementation
(CCSSI) for the design of the rotators. First, complete freedom
is given to the selection of the coefficients, i.e., no constraints to
the coefficients are set in advance and all the alternatives are
taken into account. Second, the shift-and-add implementation
uses advanced single constant multiplication (SCM) and multiple
constant multiplication (MCM) techniques that lead to lowcomplexity multiplierless implementations. Third, the design of
the rotators is done by a joint optimization of the coefficient selection and shift-and-add implementation. As a result, the CCSSI
provides an extended design space that offers a larger number
of alternatives with respect to previous works. Furthermore, the
design space is explored in a simple and efficient way.
The proposed approach has wide applications in numerous
hardware scenarios. This includes rotations by single or multiple
angles, rotators in single or multiple branches, and different
scaling of the outputs.
Experimental results for various scenarios are provided. In
all of them, the proposed approach achieves significant improvements with respect to state of the art.
Index Terms—Rotation, complex multiplier, combined coefficient selection and shift-and-add implementation (CCSSI), adder
minimization, multiple constant multiplication (MCM), shift-andadd, fast Fourier transform.
I. I NTRODUCTION
ROTATION is a transformation that describes a circular
movement with respect to a point. Many digital signal
processing algorithms calculate rotations of complex numbers
by given angles with respect to the origin. This is the case
for the fast Fourier transform (FFT) [1]–[7], the fast discrete
cosine transform (DCT) [8], [9] and lattice filters [10], [11].
A rotator is the hardware component used to calculate
rotations. There are two main types of rotators: general rotators
and constant rotators. General rotators can carry out a rotation
by any angle, which is provided as an input to the rotator.
They are usually implemented by a complex multiplier or by
the coordinate rotation digital computer (CORDIC) algorithm.
A complex multiplier typically consists of four real multipliers
and two adders. In this case, the rotation is simply calculated
by multiplying the input of the multiplier by the rotation
coefficient [12]. Conversely, the CORDIC algorithm [13]–[17]
A
M. Garrido, F. Qureshi and O. Gustafsson are with the Department of
Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden,
e-mails: [email protected], [email protected], [email protected]
c 2013 IEEE. Personal use of this material is permitted.
Copyright However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to [email protected]
is based on breaking down the rotation angle into a series of
micro-rotations by specific angles. These micro-rotations are
carried out by means of shifts and additions, which are simple
to implement in hardware. A review of CORDIC techniques
can be found in [16].
Constant rotators calculate rotations by specific angles. They
are mainly used to calculate the twiddle factors in FFT architectures [1]–[7]. In constant rotators, the a priori knowledge of
the rotation angles allows for optimizing the implementation
of the rotator. CORDIC-based approaches for constant rotators [18]–[23] are based on selecting stages of the conventional
CORDIC [18]–[20] or increasing the amount of micro-rotation
angles [21]–[23]. Multiplier-based approaches for constant
rotators [4]–[7], [24]–[27] base on techniques to optimize realvalued constant multiplications [4]–[6], trigonometric identities [7], [24], optimization of the coefficient encoding [25],
[26], and angle generation by a base-3 system [27].
The design of constant rotators bases on two fundamental
elements: the coefficient selection and the shift-and-add implementation. The success of previous approaches is mainly
due to an efficient shift-and-add implementation: On the one
hand, the techniques used in multiplier-based approaches to
implement constant multiplications as shifts and additions are
widely developed [6], [28]–[37]. On the other hand, CORDICbased approaches rely on using elementary angles that allow
for an efficient shift-and-add implementation.
However, the coefficient selection has hardly been taken
into account. In multiplier-based approaches the coefficients
are traditionally obtained by rounding the sine and cosine
components of the angle. However, it has been shown that an
addition-aware quantization [37] can provide better coefficient.
Likewise, the CORDIC elementary angles have been used
since the CORDIC algorithm was proposed half a century
ago, without questioning if another selection of angles might
provide better results. Now, there are results that demonstrate
the existence of better angle sets than the CORDIC one for
the FFT rotations [27].
This work overcomes the old paradigm for the design of
rotators where the main focus was put on the shift-and-add
implementation. The new perspective presented in this paper
sets the coefficient selection and the shift-and-add implementation as equally important elements in the design of the rotators.
This removes the restrictions set by previous approaches to the
coefficient selection, widening the amount of alternatives that
are explored. This enables optimized solutions that cannot be
achieved by using previous approaches.
2
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
The main contributions of this work are:
•
•
•
•
It presents a new paradigm for the design of constant
rotators that combines the coefficient selection and the
shift-and-add implementation in the design process.
It provides a simple and efficient method to find optimized rotators.
It can be applied to solve a variety of problems with different demands, including single constant rotation (SCR)
and multiple constant rotations (MCR).
Experimental results for different contexts are provided.
In all cases, the proposed approach achieves significant
improvements in area and accuracy with respect to the
current state of the art.
This paper is organized as follows. Section II introduces the
calculation of rotations in fixed-point arithmetic. Section III
reviews previous multiplierless rotators. Section IV presents
the proposed approach. Section V provides experimental results for the approach in multiple contexts. Finally, Section VI
shows the main conclusions.
II. ROTATIONS IN F IXED -P OINT A RITHMETIC
A rotation of a complex number x + jy by an angle α can
be described as
X
cos α − sin α
x
,
(1)
=
Y
sin α
cos α
y
where X +jY is the result of the rotation. Ideally, the real and
imaginary components of the angle, cos α and sin α, should
be represented with infinite precision. However, in digital
systems, numbers must be represented with a finite number
of bits, which leads to quantization errors. Let C and S be
the coefficients that represent cos α and sin α, respectively.
If C and S use b bits in 2’s complement, then they can be
viewed as integer numbers in the range [−2b−1 , 2b−1 − 1],
i.e., C, S ∈ Zb , where
Zb = z ∈ Z : −2b−1 ≤ z ≤ 2b−1 − 1 .
(2)
According to this, a rotation in a digital system can be
described as
XD
C −S
x
=
,
(3)
YD
S
C
y
where XD + jYD is the result of the rotation and C and S
are obtained from the rotation angle as [12]
C
S
= R · (cos α + c )
= R · (sin α + s ),
(4)
where c and s are the relative quantization errors of the
cosine and sine components, respectively, and R is the scaling
factor. The output XD + jYD is also scaled by R.
For constant rotations we can distinguish between a single constant rotation (SCR) and multiple constant rotations
(MCR). These cases are explained next.
(a)
(b)
(c)
(d)
Fig. 1. Hardware layouts. (a) Single branch, single rotation. (b) Single branch,
multiple rotations. (c) Multiple branches, single rotation for each branch. (d)
Multiple branches, multiple rotations for each branch.
A. Single Constant Rotation (SCR)
This case refers to a rotation by a single angle, which
is shown in Fig. 1(a).
p In this case, the rotation error is
calculated [12] as = 2c + 2s .
Different optimization problems can be defined for SCR
depending on the scaling required by the rotator. Thus, the
scaling can be fixed, unity or arbitrary depending on the
freedom to choose the scaling factor. In fixed scaling, R is
fixed to a specific value. Unity scaling is a particular case of
fixed scaling in which the rotation has magnitude one or, in
more general terms, R = 2q . This is equivalent to considering
that the binary point is in a different position of the binary
representation. Conversely, arbitrary scaling means that R can
take any value, i.e., no restriction is set to R. For arbitrary
scaling the approximation error is equal to the angular error
only, since R always will take on the optimal value.
B. Multiple Constant Rotations (MCR)
This case refers to multiple angles that must be optimized
together. This joint optimization happens when there is a
dependency in the scaling of the angles. Given the angles
αi , i = 1, . . . , M , each angle must be approximated by
a complex number Pi = Ci + jSi , where Ci , Si ∈ Zb .
The set of complex numbers Pi , i = 1, . . . , M , is called a
kernel. The error of a kernel is calculated as the maximum
of the errors of the angles [12], i.e., = maxi ((i)) =
p
2c (i) + 2s (i) , i = 1, . . . , M .
maxi
Different optimization problems for MCR can be defined
depending on the scaling that is required and on the hardware
layout. The scaling for multiple angles is classified based on
the relation among the scaling factors of the angles. Uniform
scaling means that R is the same for all the angles, and nonuniform scaling means that different angles may have different
scaling factors. Note that in uniform and non-uniform scaling
the scaling factor is not fixed from the beginning. Conversely,
fixed and unity set a fixed scaling factor. In this cases, the
angles can be treated independently and the problem is reduced
to several SCR problems.
Depending on the hardware layout, an MCR problem can
target a single rotator that is reconfigured to calculate multiple
GARRIDO et al.: LOW-COMPLEXITY MULTIPLIERLESS CONSTANT ROTATORS
3
rotations (Fig. 1(b)), or several rotators in parallel that require
the same scaling with one (Fig 1(c)) or several rotations
(Fig. 1(d)) each. The case in Fig. 1(b) is typical in feedback
FFT architectures [2]. The case in Fig 1(c) is typical in fully
parallel FFT architectures and in some DCT architectures [9].
Finally, the case in Fig. 1(d) is typical in feedforward FFT
architectures [1]. Note that these layouts represent the optimization problem that must be solved, but not the final solution
to it, as the rotators will consist of adders and multiplexers
instead of the multipliers and memories shown in Fig. 1.
Rotators based on techniques to optimize real constant
multiplications [4]–[6] follow a different approach. In this case
the coefficients C and S are obtained by quantizing cos α and
sin α to a certain number of bits, b. This is usually done by
III. R EVIEW OF M ULTIPLIERLESS ROTATORS
In the literature, C and S are usually considered as numbers
in the range [−1, 1]. However, as C and S are quantized to
a certain number of bits, we find it more natural to consider
them as integers in Zb , as explained in the previous section. In
this section we use this convention to review previous works.
For general rotations, the CORDIC algorithm [13]–[17]
breaks down the rotation angle into a series of k microrotations by the angles αk = ± tan−1 (2−k ). For each stage,
k, the micro-rotation only uses two adders and calculates
k
XD
2 −δk
x
=
,
(5)
YD
y
δk
2k
The multiplication by C and S is implemented as shift-andadd operations. A typical approach is to use the canonical
signed digit (CSD) representation [6], [28]. This reduces the
number of non-zero digits with respect to the purely binary
representation and, therefore, the number of adders. Further
simplification is achieved by single constant multiplication
(SCM) techniques [30], [31]. They exploit the redundancy in
the multiplication by a single constant. Additional reduction
in complexity and improvements in accuracy can be obtained
by the addition-aware coefficient quantization method [37].
Finally, as the input of a rotator is multiplied by the real
and imaginary parts of the coefficient simultaneously, both
multiplications can be optimized together [36]. This is done
by multiple constant multiplication (MCM) techniques [32]–
[35].
Finally, approaches based on trigonometrical identities [7],
[24] search for expressions that are shared among the different
angles. This results in a simplified rotator that includes a
reduced number of adders, multiplexers and real constant
multiplications. For instance, ∀i ∈ Z, any angle α = i · π/8
can be calculated with real multiplications by only cos(π/8)
and/or sin(π/8) [24]. These real constant multiplications are
implemented by CSD [7] or SCM [24] techniques.
From the previous discussion we can note that previous
approaches restrict the set of coefficients used for the rotations:
According to (5), the CORDIC algorithm only calculates
rotations by the coefficients C + jS = 2k + jδk = 2k ± j.
For the EEAS CORDIC in (6) the coefficients only take values
C +jS = 2bk +j(δk 2ck +γk ). The MSR-CORDIC in (7) only
considers values for C and S whose CSD representations have
I and J non-zero terms, respectively. And in multiplier-based
rotators C + jS = b2b cos αe + jb2b sin αe, according to (9).
Table I compares previous approaches in terms of coefficient
selection and shift-and-add implementation, which defines the
design space covered by the approach, i.e., the amount of
alternative solutions that it explores. Table I also summarizes
the optimization problems that each approach can solve according to Section II, and positions the proposed method, to
be elaborated further in the next section.
where δk ∈ {−1, 1} determines the direction √
of the rotation,
and the scaling factor of the stage is R(k) = 22k + 1.
For constant rotators, the extended elementary angle set
(EEAS) CORDIC [21] considers the elementary angles
αk = tan−1 (δk 2−ak + γk 2−bk ), where δk , γk ∈ {−1, 0, 1}
and ak , bk ∈ N. Assuming that bk > ak and ck = bk − ak , the
micro-rotation at stage k is defined by
XD
2bk
−(δk 2ck + γk )
x
=
, (6)
YD
y
δk 2ck + γk
2bk
which requires four adders.
In the mixed-scaling-rotation (MSR) CORDIC [22], [23]
the number of adders per stage is 2 · (Ik + Jk + 1), and each
stage calculates a rotation by


IX
JX
k −1
k −1
a
b
ki
kj

δki 2
−
γki 2



XD
 i=0
 x
j=0
=  Jk −1
, (7)

IX
k −1
YD
 X
 y
b
a

δki 2 ki 
γki 2 kj
j=0
i=0
where δki , γki ∈ {−1, 0, 1} and aki , bki ∈ N. Contrary to the
conventional CORDIC, in both EEAS CORDIC and MSRCORDIC, the scaling depends on the rotation angle. Thus,
both approaches present solutions to compensate the scaling.
Other approaches for constant rotations [18]–[20] suggest to
select a subset of CORDIC stages to approximate the rotation
angle. This reduces both the rotation error and the number of
micro-rotation stages.
Another alternative is to consider an elementary angle set
that is different to that of the CORDIC. This is done in [27],
where all the rotations are generated by combining a small
set of FFT angles. This set fits the rotation angles of the FFT
better than that of the CORDIC, which results in a reduction in
the rotation error, number of adders and latency of the circuit.
C = b2b cos αe
S = b2b sin αe,
where b·e represents a rounding operation, leading to
b
XD
b2 cos αe −b2b sin αe
x
=
.
YD
y
b2b sin αe
b2b cos αe
(8)
(9)
IV. P ROPOSED M ULTIPLIERLESS C ONSTANT ROTATORS
The proposed approach presents a new perspective to the
design of multiplierless rotators. Contrary to previous approaches, the proposed approach does not set any restrictions
to C+jS a priori. Instead, the selection of the best coefficients
C + jS is done as a part of the design process, where it is
combined to the shift-and-add implementation (CCSSI).
4
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
TABLE I
C OMPARISON OF D IFFERENT A PPROACHES TO I MPLEMENT ROTATORS BASED ON S HIFT- AND -A DD O PERATIONS .
APPROACH
Coefficient Selection
Conventional CORDIC [13]–[15]
Complex Multiplier
Small
Small
Lifting Schemes [29]
EEAS CORDIC [21]
MRS-CORDIC [22], [23]
CORDIC for Fixed Angles [18]–[20]
Trigonometric Identities (CSD) [7]
Trigonometric Identities (SCM) [24]
Base-3 Rotator [27]
Rotators Using CSD [4], [6]
Rotators Using MCM [5]
CCSSI (Proposed)
DESIGN SPACE
Shift-and-Add Optimization
General Rotators: angles not known a priori.
High (Direct)
Low
Constant Rotators: angles known a priori.
Small
Medium (CSD)
Medium
High (Direct)
Large
Medium (CSD)
Medium
High (Direct)
Medium
Medium (CSD)
Medium
High (MCM)
Medium
High (SCM, MCM)
Small
Medium (CSD)
Small
High (MCM)
Maximum (Complete Freedom)
High (SCM, MCM)
A. Design Process
The proposed approach can solve SCR and MCR problems.
The optimization problem is defined by the angle set, the scaling and the hardware layout. The goal is to obtain coefficients
with the smallest rotation error and the smallest number of
adders.
Once the optimization problem has been defined, the design
method takes as an input the word length of the coefficient
search space, b, the maximum allowed error, max , and the
number of adders allowed to perform the rotations. In case
of fixed or unity scaling, the radius, Rfixed , is provided. The
computation of the rotation error, , is done as discussed in
Section II, while the computation of the number of adders
is discussed in Sections IV-B1 and IV-B2. When explaining
the method, we will in parallel consider the design of two
different scenarios: An SCR rotator for the angle α = 38◦
and an MCR rotator for the angles α1 = 14◦ and α2 = 38◦ .
In the latter case, both angles shall have the same scaling. The
design examples will be performed using a word length of five
bits, i.e., Z5 = {−16, . . . , 15} according to (2), a maximum
error max = 5 × 10−2 , and using at most four adders.
Step 1: First, the complete design space, consisting of all
possible finite word length values is initialized, as illustrated
in Fig. 2(a) for our example. Here, there are 22b−2 different
coefficients to consider for each angle, and (22b−2 )M cases
for a kernel of M angles.
Step 2: Select all possible coefficients that differ at most an
angle δ = sin−1 (max ) from the required angle(s), i.e.,
−1 Si
tan
− αi < δ
(10)
Ci
This is illustrated in Fig. 2(b). Naturally, for the SCR case,
only the coefficients approximating 38◦ will be kept. After
this step, the number of alternative coefficients
for each angle
tan(δ)
2b−2
is reduced to about max(sin
2
.
αi ,cos αi )
Step 3: If the scaling is fixed, such as for unity scaling,
the search space is reduced further by selecting coefficients
whose scaling factor is close to Rfixed . Any coefficient whose
scaling differs more than Rfixed max from Rfixed is discarded.
This is illustrated in Fig. 3 for the case of unity scaling, i.e.,
Rfixed = 2q . It should be noted that all coefficients within the
OPTIMIZATION PROBLEM
Scaling
Angle Set
Design Space Size
Small
Small
Uniform
Unity
General Rotations
General Rotations
Small
Small
Medium
Medium
Medium
Medium
Medium
Small
Small
Large
Unity
Unity/Arbitrary
Unity
Unity/Arbitrary
Unity
Unity
Uniform
Unity
Unity
Any
SCR
SCR
SCR
SCR
MCR for FFT
MCR for FFT
MCR for FFT
MCR
MCR
SCR and MCR
15
15
10
10
S
S
5
0
0
5
5
10
0
0
15
5
C
10
15
10
15
C
(a)
(b)
15
15
10
10
S
S
5
0
0
5
5
10
C
(c)
15
0
0
5
C
(d)
Fig. 2. Overview of the proposed method using the design example. (a) Initial
coefficient space and required angles (Step 1). (b) Remaining coefficients after
pruning based on the angle (Step 2). (c) Remaining coefficients after pruning
based on the number of adders (Step 4). (d) Valid coefficients after forming
kernels (Step 5).
2q−1 region will also be present in the 2q region, although
multiplied with a factor of two, which from a shift-and-add
perspective is not significant. The number of coefficients in a
2
2max .
region is about 4Rfixed
Step 4: The number of adders required to implement each
rotation is determined as explained in Section IV-B1. This
can be stored in a table for all pairs of C and S to speed up
the computation of this step. Coefficients which require more
than the allowed number of adders are discarded. The resulting
coefficients for the example are illustrated in Fig. 2(c). To the
best of the authors knowledge there are no known equations on
how many adders are needed on average to realize a coefficient
GARRIDO et al.: LOW-COMPLEXITY MULTIPLIERLESS CONSTANT ROTATORS
5
(a)
Fig. 3.
Coefficient selection for unity scaling.
TABLE II
Step 5 – SCR: R EMAINING COEFFICIENTS FOR α = 38◦ , ACCORDING TO
F IG . 2( C ).
α = 38◦
4 + j3
5 + j4
8 + j6
10 + j8
R
5.00
6.40
10.00
12.81
1.97 × 10−2
1.15 × 10−2
1.97 × 10−2
1.15 × 10−2
TABLE III
Step 5 – MCR: R EMAINING KERNELS ACCORDING TO F IG . 2( D ).
α2 = 38◦
4 + j3
8 + j6
10 + j8
R
5.10
9.59
12.61
Fig. 4. Realization of the rotators by 14◦ and 38◦ from the example in
Fig. 2. (a) Rotator by 14◦ using 12 + j3. (b) Rotator by 38◦ using 10 + j8.
Add.
4
4
4
4
pair as a function of magnitude, which would be required to
estimate the number of remaining coefficients after this step.
Step 5 – SCR: For the SCR case, the algorithm now has
provided a number of candidate coefficients which all are valid
based on the specification. Hence, one can directly evaluate
the coefficients for α = 38◦ in Fig. 2(c) to come up with the
most suitable coefficient. Typically the one with the smallest
rotation error is selected as the number of adders are within
the specification boundaries, but different trade-offs can be
considered. The candidates are listed in Table II. It should
be noted that there exist power of two multiples of the same
coefficient. Hence, for the SCR case it is actually enough to
keep coefficients in which at least one part is odd after Step
2 (Step 3 if the scaling is fixed).
Step 5 – MCR: For the MCR case, combinations which
have approximately the same radius are found. These can be
initially pruned on the fact that no two coefficients whose radii
differ more than twice the maximum error can form a kernel
meeting the specification. For these candidates the maximum
error is determined and those meeting the specification are
kept. For the example, the remaining candidate coefficients
are illustrated in Fig. 2(d). Depending on the hardware layout
constraints, as further discussed in Section IV-B2, further pruning can be done. If not, the final set of candidate coefficients
are obtained. For the example, these are listed in Table III.
Here, it can be noted that some coefficients where both parts
are even are used, and, hence, the same simplification that was
possible for SCR is not possible. Instead, the corresponding
simplification is that in a kernel, at least one of the included
coefficients should have an odd part.
Step 6: Implement the rotator. For SCR the implementation
is straightforward from the shift-and-add realization. This is
α1 = 14◦
5+j
9 + j2
12 + j3
(b)
4.69 × 10−2
4.66 × 10−2
1.93 × 10−2
Add.
4
4
4
Fig. 5.
Rotator that can rotate either 14◦ or 38◦ using 4 adders.
illustrated for the α = 38◦ case in Fig. 4(b). For MCR, if
several rotation will be mapped to the same rotator, adder
sharing, as discussed in Section IV-B2, should be applied. In
Fig. 4 the two different realizations for α1 = 14◦ and α2 =
38◦ are shown, while the merged rotator is shown in Fig. 5.
B. Shift-and-Add Implementation
1) Number of Adders for SCR: The shift-and-add implementation depends on the rotation angle. In general, a rotation
by P = C + jS is calculated according to Fig. 6(a) and the
total number of adders of the shift-and-add implementation is
AR(P ) = 2 · AM(C, S) + 2.
(11)
where AM(C, S) is the number of adders needed to multiply
a real number by C and S simultaneously.
If the rotation coefficient is real, i.e., P = C, the rotator
is reduced to two real multiplications. This case is shown in
Fig. 6(c), and the number of adders is
AR(P ) = 2 · AM(C).
(12)
where AM(C) is the number of adders needed to multiply by
a real number C.
Likewise, if the coefficient is a pure imaginary number, i.e.,
P = jS, the rotation has two real constant multiplications as
shown in Fig. 6(b), and the number of adders is
AR(P ) = 2 · AM(S).
(13)
Finally, if |C| = |S|, which is true for angles α = m·π/2+
π/4, the structure of the rotator is shown in Fig. 6(d) and the
number of adders is
AR(P ) = 2 · AM(C, C) + 2 = 2 · AM(C) + 2.
(14)
These special cases require less adders than the general case
in Fig. 6(a). This fact is taken into account in order to make
a better use of the adders and design simpler rotators.
6
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
(c)
(a)
(b)
(d)
Fig. 6. Structure of the rotator for the cases in Table IV. (a) General case
for which P = C + jS. (b) Rotation by P = jS. (c) Rotation by P = C.
(d) Rotation by P = C + jC.
TABLE IV
A DDER COST OF A ROTATION BY AN ANGLE α.
ANGLE
α=m·π
α = m · π + π/2
α = m · π/2 + π/4
Other angles
COEFFICIENT
P =C
P = jS
P = C ± jC
P = C + jS
p
ADDER COST
2 · AM(C)
2 · AM(S)
2 · AM(C) + 2
2 · AM(C, S) + 2
V. E XPERIMENTAL R ESULTS
p
where Pp represents the rotation coefficient in branch p.
When data flows through a single branch and there are
multiple rotation angles as in Fig. 1(b), only one coefficient
is required at a time. This allows for merging the rotations
and sharing the adders among them by using additional
multiplexers. Thus, the total number of adders of the kernels
is set by the coefficient with the highest adder cost, i.e.,
i
i
where Pp,i is the ith coefficient of the pth branch.
According to (11)–(14), the number of adders of a rotation
can be obtained directly from the number of adders for a
real constant multiplication by C, S or both of them. The
adder cost of a rotation is summarized in Table IV as a
function of the rotation angle. SCM techniques [30], [31] are
used to calculate the adder cost of multiplications by singles
real constant represented by AM(C) and AM(S), and MCM
techniques [32]–[35] are used for multiplications by two real
constants represented by AM(C, S).
2) Number of Adders for MCR: The layout of the rotators
influences the total number of adders. For a single angle
in Fig. 1(a), the number of adders is obtained as explained
in Section IV-B1. For multiple branches with one angle per
rotator as in Fig 1(c), the total number of adders of the kernel,
AK, is equal to the sum of the adders in all the rotators, i.e.,
X
AK =
AR(Pp ),
(15)
AK = max {AR(Pi )} ,
This allows for finding shared terms among the coefficients
that reduce the number of multiplexers. For instance, both
Fig. 4(a) and 4(b) include a 1-bit shift at the input. Therefore,
the circuit in Fig. 5 does not need any multiplexer for the
corresponding path. When the number of angles is large,
the rotator may require more multiplexers to merge them.
However, the amount of multiplexers can be reduced by not
merging all the rotations, at the cost of a larger number of
adders. For instance, the rotator in Fig. 5 has 4 adders and
6 multiplexers. Instead, the same rotator can be implemented
with 8 adders and 2 multiplexers by implementing the circuits
in Figs. 4(a) and 4(b) and multiplexing their outputs. An
intermediate solution with 6 adders and 4 multiplexers is also
possible.
Finally, the case of several branches and several rotations
in each branch (Fig. 1(d)) is a combination of previous ones:
Each rotator requires the maximum number of adders among
the angles that it has to rotate, and the total number of adders
is the sum of the adders of all the rotators, i.e.,
X
max {AR(Pp,i )} ,
(17)
AK =
(16)
where Pi represents the coefficients that are merged. Most
rotators admit different implementations, as sometimes additions and subtractions can be carried out in different orders.
This section presents experimental results of the proposed
approach in several contexts. The experiments use the MCM
algorithms in [32] and [33] to calculate the number of adders,
and the best result among them is selected. For SCM calculations, the optimum results from [30] have been considered.
The search is done for coefficients that can be represented
with word length up to 20 bits. This provides rotators with
enough accuracy for most applications. If needed, higher
accuracy can be achieved by increasing the maximum word
length used in the search.
A. SCR with Arbitrary Scaling
For SCR with arbitrary scaling, a comparison is done based
on the example in [20]. The work in [20] is based on finding
the optimal sequence of CORDIC rotations. In the example,
rotators for all odd degree angles between 1◦ and 45◦ are
found. Two measures are used for comparison. First, the
number of adders required to obtain an angular error smaller
than 0.04◦ is shown in Fig. 7. As can be seen, the proposed
approach requires six adders for two angles (23◦ and 27◦ )
where the approach in [20] only requires four adders. However,
there are seven angles where the approach in [20] requires
eight adders (four CORDIC rotations), where the proposed
approach only requires six adders. Hence, both the maximum
and average number of adders are reduced using the proposed
approach. Second, the maximum angular errors obtained using
a given number of adders are shown in Table V. Clearly,
the proposed approach results in a significantly smaller error,
especially when more adders are allowed.
GARRIDO et al.: LOW-COMPLEXITY MULTIPLIERLESS CONSTANT ROTATORS
7
TABLE VI
A NGLES WITH U NITY S CALING , 10 ADDERS .
P0
TF
Adders
1
W128
2
W128
3
W128
4
W128
5
W128
6
W128
7
W128
8
W128
9
W128
10
W128
11
W128
12
W128
13
W128
14
W128
15
W128
16
W128
MSR-CORDIC [23]
P = P0 · P1
P1
32 + j511
256 + j7
129 − j16
4097 − j1024
513 + j2048
35
31 − j4
30 − j
16 + j33
15 − j1024
126 − j
257 − j16
8 − j7
2 + j7
48 − j
−j193
55 − j4096
2031 − j256
126 − j3
62 + j3
−j3973
56 − j17
4097 − j896
511 − j192
4 − j447
31 + j56
56 − j33
55 − j32
384 + j31
-56 − j129
129 − j112
15 + j15
2094816 − j102967
521728 − j51319
16206 − j2403
257086 − j51197
8136704 − j2038149
1960 − j595
123423 − j44164
15138 − j6271
14815 − j7020
57809 − j30904
7023 − j4214
13623 − j9104
3289 − j2440
791 − j650
6080 − j5505
2895 − j2895
−5
9.57 × 10
1.50 × 10−4
7.50 × 10−5
2.25 × 10−4
6.21 × 10−5
2.59 × 10−4
1.11 × 10−4
9.81 × 10−5
9.44 × 10−4
2.38 × 10−4
5.24 × 10−4
9.47 × 10−5
2.30 × 10−4
6.64 × 10−4
1.31 × 10−3
4.52 × 10−4
Constant CORDIC [20]
Proposed
10
5
0
0
5
10
15
20
25
Angle
30
35
40
45
Fig. 7. Number of adders required to obtain an angular error smaller than
0.04◦ for angles 1◦ , 3◦ , 5◦ , . . . , 45◦ .
TABLE V
M AXIMUM A NGULAR E RROR IN D EGREES U SING A G IVEN N UMBER OF
A DDERS FOR A NGLES 1◦ , 3◦ , 5◦ , . . . , 45◦ .
Adders
4
6
8
10
12
Constant CORDIC [20]
1.875
n/a
0.037
n/a
∼ 5 × 10−4
Proposed
1.31
0.0271
8.54 × 10−4
5.37 × 10−6
5.08 × 10−9
B. SCR with Unity Scaling
For SCR with unity scaling, Table VI compares the proposed approach with the MSR-CORDIC [23] for the twiddle
i
factors W128
= cos(2πi/128)−j·sin(2πi/128), i = 1, . . . , 15.
The MSR-CORDIC consists of two stages in series with
coefficients P0 and P1 , leading to a rotation by a coefficient
P = P0 · P1 . Conversely, the proposed approach uses a single
stage. In both approaches, the scaling of each angle is very
close to a power of two, 2q , which provides unity scaling. The
rotation error is the distance from the complex coefficient to
i
2k · W128
. The error is expressed in terms of effective word
length (W LE ), which is defined as the number of bits of the
output that are guaranteed to be accurate, and is calculated as
3
W LE = − log2 √ = − log2 + .
2
2 2
W LE
(18)
Finally, the table includes the coefficient word length (W L).
14.85
14.20
15.20
13.62
15.47
13.41
14.61
14.82
11.55
13.54
12.40
14.87
13.59
12.06
11.07
12.61
WL
21
19
14
18
23
11
17
14
14
16
13
14
12
10
13
12
P
4091 − j201
8152 − j803
259249 − j38432
64287 − j12784
127147 − j31852
3920 − j1189
964 − j345
7568 − j3135
7408 − j3503
3612 − j1931
28101 − j16856
54495 − j36414
6577 − j4880
6334 − j5199
1517 − j1376
46341 − j46336
PROPOSED
−5
1.68 × 10
6.77 × 10−5
2.53 × 10−4
1.58 × 10−4
3.91 × 10−5
9.09 × 10−5
1.40 × 10−4
5.19 × 10−5
3.13 × 10−4
9.38 × 10−5
3.39 × 10−4
8.60 × 10−5
3.51 × 10−4
3.10 × 10−4
3.90 × 10−4
7.55 × 10−5
W LE
WL
17.36
15.35
13.45
14.13
16.14
14.92
14.30
15.73
13.14
14.88
13.03
15.01
12.98
13.16
12.82
15.19
13
14
19
17
18
13
11
14
14
13
16
17
14
14
12
17
The results of both methods consider rotators that use at
most 10 adders. The results for the MSR-CORDIC are taken
from Table III in [23], and represented as C + jS instead of
numbers in the range [−1, 1]. By comparing the rotation error
in both approaches, the maximum rotation error is 1.31×10−3
for the MSR-CORDIC and 3.90 × 10−4 for the proposed
approach, i.e., the proposed approach reduces the maximum
rotation error by a factor of 3.36. The mean error is also
reduced from 3.46×10−4 in the MSR-CORDIC to 1.73×10−4
in the proposed approach, which is a reduction of 50%.
C. MCR with Uniform Scaling
The FFT calculates rotations by the twiddle factors WLi =
e−j·2πi/L , i = 0, . . . , L − 1. The number of angles in the set,
L, is usually a power of two and its value depends on the
FFT stage, as well as on the radix and decomposition [1], [4],
[38]. Apart from W4 , which only involves trivial rotations [1]
and is very simple to implement, W8 , W16 and W32 are the
most common twiddle factors in FFT architectures: Radix-2
FFTs of size N ≥ 32 calculate W8 , W16 and W32 rotations,
a 4096-point radix-23 FFT needs W8 rotators at four stages
of the architecture, and a 4096-point radix-24 FFT needs W16
rotators at three stages [1], [38].
The twiddle factors are specific sets of angles generated
by dividing the circumference in L equal parts. This leads to
multiple symmetries in the complex plane. As a result, for an
L-point kernel only M = L/8 + 1 angles in the range [0, π/4]
need to be considered. The rest of rotations can be calculated
from those in [0, π/4] by interchanging the real and imaginary
components of the input and output data and/or the signs of
the outputs. According to this and following the criterion of
previous works, we present the results and the circuits for
rotations in the range [0, π/4]. However, it is important to
keep in mind that a circuit that computes the whole kernel
in [0, 2π] may require two additional real adders, which are
equivalent to a complex adder.
Generally, a scaling in the rotations of the FFT is permissible, as long as it is the same for all the data [12]. Therefore,
8
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
0
0
10
10
−2
10
−2
10
Rotation error, ε
Rotation error, ε
−4
10
−6
10
−8
10
−10
10
−12
10
4
Proposed, 4 adders
Proposed, 6 adders
Proposed, 8 adders
Proposed, 10 adders
Non−scaled
Upper bound
Lower bound
6
8
10
12
14
16
Coefficient Word Length (Bits)
−4
10
Proposed, 4 adders
Proposed, 6 adders
Proposed, 8 adders
Proposed, 10 adders
Non−scaled
Upper bound
Lower bound
−6
10
−8
18
10
20
Fig. 8. Error of the proposed low-complexity rotators for an 8-point kernel
(W8 ) as a function of the word length and the number of adders.
4
6
8
10
12
14
16
Coefficient Word Length (Bits)
18
20
Fig. 10. Error of the proposed low-complexity rotators for a 32-point kernels
(W32 ) as a function of the word length and the number of adders.
x
0
10
1
7
4
5
1
7
4
5
X=543x
X=384x−384y
−2
Rotation error, ε
10
y
−4
10
−6
10
−8
10
4
Proposed, 4 adders
Proposed, 6 adders
Proposed, 8 adders
Proposed, 10 adders
Non−scaled
Upper bound
Lower bound
6
8
10
12
14
16
Coefficient Word Length (Bits)
Y=543y
Y=384y+384x
(a)
x
18
X=577x
X=408x−408y
1
6
9
4
1
6
9
4
3
20
Fig. 9. Error of the proposed low-complexity rotators for a 16-point kernel
(W16 ) as a function of the word length and the number of adders.
y
3
Y=577y
Y=408y+408x
(b)
uniform scaling is considered in this experiment. Likewise,
this experiment assumes the layout of a single branch with
multiple rotations shown in Fig. 1(b).
1) Obtaining the Kernels: Figures 8, 9 and 10 show the
proposed results for W8 , W16 and W32 , respectively. The
figures show the trade-off between rotation error and number
of adders. They include the upper and lower error bounds,
and the error for non-scaled coefficients. The upper bound is
the worst case approximation error corresponding to one half
of the weight of the least significant bit (LSB) for both real
and imaginary parts. This upper bound shows the points for
which the effective word length is equal to the word length
of the coefficients, i.e., W L = W LE . The lower bound is the
minimum error that can be achieved for a given incremental
word length. This lower bound is provided in [12]. Finally,
the error for non-scaled coefficients is the error of the kernels
when the coefficients are simply obtained by rounding the sine
and cosine of the angle to the closest values. The search has
been done for coefficient word lengths up to 20 bits.
The upper and lower bounds and the case of non-scaled
coefficients assume a full complex multiplier without adder
optimization. Therefore, any result below the non-scaled case
improves it in both accuracy and number of adders. For
Fig. 11. Circuits for the calculation of W8 rotations. (a) Kernel [543, 384 +
j384], 4 adders. (b) Kernel [577, 408 + j408], 6 adders.
example, in Fig. 8 a W8 kernels with W L = 14 that uses
non-scaled coefficients achieves an error of 5.34 × 10−5 and
requires four real multipliers and two adders. Conversely, the
proposed approach only needs 6 adders to provide an error of
4.64×10−8 , i.e., more than three orders of magnitude smaller.
Table VII summarizes the most relevant results from
Figs. 8, 9 and 10. The first columns of the table show
the coefficients that are used for the angles of the kernel,
whereas the following columns include the parameters of the
kernel: normalized error (), coefficient word length (W L) and
number of adders. The error is provided both in linear units
and in effective word length, W LE . Those kernels marked
with (?) achieve the lowest rotation error for their word length.
The table shows various efficient alternatives to calculate
accurate rotations with few adders. For instance, W8 with
accuracy of 29.35 bits can be calculated with only 6 adders.
For W16 an accuracy of 22.69 bits is achieved with 10 adders.
Compared to a general complex multiplier, this corresponds
to two adders per real multiplier.
GARRIDO et al.: LOW-COMPLEXITY MULTIPLIERLESS CONSTANT ROTATORS
9
TABLE VII
D ESIGNED L OW-C OMPLEXITY ROTATORS FOR FFT.
TF
0
π/16
W8
7
17
543
577
6149
196587
19601
208885
275807
-
W16
85
623
669
8027
21059
349093
513764
-
COEFFICIENTS
π/8
π/4
-
-
5 + j5
12 + j12
384 + j384
408 + j408
4348 + j4348
139008 + j139008
13860 + j13860
147704 + j147704
195025 + j195025
5.05 × 10−3 ?
8.67 × 10−4 ?
5.34 × 10−5
7.51 × 10−7 ?
4.64 × 10−8
4.14 × 10−9
1.30 × 10−9 ?
8.02 × 10−11
6.57 × 10−12 ?
9.13
11.67
15.69
21.84
25.86
29.35
31.02
35.04
38.65
4
6
11
11
14
19
16
19
20
4
4
4
6
6
6
8
8
10
80 + j32
576 + j238
618 + j256
7416 + j3072
19456 + j8059
322520 + j133592
474656 + j196609
-
60 + j60
441 + j441
473 + j473
5676 + j5676
14891 + j14891
246846 + j246846
363286 + j363286
1.25 × 10−2
8.70 × 10−4
5.86 × 10−5 ?
2.21 × 10−5
2.67 × 10−6
4.19 × 10−7
8.51 × 10−8
7.82
11.67
15.56
16.97
20.01
22.69
24.98
8
11
11
14
16
20
20
4
6
8
8
10
10
12
56 + j56
122 + j122
148 + j148
820 + j820
62660 + j62660
14988 + j14988
100415 + j100415
4.08 × 10−2
2.52 × 10−3
1.06 × 10−3 ?
3.02 × 10−4
1.79 × 10−4
3.55 × 10−5
3.55 × 10−6
6.12
10.13
11.39
13.19
13.95
16.28
19.60
8
9
9
12
18
16
19
4
6
8
8
8
10
12
75
72 + j16
72 + j32
64 + j40
173
170 + j34
160 + j66
144 + j96
209
205 + j41
193 + j80
174 + j116
1159
1137 + j226
1071 + j444
964 + j644
W32
88600
86912 + j17280
81856 + j33919
73680 + j49248
21197
20790 + j4136
19584 + j8112
17624 + j11776
142009
139280 + j27704
131199 + j54344
118076 + j78896
?: Kernels labeled with (?) reach the minimum achievable error for their word length.
X=349093x
X=322520x−133592y
X=246846x−246846y
7
3
7
x
4
3
6
2
5
1
13
4
16
16
7
3
7
y
4
3
6
Y=349093y
Y=322520y+133592x
Y=246846x+246846y
2
5
1
13
PROPERTIES
W LE
WL
3π/16
4
Fig. 12.
Circuit for the calculation of W16 rotations. Kernel [349093,
322520 + j133592, 246846 + j246846], 10 adders.
2) Shift-and-Add Implementation: Figures 11 and 12 show
the hardware circuits for some kernels in Table VII. They
consists of adders, multiplexers and shifters. In all the implementations the rotation angle is selected using the control
signals of the multiplexers. The different output configurations
are shown by the symbols , and 4. For the FFT the
control signals can be generated directly from the bits of a
counter [14]. This removes the necessity of a memory to store
the rotation coefficients.
Figure 11 shows two circuits for W8 , i.e., it considers the
angles α1 = 0 and α2 = π/4. Figure 11(a) shows the kernel
[543, 384 + j384]. This circuit requires 4 adders and achieves
an accuracy of 15.69 bits, as shown in Table VII. Depending
on the configuration of the multiplexers, the circuit multiplies
the input signal either by 543 or by 384+j384. These multiplications are carried out by taking into account the shift-and-add
Add.
representations of the numbers, i.e., 543A = 25 ·(24 A+A)−A
and 384A = 27 · (21 A + A).
Figure 11(b) shows another option for W8 . In this case, the
circuit considers the kernel [577, 408 + j408]. This circuit
requires two adders more than that in Fig. 11(a). This reduces
the rotation error to 7.51 × 10−7 , i.e., approximately two
orders of magnitude or, equivalently, six correct fractional
bits. As a result, both circuits in Figure 11 are efficient
implementations for W8 , and provide a trade-off between
accuracy and hardware resources.
Finally, Fig. 12 shows an example for W16 . This kernel consists of the coefficients [349093, 322520 + j133592,
246846 + j246846]. The kernel achieves a precision of 22.69
correct bits by using 10 adders, as shown in Table VII.
3) Comparison: Figures 13 and 14 compare the proposed
rotators from Table VII with other multiplierless rotators for
W16 and W32 in the literature. The previous approaches
include rotators based on MCM [5]1 , Booth encoding [25],
trigonometric identities [7], [24], base-3 rotators [27], MSRCORDIC [23]2 and non-redundant CORDIC [13]. The number
of adders in the figures are for rotations in the range [0, π/4].
As said before, two more half adders are needed to calculate
rotations in [0, 2π].
Figure 13 shows the results for W16 . Except for the
CORDIC algorithm, which is a general rotator, all the approaches are constant rotators that offer ad-hoc solutions for
W16 , leading to more accurate results. Among them, the
proposed approach achieves less error and requires less adders
1 For the results presented here, the algorithm in [33] is used, which
generally gives fewer adders compared to the algorithm used originally [5].
2 The W
i
16 and W32 kernels are obtained by combining the W128 rotations
in Table VI: W16 corresponds to i = 0, 8, 16 and W32 corresponds to i =
0, 4, 8, 12, 16.
10
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS
TABLE VIII
C OMPARISON OF W16 ROTATORS FOR W LE ≥ 16 B ITS OR B EST
R ESULTS P ROVIDED .
0
10
MCM−based [5]
R. Booth [25]
Trig. Id. [7]
Trig. Id. [24]
Base−3 [27]
CSD−based [4]
MSR−CORDIC [23]
CORDIC [13]
Proposed (Table V)
−1
10
−2
Rotation error, ε
10
−3
10
Approach
Base-3 [27]
CSD-based [4]
Trig. Id. [24]
MSR-CORDIC [23]
R. Booth [25]
MCM-based [5]
Proposed, 8 adders
Proposed, 10 adders
(Fig. 12)
−4
10
−5
10
−6
10
Mux.
16
14
8
14
28
8
16
18
Add. HW Cost
16
21.33
30
34.67
16
18.67
10
14.67
22
31.33
18
20.67
8
13.33
10
16
Lat.
8
5
12
4
4
4
4
5
Error
5.02e-7
1.31e-4
4.94e-6
2.73e-4
1.62e-5
1.62e-5
2.21e-5
4.19e-7
W LE
21.58
14.31
19.12
13.33
17.41
17.41
16.97
22.69
−7
10
VI. C ONCLUSION
−8
10
0
Fig. 13.
5
10
15
Adder cost
20
25
30
Error versus number of adders of different W16 rotators.
0
10
MCM−based [5]
R. Booth [25]
Trig. Id. [24]
Base−3 [27]
MSR−CORDIC [23]
CORDIC [13]
Proposed (Table V)
−1
10
−2
Rotation error, ε
10
−3
10
−4
10
This paper presents a new approach to design lowcomplexity multiplierless constant rotators, based on combined
coefficient selection and shift-and-add implementation. This
combination increases the number of alternatives in the design,
which widens the design space with respect to previous works.
The proposed approach applies to many hardware scenarios
where rotations are carried out. These scenarios include rotations by a single angle or multiple angles, rotators in a single or
multiple branches, and uniform, non-uniform or unity scaling.
Experimental results for different contexts are provided. In
all cases, significant reductions in complexity and improvements in accuracy are observed with respect to state of the
art.
−5
10
VII. ACKNOWLEDGMENT
−6
10
−7
10
0
Fig. 14.
5
10
15
20
Adder cost
25
30
35
Error versus number of adders of different W32 rotators.
than previous approaches. Quantitatively, it reduces the error
more than one order of magnitude in kernels of 6 and 8
adders and more than three orders of magnitude for 10 adders.
Furthermore, in order to meet the same precision as the
proposed 10-adder kernel, previous approaches need at least
18 adders, i.e, almost twice the number of adders. For W32
rotators in Fig. 14, the proposed approach also outperforms
previous ones with significant reductions in adders and error.
For the evaluation of the total hardware cost, Table VIII
includes the number of 2-input multiplexers used in the
rotators. The table considers rotators for which W LE ≥ 16 or,
otherwise, the best cases provided. The hardware cost is calculated considering that the area of a multiplexer is one third of
the area of an adder [39], i.e., HW Cost = Adders + Mux/3.
The latency of the circuits is calculated as the number of
adders in the critical path. The proposed rotators are taken
from Table VII. The proposed 8-adder rotator has the lower
latency and hardware cost among all the approaches. The
proposed 10-adder rotator achieves the highest accuracy with
low hardware cost and low latency.
The authors would like to thank Dr. Martin Kumm and
Dr. Saied Hemati for their valuable suggestions about the
presentation of this work.
R EFERENCES
[1] M. Garrido, J. Grajal, M. A. Sánchez, and O. Gustafsson, “Pipelined
radix-2k feedforward FFT architectures,” IEEE Trans. VLSI Syst.,
vol. 21, no. 1, pp. 23–32, Jan. 2013.
[2] S. He and M. Torkelson, “Design and implementation of a 1024-point
pipeline FFT processor,” in Proc. IEEE Custom Integrated Circuits
Conf., May 1998, pp. 131–134.
[3] S.-N. Tang, J.-W. Tsai, and T.-Y. Chang, “A 2.4-GS/s FFT processor
for OFDM-based WPAN applications,” IEEE Trans. Circuits Syst. II,
vol. 57, no. 6, pp. 451–455, Jun. 2010.
[4] C.-H. Yang, T.-H. Yu, and D. Markovic, “Power and area minimization
of reconfigurable FFT processors: A 3GPP-LTE example,” IEEE J.
Solid-State Circuits, vol. 47, no. 3, pp. 757–768, Mar. 2012.
[5] W. Han, A. T. Erdogan, T. Arslan, and M. Hasan, “High-performance
low-power FFT cores,” ETRI J., vol. 30, no. 3, pp. 451–460, Jun. 2008.
[6] H. Liu and H. Lee, “A high performance four-parallel 128/64-point
radix-24 FFT/IFFT processor for MIMO-OFDM systems,” in Proc.
IEEE Asia-Pacific Conf. Circuits Syst., Nov. 2008, pp. 834–837.
[7] J.-Y. Oh and M.-S. Lim, “New radix-2 to the 4th power pipeline FFT
processor,” IEICE Trans. Electron., vol. E88-C, no. 8, pp. 1740–1746,
Aug. 2005.
[8] C. Loeffler, A. Ligtenberg, and G. Moschytz, “Practical fast 1-D DCT
algorithms with 11 multiplications,” in Proc. IEEE Int. Conf. Acoust.
Speech Signal Process., vol. 2, May 1989, pp. 988–991.
[9] Z. Wu, J. Sha, Z. Wang, L. Li, and M. Gao, “An improved scaled DCT
architecture,” IEEE Trans. Consum. Electron., vol. 55, no. 2, pp. 685–
689, May 2009.
[10] A. Gray, Jr. and J. Markel, “Digital lattice and ladder filter synthesis,”
IEEE Trans. Audio Electroacoust., vol. 21, no. 6, pp. 491–500, Jun.
1973.
GARRIDO et al.: LOW-COMPLEXITY MULTIPLIERLESS CONSTANT ROTATORS
11
[11] P. P. Vaidyanathan, “Passive cascaded-lattice structures for lowsensitivity FIR filter design, with applications to filter banks,” IEEE
Trans. Circuits Syst., vol. 33, no. 11, pp. 1045–1064, Nov. 1986.
[12] M. Garrido, O. Gustafsson, and J. Grajal, “Accurate rotations based on
coefficient scaling,” IEEE Trans. Circuits Syst. II, vol. 58, no. 10, pp.
662–666, Oct. 2011.
[13] J. E. Volder, “The CORDIC trigonometric computing technique,” IRE
Trans. Electronic Computing, vol. EC-8, pp. 330–334, Sep. 1959.
[14] M. Garrido and J. Grajal, “Efficient memoryless CORDIC for FFT
computation,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process.,
vol. 2, Apr. 2007, pp. 113–116.
[15] R. Andraka, “A survey of CORDIC algorithms for FPGA based computers,” in Proc. ACM/SIGDA Int. Symp. FPGAs, Feb. 1998, pp. 191–200.
[16] P. K. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, “50
years of CORDIC: Algorithms, architectures, and applications,” IEEE
Trans. Circuits Syst. I, vol. 56, no. 9, pp. 1893–1907, Sep. 2009.
[17] C.-Y. Chen and C.-Y. Lin, “High-resolution architecture for CORDIC
algorithm realization,” in Proc. Int. Conf. Comm. Circuits Syst., vol. 1,
Jun. 2006, pp. 579–582.
[18] Y. Hu and S. Naganathan, “An angle recoding method for CORDIC
algorithm implementation,” IEEE Trans. Comput., vol. 42, no. 1, pp.
99–102, Jan. 1993.
[19] C.-S. Wu and A.-Y. Wu, “Modified vector rotational CORDIC (MVRCORDIC) algorithm and architecture,” Circuits and Systems II: Analog
and Digital Signal Processing, IEEE Transactions on, vol. 48, no. 6, pp.
548–561, Jun. 2001.
[20] P. K. Meher and S. Y. Park, “CORDIC designs for fixed angle of
rotation,” IEEE Trans. VLSI Syst., vol. 21, no. 2, pp. 217–228, Feb.
2013.
[21] C.-S. Wu, A.-Y. Wu, and C.-H. Lin, “A high-performance/low-latency
vector rotational CORDIC architecture based on extended elementary
angle set and trellis-based searching schemes,” IEEE Trans. Circuits
Syst. II, vol. 50, no. 9, pp. 589–601, Sep. 2003.
[22] C.-H. Lin and A.-Y. Wu, “Mixed-scaling-rotation CORDIC (MSRCORDIC) algorithm and architecture for high-performance vector rotational DSP applications,” IEEE Trans. Circuits Syst. I, vol. 52, no. 11,
pp. 2385–2396, Nov. 2005.
[23] S. Y. Park and Y. J. Yu, “Fixed-point analysis and parameter selections of
MSR-CORDIC with applications to FFT designs,” IEEE Trans. Signal
Process., vol. 60, no. 12, pp. 6245–6256, Dec. 2012.
[24] F. Qureshi and O. Gustafsson, “Low-complexity constant multiplication
based on trigonometric identities with applications to FFTs,” IEICE
Trans. Fundamentals, vol. E94-A, no. 11, pp. 324–326, Nov. 2011.
[25] Y.-E. Kim, K.-J. Cho, and J.-G. Chung, “Low power small area modified
Booth multiplier design for predetermined coefficients,” IEICE Trans.
Fundamentals, vol. E90-A, no. 3, pp. 694–697, Mar. 2007.
[26] V. Karkala, J. Wanstrath, T. Lacour, and S. P. Khatri, “Efficient arithmetic
sum-of-product (SOP) based multiple constant multiplication (MCM) for
FFT,” in Proc. IEEE/ACM Int. Comput.-Aided Design Conf., Nov. 2010,
pp. 735–738.
[27] P. Källström, M. Garrido, and O. Gustafsson, “Low-complexity rotators
for the FFT using base-3 signed stages,” in Proc. IEEE Asia-Pacific
Conf. Circuits Syst., Dec. 2012, pp. 519–522.
[28] J. Jedwab and C. Mitchell, “Minimum weight modified signed-digit
representations and fast exponentiation,” Electron. Lett., vol. 25, no. 17,
pp. 1171–1172, Aug. 1989.
[29] S. C. Chan and P. M. Yiu, “An efficient multiplierless approximation
of the fast Fourier transform using sum-of-powers-of-two (SOPOT)
coefficients,” IEEE Signal Process. Lett., vol. 9, no. 10, pp. 322–325,
Oct. 2002.
[30] O. Gustafsson, A. G. Dempster, K. Johansson, M. D. Macleod, and
L. Wanhammar, “Simplified design of constant coefficient multipliers,”
Circuits Syst. Signal Process., vol. 25, no. 4, pp. 225–251, Apr. 2006.
[31] J. Thong and N. Nicolici, “Time-efficient single constant multiplication
based on overlapping digit patterns,” IEEE Trans. VLSI Syst., vol. 17,
no. 9, pp. 1353–1357, Sep. 2009.
[32] A. G. Dempster and M. D. Macleod, “Multiplication by two integers
using the minimum number of adders,” in Proc. IEEE Int. Symp. Circuits
Syst., vol. 2, May 2005, pp. 1814–1817.
[33] O. Gustafsson, “A difference based adder graph heuristic for multiple
constant multiplication problems,” in Proc. IEEE Int. Symp. Circuits
Syst., May 2007, pp. 1097–1100.
[34] Y. Voronenko and M. Püschel, “Multiplierless multiple constant multiplication,” ACM Trans. Algorithms, vol. 3, pp. 1–39, May 2007.
[35] L. Aksoy, E. Günes, and P. Flores, “Search algorithms for the multiple
constant multiplications problem: Exact and approximate,” Microprocess. Microsyst., vol. 34, no. 5, pp. 151–162, Aug. 2010.
[36] M. D. Macleod, “Multiplierless implementation of rotators and FFTs,”
EURASIP J. Appl. Signal Process., vol. 2005, no. 17, pp. 2903–2910,
2005.
[37] O. Gustafsson and F. Qureshi, “Addition aware quantization for low
complexity and high precision constant multiplication,” IEEE Signal
Process. Lett., vol. 17, no. 2, pp. 173–176, Feb. 2010.
[38] F. Qureshi and O. Gustafsson, “Generation of all radix-2 fast Fourier
transform algorithms using binary trees,” in Proc. Europ. Conf. Circuit
Theory Design, Aug. 2011, pp. 677–680.
[39] M. Janssen, F. Catthoor, and H. De Man, “A specification invariant
technique for regularity improvement between flow-graph clusters,” in
Proc. European Design Test Conf., 1996, pp. 138–143.
Mario Garrido received the M.S. degree in electrical engineering and the Ph.D. degree from the Technical University of Madrid (UPM), Madrid, Spain,
in 2004 and 2009, respectively. In 2010 he moved to
Sweden to work as a postdoctoral researcher at the
Department of Electrical Engineering at Linköping
University. Since 2012 he is Associate Professor at
the same department.
His research focuses on optimized hardware design for signal processing applications. This includes
the design of hardware architectures for the calculation of transforms, such as the fast Fourier transform (FFT), circuits for
data management, the CORDIC algorithm, and circuits to calculate statistical
and mathematical operations. His research covers high-performance circuits
for real-time computation, as well as designs for low area and low power
consumption.
Fahad Qureshi was born in 1978. He received
the M.Sc. from NED University of Engineering
and Technology in Karachi, Pakistan. In 2012 he
received his Ph.D. from the Division of Electronics
Systems at Linköping University, Sweden.
Qureshi’s research interest is design and implementation of high performance resource flexible
FFTs.
Oscar Gustafsson (S’98–M’03–SM’10) received
the M.Sc., Ph.D., and Docent degrees from
Linköping University, Linköping, Sweden, in 1998,
2003, and 2008, respectively.
He is currently an Associate Professor and Head
of the Electronics Systems Division, Department of
Electrical Engineering, Linköping University. His
research interests include design and implementation
of DSP algorithms and arithmetic circuits. He has
authored and coauthored over 140 papers in international journals and conferences on these topics.
Dr. Gustafsson is a member of the VLSI Systems and Applications and
the Digital Signal Processing technical committees of the IEEE Circuits
and Systems Society. Currently, he serves as an Associate Editor for the
IEEE Transactions on Circuits and Systems Part II: Express Briefs and
Integration, the VLSI Journal. He has served and serves in various positions
for conferences such as ISCAS, PATMOS, PrimeAsia, Asilomar, Norchip,
ECCTD, and ICECS.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement