TSTE87Lecture2

�
�
NMOS
PMOS
�
�
����
��
������
����������
����
��
������
����������
��
�����
��
�����
����������
�����������
������
������
������������������
������
������
������
������
������
������
�
�
�
�
�
�
�
�
Complementary metal-oxide semi-conductor �CMOS)
technology commonly used and considered here
MOS transistors �on p-substrate)
Digital circuits
Oscar Gustafsson
�
�
�
Application Specific Integrated Circuits for
Digital Signal Processing
Lecture 2
Digital signal processing
�
�
�
�
�
�
µ mobility �electrons or holes)
� permitivity of SiO2
Tox gate oxide thickness
W � L width and length
µ�W
Tox L
V�S
2
VGS − VT < V�S
� VGS − VT < 0
VGS − VT > V�S
Differently sized NMOS and PMOS transistors required for
same current
Difference between NMOS and PMOS – mobility
�
�
�
�
β=
VT is the threshold voltage
Simple analog model


 0�
β VGS − VT −
I� =

 β
2
2 �VGS − VT )
CMOS models
Digital circuits
�
Today’s topics
�
�
���
φ = 0 – pre-charge
φ = 1 – evaluate
�
�����
�
Switch
off
on
on
off
�
� � ����������������
��
�������������
�
�
�������
Gate input
high �V�� )
low �GND)
high �V�� )
low �GND)
Dynamic logic
�
� � ����������������
��
�������������
�
���
�
MOS logic
NMOS
Transistor type
PMOS
�
�
�������
�
�
�
�
���
�
� � ����������������
��
�������������
Resulting structure can be seen as a switch
�
�V�� − VT )2
Current I� is 0 or
β
2
For digital circuits V�S = V�� , VGS = 0 or V��
�
�
CMOS models
�����
������������
�
� �
��
�
� ��
�
�
�
�
���
�����������
�������������
�
�
�
�
���
�
�
��
��
��
��
�
F ��� B) = �B
SN = F ��� B) = �B – series connection
SP = F ��� B) = � B = � + B – parallel connection
���
Two-input NAND-gate
�������������
������������
�������������
P-transistors SP ��� B� . . . ) = F ��� B� . . . )
N-transistors SN ��� B� . . . ) = F ��� B� . . . )
�����������
�������������
�
�
Logic function F ��� B� . . . ) – Boolean algebra
Switching function S��� B� . . . ) – Switching algebra
CMOS logic gate
Many different logic styles �ways to interconnect transistors)
Static CMOS logic
MOS logic example
�
�
MOS logic
Discharge time
Q = CL V�� =
�
�
�
td =
�
�
�
2
P = afclk CV��
a – switching activity
fclk – clock frequency
Power consumption
�5)
�4)
�2)
td
β
Imax td
= �VGS − VT )2
2
2
2
�1)
��
4CL V��
β�V�� − VT )2
I� �t)dt ≈
Battery life time
Cooling
Avoiding metal migration
�
td
��
Energy consumed by charging to V�� and discharging a
capacitance C
2
E = CV��
�
�
�
Important for:
�
����
Discharge initial charge Q from load capacitance CL through
N-transistor I� = Imax at t = 0, 0 at t = td
Power dissipation
�
�
Propagation delay
4CL V��
β�V�� − VT )α
Propagation delay: τ = −td ln�0.5) ≈ 0.69td
due to velocity saturation
td =
For deep-submicron processes �α ≈ 1.2 to 1.55)
�
�
�
0.2
0.4
0.6
0.8
Power supply voltage
Frequency
Power
1
Often beneficial to design a too fast circuit and lower the
power supply voltage
Power reduces faster compared to frequency
0 0
0.2
0.4
0.6
0.8
1
Normalized maximum frequency and power as function of
power supply voltage
Power supply voltage scaling
�
�
Propagation delay
Normalized power/frequency
�3)
�
�
�
�
�����
��
���
�
�������
���
��
����
�����
�
��
�������
X �z) =
Z-transform �z = e jωT )
n=−∞
∞
�
x�nT )z −n
n=−∞
Fourier transform
∞
�
�
�
jωT
x�nT )e −jnωT
=
X e
����
�����
Analog signal spectrum Xa �ω)
�
�
Sampled signal spectrum X e jωT , sample period T
Digital Signal Processing
Common to have carry logic to speed up addition
�
�
���
��
Optional register at output
��
��
��
Typical LUT size �N) between four and six
�7)
�6)
The basic building block of an FPGA is a look-up table �LUT)
storing the truth table of the logic function
�
�
�
Field-Programmable Gate-Arrays �FPGA)
�
�
�10)
y �n) = by �n − 1) + a� x�n) + a1 x�n − 1)
T – sample delay �algorithmic), D – flip-flop/register
�9)
��
y �n) = a� u�n) + a1 u�n − 1)
�
��
�8)
�
����
u�n) = x�n) + bu�n − 1)
����
Time-domain representation of algorithm
����
Dedicated large memories are also commonly available
Typical sizes are 18 � 18 or 18 � 25 bits with 48 bits
accumulator
Larger FPGAs have dedicated multipliers or even more
complicated arithmetic blocks including adder/accumulator
and bit-operations
Signal flow graph �block diagram)
�
�
�
FPGA
�
�
�
�
���
����
��
��
����
a� + a1 z −1
Y �z)
=
H�z) =
X �z)
1 − bz −1
X �z)
U�z) = X �z) + bz −1 U�z) ⇒ U�z) =
1 − bz −1
�
�
X �z) a� + a1 z −1
−1
Y �z) = a� U�z) + a1 z U�z) =
1 − bz −1
����
��
Q­point DFT
�N=PQ
Complexity P 2 Q + Q 2 P + N = N �P + Q + 1) < N 2
��
�
�
where 0 ≤ k1 ≤ P − 1 and 0 ≤ k2 ≤ Q − 1
�
�
�16)
The DFT equation can be rewritten as P Q-point DFTs, N
twiddle factor multiplications, and Q P-point DFTs
�

P­point DFT
��
��
�
� TF mult
Q−1
� �� � 
�
�
 P−1
n
k
t2 n1 
t2 n2
1 1

x
[Qn
+
n
]
W
W
X [k1 + Pk2 ] =
1
2
P
N

 WQ

n2 =�  n1 =�
Assume that the N is a non-prime number such that N = PQ
�13)
�12)
�11)
Z-domain �frequency-domain) representation of algorithms
Fast Fourier Transform �FFT)
�
Signal flow graph �block diagram)
FFT
�
�
�
�
�
�
n=�
N−1
�
n=�
x�n)e
�j2πnk
N
=
n=�
N−1
�
�-point DFT
x�n)
Block diagram
W��
Q-point DFT
X �k)
�15)
�14)
x�n)WNnk
Used in spectral analysis, OFDM, RADAR, and more
Direct computation requires N 2 complex multiplications
N−1
N−1
j2πnk
1 �
1 �
N
x�n) =
=
X �k)e
X �k)WN−nk
N
N
n=�
=
�j2π
N
ωT = 2πk
N
Twiddle factor: WN = e
Inverse DFT �IDFT)
��
�
�
X �k) = X e jωT �
The N-point DFT is the Fourier transform of N samples
evaluated at N equally spaced points on the unit circle
ωT = 2πk
N
Discrete Fourirer Transform �DFT)
�
�
2-point DFT �butterfly operation)
�
�
�
�
�
����
Many �roughly half, depending on in which order the smaller
DFTs are derived) of the complex multipliers will have
coefficients = 0
Complexity: N2 log2 N = NM
2 butterflies,
N�log2 N − 1) = N�M − 1) complex multipliers
�����
�
�
����
Can be split down to 2-point DFTs
�
�����
The most common case is N = 2M
�
�
If P and/or Q are non-prime number, they can be split further
�
0
10
1
10
2
3
10
10
FFT length
Direct computation
FFT
4
10
Often the terms are used interchangably
Note that the FFT is only an efficient �class of) algorithm�s)
to compute the DFT
10
0
10
2
10
4
10
6
10
8
10
10
Rough complexity estimation for radix-2 algorithms
FFT complexity
FFT
Number of complex multiplications
The TF multiplier between the 2 and 2 terms in the 22 will be
a W4 TF multiplier, so only multiplication with 1 and �j
Can be generalized as 2i with different resolutions of the
different TF multipliers
�
Radix-2 decimation-in-time �DIT) FFT �Cooley-Tukey) – split
as 2 and N/2, then the N/2 as 2 and N/4 and so on
Radix-2 decimation-in-frequency �DIF) FFT �Cooley-Tukey) –
split as N/2 and 2, then the N/2 as N/4 and 2 and so on
Radix-22 DIT/DIF – split as 4 and N/4, then split the 4 as 2
and 2, N/4 as 4 and N/16 and so on
�
�
W��1
W��31
W��3�
W��29
W��28
W��27
W��26
W��25
W��24
W��23
W��22
W��21
W��2�
W��19
W��18
W��17
W��16
W��15
W��14
W��13
W��12
W��11
W��1�
W��9
W��8
W��7
W��6
W��5
W��4
W��3
W��2
W1�31
W1�3�
W1�29
W1�28
W1�27
W1�26
W1�25
W1�24
W1�23
W1�22
W1�21
W1�2�
W1�19
W1�18
W1�17
W1�16
W1�15
W1�14
W1�13
W1�12
W1�11
W1�1�
W1�9
W1�8
W1�7
W1�6
W1�5
W1�4
W1�3
W1�2
W1�1
W1��
W2�31
W2�3�
W2�29
W2�28
W2�27
W2�26
W2�25
W2�24
W2�23
W2�22
W2�21
W2�2�
W2�19
W2�18
W2�17
W2�16
W2�15
W2�14
W2�13
W2�12
W2�11
W2�1�
W2�9
W2�8
W2�7
W2�6
W2�5
W2�4
W2�3
W2�2
W2�1
W2��
W3��
W3�1
W3�31
W3�3�
W3�29
W3�28
W3�27
W3�26
W3�25
W3�24
W3�23
W3�22
W3�21
W3�2�
W3�19
W3�18
W3�17
W3�16
W3�15
W3�14
W3�13
W3�12
W3�11
W3�1�
W3�9
W3�8
W3�7
W3�6
W3�5
W3�4
W3�3
W3�2
x�0)
x�16)
x�31)
x�15)
x�23)
x�7)
x�27)
x�11)
x�19)
x�3)
x�29)
x�13)
x�21)
x�5)
x�25)
x�9)
x�17)
x�1)
x�30)
x�14)
x�22)
x�6)
x�26)
x�10)
x�18)
x�2)
x�28)
x�12)
x�20)
x�4)
x�24)
x�8)
Can be used for any radix-2 algorithm �include 2i ), the
difference is the twiddle factor values
The outputs are provided in bit-reversed order, i.e., the output
at row 61� = 001102 corresponds to value 011002 = 201�
x�31)
x�30)
x�29)
x�28)
x�27)
x�26)
x�25)
x�24)
x�23)
x�22)
x�21)
x�20)
x�19)
x�18)
x�17)
x�16)
x�15)
x�14)
x�13)
x�12)
x�11)
x�10)
x�9)
x�8)
x�7)
x�6)
x�5)
x�4)
x�3)
x�2)
x�1)
x�0)
W���
Can of course have a larger DFT as the smallest size, radix-4
etc.
�
�
�
�
Standard radix-2 algorithms
32-point FFT SFG
�
�
FFT algorithms
�
�
�
�
1024 points
2000 FFTs/s ⇒ 0.5 ms per FFT
Connected to a computer via a 32-bit bus clocked at 16 MHz
16 + 16 bits real and complex data input and output
In the first case study we will design an FFT processor with
the following specifications �note the last-millenium specs...):
�
�
�
�
�
�
�
�18)
Choose to use two memories to avoid multiple accesses per
operation to a single memory
Also, assume the memory rate is a multiple of the I/O rate ⇒
32 MHz
The new time for an FFT is now 0.32 ms and the clock
frequency 192 MHz
Finally, with these parameters the resulting throughput is
2232 FFTs/s
fmem
�2 + 2)5120
=
≈ 55 � 106 complex values�s
0.372 � 10−3
Two PEs lead to a minimum clock frequency of 165.2 MHz
For each PE operation we need to read and write two complex
values
The memory access rate is
Case Study I – FFT
�
Case Study I – FFT
N
2
1�24�2
16�1�6
= 0.128 ms
24 clock cycles required per operation
Max 220 MHz clock frequency
Nclk NOP
24 � 5120
=
≈ 1.5
tFFT fmax
0.372 � 10−3 � 220 � 106
Use two PEs
NPE =
Number of PEs
�
�
�
x=� y =�
�19)
�
� �
� �
� �
π
1
1
x+
u cos
y+
v
2
8
2
Fast structures available, but not as regular as for the DFT
Can be separated into 1-D DCTs that operate on first rows
and then columns
where α�u� v ) is a scaling factor
G �u� v ) =
π
cos
α�u� v )
8
For JPEG, the following 2-D DCT is used
�
�
Several different DCT transforms proposed
7 �
7
�
g �x� y )
DCTs are used in e.g. image coding
�
�17)
Remaining time for FFT: 0.5 − 0.128 = 0.372 ms
Assume processing element �PE) with bit-serial butterfly and
complex multiplier
Time to transfer data:
�
�
log2 �N) = 5120
Assume that the processor is idle during data transfer
�otherwise more memory would be required)
Number of butterfly operations per FFT:
Discrete cosine transform �DCT)
�
�
�
�
�
�
�
Case Study I – FFT
Download PDF