A Hardware Accelerator for Digital Holographic Imaging

A Hardware Accelerator for Digital Holographic Imaging
A Hardware Accelerator for
Digital Holographic Imaging
Design Methodology and Implementation Aspects
Thomas Lenart
Thesis for the degree of
Licentiate in Engineering
2005
Department of Electroscience
Lund University, Sweden
Department of Electroscience
Lund University
Box 118, S-221 00 LUND
SWEDEN
This thesis is set in Computer Modern 11pt,
with the LATEX Documentation System
c Thomas Lenart
March 2005
Abstract
Today, optical microscopes dominate medical and biological research laboratories. However, microscopes based on the principle of digital holography are emerging as an interesting alternative. The advantage with holography is that both phase and amplitude of
the light is captured, while a traditional microscope only captures the amplitude. Phase
information reveals material and object properties that cannot be seen in a traditional
microscope, e.g. refractive index and the three-dimensional structure of an object. In
the holographic microscope, images are captured using a digital image sensor and the
optical lenses in a traditional microscope are replaced by reconstruction algorithms.
This thesis presents a hardware accelerator for image reconstruction in digital holographic imaging. The hardware accelerator executes a reconstruction algorithm, which
transforms the light captured on a digital image sensor into visible images. The reconstruction algorithm, which is based on the FFT, is computationally demanding and
requires a substantial amount of signal processing. Hence, a general-purpose computer
is not capable of meeting the real-time constraints and a custom solution has been developed.
The hardware accelerator is based on a one-dimensional FFT, which has been implemented on silicon, fabricated, and verified. The FFT is used as a central building block
in a two-dimensional signal processing datapath, optimized for executing reconstruction
algorithms. Finally, the accelerator is integrated together with a microprocessor, memory controller, sensor and monitor interface to form a complete system. The system is
synthesized to programmable logic and integrated into a prototype of the holographic
microscope. To our knowledge, no other research project has combined these research
areas before.
iii
The important thing in science is not so much to obtain new facts
as to discover new ways of thinking about them
Sir William Bragg (1862 - 1942)
Contents
Abstract
iii
Contents
vii
Preface
ix
Acknowledgment
xi
List of Acronyms
xiii
1 Introduction
1.1 Research project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2 Holography
2.1 Traditional Holography . . . . . . . . . .
2.2 Digital holography . . . . . . . . . . . . .
2.3 A microscope based on digital holography
2.4 Reconstruction algorithms . . . . . . . . .
2.5 System requirements . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
4
4
9
3 Processing and memory
3.1 General-purpose processing . .
3.2 Special-purpose processing . .
3.3 Application-specific processing
3.4 Memory and bandwidth . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
14
15
4 A flexible FFT core
4.1 Discrete Fourier transform .
4.2 FFT algorithms . . . . . .
4.3 Implementation aspects . .
4.4 Architecture selection . . .
4.5 Scaling alternatives . . . .
4.6 Simulations . . . . . . . .
4.7 Implementation . . . . . .
4.8 ASIC Prototyping . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
20
22
23
24
29
31
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
5 Streaming hardware accelerator
5.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 External interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
35
36
41
6 Optimizations
6.1 Internal buffering .
6.2 Memory bandwidth
6.3 Parameter selection
6.4 Summary . . . . . .
.
.
.
.
45
45
48
49
50
7 System integration
7.1 Hardware design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Software design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Prototype - A Holographic Microscope . . . . . . . . . . . . . . . . . . . .
51
51
54
57
8 Conclusions
61
9 Future work
63
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Preface
This thesis summarizes my academic work in the digital ASIC group at the department of
Electroscience for the Licentiate degree in Engineering. The main contribution to the thesis
is derived from the following publications:
Thomas Lenart, Viktor Öwall, Mats Gustafsson, Mikael Sebesta, and Peter Egelberg, “Accelerating signal processing algorithms in digital holography using an FPGA
platform,” in Proc. of International Conference on Field-Programmable Technology,
ICFPT’03, pp. 387–390, December 15-17, 2003, Tokyo, Japan.
Thomas Lenart and Viktor Öwall, “A 2048 Complex Point FFT Processor using a
Novel Data Scaling Approach,” in Proc. of International Symposium on Circuits and
Systems, ISCAS’03, vol. 4, pp. 45–48, May 25-28, 2003, Bangkok, Thailand.
Background information is derived from the following publications:
Thomas Lenart and Viktor Öwall, “A Reconfigurable System for Image Reconstruction
in Digital Holography,” in Proc. of Swedish System-on-Chip Conference, SSoCC’04,
April 13-14, 2004, Båstad, Sweden.
Mats Gustafsson, Mikael Sebesta, Bengt Bengtsson, Sven-Göran Pettersson, Peter
Egelberg, and Thomas Lenart, “High resolution digital transmission microscopy - a
Fourier holography approach,” in Optics and Lasers in Engineering, vol. 41, issue 3,
pp. 553–563, March 2004.
Thomas Lenart and Viktor Öwall, “A Pipelined FFT Processor using Data Scaling
with Reduced Memory Requirements,” in Proc. of Norchip, NORCHIP’02, pp. 74–
79, November 11-12, 2002, Copenhagen, Denmark.
The following papers are also published, but not considered part of this thesis:
Hugo Hedberg, Thomas Lenart, Henrik Svensson, “A Complete MP3 Decoder on a
Chip,” in Proc. of Microelectronic Systems Education, MSE’05, June 12-13, 2005,
Anaheim, California, USA.
Hugo Hedberg, Thomas Lenart, Henrik Svensson, Peter Nilsson and Viktor Öwall,
“Teaching Digital HW-Design by Implementing a Complete MP3 Decoder,” in Proc.
of Microelectronic Systems Education, MSE’03, pp. 31–32, June 1-2, 2003, Anaheim,
California, USA.
ix
Acknowledgment
Thanks to all the people in the digital ASIC corridor, I really enjoy going to work every
day. During my time at the department, there has never been a day without stimulating
discussions regarding work, teaching and other less work-related topics.
I would like to thank my supervisor Viktor Öwall for his help and constructive criticism. Also
many thanks to my associate supervisor Mats Gustafsson for explaining the reconstruction
algorithms to me over and over again.
Several people has contributed to this work in different ways. Especially, I would like to
thank Fredrik Kristensen, Matthias Kamuf, Henrik Svensson and Hugo Hedberg for commenting the contents and language in the thesis. I have probably removed a billion ”the”
by now and a bunch of other annoying stuffing words.
When it comes to chip design, I would first of all like to thank Anders Berkeman for being
our mentor in ASIC design and also our universal guru on everything else. I would also like
to thank Pontus Åström for a great LATEX template. Over the years, all the people in the
corridor has contributed in various ways to improve our design flow, including efforts put
into developing courses and course material. I would like to thank all the people who has
contributed, since this work has helped us all.
Finally I would like to thank the people working in the Digital Holography project. In particular, Mikael Sebesta for being the driving force behind the prototype and Bengt Bengtsson
for rapidly and efficiently developing new PCBs to connect the strange and unknown analog
world to the more understandable and logical digital world.
This work has been financed by the Competence Center for Circuit Design (CCCD).
Lund, 2005
Thomas Lenart
xi
List of Acronyms
AHB
Advanced High-performance Bus
AGU
Address Generation Unit
ALU
Arithmetic Logic Unit
AMBA
Advanced Microcontroller Bus Architecture
APB
Advanced Peripheral Bus
API
Application Programming Interface
ASIC
Application-Specific Integrated Circuit
ASIP
Application-Specific Instruction Processor
CCD
Charge-Coupled Device
CMOS
Complementary Metal Oxide Semiconductor
CORDIC
Coordinate Rotation Digital Computer
CPU
Central Processing Unit
DAB
Digital Audio Broadcasting
DAC
Digital-to-Analog Converter
DCT
Discrete Cosine Transform
DFT
Discrete Fourier Transform
DIF
Decimation-In-Frequency
DIT
Decimation-In-Time
DRAM
Dynamic Random-Access Memory
DSP
Digital Signal Processor
DVB
Digital Video Broadcasting
FFT
Fast Fourier Transform
FIFO
First In, First Out
FIR
Finite Impulse Response
xiii
FPGA
Field Programmable Gate Array
FSM
Finite State Machine
GDI
Graphics Device Interface
GUI
Graphical User Interface
HAL
Hardware Abstraction Layer
HDL
Hardware Description Language
IFFT
Inverse Fast Fourier Transform
IP
Intellectual Property
LSB
Least Significant Bit
LUT
Lookup Table
MAC
Multiply-Accumulate
MSB
Most Significant Bit
OFDM
Orthogonal Frequency Division Multiplexing
PCB
Printed Circuit Board
PE
Processing Element
RAM
Random-Access Memory
ROM
Read-Only Memory
SDF
Single-path Delay Feedback
SDRAM
Synchronous DRAM
SNR
Signal-to-Noise Ratio
SQNR
Signal-to-Quantization-Noise Ratio
SRAM
Static Random-Access Memory
SRF
Stream Register File
VGA
Video Graphics Array
VHDL
Very high-speed integrated circuit HDL
VLIW
Very Long Instruction Word
Chapter 1
Introduction
In 1981, IBM introduced a personal computer based on the Intel x86 architecture. In
those days, computers were centralized around the microprocessor, not only for computations but also for graphics functionality and access to ports and disk drives. Computers were soon extended with additional processing units, or co-processors, instead of
using a generic microprocessor for every task. One of the first co-processors available
was a math processor supporting floating-point operations, a feature that was not yet
available inside microprocessors. Over time, more specialized accelerators have been
introduced to the market, to be able to meet the requirements in multimedia processing
and other computationally demanding real-time applications.
Today, desktop computers contain specialized cards for all kinds of processing.
Graphic cards render 3D-graphics in real-time, supporting z-buffers, α-channels, and
texture mapping to name a few. Soundcards handle audio effects such as echo, reverb,
and sound rendering using wave tables. In addition to this, plug-in cards containing
programmable logic devices are now available, which makes it possible for anyone to
build a custom accelerator. However, the design flow for utilizing the features of programmable logic inside a desktop computer is still a cumbersome procedure, requiring
knowledge in both hardware- and software design. This will probably change in the future, simplifying the development of hardware accelerators to the same level as writing
software programs today.
As processing elements become more complicated and powerful, another issue arises.
Processors today can execute billions of operations per second. However, memory technologies has not evolved as fast and the result is an increasing gap between processing
speed and memory bandwidth [1]. The problem significantly increases when designing
high-performance custom hardware accelerators, which requires a substantially higher
data rate. Improving the communication between the processing elements and memory
could in many cases be more important than improving the hardware accelerator itself.
1.1 Research project
The research presented in this work is part of a larger multi-disciplinary project involving several research groups within the fields of optics, algorithm development, and
digital hardware design. The project goal is to construct a microscope based on digital holography. The advantage of using holography is that not only the amplitude of
1
2
CHAPTER 1. INTRODUCTION
the light is captured, but also the phase. The phase information reveals object properties that cannot be seen in a traditional microscope, e.g. refractive index and the
three-dimensional structure of an object.
This work presents a hardware accelerator for capturing and reconstructing images
from a holographic microscope. To solve a specific problem, a custom hardware accelerator is more efficient than a general-purpose computer. Custom-designed accelerators
can either be implemented in programmable logic, using a field programmable gate array (FPGA), or as an application-specific integrated circuit (ASIC). The result from
this work has been integrated into a prototype of the holographic microscope, based
on an FPGA platform. The research project on digital holography has resulted in a
start-up company, Phase Holographic Imaging AB, or PHI.
1.2 Thesis outline
This thesis presents the work on a hardware accelerator for digital holography. Since
this work is closely connected to the application, a brief background on holography is
presented in Chapter 2. This includes traditional holography, digital holography, and
how to reconstruct information from captured holographic images. This chapter also
specifies the real-time requirements for the hardware accelerator. Chapter 3 gives an
introduction to different processing units and their efficiency, followed by an overview of
memories, discussing storage management, bandwidth limitations, and access pattern
scheduling. The implementation work is divided into three chapters. The fast Fourier
transform (FFT) is a central part of many reconstruction algorithms. Therefore, the
work on a one-dimensional, high-precision pipeline FFT core is presented in Chapter 4.
In Chapter 5, the FFT is reused when designing a two-dimensional FFT and constructing a pipeline with the required computational units for image reconstruction. The
optimizations based on the design decisions in Chapter 5 are presented in Chapter 6.
Chapter 7 presents the integration of the hardware accelerator into an embedded system, connecting external components for capturing and viewing holographic images.
As a part of Chapter 7, a prototype of a complete holographic microscope is presented.
The prototype includes the embedded system together with the optics, sensor, and interface to an external computer. Finally, conclusions and future work are presented in
Chapter 8 and Chapter 9.
Chapter 2
Holography
In 1947, the British scientist Dennis Gabor invented a method to photographically create
a three-dimensional recording of a scene [2]. However, Gabor needed a light source with
a single frequency and a constant phase shift, properties known as monochromatic and
coherent light, respectively. Since this device was not available at the time, he initially
used a mercury lamp with spatial and temporal filtering, significantly degrading the
image quality. A coherent light source was first discovered in 1960, an invention named
Light Amplification by Stimulated Emission of Radiation, or Laser for short.
2.1 Traditional Holography
A holographic setup is based on a coherent light source, shown in Figure 2.1. The interference pattern between light from a reference wave and reflected light from an object
illuminated with the same light source is captured on a photographic film. Interference
between two wave fronts cancels or amplifies the light in each point on the holographic
film. This is called constructive and destructive interference respectively.
A recorded hologram has certain properties that distinguish it from a traditional
photograph. In a normal camera, the amplitude of the light is captured and the developed photography is directly visible. The photographic film in a holographic setup
captures the interference, or phase difference, between two waves. Hence, depth information is stored in the hologram. By illuminating the developed photographic film
with the same reference light as used during the recording phase, the original image is
reconstructed and appears three-dimensional.
In a camera, a lens is used to focus the light onto the film. The location of the lens
determines the focus position, and the characteristics of the lens determine the focal
depth. This causes focus to be sharp only at a certain distance, limited by the properties
of the lens. Since a hologram can be captured without lenses, the focal depth is high,
and focus is extremely sharp regardless of the distance to the object.
2.2 Digital holography
In digital holography, a high-resolution image sensor replaces the photographic film. A
computer algorithm is used to calculate a visible image based on the digital recordings
[3]. One advantage over traditional holography is that image capturing only takes a
3
CHAPTER 2. HOLOGRAPHY
4
plane
waves
laser
light
photographic
plate
scene
laser
light
Figure 2.1: The reflected light from the scene and the reference light creates
an interference patten on a photographic film.
fraction of a second, instead of developing a photographic film. The downside is that the
sensor resolution is about 300 pixels/mm, whereas the photographic film contains 3000
lines/mm. However, the resolution of digital image sensors is continuously increasing,
improving the quality and usefulness of digital holography.
2.3 A microscope based on digital holography
Digital holography can be used as an alternative to conventional microscopy and has
several interesting features, e.g. improved depth focus and the possibility of generating
three-dimensional and phase contrast images [4]. Since not only amplitude but also
phase information of the object is recorded, special material characteristics can be determined that are useful in for instance biomedical applications [5]. A few examples are
to determine the refractive index, object thickness, total volume, or volume distribution. While a conventional microscope uses lenses, the holographic setup uses a signal
processing algorithm, where a parameter change in the algorithm will instantly reconfigure the digital lens into any shape or focal length. Figure 2.2 shows the holographic
microscope setup.
2.4 Reconstruction algorithms
A reconstruction algorithm processes the captured wave patterns on the sensor to create
a visible image. With three exposures, images of the hologram, object, and reference
are gathered. First, the hologram is captured by measuring the interference between the
object and the reference source. Then, by blocking either the object or reference beam,
the reference and object light is measured. By subtracting the reference light ψr and
object light ψo from the hologram ψh , the only remaining component is the interference
pattern ψ as
ψ(ρ) = ψh (ρ) − ψo (ρ) − ψr (ρ).
(2.1)
2.4. RECONSTRUCTION ALGORITHMS
1.
5
2.
3.
7.
13.
12.
9.
8.
10.
11.
4.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
5.
6.
10mW He-Ne Laser
Mirror.
Polarization beamsplitter cube.
Half-wave plate.
Polarizer.
Polarizer.
Shutter.
Half-wave plate.
Mirror.
Iris diaphragm.
Object.
Beam splitter cube.
GRIN lens.
Image sensor.
Figure 2.2: The experimental holography setup. The reference, object, and
interference pattern are captured on a high-resolution digital image sensor,
instead of using a photographic film as in traditional holography. The images
are captured by blocking the reference or object beam.
To simplify the equations, the vector ρ is used for specifying the (x, y) position on the
sensor. A visible image is reconstructed from ψ(ρ), which is the object field in the sensor
plane, by retrofocusing the light to the object plane. The reconstruction algorithm, or
inversion algorithm, can retrofocus the light captured on the sensor to an arbitrary
object plane, which makes it possible to change focus position in the object. This is
equivalent to manually moving the object up and down in a traditional microscope.
Two reconstruction algorithms are presented in the following sections, first the convolution algorithm and then a simplified version based on the far-field approximation.
The simplified version requires less computational power and is therefore of high interest, at the penalty of slightly reduced image quality. A comparison between the
algorithms is presented in Section 2.4.3.
2.4.1 Convolution algorithm
The convolution algorithm is compute-intensive but accurate. Images are reconstructed
by computing the Rayleigh-Sommerfeld diffraction integral as
Z
0
0
ψ(ρ)e−ik|r−rr | e−ik|r−r | dSρ .
(2.2)
Ψ(r ) =
sensor
Figure 2.3 illustrates the relation between the sensor and object plane. By specifying
a point of origin, r 0 represents the vector to any point in the object plane, r is the
distance to any point in the sensor plane and rr is the position of the reference light
source. Accordingly, |r − r r | is the distance between a point on the sensor and the
reference position while |r − r 0 | represents the distance between points in the object
plane to pixels in the sensor plane. The integral can be expressed as a convolution,
which in the frequency domain can be simplified to a matrix multiplication as
Ψ(r 0 ) = ψ 0 ∗ G = F −1 (F(ψ 0 ) · F(G)),
(2.3)
CHAPTER 2. HOLOGRAPHY
6
sensor plane
sensor plane
object plane
r
ψ(ρ)
Ψ(r 0 )
rr
z
r
z
r0
object plane
z0
zr
O
O
ref
(a)
(b)
Figure 2.3: (a) Definition of vectors from origin of coordinates to the sen-
sor surface, reference point, and object plane. (b) Sensor and object planes.
Modifying the z-position of the object plane changes focus position.
where
ψ 0 (ρ) = ψ(ρ)e−ik
and
G(ρ) = e−ik
√
|z−zr |2 +|ρ−ρr |2
√
|z−z 0 |2 +|ρ|2
.
The three-dimensional vectors representing the distance between the sensor and object
planes has been divided into a z component and an orthogonal two-dimensional vector
ρ representing the x and y positions as
r = ρ + z ẑ.
The distance z 0 specifies the location of the image plane to be reconstructed, whereas z
is a constant value. The discrete version of the integral with an equidistant grid, equal
to the sensor pixel size, generates a discrete convolution that can be evaluated with the
FFT. The size of the two-dimensional FFT must be at least the sum of the number of
the sensor size and the object size in each direction. Higher resolution is achieved by
shifting the coordinates a fraction of a pixel size ∆x, and combining the partial results
into a larger holographic image. A larger image requires several iterations and increases
the reconstruction time with a factor N 2 for a sub-pixel distance of (∆x/N, ∆y/N ). One
image reconstruction requires three FFT calculations, while each additional evaluation
only requires two FFT calculations, since only G(ρ) will change. For fast low-resolution
reconstructions, G(ρ) can be precalculated to avoid one FFT operation. The result
from the calculations, as well as the recorded images, can be found in Figure 2.4.
2.4. RECONSTRUCTION ALGORITHMS
7
(a)
(b)
(c)
(d)
Figure 2.4: (a) The reference image ψr . (b) The object image ψo . (c) The
interference pattern ψh . (d) Reconstructed holographic image Ψ.
2.4.2 Simplified far-field approximation
The convolution algorithm requires three two-dimensional FFTs to be evaluated as
expressed in Equation 2.3. First, ψ 0 (ρ) and G(ρ) are transformed into the frequency
domain. After a matrix multiplication between the two, the result is transformed back
to the time domain. This section will present a simplified reconstruction algorithm that
only requires a single FFT to be evaluated.
By observing that |r 0 | |r|, the Fraunhofer or far-field approximation simplifies
the distance between the sensor plane and object plane as
|r − r 0 | ≈ r − r · r 0 /r.
This assumption simplifies the integral expression in Equation 2.2 to
Ψ(r 0 ) =
Z
Zsensor
≈
Zsensor
=
Zsensor
=
sensor
0
ψ(ρ)e−ik|r−rr | e−ik|r−r | dSρ
0
ψ(ρ)e−ik(|r−rr |−r) e−ikr·r /r dSρ
0
0
ψ(ρ)e−ik(|r−rr |−r) e−ikzz /r e−ikρ·r /r dSρ
0
ψ 0 (ρ)e−ikρ·r /r dSρ.
Except for the factor 1/r, this is the definition of the two-dimensional FFT. Before
carrying out the Fourier transformation, ψ 0 (ρ) has to be calculated as
0
ψ 0 (ρ) = ψ(ρ)e−ik(|r−rr |−r) e−ikzz /r .
CHAPTER 2. HOLOGRAPHY
8
(b)
(a)
Figure 2.5: Comparison of image quality when reconstructing the inner pattern on the test target. (a) Convolution algorithm with 6 interpolations in
each direction, requiring 73 FFTs to be evaluated. (b) Simplified far-field
approximation evaluated with a single FFT.
The expression can be further simplified by removing the first exponential term since
it will only affect the location of the object in the reconstructed image. Instead, the
coordinates of the object location can be modified after reconstruction. The final result
is an image reconstruction algorithm containing only one FFT as
0
Ψ(r 0 ) ≈ F(ψ(ρ)e−ikzz /r ).
(2.4)
2.4.3 Algorithm comparison
Producing a visible image requires three images to be captured and processed using a
reconstruction algorithm. The convolution algorithm needs three FFTs to be evaluated
for the first image and two for each additional interpolation, whereas the simplified
version only requires one FFT. However, the simplified reconstruction algorithm will
degrade the image quality due to the mathematical approximations. Figure 2.5 shows
a quality comparison between the two algorithms, reconstructing the inner pattern on
a test target from Figure 2.4(d). There is a difference in smoothness, but the resolution
is about the same. The convolution algorithm is here interpolated 6 times in each
direction, which requires 1 + 2 × 62 = 73 FFTs (one extra FFT in the first iteration).
The simplified reconstruction algorithm still only requires one FFT to be evaluated.
Comparing the image quality and computational requirements, the simplified algorithm
is chosen for implementation.
2.5. SYSTEM REQUIREMENTS
9
2.5 System requirements
For the holographic microscope to be useful, images have to be reconstructed in realtime. The term real-time is somewhat vague and calls for further explanation. For the
human eye to interpret a series of images as motion video, a rate of about 25 frames per
second (fps) is required. In a microscope, the response time between moving a sample
and getting the visual feedback from the motion should be as short as possible. A longer
response time is justified for images with higher resolution and quality. The system
frame rate is limited by two factors, the sensor frame rate and the reconstruction time.
These limitations are discussed in the following section, also specifying the requirements
for the hardware accelerator in terms of resolution, frame rate, and other properties.
2.5.1 Digital sensors
The frame rate of image sensors is a limiting factor for the current application. In digital
cameras, it is often not critical that it takes one second to transfer the high-resolution
image from the sensor array to the memory. The delay is due to the number of pixels on
the sensor array, currently somewhere in the range of 5-10 million pixels. However, in
a digital video camera the real-time constraints require 25 fps for the images to appear
as motion video. To satisfy this condition, resolution is substantially reduced, normally
in the range of 0.5-1 million pixels. Since the images are continuously changing, the
observer is less sensitive to the low resolution.
A problem to meet the real-time constraints in digital holography is that the best of
two worlds needs to be combined, high resolution and high frame rate. However, lower
resolution is sufficient when scanning and moving the sample. When the interesting
part of the sample is located, a high-resolution image can be reconstructed.
CMOS versus CCD
Digital image sensors can be divided into two groups, Charge-Coupled Devices (CCD)
and Complementary Metal-Oxide-Semiconductor (CMOS). CCD sensors are characterized by high resolution and quality but also high manufacturing cost. They require
external circuits to access the sensor array and convert the analog charge into a digital
value, shifted out row-by-row. A problem with CCD sensors is that charges leak between adjacent pixels, causing a saturated pixel to bloom (overflow) and affect closely
located pixels.
CMOS sensors are still not available in the same high resolution as CCD sensors.
However, they have many other advantages. CMOS sensors can be fabricated in standard manufacturing facilities, which significantly reduces the cost and increases the
yield. Circuitry for accessing the sensor array is integrated on-chip, avoiding external
components. The on-chip circuitry also enables random access to regions of interest,
rather than reading out the complete sensor array. CMOS sensors have anti-blooming
features, which is very important in digital holographic imaging. Finally, the power
consumption is lower for CMOS sensors, but only relevant to mobile and battery-driven
applications.
CHAPTER 2. HOLOGRAPHY
10
2.5.2 Image reconstruction
In this section, an initial specification is outlined based on the selected algorithm. One
parameter is the sensor size. A larger sensor requires more processing, but generates a
higher image quality. By evaluating available image sensors on the market, the maximum size of the array is set to 2048 × 2048 pixels, which also specifies the size of the
Fourier transform. Another parameter is the clock frequency, which linearly affects the
computation time. Initially, the clock frequency is estimated to 250 MHz, a reasonable
and modest number for any modern fabrication process. However, when designing an
FPGA platform for prototyping and evaluation, a much lower value has to be assumed.
• FFT size – 2048 points.
• Frame rate – 25 fps.
• Clock frequency – 250 MHz.
• Architectural selection – To be decided.
The transform size, frame rate, and clock frequency have been specified, but the architectural decisions are still open. Selecting an architecture is a trade-off between
computation time and hardware resources. According to the specification, each frame
has a clock cycle (cc) budget of
250 MHz
= 10 Mcc.
25 fps
With these assumptions, a two-dimensional FFT has to be evaluated in 10 million
clock cycles. Due to the extremely large transform size, implementation of the twodimensional FFT is separated into one-dimensional FFTs, first applied over the rows
and then over the columns [6]. Hence, 2×2048 transforms of size 2048 must be computed.
The requirements for the one-dimensional FFT is therefore
FFTcc =
10 Mcc
≈ 2400 cc.
2 × 2048
(2.5)
When selecting the architecture of the one-dimensional FFT, the computation time
must be in the range of 2400 clock cycles. The architecture evaluation and selection is
presented in Chapter 4.
Chapter 3
Processing and memory
This chapter presents a brief background on processors and memories as a motivation
to build application-specific hardware accelerators. The chapter is divided into generalpurpose, special-purpose, and application-specific processing. Microprocessors are limited by the sequential nature of execution. Designing an application-specific accelerator
can significantly increase the computational efficiency by exploiting the parallelism in
an algorithm.
The second part of this chapter addresses memories. As the speed of execution increases, the amount of data transferred to and from the storage space increases as well.
Internal memory is fast but expensive in terms of area. External memory provides a
larger storage space, but data must be transferred efficiently to maximize memory bandwidth and minimize latency. The non-uniform access time of modern high-capacity
memories require data to be transferred in blocks, known as burst transfers. Memory access patterns are therefore extremely important, and sometimes require special
scheduling units.
3.1 General-purpose processing
The most common processing unit is probably the microprocessor, the central core
in desktop computers, workstations, laptops, and servers. Common for all processors
is their programmability. The microprocessor, or central processing unit (CPU), is
designed for general-purpose processing to solve all kinds of tasks. Since the execution is sequential, it has limited ability of exploiting parallelism on an algorithmic
level. However, there are several ways of finding and exploiting parallelism on an instruction level. The implementation differs between processor architectures, e.g. very
long instruction words (VLIW) containing multiple instructions, superscalar processors
with several computational pipelines, and even processors supporting thread parallelism
(hyper-threading).
VLIW processors exploit the instruction level parallelism by loading and executing
parallel instructions. Every instruction word contains a set of independent instructions,
which can be executed in parallel. The instruction level parallelism is determined at
compile-time and requires more complex compilers. Examples of VLIW architectures
are the 128-bit Crusoe and 256-bit Efficēon processors from Transmeta [7]. Another
example is the TriMedia core from Philips.
11
CHAPTER 3. PROCESSING AND MEMORY
12
x(n)
C1
C
C2
x
C3
y
y(n)
(a)
(b)
Figure 3.1: (a) Block diagram of a 3-tap FIR filter. (b) MAC unit suitable for
implementing a FIR filter.
In contrast to the VLIW architecture, superscalar processors dynamically determines
how to parallelize the execution. This is possible by analyzing the program and even
reorder instructions to find more concurrency. Processors such as the Intel Pentium,
AMD Athlon, and PowerPC are based on superscalar pipelines. Pentium 4 also supports
hyper-threading, i.e. finding and exploiting parallelism between separate threads.
3.2 Special-purpose processing
Despite all improvements and special features, microprocessors are still generic computational units. For special-purpose processing, several alternatives to microprocessors
exist. For example, a digital signal processor (DSP) includes features to improve execution of signal processing algorithms, while stream processors enhance the processing
time in media application such as video compression and graphic rendering.
3.2.1 Digital signal processors
A digital signal processor has much in common with a microprocessor but contains additional features. In signal processing algorithms, a set of frequently used operations
can be isolated. A common example is multiplication followed by accumulation, present
in for instance FIR filters [8]. In a DSP, this operation is executed in a single clock
cycle using the multiply-accumulate (MAC) operation, as shown in Figure 3.1. Another
useful feature is the possibility of scaling a result from the arithmetic unit. By connecting a barrel shifter to the output, scaling can be applied without time penalty. In
signal processing, overflow causes serious problems when the values exceed the maximum wordlength. In a conventional CPU, values wrap around on overflow and corrupt
the computation result. DSPs often use saturation logic, preventing the value to wrap
and cause less damage to the final result. In the MAC unit, overflow is avoided by
adding guard bits to the accumulation register.
The memory architecture of a DSP differs from a CPU. Consider the MAC operation
to execute in one clock cycle. In this case, two values are required every clock cycle,
which demands two memory accesses in a single clock cycle. Therefore, DSPs often
have more than one on-chip memory, single- or dual-ported, to supply the arithmetic
units with data. In addition to the internal memories, some DSPs even have more than
one external memory interface. Another issue is computational overhead due to address
3.2. SPECIAL-PURPOSE PROCESSING
ALU cluster
SDRAM
SDRAM
13
ALU cluster
Stream
register file
...
(SRF)
ALU cluster
SDRAM
SDRAM
LRF
LRF
LRF
LRF
ALU cluster
Figure 3.2: The Imagine stream processor. Kernels are operating concurrently,
communicating through the stream register file (SRF).
generation. Instead of calculating addresses using an ALU, and wasting computational
power, special address generation units (AGU) takes care of memory addressing. This
is useful when addresses appear, for example, bit-reversed or for modulus counters.
Instead of calculating complex addresses, these features are supported by the AGU.
3.2.2 Stream processors
The number of applications with multimedia processing continuously increases. Applications span from audio decompression such as MP3, to computationally demanding real-time video compression and three-dimensional graphics rendering. Traditional
CPUs or DSPs are not suitable for this task, because they lack the required computational efficiency for real-time applications.
Media applications often involve operations on streams of data with high level of local
dependency. This enables the possibility to parallelize media applications and design
clusters of concurrent processors with separate local storage, referred to as a stream
processor [9]. Each cluster contains kernels with a local register file (LRF), enabling a
high bandwidth during execution. The kernels are connected to a stream register file
(SRF) to handle communication between clusters. The SRF also communicates with
external memories, storing global data available to all kernels. As an example, the
Imagine stream processor [10] contains eight ALU clusters, connected to an SRF. An
overview of the system can be found in Figure 3.2.
The drawback with conventional CPUs is their fixed structure with a computational
unit, register file, cache, and main memory. In a stream processor, each arithmetic kernel
works on local data, while streams can pass between kernels using a SRF. Therefore, data
is exchanged where bandwidth is high, i.e. close to the kernels. Instead of communicating
through the main memory, data is shuffled among kernels. This approach reduces the
bandwidth between memory and processing unit and increases the processing capacity.
CHAPTER 3. PROCESSING AND MEMORY
14
1
2
Huffman
decoding
5
3
Requantize
6
Alias
reduction
4
Frequency
inversion
18%
17%
8
7
IMDCT
58%
Stereo
decoding
Reordering
Subband
synthesis
1
2
3
4
5
6
7
8
Figure 3.3: The MP3 decoding process and software profiling showing the time
required in each step. Profiling is based on the ISO/IEC 11172-3 floating-point
reference model.
3.3 Application-specific processing
Application-specific means that the architecture is designed to optimize a specific function or algorithm. The computational efficiency is likely to improve if available hardware
resources match the algorithmic requirements. As an example of application-specific
processing, the MP3 decoding algorithm is considered [11]. MP3, or MPEG1 layer 3, is
a compression algorithm for sound. A block diagram is presented in Figure 3.3, divided
into eight functional blocks. When executed on a microprocessor, blocks are processed
one at a time, decoding the bitstream into audio samples. The graph in Figure 3.3 illustrates the software profiling [12]. The profiling shows the processing time for each block,
and the most time-consuming blocks are the first candidates for hardware acceleration.
There are several approaches to accelerate the decoding algorithm, ranging from
accelerating commonly used operations, to complete functions, or implement the entire
algorithm in hardware. When mapping the algorithm directly to hardware, flexibility is
lost but computational efficiency gained. Compared to a programmable solution, there
is no longer an overhead in reading and executing instructions.
3.3.1 Extended instruction set
One approach to acceleration is to extend the instruction set of an ordinary microprocessor, referred to as an application-specific instruction set processor (ASIP). With
basic knowledge about the MP3 decoding blocks, it would be suitable to extend the
processor with instructions such as a MAC unit for the subband synthesis, or a butterfly operation for the IMDCT. However, changing the instruction pipeline requires a
modified processor architecture. An easier and more common approach is to place the
hardware extension parallel to the processor pipeline, as shown in Figure 3.4(a).
3.3.2 Accelerator on bus
Another approach is to isolate complete functions and build application-specific hardware accelerators. In a computer system, hardware accelerators are usually connected
to the internal bus. Figure 3.4(b) shows a configuration with microprocessor and accelerator on a bus. Many companies develop and market accelerators for various bus
architectures, often referred to as intellectual property (IP).
3.4. MEMORY AND BANDWIDTH
CPU
ACC
15
ASIP
CPU
ACC
Bus
Bus
Memory
Memory
(a)
(b)
Figure 3.4: (a) Co-processor connected to a CPU pipeline. (b) Hardware
accelerator connected to a bus.
The profiling in Figure 3.3 reveals that the microprocessor spends more than half
the time on subband synthesis. By mapping this function to hardware the overall processing time can be reduced by more than 50%, according to the software profiling. The
hardware accelerator and microprocessor operates concurrently, hence total execution
time is the maximum of the two. A well-designed hardware accelerator must be balanced against the microprocessor for maximum performance with minimum resources.
In this case, the accelerator must be fast enough to finish execution in the same time
as the microprocessor, else idle time is introduced. If the accelerator is too fast, it requires more resources than actually necessary, without gain in the total execution time.
When choosing to accelerate other functions of the algorithm as well, the time- and
resource allocation have to be analyzed again. This is referred to as hardware/software
partitioning.
3.4 Memory and bandwidth
Data processing requires storage space for input vectors, intermediate values, and results. A faster processing unit requires a higher bandwidth to transfer information to
and from memory. When the efficiency of computational elements increases, it also
requires more efficient memory management. Modern architectures are based on a
memories hierarchy, with the fastest memory located close to the computational units,
see Figure 3.5. In a microprocessor, this memory is referred to as a cache, a temporary
storage space for frequently accessed information. The high-speed memory is located
on-chip and can sustain high bandwidth requirements. When more storage space is
required, it must be located off-chip. This limits the bandwidth in several ways, which
is further discussed in Section 3.4.2. Accessing off-chip memory is time and power
expensive, hence the communication with external memory should be minimized.
3.4.1 Internal memories
Along with the computational units in an integrated circuit, on-chip memory stores
frequently accessed information and intermediate results between calculations. In a
CHAPTER 3. PROCESSING AND MEMORY
16
Datapath
Register
File
1KB
100GB/s
Cache/Internal
Memory
External
Memory
1MB
10GB/s
1GB
1GB/s
Figure 3.5: The memory hierarchy.
microprocessor there are several hierarchies of memory. Closest to the executional unit
is the register file, a bank of registers for storing results. The register file stores only
data and not instructions, and is usually implemented using a multi-port memory or a
small bank of flip-flops.
Caches and internal memories improve both the bandwidth and the latency for a
data access. Caches and other on-chip memories are often based on static random
access memory (SRAM). For SRAMs, the access time to arbitrary memory elements is
uniform, and the pattern of which elements are accessed is not of importance from a
performance perspective. For demanding applications, several separate memories could
supply the computational units with data.
3.4.2 External memories
Due to limited amount of on-chip memory, external memory is required when a larger
storage space is needed. External memory banks are designed to store significantly more
information, and hence demands a higher density of memory elements. High-density
memory devices are based on dynamic random access memory (DRAM) cells. Today,
different versions of synchronous DRAM (SDRAM) is commonly used in consumer electronics. These devices are burst oriented, and have a non-uniform access time. Still,
many applications have substantial memory requirements [13], and have to cope with
the non-uniform access time by reordering memory operations or physical location of
data in the memory [14].
External memories increase the storage space, but there are several penalties. Since
the memory is placed off-chip, internal signals must be routed from a pad (external
connection), over a printed circuit board (PCB) and into another circuit. This will
significantly increase the delay and limit the maximum data rate. The number of data
bits is also limited, since each connection requires separate pads, which are expensive
in terms of chip area. The function of the pad is to amplify weak on-chip signals to
communicate with off-chip components. Therefore, sending a signal off-chip requires
more current than internal communication, and results in higher power consumption.
Modern DRAMs are based on several individual memory banks, and the memory address is separated into rows and columns, as shown in Figure 3.6. The three-dimensional
organization of modern memory devices results in non-uniform access time. The memory banks are accessed independently, but the two-dimensional memory array in each
Address
Row
Address
Latch
Row Decoder
3.4. MEMORY AND BANDWIDTH
17
Bank 0
Memory array
Sense Amplifiers
Row buffer
Column
Address
Latch
Bank N
Bank 2
Bank 1
Data
I/O
Latch
Data
Column Decoder
Figure 3.6: Modern SDRAM is divided into three dimensions: banks, rows and
columns.
bank is more complicated. To access a memory element, the corresponding row has to
be selected. Data in the selected row is then transferred to the row buffer. From the
row buffer, data is accessed at high-speed and with uniform access time for any access
pattern. When data from a different row is needed, the current row has to be closed,
by precharging the bank, before the next row can be activated and transferred to the
row buffer.
Memory bandwidth of modern DRAM is highly dependent on the access pattern.
Accessing different banks or columns inside a row has a low latency, while accessing
data in different rows has a high latency. When processing several memory streams, a
centralized memory access scheduler can optimize the overall performance by reordering
the memory accesses [15]. Latency for individual transfers may increase, but the goal
is to minimize average latency and maximize overall throughput.
Chapter 4
A flexible FFT core
This chapter presents the implementation of a flexible pipeline FFT core, which is used
as the central building block in a hardware accelerator for digital holographic imaging
[16]. The design is based on a flexible and highly configurable hardware description,
which supports several different architectures. The FFT core can be configured at runtime to calculate forward or inverse transforms for any number of points up to the
maximum transform size, which is selected at design time.
The chapter starts with a description of the discrete Fourier transform (DFT) before
introducing the more efficient fast Fourier transform (FFT). Different architectures for
hardware implementation are presented, followed by scaling alternatives and important
considerations when dealing with pipeline architectures. The implementation is then
described in detail along with simulations and results. Finally, measurements from the
fabricated design are presented.
4.1 Discrete Fourier transform
The discrete Fourier transform is a commonly used operation in digital signal processing [6]. Typical applications are linear filtering, correlation, and spectrum analysis. The
Fourier transform is also found in modern communication systems using digital modulation techniques such as orthogonal frequency division multiplexing (OFDM). OFDM
is used in several wireless network standards, for example 802.11a [17] and 802.11g [18],
as well as in audio and video broadcasting using DAB and DVB. Another example is
GPS receivers, that use the DFT to modify the spectrum and suppress interference [19].
The DFT is defined as
X(k) =
N
−1
X
x(n)WNkn
0 ≤ k < N,
(4.1)
n=0
where
WN = e−i2π/N .
(4.2)
Evaluating Equation 4.1 requires N MAC operations for each transformed value in X,
or N 2 operations for the complete DFT. Changing transform size significantly affects
19
CHAPTER 4. A FLEXIBLE FFT CORE
20
computation time, e.g. calculating a 1024-point Fourier transform requires a thousand
times more work than a 32-point DFT. An efficient way of computing the DFT is
therefore of great importance, and presented in the next section.
4.2 FFT algorithms
A more efficient way of computing the DFT is to use the FFT [20]. The FFT is
a decomposition of an N -point DFT into successively smaller DFT transforms. The
concept of breaking down the original problem into smaller sub-problems is known as
a divide-and-conquer approach. The original sequence can for example be divided into
N = r1 · r2 · ... · rq where each r is a prime. For practical reasons, the r values are often
chosen equal, creating a more regular structure. As a result, the DFT size is restricted
to N = rq , where r is called radix or decomposition factor. Most decompositions are
based on a radix value of 2, 4 or even 8 [21]. Consider the following decomposition of
Equation 4.1, known as radix-2
X(k) =
N
−1
X
x(n)WNkn
(4.3)
n=0
X
N/2−1
=
+
n=0
X
X
N/2−1
k(2n)
x(2n)WN
n=0
X
N/2−1
=
k(2n+1)
x(2n + 1)WN
N/2−1
kn
xeven (n)WN/2
+WNk
|n=0
{z
kn
xodd (n)WN/2
.
|n=0
}
DF TN/2 (xeven )
{z
}
DF TN/2 (xodd )
The original N -point DFT has been divided into two N/2 DFTs, a procedure that can
be repeated over again on the smaller transforms. The complexity is thus reduced from
O(N 2 ) to O(N log2 N ). The decomposition in Equation 4.3 is called decimation-intime (DIT), since the input x(n) is decimated with a factor of 2 when divided into an
even and odd sequence. Combining the result from each transform requires a scaling
and add operation. Another common approach is known as decimation-in-frequency
(DIF), splitting the input sequence into x1 = {x(0), x(1), ..., x(N/2 − 1)} and x2 =
{x(N/2), x(N/2 + 1), ..., x(N − 1)}. The summation now yields
X
N/2−1
X(k) =
n=0
X
x(n)WNkn
+
N
−1
X
x(n)WNkn
N/2−1
=
n=0
kN/2
(4.4)
n=N/2
X
N/2−1
kN/2
x1 (n)WNkn + WN
| {z }
(−1)k
x2 (n)WNkn ,
n=0
can be extracted from the summation since it only depends on the value
where WN
of k, and is expressed as (−1)k . This expression divides, or decimates, X(k) into two
4.2. FFT ALGORITHMS
21
x(0)
X(0)
x(1)
x1
x(2)
x1+x2
x(3)
x(4)
WN
x2
x(5)
(x1-x2)WN
-1
X(4)
DFTN/2
x(6)
x(7)
X(2)
X(6)
W08
X(1)
W18
X(5)
DFTN/2
W28
W38
(a)
X(3)
X(7)
(b)
Figure 4.1: (a) Butterfly operation and scaling. (b) The radix-2 decimation-infrequency FFT algorithm divides an N -point DFT into two separate N/2-point
DFTs.
groups depending on whether (−1)k is positive or negative. That is, one equation
calculate the even values and one calculate the odd values as in
N/2−1
X(2k) =
X
kn
x1 (n) + x2 (n) WN/2
(4.5)
n=0
= DF TN/2 (x1 (n) + x2 (n))
and
N/2−1h
X(2k + 1) =
X
i
kn
x1 (n) − x2 (n) WNn WN/2
(4.6)
n=0
= DF TN/2 ((x1 (n) − x2 (n))WNn ).
Equation 4.5 calculates the sum of two sequences, while Equation 4.6 calculates the
difference and then scales the result. This kind of operation, adding and subtracting
the same two values, is commonly referred to as butterfly due to its butterfly-like shape
in the flow graph, shown in Figure 4.1(a). Sometimes, scaling is also considered to be
a part of the butterfly operation. The flow graph in Figure 4.1(b) represents the computations from Equation 4.5 and Equation 4.6, where each decomposition step requires
N/2 butterfly operations.
In Figure 4.1(b), the output sequence from the FFT appears scrambled. The binary
output index is bit-reversed, i.e. the most significant bits (MSB) have changed place with
the least significant bits (LSB), e.g. 11001 becomes 10011. To unscramble the sequence,
bit-reversed indexing is required.
CHAPTER 4. A FLEXIBLE FFT CORE
22
x0
x1
-j
j
j
x2
-
x0
x1
-
x2
x3
-j
(a)
x3
-j
(b)
Figure 4.2: (a) Radix-4 butterfly. (b) Radix-22 butterfly.
4.3 Implementation aspects
In Section 4.2, the sequence is decomposed into two groups using a radix-2 butterfly,
which is possible when the transform size is a power of 2. When the transform size
is a power of 4, more hardware efficient decomposition algorithms exist. This section
presents the radix-4 and the radix-22 algorithm, which both reduce the hardware requirements compared to the radix-2 algorithm. However, to calculate an FFT size not
supported by radix-4, for example 2048, both radix-4 decompositions and a radix-2 decomposition will be needed since N = 2048 = 21 45 . A short description of the FFT
algorithms is presented below.
Radix-2
The radix-2 algorithm can be used when the size of the transform is N = 2q , where q is
an integer number. The algorithm requires q decomposition steps, each computing N/2
butterfly operation using a radix-2 (R-2) butterfly, shown in Figure 4.1(a).
Radix-4
When the size is N = 4q , the radix-4 algorithm can be used. Radix-4 decomposition reduces the number of complex multiplications, with the penalty of increasing the number
of complex additions. The complex radix-4 butterfly is found in Figure 4.2(a).
Radix-22
Another decomposition similar to radix-4 is the radix-22 algorithm, which simplifies the
complex radix-4 butterfly into four radix-2 butterflies. On a flow graph level, radix4 and radix-22 requires the same number of resources. The difference between the
algorithms become evident when folding is applied, which is shown later. The radix-22
(R-22 ) butterfly is found in Figure 4.2(b).
4.3.1 Algorithm mapping
There are several ways of mapping an algorithm to hardware. Three approaches are
discussed and evaluated in this section: direct-mapped hardware, a pipeline structure,
4.4. ARCHITECTURE SELECTION
23
x(0)
X(0)
x(1)
X(2)
x(2)
x(3)
0
W4
-1
-1
X(1)
1
W4
-1
-1
(a)
2
x
R-2
1
W4
R-2
X
X(3)
(b)
Figure 4.3: (a) Flow graph of a 4-point radix-2 FFT (b) Mapping of the 4-point
FFT using two radix-2 butterfly units with delay feedback memories.
and time-multiplexing using a single butterfly unit. Direct-mapped hardware basically
means that each processing unit in the flow graph is implemented using a unique arithmetic unit. Normally, using this approach in a large and complex algorithm is not
desirable due to the huge amount of hardware resources required. The alternative is to
fold operations onto the same block of hardware, an approach that saves resources but
to the cost of increased computation time.
Figure 4.3(a) shows a 4-point radix-2 FFT. Each stage consists of two butterfly
operations, hence the direct-mapped hardware implementation requires 4 butterfly units
and 2 complex multiplication units, numbers that will increase with the size of the
transform. Folding the algorithm vertically, as shown in Figure 4.3(b), reduces hardware
complexity by reusing computational units, a structure often referred to as a pipeline
FFT [22]. A pipeline structure of the FFT is constructed from cascaded butterfly blocks.
When the input is in sequential order, each butterfly operates on sample xn and xn+N/2 ,
hence a delay buffer of size N/2 is required in the first stage. This is referred to as a
single-path delay feedback (SDF). In the second stage, the transform is N/2, hence the
delay feedback memory is N/4. In total, this sums up to N − 1 words in the delay
buffers.
Folding the pipeline architecture horizontally as well reduces the hardware to a single
time-multiplexed butterfly and complex multiplier. This approach saves arithmetic
resources, but still requires the same amount of storage space as a pipeline architecture.
The penalty is further reduced calculation speed, since all decompositions are mapped
onto a single computational unit.
To summarize, which architecture to use is closely linked to the required computational speed and available hardware and memory resources. A higher computational
speed normally requires more hardware resources, a trade-off that has to be decided
before the actual implementation work begins.
Another important parameter associated with the architectural decisions is the
wordlength. An increased wordlength improves the computational quality, measured
in signal-to-quantization-noise-ratio (SQNR), but increases the hardware cost as well.
This trade-off is known as wordlength optimization and is usually specified as a generic
parameter during the design phase.
CHAPTER 4. A FLEXIBLE FFT CORE
24
Table 4.1: Properties for different FFT architectures. Multipliers and adders are complex valued. The number of clock cycles depends on the transform length N .
Hardware architecture
Direct-mapped radix-2
Direct-mapped radix-4
Direct-mapped radix-22
Pipeline radix-2
Pipeline radix-4
Pipeline radix-22
Time-multiplexed radix-2
Time-multiplexed radix-4
Time-multiplexed radix-22
Adders
N log2 N
2N log4 N
2N log4 N
2 log2 N
8 log4 N
4 log4 N
2
8
4
Multipliers
(N/2)(log2 (N ) − 1)
(3N/4)(log4 (N ) − 1)
(3N/4)(log4 (N ) − 1)
log2 (N ) − 1
log4 (N ) − 1
log4 (N ) − 1
1
1
1
Memory
0
0
0
N −1
N −1
N −1
N
N
N
Cycles
N −1
N −1
N −1
N log2 N
N log4 N
N log4 N
4.4 Architecture selection
The hardware structures described in Section 4.3 are summarized in Table 4.1, which
also presents requirements for implementations based on radix-2, radix-22 and radix-4.
Based on the specification in Section 2.5, an appropriate architecture can be selected by
evaluating the speed and area requirements. The maximum transform size is N = 2048,
and the total number of clock cycles for a single one-dimensional transform must be in
the range of FFTcc ≈ 2400 clock cycles.
4.4.1 Evaluation
With N = 2048, Table 4.1 can be used to select an appropriate architecture. The directmapped hardware architecture will generate the fastest implementation, but also the
most area consuming. With the current selection of N , a radix-2 architecture requires
22528 adders and 10240 multipliers. This is not feasible to implement on a chip today.
A direct-mapped hardware architecture is only realistic when N is small.
The time-multiplexed architecture has a fixed number of adders and multipliers
regardless of the FFT size. However, the number of clock cycles does not comply with
the specification, requiring over 20000 clock cycles for a single transform.
Hardware requirements for the pipeline radix-2 architecture are 22 adders and 10
multipliers, which makes pipeline architectures suitable for hardware implementation.
The computation time is also in the range of the specified value, namely 2047 clock
cycles. The pipeline architecture is therefore a suitable trade-off between computational
speed and area.
Radix-22 and radix-4 requires fewer complex multipliers than radix-2. Furthermore,
radix-22 requires fewer adders than radix-4. Hence, the selected architecture for this
project is based on the radix-22 single path delay feedback (R22 SDF) algorithm [22].
Since the transform size is 2048, an extra radix-2 unit is added to support transform
length of size N = 2q . The total pipeline is constructed from one radix-2 block and five
radix-22 blocks, shown in Figure 4.4. The hardware requirements are 22 adders and 5
multipliers.
4.5. SCALING ALTERNATIVES
25
210
2 9 28
27 26
25 24
23 22
21 20
R-2
R-22
R-22
R-22
R-22
R-22
stage 1
stage 2
stage 3
stage 4
stage 5
stage 6
21 20
R-22
21
=
20
-j
R-2
R-2
Figure 4.4: A 2048 point pipeline FFT constructed from one radix-2 butterfly
and five radix-22 butterflies. A radix-22 butterfly is constructed from two
radix-2 butterflies separated by a trivial multiplication with −j.
4.5 Scaling alternatives
For application-specific hardware implementations, fixed-point is the most commonly
used format due to the simple implementation of arithmetic units. When designing
a datapath, the wordlength of arithmetic units are adjusted to support the required
dynamic range and precision. Adjusting the wordlength can be done manually or by
using a fixed-point simulation tool [23]. This is often performed by estimating the range
of floating-point values, and then converting the program into a fixed-point model. In
a fixed-point model, the result from an addition and subtraction requires an increased
wordlength to avoid overflow, which changes the dynamic range. The result from a multiplication is usually rounded or truncated to avoid a significantly increased wordlength,
hence causing a quantization error.
Independent of the dynamic range and the distribution of the input signal, the
quantization energy Pq is nearly constant for a fixed-point value. When the energy in
the input signal Px varies, this affects the SQNR defined as
SQNRdB = 10 · log10
Px
.
Pq
(4.7)
Therefore, precision in the calculations depends on properties of the input signal, caused
by uniform resolution over the total dynamic range. Fixed-point arithmetic has low
complexity, but usually requires an increased wordlength due to the trade-off between
dynamic range and precision.
Floating-point arithmetic increases the dynamic range by expressing numbers with a
mantissa m and an exponent e, represented with M and E bits, respectively. By changing the quantization steps, the energy in the error signal will follow the energy in the
input signal, and the SQNR will remain relatively constant over a large dynamic range.
Compared to fixed-point numbers, some bits are reserved to represent the exponent,
lowering the precision but increasing the dynamic range. Arithmetic units implemented
using floating-point representation increases in complexity since the format requires
mantissa alignment prior to computation and that the result is normalized [24]. In the
shift
=
e
e
mA
m
mB
m
=
e
e
max
LZOC
m
barrel
shifter
CHAPTER 4. A FLEXIBLE FFT CORE
26
m
e
(b)
(a)
Figure 4.5: (a) Unit for aligning input operands for addition/subtraction. (b)
Normalization unit selecting the most significant bits after a multiplication.
LZOC is a leading zeros/ones counter to determine the shift amount.
x
x
2M+E
MR-2
=
R-2
(a)
=
(b)
Figure 4.6: Building blocks for a hybrid floating-point implementation. (a)
Modified butterfly containing an align unit on the input. (b) Modified complex
multiplier containing a normalization unit on the output.
following subsections, different scaling alternatives is presented followed by a comparison
in terms of hardware requirements and precision in Section 4.6.
4.5.1 Hybrid floating-point
A simplified scheme for floating-point representation of complex numbers is to use a
single exponent for the real and imaginary part. Besides reduced complexity in the
arithmetic units, especially the complex multiplication, the total wordlength for a complex number is reduced from 2(M + E) to 2M + E bits. This representation is referred
to as a hybrid floating-point format.
Supporting hybrid floating-point includes pre- and post-processing in the arithmetic
building blocks. Adding and subtracting values requires an alignment of input operands,
a low complexity operation performed by right-shifting the smallest value by the difference between the exponents. The multiplication with a twiddle factor, represented with
T bits, is not affected by misaligned operators but produces an M + T bit full precision
mantissa which requires normalization. Normalization is performed by finding the M
most significant bits from the full precision result using a leading zeros/ones counter
(LZOC). The alignment and normalization units are found in Figure 4.5.
In the general case, multiplication also requires adding the exponent values. Since
the FFT twiddle factors are fixed-point values, this operation can be discarded in the
current design. The building blocks for hybrid floating-point are shown in Figure 4.6,
where MR-2 stands for modified radix-2 butterfly.
4.5. SCALING ALTERNATIVES
27
Rescale
BF
An
N-word
An+N/2 Memory
max
Zn
Zn+1
Figure 4.7: A simplified view of an iterative FFT processor using block floatingpoint. The peak value is monitored by the max unit, controlling the normalizing unit connected to the input of the butterfly.
4.5.2 Block floating-point
Block floating-point (BFP) combines the advantages of simple fixed-point arithmetic
with floating-point dynamic range. Instead of having an exponent for each value, one
single exponent is assigned to a group of values. There are two major reasons for using
block floating-point, reduced memory requirements and a lower arithmetic complexity.
However, output signal quality depends on the block size and characteristics of the input
signal [25]. The dynamic range and memory requirements for representation in fixedand floating-point are straightforward. These properties are more complex for block
floating-point and depend on the block size. A large block size decreases the precision
of low amplitude signals for a block set containing a high peak value, but consumes less
memory.
Representing data in block floating-point only requires one single exponent, valid
for a block of values. The exponent is based on the largest value in the block, hence
all values in a block must be analyzed before a common exponent can be determined.
For a time-multiplexed implementation of the FFT, using a single butterfly unit, this
will not cause any problems. Figure 4.7 shows an architecture containing a single
memory and butterfly unit. The data is processed one radix-r butterfly stage at a
time, requiring logr N iterations for an N -point FFT sequence. After each iteration, all
information are transferred from the main memory, processed, and then written back
to the memory. The output from the complex multiplication is monitored to find a
common exponent, and scales the data accordingly at the next iteration. The drawback
is that the wordlength of the memory must hold the values in extended precision from
the complex multiplication unit, adding a number of guard bits that will increase the
memory requirements.
4.5.3 Convergent block floating-point
When applying block floating-point to an architecture with cascaded butterfly units,
scaling becomes more complicated. To cope with this situation, an algorithm known
as convergent block floating-point (CBFP) is proposed in [26]. By placing buffers between intermediate stages, data can be rescaled using block floating-point before it is
propagated to the next FFT stage, as shown in Figure 4.8. Since the pipeline FFT
divides the sequence into smaller butterfly operations, the block size will decrease as
CHAPTER 4. A FLEXIBLE FFT CORE
28
1024
m
R-2
512 256
BUF m
1024
e
R-22
BUF
256
e
m
e
Figure 4.8: An example of convergent block floating-point. The buffer after
each complex multiplier selects a common exponent for a group of values,
allowing fixed-point butterfly units. The first buffer can be removed to save
storage space, but will have a negative impact on the quality.
e
m
1024
m
R-2
BUF e
D1
m
e
N1/D1
m
N1=512
MR-2 e
m
N2/D1
N2=256
MR-2
-j
e
m
Figure 4.9: The buffer after each complex multiplier selects a common exponent
for a small group of values. The optimal value is evaluated from Equation 4.8.
it propagates through the pipeline. Consider the architecture in Figure 4.4. After the
first butterfly stage, two sequences of 1024 samples are propagated. Each radix-r stage
produces r groups using separate exponents. As the data propagates, the groups will
be further divided into smaller groups, until each value has its own exponent. Hence
the name convergent block floating-point. The buffering of data between each stage
requires a large amount of memory. The intermediate buffer is placed after the complex
multiplication and normalization is performed at the output. In practical applications,
the first intermediate buffer is often omitted to save storage space, but this leads to a
reduced SQNR.
4.5.4 Co-optimization
As previously mentioned, convergent block floating-point places intermediate buffers
between each radix-r block to rescale a group of values. Hybrid floating-point uses one
exponent for each complex value, and the intermediate buffers become obsolete. This
section describes how these architectures can be co-optimized, using hybrid floatingpoint extended with small intermediate buffers to reduce storage space for exponents.
The hybrid floating-point solution requires Nx (2M + E) bits in the delay feedback,
where Nx is the length of the FIFO. Using a block size D requires an intermediate buffer
with D full precision words. At the same time, the number of exponents decreases with
a factor of D to N E/D. For a radix-2 stage, the memory requirements are
4.6. SIMULATIONS
29
40
Stage 3 (FFT = 64)
Stage 4 (FFT = 256)
Stage 5 (FFT = 1024)
35
Memory (Kbit)
30
25
20
15
10
5
0
1
2
4
16
64 128
8
32
Intermediate block size D
256
512
Figure 4.10: Memory requirements as a function of D for radix-22 blocks with
and input FFT length from 64 to 1024. The memory for each stage includes
the intermediate buffer connected to its input. D = 1 shows the direct hybrid
floating-point approach. M = 10, E = 4.
N (2M + E/D) + 2D(M + T ) .
{z
} |
{z
}
|
FIFO
(4.8)
BUF
For a radix-22 stage, the delay feedback for the second butterfly should be added
to the equation. The resulting architecture is presented in Figure 4.9, with small intermediate buffers and a separate exponent feedback memory. The MR-2 building block
is the modified butterfly found in Figure 4.6(a). For each radix-r stage, an optimal
value of D can be found. Figure 4.10 shows how memory requirements depend on the
intermediate buffer size. Each graph represents the memory requirements to implement
a radix-22 stage with an intermediate buffer connected to its input.
4.6 Simulations
A simulation tool to evaluate different FFT architectures has been designed. The simulation environment can extract precision, dynamic range, memory requirements, estimated synthesis results and chip size based on architectural descriptions. The user
can specify the number of bits for representing mantissa, twiddle factors, and exponents. Furthermore, the user selects FFT size, rounding type and finally stimuli type to
produce random values, sine waves, peaks, or signal constellations used in for example
OFDM.
CHAPTER 4. A FLEXIBLE FFT CORE
30
50
45
SQNR (dB)
40
35
30
Hybrid floating-point
CBFP (full)
CBFP (limited)
Co-optimization
25
20
1
1/2
1/4
1/8
Signal level
1/16
1/32
Figure 4.11: Decreasing the energy in a random value input signal affects only
the architecture when scaling is not applied in the initial stage. Signal level=1
means utilizing the full dynamic range.
Four architectures have been evaluated. The first is the hybrid floating-point with
bidirectional support. Next, two versions of convergent block floating-point, one with
scaling starting after the first stage (CBFP full) and one with scaling starting after the
second stage (CBFP limited). Avoiding the first intermediate buffer saves a large amount
of storage space, but will have a negative affect on quality. The latter architecture is
a co-optimization between hybrid floating-point and CBFP, using smaller intermediate
buffers. Table 4.2 shows a comparison in required memory and SQNR on a random
value signal utilizing the full dynamic range. The intermediate buffers used in CBFP
consume a large amount of memory, a problem that can be avoided by using the cooptimized architecture. Figure 4.11 shows the result of changing energy in the input
signal. In this case, the variations only affect the CBFP implementation with scaling
applied later in the pipeline. Figure 4.12 shows the result when applying signals with a
large crest factor, i.e. the ratio between peak and mean value of the input. For a signal
Table 4.2: Memory elements and SQNR for different architectures, based on a flexible
2048 point radix-22 implementation, with 10-bit input.
Scaling method
Hybrid FP
CBFP (limited)
CBFP (full)
Co-optimization
FIFO
49040
45716
45716
45779
BUF
0
14690
60016
1584
Total
49040
60406
105732
47363
SQNR
45.3 dB
43.7 dB
43.9 dB
44.3 dB
Bidir.
Yes
No
No
No
4.7. IMPLEMENTATION
31
50
45
SQNR (dB)
40
35
30
25
20
Hybrid floating-point
CBFP (full)
CBFP (limited)
Co-optimization
5
10
Crest Factor
15
20
Figure 4.12: Decreasing the energy in a random value input signal with peak
values utilizing the full dynamic range. This affects all block scaling architectures, and the SQNR depends on the block size. The co-optimized architecture
performs better than convergent block floating-point, since it has a smaller
block size through the pipeline.
s(n) with N samples, the crest factor (CF) is defined as
CF = q
max |s(n)|
.
1 P
2
|s(n)|
n
N
(4.9)
In this case, both CBFP implementations are affected due to the large block size in the
beginning of the pipeline. Hybrid floating-point is not affected by signal statistics, since
every value is scaled individually. The SQNR for the co-optimized solution is located
between hybrid floating-point and CBFP since the block size is in the range of 4 to 16
values.
For the current application, it is desirable to use an architecture capable of producing
a high quality result for signals with high crest factors. Images captured in the digital
holographic setup are pre-processed before the inverse FFT is applied. The crest factor
after pre-processing is large, which requires an accurate FFT core to produce a high
quality result.
4.7 Implementation
A pipeline FFT core using hybrid floating-point and based on the radix-22 decimationin-frequency algorithm [22] has been implemented, fabricated, and verified. The pipelined
implementation requires one radix-2 unit, 5 radix-22 units, and 5 complex multiplication
CHAPTER 4. A FLEXIBLE FFT CORE
32
REG
1
D
D
D
1
2
N
2n+1
D
SP
RAM
D
mm2
D
MEM
.6
MEM
2n
REG
Dual port
.8
D
.4
.2
0
0
0.5
1
1.5
2
Kbits
(a)
(b)
Figure 4.13: (a) Registers are used for short delays and single port memories
for long delays. (b) Simulation of area requirements for different approaches.
and normalization blocks. The initial radix-2 stage is required since the size of the FFT
is not a power of 4. This section presents internal building blocks and the fabricated
ASIC.
4.7.1 Butterfly and complex multiplier
There are two radix-2 butterfly stages in each radix-22 stage that calculate the sum
and the difference between the input values and the delayed output sequence from the
single-path delay feedback. The butterfly has two different modes, one for calculating
and one for transferring data to and from the delay feedback. When scale factors are
used, it must be possible to align the inputs if they do not share the same exponent. An
equalizer unit performs the alignment of input values to the butterfly stage. When the
butterfly is filling or draining the delay feedback, the equalizer unit only propagates the
current input values. The equalizer compares the exponents of the two inputs to detect
if the values are aligned or not. If there is a difference, the smallest input value is right
shifted with the same number of bits as the difference between the two exponents. The
aligned values are propagated to the butterfly unit.
The output from the complex multiplier is normalized and sent to the next FFT
stage. At the same time as the value is shifted by the normalizing unit, the exponent is
incremented accordingly.
4.7.2 Delay feedback considerations
Delay feedback units provide a temporary storage space for the input sequence to each
butterfly, i.e. the FIFOs in Figure 4.4. The delay feedback can be implemented either
as registers or as memories (RAMs), as shown in Figure 4.13(a). For shorter sequences,
serially connected flip-flops are used as delay elements. As the delay increases, this
4.8. ASIC PROTOTYPING
X
x
0 1 2 3 ...
33
8
MR-2
-j
4
2
MR-2
MR-2
1
MR-2
-j
x
X
0 8 4 12 ...
Figure 4.14: The bidirectional pipeline of a 16-point FFT. The input/output
to the left is in normal bit order. The input/output from the right is in
bit-reversed order.
approach is no longer area and power efficient. One solution is to use SRAM memories,
single or dual port, and address pointers to keep track of read and write locations. To
keep computational units supplied with data, one read and one write operation has
to be performed each clock cycle. A dual port memory approach allows simultaneous
read and write operations with simple control logic. However, dual port memories
are both larger and consume more power than single port memories. Instead it is
possible to use two single port memories, alternating between read and write each clock
cycle. This approach can be further simplified by using only one single port memory
with double wordlength to hold two consecutive values in a single location, alternating
between reading two values in one cycle and writing two values in the next cycle. This
scheme requires temporary storage to synchronize the dataflow. In this design, the
latter single-port memory approach is used for delay feedback exceeding the length of
8 values. Simulation has shown this to be a good trade-off to minimize area [27]. A
comparison between different approaches is presented in Figure 4.13(b). In the current
0.35 µm CMOS process, the dividing line between flip-flops and memories is located at
approximately 250 bits. Changing to a different process implies a new evaluation to
find the optimal solution.
4.7.3 Bidirectional pipeline
If the wordlength is constant in the pipeline, the dataflow can be reversed, as shown in
Figure 4.14. The bidirectional FFT has several applications. Replacing the FFT/IFFT
used in for example an OFDM transceiver with a bidirectional pipeline can minimize
the buffering required for inserting and removing the cyclic suffix, which has been proposed in [28]. Today, wordlength optimized one-directional FFT are commonly used,
increasing the wordlength through the pipeline to maintain accuracy. Implementations
based on CBFP are also proposed in [29], but both solutions only work in one direction. Another application for a bidirectional pipeline is when evaluating a one- or
two-dimensional convolution using the FFT as
x ∗ h = F −1 (F(x) · F(h)).
(4.10)
If the forward transform generates data in bit-reversed order, input to the reverse transform should also be bit-reversed. In the first step, the dataflow is from left to right.
In the second step, the dataflow is from right to left. Both the input and the output
from the convolution is in normal bit order, hence no reorder buffers are required. The
FPGA prototype is based on a bidirectional pipeline.
CHAPTER 4. A FLEXIBLE FFT CORE
34
Figure 4.15: Chip photo of the 2048 complex point FFT core fabricated in a
0.35 µm 5ML CMOS process. The core size is 2632 × 2881 µm2 connected to
58 I/O pads and 26 power pads.
4.8 ASIC Prototyping
A 2048-point FFT supporting hybrid floating-point has been fabricated in a 0.35 µm
five metal layer CMOS process from AMI Semiconductor, and a chip photo is shown in
Figure 4.15. The core size is 2632 × 2881 µm2 connected to 58 I/O pads and 26 power
pads. The FFT chip contains 11 delay feedback buffers, one for each butterfly unit.
Seven on-chip RAMs are used as delay buffers, with approximately 49 Kbits of storage
capacity. The four smallest buffers are implemented using flip-flops, which have been
evaluated to be a good solution in terms of chip area. Twiddle factors for the complex
multipliers are stored in three ROMs, with a size of approximately 47 Kbits in total.
RAMs and the ROMs can be seen along the sides of the chip in Figure 4.15.
A pattern generator and a logic analyzer were used to confirm functionality and
measure the power characteristics of the prototypes. By converting the test vectors
from the preceding netlist simulations, the exact same test vectors are used for prototype
verification. The comparison between the output results from the prototype and the
netlist simulation showed that the circuits were functional up to 50 MHz using a supply
voltage of 1.8 V. The power consumption of the core was measured to 234 mW at 1.8 V
and 50 MHz.
Chapter 5
Streaming hardware accelerator
This chapter presents a generic and flexible hardware accelerator for general signal and
image processing, which is targeted for digital holographic imaging. The developed
accelerator, referred to as Xstream, contains a datapath with several modules for
reordering, scaling, rotating, and transforming data. The FFT described in Chapter 4
is a central building block of the accelerator.
5.1 Requirements
The reconstruction algorithms described in Chapter 2 define the requirements for the
accelerator. First, Equation 2.1 states that the recorded images are to be subtracted
to obtain the interference pattern. Then, looking at Equation 2.4, each element in the
interference pattern is rotated with a phase factor before Fourier transformation.
Figure 5.1 shows the actual dataflow and how the algorithm is executed. Figure 5.1(a) combines and subtracts the images, rotates each pixel vector with a phase
factor, and calculates one-dimensional FFTs over the rows. Figure 5.1(b) takes the
result from the first operation and calculates one-dimensional FFTs over the columns.
The largest value is stored in a register before writing the sequence back to memory.
ψ(ρ)
(a)
1-D FFT
combine
bit-reverse
spectrum shift
α
max/cut
(b)
Ψ(r 0 )
bit-reverse
spectrum shift
1-D FFT
MEM
MEM
(c)
abs/phase
{0..255}
Figure 5.1: Processing dataflow. Grey boxes indicate memory transfers and
white boxes indicate computation. (a) Combine, rotate, and one-dimensional
FFT over rows. (b) One-dimensional FFT over columns and store maximum
value. (c) Calculate magnitude or phase and scale result into pixels.
35
CHAPTER 5. STREAMING HARDWARE ACCELERATOR
36
CMUL
B
DMA I/F
Combine
B
B
FFT
CORDIC
Buffer
DMA I/F
AGU I/F
AGU I/F
AGU
AGU
Config
Figure 5.2: Functional units in the Xstream accelerator. Grey boxes indi-
cate units that contain local buffers. Processing units communicate using a
handshake protocol.
In the final step, shown in Figure 5.1(c), the vector magnitude or phase is calculated.
The result is also scaled to a pixel value, using the largest value from the previous
iteration.
5.2 Architecture
Xstream is constructed from a pipeline with several hardware units, mapping the
dataflow in Figure 5.1 to hardware. The name Xstream denotes that the accelerator
operates on data streams, and could potentially be reconfigured to support multiple
input and output streams. Focus is on maximizing computational efficiency with parallel
execution, and optimizing global bandwidth using burst transfers. Processing units
communicate using a handshake protocol, where each produced value is acknowledged
by the receiver. Without flow control, buffers could overrun with corrupted data as a
result. The pipeline is configured using an external controller, e.g. a microprocessor.
Each unit can be individually configured, and a global register allows units to be enabled
and disabled.
Figure 5.2 shows the Xstream pipeline, and the processing direction is from left to
right. The first unit combines recorded images and calculates the interference pattern.
The combine unit is constructed from a reorder unit to extract pixels from the same
position in each image, and a subtract unit. Each individual pixel in the interference
pattern is actually complex-valued, with the imaginary part set to zero. This twodimensional pixel vector is then rotated with a phase factor. The implementation is
based on a coordinate rotation digital computer (CORDIC) operating in rotation mode.
The CORDIC algorithm and implementation is described in Section 5.2.2. The next
operation is the Fourier transformation, presented in Chapter 4. A two-dimensional
FFT can be evaluated using a one-dimensional FFT, separately applied over rows and
columns. Considering the size of the two-dimensional FFT, data must be transferred
to the main memory between row and column processing. A buffer is required to
unscramble the FFT output since it is produced in bit-reversed order, and a spectrum
shift operation is also required to center the zero-frequency component. The CORDIC
unit is finally reused to generate magnitude or phase images from the complex-valued
result produced by the FFT. The result is scaled using the complex multiplier in the
5.2. ARCHITECTURE
ψh
37
ψr
ψh
ψo
ψo
ψr
α
α
(a)
(b)
Figure 5.3: (a) The three captured images and the rotation factor α. (b) The
interleaved images stored in memory and the interleaved memory map with
M pixels in each row, where the buffer can hold 4M pixels. Grey indicates the
amount of data copied to the input buffer in a single transfer. In this example
the interleaving is 4 groups with 8 pixels in each group.
pipeline, from floating-point format back to fixed-point numbers that represent an 8-bit
greyscale image. The following sections present the computational units in detail.
5.2.1 Image combine
The first operation is to combine the captured images and calculate the interference
pattern as
ψ(x, y) = ψh (x, y) − ψo (x, y) − ψr (x, y).
(5.1)
First consider the images to be stored in separate memory regions, as shown in Figure 5.3(a). The read operation would then either be single reads accessing position
(x, y) in each image, or several burst transfers from separate memory location. In the
former case, simply accessing the images pixel-by-pixel from a burst oriented memory
is inefficient since it requires three single transfers for each pixel. The latter approaches
require several burst transfers, reading from different parts of the memory. Burst transfers are fast but the scheme is more complex, since reading and reordering information
will become a complicated procedure of transferring data from three separate memory
areas simultaneously. Accessing the data in one single burst would be desirable, and
thus a memory reorder scheme is proposed in the following section.
Memory interleaving
One approach to combine images is to reorder the physical placement in memory. Instead of storing images in separate memory regions, they can be interleaved, as shown in
Figure 5.3(b). Here, data is fetched in a single burst transfer and internally reordered.
The transfer fills an input buffer, which reorders the information using a modified address generator to extract image information pixel-by-pixel. Using this scheme, both
storing captured images and reading images from memory can be performed with single
burst transfers.
CHAPTER 5. STREAMING HARDWARE ACCELERATOR
38
x
y
z
CORDIC
rotation
K(x cos z − y sin z)
x
K(y cos z + x sin z)
y
z→0
z
K
CORDIC
vector
(a)
p
x2 + y 2
y→0
z + tan−1 (y/x)
(b)
Figure 5.4: (a) CORDIC in rotation mode, z → 0. (b) CORDIC in vector
mode, y → 0. Other functions are evaluated by changing from circular rotations to linear or hyperbolic rotations.
In addition to combining the three images, a phase factor is required for each pixel.
The phase factor is propagated to the CORDIC unit to rotate the interference pattern.
Assuming that the images are interleaved with a partial size of M pixels, the buffer
must be able to hold 4M pixels. A transfer to the internal reorder buffer now includes
information from each image, and the buffer output is reordered to produce a stream of
pixels from the same location. Simulation results for the proposed method are presented
in Chapter 6.
5.2.2 CORDIC
The second operation is to rotate each pixel in the interference pattern with a phase
factor. The implementation is based on a CORDIC, which is a convergence method for
evaluating trigonometric functions using a fixed number of iterations [30]. Each iteration
step is assigned a pre-calculated angle α(i) , associated with the iteration number. The
CORDIC algorithm does not perform real rotations, i.e. preserving the magnitude of
the vector being rotated. Instead it uses pseudo-rotations, which reduces the complex
rotation equations from
x(i+1) = x(i) cos α(i) − y (i) sin α(i) =
x(i) − y (i) tan α(i)
(1 + tan2 α(i) )1/2
y (i+1) = y (i) cos α(i) + x(i) sin α(i) =
y (i) + x(i) tan α(i)
(1 + tan2 α(i) )1/2
z (i+1) = z (i) − α(i)
to the more simple equations
x(i+1) = x(i) − y (i) tan α(i)
y (i+1) = y (i) + x(i) tan α(i)
z (i+1) = z (i) − α(i)
by removing the factor 1/(1 + tan2 α(i) )1/2 . Hence, the final result is scaled with a
known expansion factor K, which is the product of all the expansion factors for every
iteration step as
5.2. ARCHITECTURE
39
x(i)
mode
valid
{xyz}(i)
CORDIC
stage
mode
valid
{xyz}(i+1)
y (i)
>>
>>
+/-
x(i+1)
-/+
y (i+1)
di
z (i)
mode
(a)
α(i)
+/-
z (i+1)
(b)
Figure 5.5: (a) A basic CORDIC building block. (b) The internal hardware
structure. The constant and shift factor is unique for each block.
K=
Y
(1 + tan2 α(i) )1/2 .
i
As long as the scale factor is known, it can always be compensated for later on. To
reduce complexity even more, α(i) is chosen such that the multiplication with tan α(i)
becomes a simple operation. Setting tan α(i) = di 2−i , where di ∈ {−1, 1} results in
simple bit-shift and add/subtract operations. α(i) can be precomputed and stored in a
lookup table (LUT). The limitation is that it is no longer possible to rotate with any
angle. The rotation is instead the sum of a set of angles, where the sign depends on di .
Note that CORDIC is not limited to two-dimensional calculations. Several units can be
connected together to evaluate even three-dimensional rotations [31].
CORDIC modes
CORDIC can operate in two different modes, and the mode determines the function to
be evaluated, as shown in Figure 5.4. In rotation mode, di is chosen such that z → 0,
i.e. di = sign(z (i) ). In rotation mode, CORDIC can evaluate for example x sin(z) and
x cos(z), which is actually a rotation of x with the angle z. The complex multipliers in
the FFT actually perform rotations and can therefore also be evaluated with a CORDIC
unit [32]. In vector mode, di is chosen such that y → 0, by setting di = −sign(x(i) y (i) ).
In vector mode, the vector magnitude and phase can be evaluated from a complex value,
which is used to convert the complex result from the FFT into visible images.
Implementation
Instead of performing iterations over time, the design can be constructed by connecting
a number of identical stages in sequence. Figure 5.5(a) shows a CORDIC stage, and
Figure 5.5(b) shows the actual implementation, which requires three adders and control
logic for mode selection and di condition. A pipeline implementation of the CORDIC
algorithm is chosen since it can produce one value each clock cycle, which is matched
to the throughput of the one-dimensional FFT core.
CHAPTER 5. STREAMING HARDWARE ACCELERATOR
40
1
4
5
2
6
3
7
8
9
10
11
12
13
14
15
Buffer
DMA I/F
DMA I/F
AGU I/F
AGU I/F
AGU
AGU
Source
4
8 12
1
5
9 13
2
6 10
14
3
7
15
11
Destination
Figure 5.6: The transpose logic uses the DMA controllers, the AGU units and
the internal buffer to transpose large matrices. In the figure, a 32 × 32 matrix
is transposed by individually transposing and relocating 4 × 4 macro blocks.
Each macro block contains an 8 × 8 matrix.
5.2.3 Two-dimensional FFT
A two-dimensional FFT can be calculated by first performing one-dimensional transforms over the rows, and the applying the one-dimensional transform over the columns
of the result. However, the memory access pattern when transforming columns would
cause a serious performance loss since new rows are constantly accessed, which will
prevent burst transfers. An equivalent procedure to the row/column approach is to
transpose the information between the computations. This means that the FFT is
actually applied two times over rows in the memory with an intermediate transpose operation. If the transpose operation combined with FFTs over rows is performed faster
than FFTs over columns, the overall calculation time is reduced. To evaluate this approach, a fast transpose unit is required. Simulation results and a comparison between
the two approaches are presented in Chapter 6.
Transpose unit
The transpose operation is critical for system performance. Considering the transformation size, data cannot be stored on-chip and has to be transferred to the off-chip
main memory. The access pattern for a transpose operation is normally reading rows
and writing columns, or vice versa. Reading or writing a row will not cause performance
problems, but the access pattern of reading or writing columns will significantly degrade
performance.
Performance can be improved by changing the memory access pattern [33]. The
transpose operation of a matrix can be broken down into a number of smaller transpose
operations combined with block relocation as
H=
h11 h12
h21 h22
T
H =
hT11 hT21
hT12 hT22
.
(5.2)
The matrix can be further divided using a divide-and-conquer approach, all the way
down to single elements. Breaking it into single elements is not desirable, since all
accesses will become single read and write operations. However, if the matrix is divided
into smaller blocks that fit into a cache memory, burst operations can be applied for
5.3. EXTERNAL INTERFACE
41
Bus 1 Bus 2
Bus
MEM
MEM
XSTREAM
XSTREAM
MEM
(a)
(b)
Figure 5.7: (a) Xstream connected to a single bus. (b) Xstream connected
to separate busses. Throughput is increased without affecting behaviour.
both reading and writing. Memory burst length depends on the size of the transpose
buffer. The transpose operation is illustrated in Figure 5.6, and the simulation results
are presented in Chapter 6.
5.3 External interface
Xstream communicates using a standard bus interface to transfer data to and from the
pipeline. Since the accelerator is based on a pipeline of computational units, there is one
input interface and one output interface. Figure 5.7 shows two possible configurations
for the system. The input and output can either be connected to the same bus, which
would limit the performance, or to separate busses and memory banks. However, the
problem with multiple memory interfaces is the large amount of input and output signals
required. Each 32-bit SDRAM bank requires approximately 50 connections for data,
address, and control signals. Designing a system with several memory interfaces is padconsuming, hence the prototype presented in Chapter 7 uses a single interface due to
board limitations.
Data is transferred using DMA modules, as shown in Figure 5.8. After a DMA
transfer is initiated, the transfer is carried out by directly accessing the external memory.
The DMA modules are connected to one or several on-chip busses, and arbitration is
used to select the bus master. The DMA interface is constructed from three blocks, the
bus interface, a buffer module and a configuration interface. One side connects to the
external bus while the other side connect to the internal dataflow protocol.
5.3.1 Bus interface
The Xstream bus interface is compatible with the advanced high-speed bus (AHB)
protocol which is a part of the AMBA specification [34]. Configuration data is transferred over the advanced peripheral bus (APB). The bus interface acts as a master
on the bus, reading and writing data using burst accesses of any length. This is an
important feature, which allows fast access to burst oriented memories.
The interface can not only access a linear address space, but also two-dimensional
memory arrays. By specifying a base address (addr), the transfer size (hsize, vsize),
and the space between individual rows (space), the bus interface can access a two-
CHAPTER 5. STREAMING HARDWARE ACCELERATOR
42
APB AHB
DMA I/F
0 1 2 3 4 ...
Buffer
data
flow
control
addr
size 2D
space
data
flow
addr
... N-1 N
hsize
space
vsize
Bus I/F
0
1
2
3
reorder
format
AGU
M-1
M
Config
(a)
(b)
Figure 5.8: (a) The DMA interface connects the internal dataflow protocol
to the external bus using a buffer and an AMBA bus interface. (b) The
DMA interface can access a two-dimensional matrix inside a two-dimensional
memory array. space is the distance to the next row.
dimensional matrix of any size inside a two-dimensional address space of any size. The
control register contains signals to start and stop a transfer, and to specify the burst
size.
5.3.2 Buffering
Bursting data requires internal buffering to guarantee that neither overflow nor underrun will occur on the dataflow side during the transfer. Therefore, the DMA interface
contains a data buffer, shown in Figure 5.9, separating the bus from the dataflow modules. The size of the buffer determines the maximum burst size.
The buffer contains a format transformation unit that allows splitting of words into
sub-word transfers on the read side and combining sub-words into words on the write
side. This is useful when processing pixels from the sensor, for example, calculating
the vector magnitude of the resulting image and then rescaling the value into a pixel
(8 bits). The pixels are individually processed, but combined into words (groups of 4
pixels) in the buffer before transferred to the memory.
Another useful feature is the possibility of reordering information in the buffer. The
buffer can be divided into 2 or 4 sections, and data is transferred by sending the first word
in each section, then the second word from each section and so on. In this addressing
mode, information from different data sets can be processed, used for example in the
combine unit and for calculating convolutions.
Note that in a system containing a single bus, the burst length should not exceed
the buffer size. Exceeding the buffer size could potentially create a deadlock situation,
for example when a read access is stalled waiting for data to be written back to memory
before accepting new data. In this situation, a bus architecture supporting split transfers
has to be used, changing bus master even if the current transfer has not completely
5.3. EXTERNAL INTERFACE
43
Buffer
32
Memory
write
addr
read
addr
linear
counter
address
generator
reorder
data
flow
Sub-word
extraction
8/16/32
format
Figure 5.9: The DMA buffer can reorder data in several ways and extract (split
and combine) subword for 8, 16 and 32 bit processing. The buffer can operate
in both directions, depending on if it is used as a read buffer or a write buffer.
finished. In a system that contains more than one internal bus, the burst size is allowed
to exceed the buffer size by inserting idle cycles on the bus when the dataflow network
is empty or unable to accept data.
5.3.3 AGU interface
The DMA interface also supports an external address generation unit (AGU). It can
automatically restart the DMA transfer from a new base address when the previous
transfer has completed. Hence, there is no latency between transfers. This is useful
when accessing several blocks of information inside a larger memory space, e.g. a matrix
of blocks. An illustrating example is the transpose operation, which relocates blocks of
data inside a matrix. The block is the actual transfer and matrix is the current address
space.
Chapter 6
Optimizations
This chapter present results from the architectural decisions taken in Chapter 5. Focus
is on optimizing the image combine unit and the two-dimensional FFT by changing
memory access patterns, and to minimize internal buffering by sharing functionality.
Modifying the original memory access patterns increases the memory transfer rate,
while internal buffers are reused and shared between operations in order to save chip
area.
6.1 Internal buffering
Buffering is required both on an algorithmic level as well as on architectural level. On
an algorithmic level, buffers are used as time or sample delays. An example where delay
is part of the algorithm is a reverb filter. A reverb filter creates the effect of sound
reflected against walls in a room. The algorithm combines direct sound from the source
with reverberated sound of reflections, in other words the delayed sound or echo. Hence,
delay buffers are a part of the algorithm. Buffers are also required on an architectural
level, for example when folding an algorithm. Taking the pipeline FFT as an example,
delay feedback buffers are inserted to supply the butterfly units with the correct input
sequence, calculating x(n) + x(n + N/2). To access sample n and n + N/2 at the same
time requires the sequence to be delayed with N/2 samples, if data appears on the input
in sequential order.
The Xstream accelerator, shown in Figure 6.1, requires several buffers for storing,
processing, and reordering information. However, by moving functionality between
blocks, it is shown that buffers can be removed and memory requirements reduced.
The unoptimized pipeline is described in Section 6.1.1, and possible optimizations are
presented in Section 6.1.2. Buffers in the pipeline depend on two values, the minimum
block transfer size Nburst , and the maximum transform size NF F T . Based on simulations,
the actual values for these constants are selected in Section 6.3.
6.1.1 Original pipeline
The unoptimized pipeline requires buffering in almost every unit. Starting with the
DMA interface, data is read into a small buffer with the same size as the burst length
Nburst . Data is then reordered, to extract information pixel-by-pixel, which also requires
45
CHAPTER 6. OPTIMIZATIONS
46
CMUL
Combine
DMA I/F
CORDIC
FFT
Buffer
DMA I/F
Figure 6.1: Functional units in the Xstream accelerator.
a buffer. After this, data is propagated to the CORDIC and complex multiplier, which
are non-buffered processing units and only needs internal pipeline registers. The next
unit is a one-dimensional pipeline FFT, requiring delay feedback elements. For a twodimensional transform, a buffer is added to transpose the sequence before applying the
one-dimensional FFT again. A buffer is also required for bit-reversing the FFT output,
an operation that requires the complete sequence to be stored. FFT spectrum shift
requires half the sequence to be stored in a buffer. Finally, the result is cached in
the output DMA interface, before written back to main memory. The original buffer
requirements in the Xstream pipeline can be found in Table 6.1.
6.1.2 Optimized pipeline
By moving functionality between blocks, buffer space can be saved. Starting with the
reorder operation in the image combine block, this function is moved to the DMA input
buffer. The only modification required is that the DMA input buffer should support both
linear and interleaved addressing mode. The internal feedback in the one-dimensional
FFT can not be removed or combined with any other block, since it is constantly being
accessed. However, the transpose buffer for two-dimensional FFT transforms is reused
for bit-reversal and FFT spectrum shift. Those are in fact only two different addressing
modes, which are supported by the transpose buffer instead. A shared buffer saves a
large amount of memory, since both bit-reverse and FFT shift requires the complete
sequence and half the sequence to be stored, respectively. Table 6.2 shows the optimizations, moving buffers between functional units, and the memory requirements are
found in Figure 6.5. The addressing mode for each operation is presented in Table 6.1,
and is further explained in the next section.
Table 6.1: Required buffering for each processing block. Without optimization, the
total memory is the sum of all buffers. Each buffer requires a different addressing
mode, depending on the function that the block is evaluating.
Unit
DMA I/F
Combine
FFT
Buffer
DMA I/F
Buffer
Input buffer
Reorder
FFT delay feedback
Transpose
Bit-reverse
Spectrum shift
Output buffer
Buffer size
Nburst
Nburst
NF F T −1
Nburst 2
NF F T
NF F T /2
Nburst
Addressing mode
Linear
Interleaved
FIFO
Col-row swap
Bit-reversed
FFT shift
Linear
6.1. INTERNAL BUFFERING
47
Table 6.2: Buffers can be reused for several operations. Buffer memory can also be
shared between units to save storage space. The actual buffer size is the maximum of
each individual operation.
Unit
DMA I/F
Buffer size
Nburst
FFT
Buffer
NF F T −1
max(Nburst 2 , NF F T )
Addressing mode
Linear
Interleaved
FIFO
Col-row swap
Bit-reversed
FFT shift
Linear
Addressing modes
The addressing mode controls the data output sequence from the buffer. Figure 6.2
shows how the address bits are rearranged in each mode. Data is always written to
the buffer in linear addressing mode. When reading from the buffer, the address mode
controls the sequence order. Mapping functionality onto the same buffers, as described
in the last section, requires each buffer to support several modes, as stated in Table 6.1.
The input buffer supports linear addressing and data interleaving. The buffer unit
supports linear, bit-reversed, row and column swap for the transpose operation, and
FFT spectrum shift.
Both bit-reverse and FFT spectrum shift depends on the transform size, which
requires the addressing modes to be flexible. Since bit-reverse and FFT spectrum shift
is often used in conjunction, this addressing mode can be optimized. The spectrum
shift inverts the MSB, as shown in Figure 6.2(d), and the location of the MSB depends
on the transform size. However, in bit-reversed addressing, the MSB is actually the
LSB, and the LSB is always statically located. By reordering the operations, the cost
for moving the spectrum is a single inverter.
a4 a3 a2 a1 a0
a4 a3 a2 a1 a0
a5 a4 a3 a2 a1 a0
a4 a3 a2 a1 a0
a0 a1 a2 a3 a4
a1 a2 a4 a3 a2
a2 a1 a0 a5 a4 a3
a4 a3 a2 a1 a0
(a)
(b)
(c)
(d)
Figure 6.2: Buffers support multiple addressing modes and can be used for
several operations. an represents the address bits, and the input address is
a linear counter. (a) Bit-reversed addressing. (b) Interleaving with 4 groups
and 8 values in each group. (c) Transpose for an 8 × 8 matrix by swapping
rows and columns. (d) FFT spectrum shift by inverting the MSB.
CHAPTER 6. OPTIMIZATIONS
48
16
separated images (read and write)
interleaved images (read and write)
writing captured images to memory
reading interleaved images from memory
Clock cycles per pixel
14
12
10
8
6
4
2
0
4
8
32
16
Burst length Nburst
64
128
Figure 6.3: Average number of clock cycles for each produced pixel, including
both write and read operation to and from memory. The dashed lines show
the read and write part from the interleaved version.
6.2 Memory bandwidth
To achieve high memory transfer rate, the access pattern to the external memory is important. The memory has a non-uniform access time, which is described in Section 3.4.
The combine unit increases the bandwidth by interleaving data for faster access. The
transpose unit in the two-dimensional FFT increases the bandwidth by operating on
smaller blocks of data, applying the divide-and-conquer approach.
6.2.1 Image combine unit
The combine module, described in Section 5.2.1, uses interleaving to increase overall
bandwidth. Storing the images interleaved enables small burst transfers to be used.
When reading the interleaved images, the burst length is equal to the buffer size, filling
the complete buffer before reordering of the output. After a transfer, the buffer contains
a set of pixels from four different images, and a simple reorder operation outputs the
information pixel-by-pixel. By storing the images interleaved, burst transfers can be
used both for storing and reading the images from memory. A comparison of the
different approaches is shown in Figure 6.3.
Figure 6.3 shows how the reorder operation depends on the burst size. For a small
burst length, it is faster to store the images separately. When the burst length increases,
interleaving becomes a suitable alternative.
6.3. PARAMETER SELECTION
49
18
row-column two-dimensional FFT
row-transpose-row two-dimensional FFT
only transpose operation
Clock cycles per element
16
14
12
10
8
6
4
2
0
1
2
4
8
32
16
Burst length Nburst
64
128
Figure 6.4: Number of clock cycles per element for calculating a twodimensional FFT using the two methods. The transpose operation is also
shown separately. Using a small burst size, the transpose operation dominates
the processing time.
6.2.2 Two-dimensional FFT
The assumption in Section 5.2.3 is that transposing the matrix and process over rows is
faster than processing the matrix over columns. However, transposing the large matrix
would be time expensive, since it involves either reading or writing columns. By dividing
the matrix into smaller blocks, each block can be individually transposed and relocated.
Read and write operations are then performed in burst transfers and data is stored and
transposed in a local buffer. Performance of the transpose operation depends on the
internal buffer size.
Figure 6.4 shows a simulation of the row-column FFT and the row-transpose-row
FFT. For the latter approach, the transpose operation is required, also shown separately in the graph. When the burst length is short, the transpose overhead dominates
the computation time. When the burst length increases, the row-transpose-row FFT
improves the overall performance.
6.3 Parameter selection
Nburst and NF F T are selected based on simulations from previous sections. NF F T is
the maximum transform size and is chosen to 2048 in Chapter 2. Hence, it remains to
select a value for Nburst , which is a trade-off between required buffer size and memory
bandwidth.
The shared buffer unit in the pipeline must be at least the size of NF F T to support
the bit-reverse operation. It is reused for the transpose operation, which requires Nburst 2
CHAPTER 6. OPTIMIZATIONS
50
14K
Unoptimized pipeline
Optimized pipeline
12K
Memory words
10K
8K
6K
4K
2K
0
1
2
4
8
16
Nburst
32
64
128
Figure 6.5: Memory requirements in words for different values on Nburst when
NF F T = 2048. The delay feedback memory is not included since it can not be
shared with other units.
elements. The buffer size is hence the maximum of the two. The delay feedback memory
in the FFT can not be shared and will not be included in the simulations. The total
memory requirements for the optimized pipeline is therefore
2
, NF F T )
MEM = Nburst + max(Nburst
(6.1)
Figure 6.5 shows how the memory requirements depend on Nburst . When Nburst 2 >
NF F T , the memory requirements rapidly increases, which has a high area cost and a relatively low performance
improvement according to Figure 6.3 and Figure 6.4. However,
√
selecting Nburst = 2048 complicates the transpose operation since it is not a power of
2. Selecting Nburst = 32 satisfies both conditions and has a relatively small overhead
according to the simulations.
6.4 Summary
Nburst is chosen to minimize the buffer requirements and optimize the memory transfer
rate. With NF F T =2048, Nburst is chosen to 32. From Figure 6.4, it can be seen that this
generates an overhead of about 20% in the transpose operation. Compared to a solution
with uniform access time, the overhead for a two-dimensional row-transpose-row FFT
is about 60% due to the extra transpose operation. But at the same time, overhead for
a traditional row-column FFT is about 550%.
Chapter 7
System integration
This chapter presents the integration of the Xstream accelerator into an embedded
system and a prototype of the holographic microscope. The functionality of the embedded system is to capture, reconstruct and present holographic images. The architecture
is based a SPARC compatible microprocessor, the Xstream accelerator, and a memory controller for connecting external memory. Two interfaces provide functionality for
capturing images from an external sensor device and to present reconstructed images
on a monitor.
The chapter is divided into two sections. The first part describes hardware design,
and how the Xstream accelerator is integrated together with a microprocessor and
external interfaces. The second part describes software development and how to develop
platform independent source code using a hardware abstraction layer.
7.1 Hardware design
Embedded systems are, in contrast to computers, designed for one specific task. They
can be found in all kinds of electronic devices such as watches, DVD-players, video
games, ATMs and car electronics. A system designed for the specific task of image
reconstruction in digital holography can thus also be referred to as an embedded system.
To fully utilize the functionality of the Xstream hardware accelerator, it is integrated together with a CPU core and a memory controller to form a complete embedded
system for digital holographic imaging [35]. The system also includes a controller to
connect an external sensor device and a VGA generator to connect an external monitor.
Figure 7.1 shows the complete system, using an internal bus architecture.
7.1.1 Integrated soft processor
A soft microprocessor core, designed for embedded applications, extends system flexibility and programmability. The embedded microprocessor does not have the same
computational efficiency as a hardware accelerator, but is useful to control and configure on-chip modules. The word soft processor means that the design is available as a
soft macro, i.e. as source code. Unlike hard macros, which are placed and routed designs
for a specific fabrication process, soft macros can still be configured and modified. The
51
CHAPTER 7. SYSTEM INTEGRATION
52
mem
ctrl
LEON
DMA
XSTREAM
DMA
AHB
bridge
DMA
sensor
DMA
VGA
I/F
APB
UART
I2C
IRQ
I/O
Figure 7.1: The system is based on a CPU (LEON), a hardware accelerator, a
high-speed memory controller, an image sensor interface and a VGA controller.
soft macro has lower performance and logic density, but available for fabrication in any
technology and for rapid prototyping with programmable logic.
The processor is used for configuration of the internal units and for controlling the
user interface. It is distributed under the name LEON, and is compatible with the
32-bit SPARC V8 architecture. It was originally developed by the European Space
Agency, but is now maintained by Gaisler Research [36]. LEON is chosen since it is
available as open-source code, uses free development tools and is continuously updated
and improved. The processor has an AMBA bus interface and an extended configuration
for both FPGA and ASIC synthesis. When targeting FPGA devices, an alternative
solution could be the MicroBlaze soft processor from Xilinx [37]. This would probably
require fewer FPGA resources but will limit the design to run on only Xilinx devices.
Another approach would be to use the embedded IBM PowerPC 405 available on Xilinx
Virtex-II Pro. However, the choice of CPU is non-crucial for the project itself, since it
is merely being used for control and configuration.
7.1.2 External memory interface
The memory controller interface provides the system with 128 MB of external SDRAM.
This rather large memory bank is required to store high-resolution images from the
sensor as well as intermediate calculation results and tables. The hardware accelerator,
sensor interface and VGA controller have direct memory access through the DMA units.
The highest priorities on the bus are assigned to the sensor and VGA controllers, which
have the most demanding real-time constraints, since data is continuously streaming
and a delay would cause data loss. The hardware accelerator is non-sensitive to delays
and is therefore assigned a lower priority.
7.1.3 Sensor interface
The sensor interface is configurable to support several different image sensor devices.
With a programmable vertical and horizontal synchronization delay and image size, it
is possible to capture images from any location on the sensor. For compatibility with
7.1. HARDWARE DESIGN
53
Sensor
Sensor Interface
clk
acquire
pixel
line
frame
data
Active
Area
config
SCL
SDA
AHB APB
DMA
I 2C
Figure 7.2: The sensor interface can be configured to work with a range of
sensor devices. The information from the sensor is transferred to main memory
by the DMA controller. Grey boxes represent external components.
various image sensors, the interface supports a resolution of up to 12 bits. Table 7.1
shows the specification of the two sensor devices currently used in the project. The
sensor is clocked at a lower clock frequency than the system, a value configurable from
the sensor interface. When the acquire signal is asserted, the sensor device starts the
integration phase. After the image has been captured, data streams out from the sensor.
Output signals from the sensor indicates start of a new frame, new line and when pixels
are valid. The interface is illustrated in Figure 7.2. The acquire signal request the sensor
to capture an image. Pixel data is produced by the sensor and synchronized using frame,
line, and pixel clocks. The CMOS sensor have on-chip circuitry and can be configured
using a serial interface supporting the inter-integrated circuit (I2 C) protocol.
7.1.4 Monitor interface
The result from the image reconstruction is presented on an external monitor, and the
integrated VGA controller produces pixel data and synchronization signals. An external
digital-to-analog (DAC) is connected between the VGA controller and the monitor, as
shown in Figure 7.3. This circuit also contains a color palette, translating the 8-bit
values from the VGA controller to 24-bit (3 × 8) color. The palette is configured using
a parallel I/O interface.
The other side of the VGA controller is connected to the internal AHB bus using a
DMA controller, continuously streaming video data from a specific location in the main
memory, referred to as video area. The DMA base address indicates the location from
Table 7.1: Supported sensor devices. NSx , NSy and ∆x is the number of pixels and the
pixel size in the x, y directions respectively.
Device
KAF3200E (CCD)
LM9638 (CMOS)
NSx
2184
1032
NSy
1472
1312
∆x = ∆y
6.8 µm
6.0 µm
Precision
12 bits
10 bits
Frame Rate
2 fps
18 fps
CHAPTER 7. SYSTEM INTEGRATION
54
AHB APB
VGA Interface
hsync
vsync
R
G
B
Color palette
RAM-DAC
hsync
vsync
blank
DMA
(ADV478)
palette
pixel
addr
color
I/O
Figure 7.3: The VGA interface connects to an external monitor to display
images from the microscope. Grey boxes represent external components.
which data is copied, and the DMA transfer size is equal to the dimensions of the screen.
High-level functionality such as printing text on the screen and drawing bitmap images
is handled in software. The graphical user interface is hence fully software controlled.
7.2 Software design
Software design is a major part of the project. The main purpose of the software is to
execute a reconstruction program on the embedded processor, which utilizes the hardware resources for capturing, processing and presenting images. The software program
is also capable of emulating hardware resources when they are not available. Hence,
the system can run on any computer with a C compiler. The software also includes a
floating-point model, which can be activated to compare the results from the computations with the full-precision result. The layers between the software application and
the actual hardware is shown in Figure 7.4.
7.2.1 Hardware API
The application programming interface (API) is a set of instructions or rules that enable
different part of a system to communicate. Three APIs has been defined to simplify
communication between the hardware accelerator, sensor and monitor device. The
function calls are found in Table 7.2. Applications use the API to communicate with
hardware, e.g. capturing images from the sensor or drawing on the screen. Higher
level functions are defined on top of the API, e.g. the graphics device interface (GDI).
The GDI supports additional graphical function for printing text and defining fonts,
manipulating bitmaps and opening a basic terminal window. These functions are useful
not only for creating a graphical user interface, but also for debugging the program.
7.2.2 Hardware abstraction layer
The reconstruction software is intended to execute on the embedded processor, but it is
more convenient to develop and debug the application on a computer. The only problem
is that the hardware accelerator and sensor is not available, and must be emulated. To
7.2. SOFTWARE DESIGN
55
Reconstruction program
C/C++ application
GDI
XSTREAM API
XGRAB API
XDRAW API
High-level functions
Common API
Hardware abstraction layer (HAL)
Accelerator
Sensor
Monitor
Platform specific
Figure 7.4: Hardware abstraction layer separating the software from the plat-
form specific parts of the system.
simplify the development, hardware and software is divided using a hardware abstraction layer (HAL). The software side of the HAL is the API from the previous section
and the hardware side implements platform specific functionality. With this approach,
it is possible to transparently change underlying hardware without affecting system behaviour. Another reason for using a hardware abstraction layer is that software and
hardware can be developed in parallel, since they communicate using a well-defined
interface.
On the embedded system, platform specific drivers control the underlying hardware
through configuration registers. On a desktop computer, hardware functionality must
be completely emulated. Switching between hardware and emulated hardware is done
transparently during compile time, based on the current platform.
The emulation procedure is different for each hardware function. Capturing images
from the sensor is on the computer implemented by reading a bitmap image from a file.
From an application point of view, the behaviour of the system is the same. Output
to a monitor is emulated with a graphical window, as shown in Figure 7.5. Graphics
Table 7.2: API function calls for the hardware accelerator, sensor and monitor device.
API
Xstream
Xgrab
Xdraw
Functions
fft1D(int* dst, int* src, int length, int params)
fft2D(matrix* dst, matrix* src, int params)
memcpy(int* dst, int* src, int size)
transpose(int* dst, int* src, dim size)
clear(matrix* dst)
execute(matrix* dst, matrix* src, int params)
get(SensorDevice* dev, matrix* dst, rec region)
init(screen* s, int resolution)
clear(screen* s)
putpixel(screen* s, point position, pixel value)
getpixel(screen* s, point position)
draw(screen* s, bitmap* bmp, point position)
setpalette(palette* pal)
CHAPTER 7. SYSTEM INTEGRATION
56
Figure 7.5: The platform independent software running in Microsoft Visual
Studio. The hardware accelerator, sensor and monitor is emulated in software.
is rendered directly on the screen instead of being copied to the video memory area.
The functionality of the hardware accelerator is completely emulated in software, and
support one- and two-dimensional FFTs and other functions specified by the API.
7.2.3 Memory management
Some of the C library functions can cause problems in the interface between hardware
and software. An example is the malloc function, which allocates memory on the heap.
The problem with malloc is that it can allocate memory starting from any arbitrary
memory address. However, DMA units transfer blocks of data, which are consecutive
memory addresses. SDRAMs are indexed using rows and columns, and a burst transfer
spanning over two rows is not valid and results in unpredictable behaviour. Hence,
the start address of a memory area should be aligned to the burst size. Since the
burst size is not a fixed value, it is more reliable to align the start address to the first
column in a row. The modified malloc function simply allocates slightly more memory
than required, then selecting a start address inside this memory space that satisfies the
alignment condition.
7.3. PROTOTYPE - A HOLOGRAPHIC MICROSCOPE
57
Sensor
Shutters
Attenuators
Laser
FPGA platform
(a)
(b)
Figure 7.6: (a) The prototype containing optics and the embedded system
for image reconstruction. The system is controlled using an external laptop
running a graphical user interface. The resulting images are presented on a
monitor. (b) Cross-section of the holographic microscope.
7.3 Prototype - A Holographic Microscope
The embedded system is integrated in a prototype of a holographic microscope, shown
in Figure 7.6(a). The light source is a 633 nm He-Ne laser, connected to a single mode
optical fiber with a core diameter of 4 µm. The laser beam is focused onto the fiber core
using a lens, to fully utilize the intensity from the source. In the other end the optical
fiber is connected to a beam splitter, dividing the light into two separate fibers for the
object and reference light, respectively. The light intensity in each fiber is controlled
using an optical attenuator and can also be completely blocked using an electric shutter.
The electric shutters are used when capturing the images, blocking one beam at a time.
A beam splitter cube combines the light from the fibers, one fiber illuminating the
object and one used as a point reference source. The sensor is mounted close to the
beam splitter cube and connected to the embedded system, previously described in this
chapter. A cross-section of the holographic microscope is shown in Figure 7.6(b).
7.3.1 FPGA platform
The embedded system is running inside an FPGA on a custom designed development
board. The board contains interface to the sensor device and a connection to an external
monitor. A single 128 MB 32-bit external memory is connected to the FPGA. This limits
the bandwidth compared to a dual memory configuration, but is chosen for practical
design reasons due to limited expansion possibilities.
The embedded system controls the electric signals to the sensor and the shutters.
Before the images can be captured, the sensor must be configured with an active window,
a black level compensation value, integration time, and readout directions. The active
window specifies where on the sensor to capture information. Selecting a small active
CHAPTER 7. SYSTEM INTEGRATION
58
Figure 7.7: The system is controlled through the graphical user interface (GUI),
to capture images, reconstruct images and download results. The white window represents the sensor device, and the boxes show the active window of
the sensor and the location of the object and the reference point.
window reduces the computation time but also the quality. Black level compensation
is a value to compensate for the natural offset of the pixel array. In other words, black
should be represented with the value 0. The integration time controls the time during
which the sensor array is illuminated by light. A longer integration time generates
brighter images, but can cause the pixels to saturate. A short integration time results
in a darker image with a higher noise ratio. Selecting an integration time to utilize the
full dynamic range is important for the resulting quality.
7.3.2 User interface
The prototype can be connected to an external computer to control the functionality.
From a graphical interface, users can control the microscope to capture images, reconstruct holographic images, and download the result. The graphical user interface is
shown in Figure 7.7. Initially, a user selects a working area, i.e. the active window on
the sensor. The FFT size is then automatically adjusted to cover the selected image
area, zero-padding the unused space. When a user starts the reconstruction processes,
the prototype will begin to capture and process images. The image is reconstructed
using the hardware accelerator, and presented on an external monitor. The result and
intermediate images can be downloaded to a computer for further processing. Example
of images captured and processed with the holographic microscope (HoloMicroTM ) is
presented in Figure 7.8.
7.3. PROTOTYPE - A HOLOGRAPHIC MICROSCOPE
(a)
59
(b)
(c)
Figure 7.8: Images captured in the holographic microscope. (a) Amplitude
image of a fruit-fly wing. (b) Extracted depth information from a phase image of a slightly damaged microscope slide. (c) Unwrapped and backgroundcorrected phase image of a mixture with KCl and NaCl crystals. The crystals
are separated by their refractive index and identified as bright and dark objects, respectively.
7.3.3 Results
A comparison between the prototype, a standard computer, and an ASIC solution is
shown in Table 7.3. The last two rows in the table shows recent related work scaled
to the same transform size, and the speed index measures the effectiveness per MHz
for the different solutions. The overhead represents time used by the processor and the
monitor interface, continuously transferring data from memory to an external monitor.
As the clock frequency increases, the overhead is reduced.
Scaling the clock frequency and frame rate between the prototype and a standard
computer shows that the hardware accelerator in the prototype is 120 times more efficient. The computer is running nearly 50 times faster, hence the actual speed-up is
about 2.5. An ASIC solution is more suitable since it allows a significantly increased
clock frequency. Based on a dual bus and memory architecture, this solution nearly
complies with the initial specification in Chapter 2, reconstructing holographic images
at 20 frames per second.
Table 7.3: Comparison between the FPGA prototype, a standard computer, and an
ASIC solution. The transform size is 2048 × 2048. Speed index is fps/freq, normalized
to 1.0 for the prototype.
Target
Technology
Frequency
(MHz)
Mem
I/F
Overhead Rate
(fps)
Speed
index
PC
FPGA
Pentium-III
Virtex-E 1000
1130
24
0
25%
ASIC
Miyamoto [38]
Uzin [39]
0.13 µm
0.35 µm
Virtex-E 2000
250
133
35
Single
Single
Dual
Dual
Single
Quad
.008
1.0
3.0
3.8
0.9
2.7
2.5%
0
0
0.2
0.5
1.5
20
2.6
2
Chapter 8
Conclusions
A hardware accelerator for digital holographic imaging has been presented, along with a
fully functional prototype of a holographic microscope. The hardware platform includes
a soft processor with a burst oriented memory controller, sensor and monitor interface,
and a two-dimensional signal processing datapath. Data is efficiently transferred using
DMA units, capable of performing two-dimensional burst operations. All components,
except for the soft processor, has been developed in this work.
The aim has been to construct a hardware accelerator with high computation efficiency,
low on-chip memory requirements, and a high bandwidth to external memory. This
thesis cover the work on developing a one-dimensional FFT core, constructing a twodimensional hardware accelerator, and integrating the building blocks on system level.
A flexible one-dimensional FFT has been fabricated in a 0.35 µm five metal layer CMOS
process from AMI Semiconductor. The FFT core supports a hybrid floating-point format, generating high precision with low area requirements. Several memory optimization schemes are presented, which shows that it is possible to save memory by implementing an FFT with fixed-point input and hybrid floating-point output. Furthermore,
an architecture combining convergent block floating-point with hybrid floating-point
can save buffer memory by selecting an optimum block size for each delay feedback
memory. It is also shown that a bi-directional pipeline is suitable for convolution by
avoiding the memory demanding bit-reverse operation.
The two-dimensional accelerator requires several buffers for storing, processing and reordering information. However, it is shown that by moving functionality between blocks,
buffers can be removed and memory requirements reduced. Several operations can share
a single buffer by supporting multiple addressing modes, for example bit-revered, spectrum shift, and matrix transpose.
Modern dynamic memories are burst oriented and have a non-uniform access time, which
depends on the memory access pattern. Several optimization schemes are presented to
increase the memory bandwidth, for example data interleaving and using a divide-andconquer approach for transposing large data sets. In the former case, storing the images
interleaved enables burst transfers for storing and reading the holographic images from
61
62
CHAPTER 8. CONCLUSIONS
memory. In the latter case, the memory bandwidth is increased by introducing a fast
transpose unit. It is shown that transposing the memory and applying FFTs over rows
are faster than applying the FFT over columns. The efficient transpose unit drastically
improves the performance of the two-dimensional FFT.
Chapter 9
Future work
The resolution and frame rate of digital image sensors are continuously increasing. With
improved quality, holographic microscopes could seriously challenge existing technologies and provide additional features. A larger sensor device requires more processing to
reconstruct holographic images, hence an even greater need for hardware acceleration.
A digital holographic microscope has been constructed as a proof of concept. It is based
on a relatively small sensor device and a hardware platform based on an FPGA instead
of a fabricated ASIC. The prototype is therefore limited in image quality and reconstruction time. The next step is to migrate to a high-resolution sensor, moving from a
modest 1.3 million pixels to a 6.6 million pixels, which is currently under evaluation.
On the hardware side, two approaches are considered. The first is to fabricate the hardware accelerator in a modern technology process, for example UMC 0.13 µm, which is
available for the university. Fabricating an ASIC would result in reconstruction hardware that complies with the speed requirements in Chapter 2. The other approach is
to continue using programmable logic, but integrating the hardware accelerator in a
standard computer. PC-cards that contain programmable logic is currently available
for both desktop computers and laptops. The advantage is that it combines the best of
two worlds, the generic microprocessor and the dedicated hardware accelerator. Hence,
functionality not supported in hardware can instead be evaluated in software.
In this work, memory optimizations are two-dimensional while modern dynamic memories actually have a three-dimensional organization. Implementing a memory controller
with access scheduling could improve memory bandwidth even further. Exploring different external memory configurations would also be of great interest.
63
Bibliography
[1] D. Burger, J. Goodman, and A. Kagi, “Limited bandwidth to affect processor
design,” IEEE Micro, vol. 17, no. 6, pp. 55–62, Nov. 1997.
[2] W. E. Kock, Lasers and Holography.
Publications Inc., 1981.
180 Varick Street, New York 10014: Dover
[3] U. Schnars and W. Jueptner, Digital Holography. Springer-Verlag, Berlin, Heidelberg: Springer, 2005.
[4] M. Gustafsson, M. Sebesta, B. Bengtsson, S.-G. Pettersson, P. Egelberg, and
T. Lenart, “High resolution digital transmission microscopy - a Fourier holography approach,” Optics and Lasers in Engineering, vol. 41(3), pp. 553–563, Mar.
2004.
[5] W. Xu, M. Jericho, I. Meinertzhagen, and H. Kreuzer, “Digital in-line holography
for biological applications,” Cell Biology, vol. 98, no. 20, Sept. 2001.
[6] E. O. Brigham, The Fast Fourier Transform and its Applications.
Cliffs, New Jersey: Prentice-Hall, 1988.
Englewood
[7] Transmeta
Corporation,
“Transmeta
Efficeon
http://www.transmeta.com, September, 2004.
Processor,”
[8] K. K. Parhi, VLSI Digital Signal Processing Systems.
York 10158: John Wiley and Sons, 1999.
TM8620
605 Third Avenue, New
[9] U. Kapasi, S. Rixner, W. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. Owens,
“Programmable stream processors,” Computer, vol. 36, no. 8, pp. 54–62, Aug. 2003.
[10] U. Kapasi, W. Dally, S. Rixner, J. Owens, and B. Khailany, “The Imagine Stream
Processor,” in Proc. of IEEE International Conference on Computer Design: VLSI
in Computers and Processors, Freiburg, Germany, Sept. 16-18 2002, pp. 282–288.
[11] ISO/IEC 11172-3, “Coding of moving pictures and associated audio for digital
storage media,” Part 3 - Audio, ISO Standard, 1993.
[12] T. Lenart and S. Gadd, “A Hardware Accelerated MP3 Decoder with Bluetooth
Streaming Capabilities,” Master Thesis, Lund University, Sweden, 2001.
65
66
BIBLIOGRAPHY
[13] T. Gleerup, H. Holten-Lund, J. Madsen, and S. Pedersen, “Memory architecture
for efficient utilization of SDRAM: a case study of the computation/memory access
trade-off,” in Proceedings of the 8th International Workshop on Hardware/Software
Codesign, San Diego, CA USA, May 3-5 2000, pp. 51–55.
[14] W. Xu, M. Jericho, I. Meinertzhagen, and H. Kreuzer, “An efficient memory arbitration algorithm for a single chip MPEG2 AV decoder,” IEEE Transactions on
Consumer Electronics, vol. 47, no. 3, Aug. 2001.
[15] S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, “Memory access scheduling,” in Proc. of the 27th International Symposium on Computer Architecture, Vancouver, BC Canada, June 10-14 2000, pp. 128–138.
[16] T. Lenart and V. Öwall, “A 2048 complex point FFT processor using a novel
data scaling approach,” in Proc. of IEEE International Symposium on Circuits
and Systems (ISCAS’03), Bangkok, Thailand, May 25-28 2003, pp. 45–48.
[17] IEEE 802.11a, “High-speed Physical Layer in 5 GHz Band,” http://ieee802.org,
1999.
[18] IEEE 802.11g, “High-speed Physical Layer in 2.4 GHz Band,” http://ieee802.org,
2003.
[19] P. Capozza, B. Holland, T. Hopkinson, and R. Landrau, “A single-chip narrowband frequency-domain excisor for a Global Positioning System (GPS) receiver,”
IEEE Journal of Solid-State Circuits, vol. 35, no. 3, pp. 401–411, Mar. 2000.
[20] J. Cooley and J. Tukey, “An algorithm for machine calculation of complex Fourier
series,” IEEE Journal of Solid-State Circuits, vol. 19, pp. 297–301, Apr. 1965.
[21] K. Zhong, H. He, and G. Zhu, “An ultra high-speed FFT processor,” in International Symposium on Signals, Circuits and Systems, Lasi, Romania, July 10-11
2003, pp. 37–40.
[22] S. He, “Concurrent VLSI Architectures for DFT Computing and Algorithms for
Multi-output Logic Decomposition,” Ph.D. dissertation, Lund University, Lund,
Sweden, 1995.
[23] S. Kim, K.-I. Kum, and W. Sung, “Fixed-Point Optimization Utility for C and
C++ Based Digital Signal Processing Programs,” IEEE Transactions on Circuits
and Systems—Part II: Analog and Digital Signal Processing, vol. 45, no. 11, pp.
1455–1464, Nov. 1998.
[24] G. Even and W. J. Paul, “On the Design of IEEE Compliant Floating Point Units,”
IEEE Transactions on Computers, vol. 49, no. 5, pp. 398–413, May 2000.
[25] K. Kalliojrvi and J. Astola, “Roundoff Errors is Block-Floating-Point Systems,”
IEEE Transactions on Signal Processing, vol. 44, no. 4, pp. 783–790, Apr. 1996.
BIBLIOGRAPHY
67
[26] E. Bidet et al., “A Fast Single-Chip Implementation of 8192 Complex Point FFT,”
IEEE Journal of Solid-State Circuits, vol. 30, pp. 300–305, Mar. 1995.
[27] A. Berkeman, “ASIC Implementation of a Delayless Acoustic Echo Canceller,”
Ph.D. dissertation, Lund University, Lund, Sweden, 2002.
[28] F. Kristensen, “Design and Implementation of Flexible OFDM Hardware,” Licentiate Thesis, Lund University, Sweden, 2004.
[29] Christophe Del Toso et al., “0.5-µm CMOS Circuits for Demodulation and Decoding of an OFDM-Based Digital TV Signal Conforming to the European DVB-T
Standard,” IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp. 1781–1792,
Nov. 1998.
[30] B. Parhami, Computer Arithmetic. 198 Madison Avenue, New York 10016: Oxford
University Press, 2000.
[31] T. Lang and E. Antelo, “High-throughput 3D Rotations and Normalizations,” in
Proc. of Signals, Systems and Computers, Pacific Grove, CA USA, Nov. 4-7 2001,
pp. 846–851.
[32] C.-S. Wu and A.-Y. Wu, “Modified vector rotation CORDIC algorithm and its
application to FFT,” in Proc. of IEEE International Symposium on Circuits and
Systems (ISCAS’00), Geneva, Switzerland, May 28-31 2000, pp. 529–532.
[33] M. R. Portnoff, “An Efficient Method for Transposing Large Matrices and Its Application to Separable Processing of Two-Dimensional Signals,” IEEE Transactions
on Image Processing, vol. 2, no. 1, pp. 122–124, Jan. 1993.
[34] ARM Ltd., “AMBA Specification - Advanced Microcontroller Bus Architecture,”
http://www.arm.com, 1999.
[35] T. Lenart, V. Öwall, M. Gustafsson, M. Sebesta, and P. Egelberg, “Accelerating signal processing algorithms in digital holography using an FPGA platform,”
in Proc. of International Conference on Field-Programmable Technology, FPT’03,
Tokyo, Japan, Dec. 15-17 2003, pp. 387–390.
[36] J. Gaisler, “A portable and fault-tolerant microprocessor based on the SPARC
v8 architecture,” in Proc. of Dependable Systems and Networks, Washington DC,
USA, June 23-26 2002, pp. 409–415.
[37] Xilinx, “MicroBlaze Soft Processor Core,” http://www.xilinx.com/microblaze.
[38] N. Miyamoto, L. Karnan, K. Maruo, K. Kotani, and T. Ohmi1, “A Small-Area
High-Performance 512-Point 2-Dimensional FFT Single-Chip Processor,” in Proc.
of European Solid-State Circuits (ESSCIRC’03), Estoril, Portugal, Sept. 16-18
2003, pp. 603–606.
68
BIBLIOGRAPHY
[39] I. Uzun, A. Amira, and F. Bensaali, “A reconfigurable coprocessor for highresolution image filtering in real time,” in Proc. of the 10th IEEE International
Conference on Electronics, Circuits and Systems, Sharjah, United Arab Emirates,
Dec. 14-17 2003, pp. 192–195.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement