Fast MRI Gridding
Implementing Fast MRI Gridding on GPUs via CUDA
Anthony Gregerson
University of Wisconsin – Madison
agregerson@wisc.edu
Abstract
Modern graphics processing units (GPUs) have
made high-performance SIMD designs available to
consumers at commodity prices. This has made them
an attractive platform for parallel applications,
however developing efficient general-purpose code
for graphics-optimized architectures has proven
challenging. To explore the challenges and
opportunities of exploiting general-purpose GPU
processing, we implemented the non-equispaced
Fast-Fourier Transform algorithm, commonly known
as ‘gridding’, on a Geforce 8800 GPU using Nvidia’s
CUDA framework. Our results found that
optimizations in thread scheduling, data structures,
and memory access patterns could accelerate a naïve
GPU implementation by over 400%. Our optimized
GPU implementation showed a 114X speedup of the
convolution-interpolation kernel compared to the
same kernel running on an Intel Core 2 Duo CPU
when running on a Geforce 8800GTX and a 17.5X
speedup when running on a Geforce 8400GS.
1. Overview
Modern medical imaging techniques, such
as magnetic resonance imaging (MRI) are
computationally-rich image manipulation processes.
Recently, functional MRI (fMRI) studies have
become an extremely hot research topic in the area of
neuroscience, generating hundreds of papers each
year. Because fMRI studies often monitor changing
conditions and complex interactions, they frequently
require high-powered multiprocessor systems which
can provide sufficient floating point throughput to
render images in realtime. While high floating point
performance can be had from supercomputers and
clusters, these options may be prohibitively
expensive for some researchers.
While the floating point performance of
commodity CPUs has increased steadily, this has
been dwarfed by the improvements in modern
graphics processing units (GPUs). While CPUs and
GPUs offered comparable floating point performance
in early 2003, by the end of 2006, Nvidia’s G80 GPU
was capable of nearly 60 times the maximum
theoretical floating point performance of Intel’s Core
2 Duo [1]. This rapid improvement spurred
researchers to begin adapting medical imaging
algorithms to GPUs.
Recent attempts at porting the MRI
algorithms to the DirectX and OpenGL graphics
languages have shown the potential for integer-factor
performance gains but have been limited by
programming architectures not designed for generalpurpose computations [3]. To encourage the use of
GPUs as general purpose computation engines,
Nvidia released the Compute Unified Device
Architecture (CUDA) in 2007. CUDA provides a
standardized architectural model and framework for
general-purpose GPU development.
Despite targeting CUDA at general-purpose
application development, early research has shown
that achieving high efficiency on the CUDA platform
requires careful program design and detailed
knowledge of the architecture, even for relatively
simple programs [9]. To explore the challenges of
developing for CUDA in greater depth, we will
explore the optimization of the MRI gridding
algorithm. Gridding is a computationally-intensive
process used in the medical field to perform MRI
procedures. The gridding algorithm takes MRI
measurements which are recorded in Fourier k-space
by a magnetic sensor rotating around the patient and
translates them into a 2-dimensional image [2].
2. Algorithm Description
2.1 - NFFT & Convolution-Interpolation
The MRI scan commonly samples values in
Fourier k-space on a spiral trajectory in either a
cylindrical or spherical coordinate system. To convert
these samples into a 2 or 3-dimensional image, an
inverse Fourier transform must be applied to the data
samples. However, the inverse Fourier transform is
an O(n2) operation that does not scale well to the
large datasets used in MRI scanning. It is much more
efficient to use an inverse FFT, which is an
O(n log n) operation, but standard FFT algorithms
cannot be applied to data sampled on cylindrical or
spherical coordinate systems, because they require
that the dataset be sampled on an equispaced
coordinate system. Transforming data which has been
sampled on a non-Cartesian coordinate system
requires a process variously known as the
Generalized FFT, Non-equispaced FFT (NFFT), or
Gridding. Gridding consists of a three-step process.
1. Discrete 2 or 3-dimensional convolutioninterpolation of the data set onto a Cartesian
coordinate system
2.
Inverse FFT on the Cartesian set
3. Convolution kernel density compensation
(also known as deapodization)
Discrete 2-dimensional convolution is commonly
used in image processing. Discrete convolution
involves taking a single element from the input
matrix or data set, multiplying it by a speciallyweighted constant matrix (referred to as the
convolution kernel or window function), and adding
the resulting matrix to the overlapping entries in the
output matrix. The convolution window is then
‘dragged’ across all the elements in the input set.
Depending on the window function used, discrete
convolution can perform functions such as antialiasing, edge detection, sharpening, or softening. To
perform interpolation, a symmetrical, center-
weighted window function such as a Gaussian or 1storder Kaiser-Bessel function is used.
2.2 – Software Implementation
A straightforward implementation of the
algorithm takes the following form in C:
for(i = 0; i < n; i++)
{
int x = ceil(in[i].x – m/2);
int y = ceil(in[i].y – m/2);
float val = in[i].value;
for(j = 0; j < m; j++)
for(k = 0; k < m; k++)
out[x+j][y+k] +=
val * kernel[j][k];
}
where the input data set contains n items and the
convolution kernel is size m x m. The algorithm is
characterized by a large number of memory
operations and significant potential data dependences
between outer loop iterations, limiting the application
of loop-unrolling optimizations and instruction reordering.
Assuming m >> 1 and ignoring loop
overhead, the code above contains approximately
3nm2 memory accesses and 2nm2 arithmetic
operations. For a machine with a perfect cache, such
that the only requests that are serviced by main
memory are compulsory cache misses, the minimum
number of main memory accesses for a densely
populated input set is n + r2, where r is the length of
the output image (assuming square images). For
image processing applications, r2 ~ n, suggesting that
an ideal caching system can reduce the number of
memory accesses by a factor of m2.
3. Compute Unified Device
Architecture
3.1 – Overview
CUDA is outlined in detail in Nvidia’s
CUDA Programming Guide [1]. CUDA 1.0 capable
devices are based on Nvidia’s G80 GPU, first
introduced in the Geforce 8800 series graphics cards.
The G80 is a SIMD multiprocessor
architecture with a total of 128 stream processors
(SPs) grouped into 16 streaming multiprocessors
(SMs). Each CUDA kernel is organized as a multidimensional grid of thread blocks, with each block
being a multi-dimensional grouping of threads. The
dimensional aspect of blocks and grids is an
abstraction that can be used to facilitate the
calculation of thread IDs and array strides. Threads
are assigned to an SM at the block granularity. Each
SM executes threads in groups of up to 32, known as
warps. Each warp executes logically in SIMD
lockstep, though the G80 implementation physically
interleaves execution of quarter-warps in a 4-way
SMT manner.
CUDA is a host-coprocessor architecture. A
general-purpose CPU typically acts as the host
processor and a GPU acts as the coprocessor (known
as the ‘device’ in CUDA terminology). CUDA
provides a C-like programming environment with
extensions to differentiate between code and data
structures meant for the host and device execution.
The host is responsible for initializing the device,
transferring data between system and device memory,
and initiating execution of device kernels.
3.2 – Memory Hierarchy
CUDA devices have access to five different
types of memory: global, shared, local, constant, and
texture.
Global memory is a large memory
accessible by all threads. All device kernel code and
data must fit within global memory. The global
memory on the Geforce 8800 is 768MB, has a 64KB
block size, and a latency of 400-600 cycles [1].
Shared memory is local to each SM, and
data in shared memory may be shared between
threads belonging to the same thread block. If
multiple blocks are scheduled to the same SM, shared
memory will be evenly partitioned between them.
The size of the shared memory is 16KB per SM on
the G80. Each memory is 16-banked and has a 4
cycle latency.
Local memory provides the same latency
benefits as shared memory, but is local to the SPs and
can only be accessed on a per-thread basis. Local
memory is typically used as a fast scratchpad.
The constant memory (also referred to as the
symbol memory) is a special partition of global
memory meant to hold constant values. It is read-only
from the device scope and can only be written to by
the host. The constant memory is 2-level, with
distributed, hardware-managed caches in each SM.
Because the constant memory is read-only, these
caches do not have any coherence protocol. On the
G80, each constant cache is 8KB and has a low
access latency.
Texture memory is similar to constant
memory, but also includes support for special
optimized texture transformation operations. This
extra support comes at the cost of higher latency than
constant memory for normal read operations.
In addition to the five memory types, each
SM has access to a pool of 8192 registers, which are
divided evenly among each thread executing in a
block.
3.3 – Computational Units
Each of the G80’s SMs contains 8 scalar
ALUs optimized for 32-bit single precision floating
point arithmetic. Each ALU is capable of executing a
floating point multiply or addition in a single cycle,
but because of the 4-way SMT execution of each
warp, an addition or multiplication instruction takes 4
cycles to complete on all threads.
CUDA 1.x devices do not support double
precision floating point arithmetic. All double
precision operations are automatically converted into
single precision. This is an important limitation
which restricts the class of scientific applications
which can be ported to CUDA. Double precision
support is scheduled for CUDA 2.x devices [1].
3.4 – Thread Scheduling Mechanisms
Each SP can support up to 96 active threads,
for a total of 12,288 threads on the Geforce 8800.
Threads are lightweight and have very low
scheduling overhead, giving making CUDA devices
with a large number of SMs similar to data parallel
architectures. The schedulers are capable of putting
threads to sleep to tolerate memory latencies,
however each thread within a warp must execute in
lock-step. In the case of divergent control paths
within threads of a single block, each instruction is
executed in every thread, with vector masks used to
disable instructions on untaken control paths.
blocks is only possible via slow global memory and
4. CUDA Optimization Strategies
The G80 architecture provides a maximum
theoretical throughput of over 300 GFLOPs, however
this is only achievable with kernels with a very large
ratio of computation to memory access. As described
in Section 2.2, the convolution-interpolation
algorithm requires a large amount of reads and
writes. Because accesses to global memory require
two orders of magnitude more latency than floating
point operations, Amdahl’s law suggests that the
main optimization goal should be to increase the
efficiency of memory accesses.
Our strategy for optimizing memory
accesses has three main components: making use of
cheap thread scheduling to tolerate memory stalls,
using the shared memory as a software-managed
cache and assigning work to maximize locality, and
organizing data structures to maximize bandwidth.
4.1 – Dividing Work Among Threads &
Exploiting Locality
The work in the convolution algorithm can
be allocated to hardware either by dividing the input
data set among threads or by dividing the output
matrix into sub-matrices. Additionally, work can be
assigned on a fine-grained or coarse-grained level.
This gives rise to four distinct division strategies.
An example of a fine-grained, input-driven
assignment is shown in Figure 1. This assignment
strategy is the simplest to implement, as a different
block is assigned to each input point. This method
generates a large number of thread blocks, which is
favorable for thread scheduling. However, if the input
data set is densely distributed on the output matrix,
there will be a large amount a data sharing between
blocks, particularly as the size of the window
function increases. Because data sharing between
Fig. 1 – Fine-grained, input-driven work
assignment. The squares represent the area of the
output grid read and written by each block.
Overlapping area represents data shared between
blocks.
because it is difficult to coordinate threads in
different blocks, it is not favorable to use this
method.
Figure 2 shows a coarse-grained, inputdriven assignment in which multiple consecutive
sample points are assigned to the same thread block.
This assignment method has low overhead and shares
much more data among threads in the same block,
offering greater locality. For these reasons, it is the
preferred method of assigning work for architectures
with hardware-managed caches. While CUDA does
not provide a hardware-managed cache for global
memory, it is desirable to temporarily write the
output matrix values to shared memory as they must
be read multiple times in each block. The spiral
trajectory of MRI sampling creates an irregularly
shaped output sub-matrix when assigning work based
on the input data set. This irregular shape makes it
difficult to assign work to different threads within the
block without using a diverging control scheme or
complicated arithmetic on the thread ID, adding a
large amount of overhead to a CUDA
implementation.
However, CUDA does not provide any cache for
global memory. While the constant & texture
memory partitions are cacheable, these memories are
read-only, and thus cannot be used to store the output
matrix which accounts for half of all memory
references.
To overcome the lack of caching on the
global memory, we use the per-SM shared memory as
a software-managed cache. Using shared memory
Fig. 2 – Coarse-grained, input-driven assignment
Figure 3 demonstrates fine-grained, outputdriven assignment. In this method, a thread block is
assigned to each point on the output grid and reads
from any input samples within the range of the
window function. This implementation shares no data
between blocks and has a pareto optimal number of
memory write operations [7], however it incurs a
large amount of read overhead because each thread
must search for an input data value within range of
the output grid point. For sparse data sets it might be
worthwhile to store the dataset in cacheable constant
memory, however for sparse data sets it could be
difficult to find enough data points in range to
completely populate a warp, leading to performance
degradation from unutilized processors.
Figure 4 shows the final configuration,
coarse-grained, output driven assignment (which we
refer to as sector-based assignment). This method is
the one chosen in our implementation. It has the good
data locality properties of coarse-grained assignment
while allowing for simplified control schemes and
thread assignment because of the regular shape of the
output sector. The major disadvantage to this scheme
is that data points must be presorted into the
appropriate output sector a priori – however this can
be done using efficient CPU preprocessing prior to
launching the device kernel.
The calculations shown in Section 2.2
suggest that a perfect caching system could reduce
the number of accesses to global memory by a factor
of 256 for a convolution window size of 16.
Fig. 3 – Fine-grained, output-driven assignment –
Red dots indicate samples within range of the
three blocks shown.
Fig. 4 – Coarse-grained, output-driven assigment
as a cache creates many challenges for kernels in
which data may be shared between different thread
blocks. There is no coherence mechanism to ensure
that the same memory location is not ‘dirty’ in the
shared memory on multiple SMs. Further, it is
impossible to implement such a mechanism as CUDA
1.x only allows thread synchronization on a per-block
basis. This means that the programmer must exercise
great care in algorithm design to ensure correctness
when attempting to share data between blocks. To
overcome this challenge, our implementation pads
the portion of the output matrix belonging to each
block by an amount equal to half the size of the
window function. This padded area represents an area
of the output matrix that belongs to another block.
Each block can write to its padded area safely. When
all blocks have completed computation, each block
writes portion of the output matrix it owns. When this
is complete, the padded areas are added to the
corresponding entries of the output matrix. Because
CUDA 1.x does not support global synchronization,
the adding of the padded areas is currently performed
by the host.
4.2 – Thread Scheduling
Because stalling, starting, and scheduling
threads is inexpensive on CUDA, launching large
numbers of threads is a general method of tolerating
memory latencies for any device kernel. Achieving a
large number of active threads can be a complex
problem. Because threads are scheduled in groups of
blocks, it is necessary to create enough threads that
there are multiple blocks per SM. Nvidia suggests
creating at least 2 blocks per SM [1], which equates
to 32 blocks on the Geforce 8800. In the case of the
convolution kernel, each thread block can be
considered to represent a sector of the output matrix.
Assuming a square matrix, it is preferable to use a
power-of-two grid size that is a perfect square.
Increasing the number of blocks decreases the
amount of the output matrix owned by each block,
limiting the amount of locality that can be exploited
by storing the output matrix block in shared memory.
Conversely, the maximum sector size is limited by
the amount of shared memory available to each
block. Each element of the output matrix is 8-bytes in
size, representing a single-precision complex number.
Each shared memory is 16 KB, but this size must be
divided among the number of blocks assigned to each
SM. Assuming the minimum recommended value of
2 blocks per SM, each block gets 8 KB of memory,
representing 1K entries. The largest square matrix
that can fit in this area is 32 x 32. Increasing the
sector size beyond this amount requires complex
management of the shared memory resources within
each block. The alternative is to schedule only one
block per SM. This would allow for 2K entries per
sector which could accommodate a 48 x 48 matrix
which would offer more locality and less sharing
between blocks at the expense of being able to
schedule blocks to tolerate global memory latencies.
The optimal approach depends on the specific
application as the importance of large sectors is
influenced by the density of the input data set and the
size of the convolution window.
4.3 –Data Structures
Both the global memory and shared memory
performance can benefit from optimization of the size
and location of data structures in memory.
While the global memory has relatively
large cycle latency, it also has a very high bandwidth
- 86 GB/s on the Geforce 8800 GTX. This high
bandwidth can be used to amortize latency costs by
consolidating the reads and writes of the threads in
the same warp. CUDA allows for coalescing reads
and writes provided the data structures and thread
access patterns meet certain criteria. Each thread in a
16-thread half-warp must be accessing its word such
that all the threads access a permutation of words in a
64 or 128-byte contiguous block. Further, the thread
which accesses the word aligned at the beginning of
the 64-byte boundary must have a thread ID xcoordinate value of zero in the thread block [1]. This
can be achieved without great difficulty because the
CUDA programming extensions provide pragmas for
automatically aligning data structures in memory
with arbitrary boundary lengths. To ensure that the
thread ID requirements were met, we structured our
thread grid to have a length of 16, such that all
groups of 16 threads would have one thread with a
thread ID of zero.
The shared memory has 16 banks, allowing
all 32 threads in a warp to access memory in a 4-
cycle read operation provided there are no bank
conflicts. Shared memory addressing uses a 4-byte
offset between banks – the size of one singleprecision floating point number. To ensure that there
are no bank conflicts when accessing a data structure
in shared memory, the size of the structure in units of
4-bytes must be co-prime to 16. Initially our
implementation used an 8-byte datatype that
represented a complex number containing singleprecision real and imaginary parts. Accessing this
structure resulted in only 50% bank utilization. To
solve this problem, we divided our output array into
two separate arrays: one holding real values and the
other holding imaginary values. This arrangement
changed the element size within each structure from
8 to 4-bytes, and improved bank utilization to 100%.
In cases of more complicated data structures, CUDA
provides a special memory transfer operation that
will automatically pad the data structures and return
the optimal stride for accessing the structure;
however manual optimizations may be more size
efficient and allow for a larger working set within
shared memory.
4.4 – Control & Computations
The SIMD nature of CUDA favors simple
non-divergent control schemes. CUDA devices
handle control divergence using a method similar to
vector masking. Each processor must execute all
instructions, including those in control paths not
active for the processor. Inactive instructions simply
are not committed. This means that complex and
diverging control schemes may degrade performance
by creating wasted execution cycles in which a
processor may do no useful work.
While Amdahl’s Law suggests that most
optimization time should be spent on memory
latency, there is room for some large-scale
computational
optimizations.
The
primary
optimization is to do offline computation of the
convolution window function. Because the window
function is constant for a given window size, it can be
computed offline and stored as a constant array.
Rather than computing an expensive operation such
as the Bessel function, the processor can simply store
the kernel values in a special array that is indexed by
the distance between an input sample point and the
grid point it is being interpolated onto. The processor
then only needs to calculate the distance between the
sample and each grid point within the kernel window,
hash the value, and use it to index into the precomputed kernel array.
4.5 – Using Shared vs. Constant Memory
Given that constant memory provides a
hardware-managed distributed cache with fast access,
it would initially seem to be ideal for storing the precomputed convolution kernel, however this is not the
case. While the local memory has 16 banks for
handling multiple requests in each cycle, we found
that multiple requests to different addresses in the
constant memory cache were serialized. We
performed some preliminary tests using random and
structured access patterns to constant memory, and
found that the latency of constant memory was
approximately equal to shared memory when all
threads were accessing the same location but around
three times slower than shared memory when each
thread was accessing a random address.
Since accesses into the pre-computed kernel
are based on the distance between the sample and
each grid point within the kernel window, they rarely
hit on the same memory address in the same cycle.
This suggests that we can get better performance by
loading the kernel into shared memory rather than
using the constant memory. In practice, we found that
because kernel read operations make up a relatively
small fraction of computation time, there is only a
slight performance benefit to storing the kernel in
shared memory, and this benefit is offset by the
reduced working set that can be fit in shared memory.
For these reasons, we chose to use constant memory
to store the kernel.
5. Methodology
We
implemented
the
convolutioninterpolation algorithm in CUDA using some of the
major optimization strategies described earlier. To
gauge the efficiency each strategy, we measured the
improvement in computation time after applying each
optimization to a naïve implementation using finegrained, input-driven work assignment. We also
compared the performance of the fully-optimized
CUDA implementation to an optimized singlethreaded CPU convolution library. CUDA device
code was run on a Geforce 8800 GTX and the CPU
implementation was run on a 2.0 GHz Intel Core 2
Duo processor. All tests used a pool of 5000 samples
uniformly distributed on a 512 x 512 grid. A
Gaussian convolution kernel with a window size of 8
was used with 10X oversampling.
Additionally, we ran the same optimized
device code on the Geforce 8400M GS, a GPU based
on the Nvidia G86 core. The Geforce 8400M GS is a
low-cost integrated graphics chip featuring only 2
SMs and a memory bandwidth of 9.6 GB/s [8]. Our
purpose was to determine if code optimized for one
CUDA device would scale well to another.
the speed of the convolution interpolation algorithm
when compared to the naïve implementation which
only used global memory.
Aligning data structures to optimize memory
bandwidth and improve shared memory bank
utilization provided a large performance increase as
well, improving the baseline performance by 70%.
Doubling the number of blocks per SM resulted in an
unexpectedly large performance increase of 35%.
This indicates that block scheduling is an effective
way of tolerating memory latencies.
The
performance
of
the
CUDA
implementation running on the Geforce 8800 GTX
and Geforce 8400M GS are compared to the CPU
implementation in Table 2.
Device
6. Results
The relative speed-ups of 5 CUDA-based
optimizations are shown in Table 1.
Optimization
Speedup
Caching input samples in
shared memory
40%
Aligning data structures
for coalesced
reads/writes & maximum
bank utilization
70%
Pre-computing
convolution kernel and
storing in constant
memory space
20%
Sector-driven
assignment and use of
shared memory as
software-managed cache
310%
Increasing the blocks per
SM from 1 to 2
35%
Table 1: Individual speedup based on optimization
The sector-driven division of work, which
enabled the use of the shared memory as a softwaremanaged cache for the output grid, produced the most
significant performance increase, more than tripling
Speedup vs
CPU
Geforce 8800 GTX
114X
Geforce 8400M GS
17.5X
Table 2 – Speedup of CUDA implementations vs
CPU implementation
The Geforce 8800 device showed an
impressive performance increase compared to the
Core 2 Duo, achieving a speedup of 114X, however
this speedup only corresponds to a throughput of
approximately 20 GFLOPs – a small fraction of the
maximum theoretical throughput for the 8800. This
gap is shows that our program’s performance is still
dominated by memory latency.
The Geforce 8400M GS produced
surprisingly good results, considering it is one of the
least powerful CUDA-capable devices available.
Performance also scaled well – compared to the
8800, the 8400 has 8 times as fewer processors and 9
times less memory bandwidth, but only performed
6.5 times slower than the 8800. This better than
expected scaling may be attributed to the fact that the
8400 used the fine-grained, input-based block
assignment, which resulted in many more blocks per
SM. This allowed for more efficient block
scheduling.
7.
Conclusions
Nvidia’s CUDA coupled with the Geforce
8800 GTX produced impressive speed increases on
the convolution-interpolation algorithm used in MRI
gridding. In addition to medical imaging,
convolution-interpolation is an algorithm with similar
characteristics to many general-purpose image
processing operations.
While
performance improvement
is
impressive, it should be noted that several of the
choices in the testing – such as the large kernel
window size and choice of an output grid size that
would fit within the distributed shared memory –
were more favorable to the data parallel capabilities
of the CUDA device, and that more generalized
testing would likely show a lesser improvement in
performance. However, the main purpose of this
paper was not to determine the absolute speed
difference between CPU and GPU implementations,
but rather to highlight optimization strategies
employed when porting an image processing
algorithm to CUDA.
With Intel and AMD developing their own
large-scale SIMD processor efforts [4] [5], the
creation of a standardized architectural model and
development environment for these massively SIMD
processors will be required for wide-scale adoption.
CUDA represents Nvidia’s attempt to create such an
environment. Whether it will be adopted as a general
standard is an open question. While this work has
shown that CUDA can be used to achieve large
speedups over traditional CPU architectures for data
parallel algorithms with a relatively small reengineering effort, several of the optimization
strategies, such as the data structure layout, thread
scheduling, and use of shared vs. constant memory,
are closely tied to the underlying hardware
architecture.
CUDA still lacks some key features that
could be used to make the adaption of gridding
algorithms easier. In particular the lack of a
coherence-backed cache system coupled with limited
support for atomic operations and global
synchronization make correct implementation of
algorithms with a large amount of data sharing a
challenge. To some degree, these limitations may be
addressed in future revisions of CUDA. CUDA 1.1
already adds support for some atomic operations on
1.1-compatible devices. Future revisions may bring
new opportunities for CUDA development.
8. Future Work
Our optimizations for 2D gridding could be
extended to support 3D imaging. Since interpolation
is an inherently lossy process, it would be worthwhile
to explore methods of using inexact calculations to
accelerate the processing algorithms, such as using a
fast estimation for the distance between a sample and
a grid point rather than an expensive Pythagorean
computation.
While this study has focused on
performance and computational efficiency, it would
be useful to compare the power consumption and
instruction power efficiency between CPU and GPU
implementations.
Acknowledgements
The author would like to thank Doug Janes
for his advice on discrete convolution optimizations
and grid topologies.
References
[1] NVIDIA CUDA Compute Unified Device Architecture
Programming Guide,
http://developer.download.nvidia.com/compute/cuda/1_
0/NVIDIA_CUDA_Programming_Guide_1.0.pdf,
Accessed March 25, 2008
[2] H. Schomberg, and J. Timmer, “The gridding method
for image reconstruction by Fourier transformations,”
IEEE Transactions on Medical Imaging, vol. 14(3), pp.
596-607, 1995.
[3] T. Schiwietz, T. Chang, P. Speier & R. Westermann,
“MR image reconstruction using the GPU,”
Proceedings of the SPIE, vol. 6142, pp. 1279-1290,
2006.
[4] Intel IDF Preview: Tukwilla, Dunnington, Nehalem and
Larrabee, http://www.pcper.com/article.php?aid=534,
Accessed March 27, 2008.
[5] AMD Financial Analyst Day,
http://download.amd.com/Corporate/MarioRivasDec20
07AMDAnalystDay.pdf, Accessed March 27, 2008.
[6] T.S. Sorensen, T. Schaeffter, K.O. Noe and M. SchachtHansen, “Fast Gridding on Commodity Graphics
Hardware,” Proceedings of Workshop on NonCartesian MRI, ISMRM: Arizona, 2007.
[7] T.S. Sorensen, T. Schaeffter, K.O. Noe and M. S.
Hansen, “Accelerating the Nonequispaced Fast Fourier
Transform on Commodity Graphics Hardware,” IEEE
Transactions on Medical Imaging, vol. 27, 538-547,
2007.
[8] Geforce 8400M,
http://www.nvidia.com/object/geforce_8400M.html,
Accessed May 07, 2008.
[9] S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D.
Kirk, and W. Hwu. “Optimization Principles and
Application Performance Evaluation of a Multithreaded
GPU Using CUDA,” PPoPP 2008, ACM: Salt Lake
City, 2008.
Appendix
Partial source code listing for CUDA implementation
convInt.h
/* Header for convInt */
#ifndef _CONVINT_
#define _CONVINT_
#define
#define
#define
#define
#define
#define
#define
#define
KERNEL_RADIUS 8
GRANULARITY 0.5f
IVG 2 //Must be (GRANULARITY)^-1 (should be int)
GRID_SIZE 512
NUM_SAMPLES 8192
THREADS_PER_TBLOCK 32
BLOCK_SIZE 32
NUM_TBLOCKS 256 //Must be (GRID_SIZE/BLOCK_SIZE)^2
/* Represents complex number in rectangular coordinates */
typedef struct __align__(8) {
float real;
float img;
} complex;
/* Datapoint from MRI scan. Contains an X-Y coordinate corresponding to
Fourier k-space and a complex value */
typedef struct __align__(16) {
float x;
float y;
complex data;
} kdata;
__constant__ float const_kernel[4*KERNEL_RADIUS*KERNEL_RADIUS];
#endif //_CONVINT_
convInt2.cu
/* 2-Dimensional Convolution-Interpolation on CUDA */
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
/* Program constants, macros, and structs declared here */
#include <convInt.h>
/* CUDA Utilities */
#include <cutil.h>
/* GPU kernel code */
#include <convInt_kernel2.cu>
/* Calculates the k-space Gaussian weighting for a given distance */
float gaussian(float dist)
{
return expf(-(dist*dist)/2.0f);
}
/* GPU kernel wrapper
Initializes device kernel and executes data transactions between
device and host */
void conv(kdata* dataset, complex* out_grid, float* conv_kernel)
{
/* Initialize CUDA Device */
CUT_DEVICE_INIT();
/* Copy convolution kernel to symbol memory on device */
CUDA_SAFE_CALL(cudaMemcpyToSymbol(const_kernel, conv_kernel,
4*KERNEL_RADIUS*KERNEL_RADIUS*sizeof(float)) );
/* Allocate global device memory for data list & copy data */
kdata* data_pt;
int memsize = NUM_SAMPLES * sizeof(kdata);
CUDA_SAFE_CALL(cudaMalloc((void**)&data_pt, memsize) );
CUDA_SAFE_CALL(cudaMemcpy(data_pt, dataset, memsize, cudaMemcpyHostToDevice) );
/* Allocate global device memory for grid */
complex* grid_pt;
memsize = GRID_SIZE * GRID_SIZE * sizeof(complex);
CUDA_SAFE_CALL(cudaMalloc((void**)&grid_pt, memsize) );
/* Initialize & start GPU timer */
unsigned int timer = 0;
CUT_SAFE_CALL(cutCreateTimer(&timer) );
CUT_SAFE_CALL(cutStartTimer(timer) );
/* Compute execution configuration */
dim3 dimBlock(THREADS_PER_TBLOCK, 1);
dim3 dimGrid(NUM_TBLOCKS, 1);
/* Launch computation on GPU */
devcon<<<dimGrid, dimBlock>>>(data_pt, grid_pt);
CUT_CHECK_ERROR("Kernel execution failed");
/* Read data from device into host memory */
CUDA_SAFE_CALL(cudaMemcpy(out_grid, grid_pt, memsize, cudaMemcpyDeviceToHost) );
printf("Output check: %f\n", out_grid[10].real);
/* Stop timer, display result, and clean-up */
CUT_SAFE_CALL(cutStopTimer(timer));
printf("Processing time: %f (ms) \n", cutGetTimerValue(timer));
CUT_SAFE_CALL(cutDeleteTimer(timer));
/* Clean-up device memory */
cudaFree(data_pt);
cudaFree(grid_pt);
}
/* Host-Main */
int main()
{
/* Pre-compute convolution-interpolation lookup table */
/* Allocate lookup table */
const int OS_KERNEL_RADIUS = (int) (KERNEL_RADIUS/GRANULARITY) + 1;
const int KERNEL_ENTRIES = 4*OS_KERNEL_RADIUS*OS_KERNEL_RADIUS;
float *conv_kernel = (float *)malloc((KERNEL_ENTRIES)*sizeof(float));
if(conv_kernel == NULL) {printf("Memory allocation error\n"); exit(1);}
/* Populate table */
float distance = 0.0f;
for(int i = 0; i < KERNEL_ENTRIES; i++)
{
conv_kernel[i] = gaussian((distance/((float) KERNEL_RADIUS)));
distance = distance + GRANULARITY;
}
/* Generate data set */
srand(1338);
kdata *dataset = (kdata *) malloc(NUM_SAMPLES * sizeof(kdata));
if(dataset == NULL) {printf("Memory allocation error\n"); exit(1);}
/* Generates a random point within the GRID with given granularity
points are sorted by their target block */
int blocks_per_row = GRID_SIZE / BLOCK_SIZE;
int num_blocks = blocks_per_row * blocks_per_row;
int samples_per_block = NUM_SAMPLES / num_blocks;
for(int i = 0; i < num_blocks; i++)
{
int x_start = (i % blocks_per_row) * BLOCK_SIZE;
int y_start = (i / blocks_per_row) * BLOCK_SIZE;
for(int j = 0; j < samples_per_block; j++)
{
kdata gtmp;
gtmp.x = x_start + 0.001f + GRANULARITY * (float) (rand() % ((BLOCK_SIZE
- 1) * IVG ));
gtmp.y = y_start + 0.001f + GRANULARITY * (float) (rand() % ((BLOCK_SIZE
- 1) * IVG ));
gtmp.data.real = rand() / (float) (RAND_MAX);
gtmp.data.img = rand() / (float) (RAND_MAX);
dataset[i*samples_per_block + j] = gtmp;
}
}
/* Allocate padded output grid */
complex *out_array = (complex *) calloc((GRID_SIZE*GRID_SIZE), sizeof(complex));
if(out_array == NULL) {printf("Calloc Memory allocation error\n"); exit(1);}
/* Perform convolution-interpolation */
conv(dataset, out_array, conv_kernel);
/* Cleanup Host Memory */
free(conv_kernel);
free(dataset);
free(out_array);
/* Finished */
printf("Program completed successfully\n");
exit(0);
}
convInt_kernel2.cu
/* Device kernel */
#include <convInt.h>
/* Calculates the index into the kernel lookup table for set of points */
__device__ int get_kernel_index(int x, int y, float xf, float yf)
{
float del_x = xf - x;
float del_y = yf - y;
float dist = sqrt(del_x * del_x + del_y * del_y);
return (int) (dist / GRANULARITY);
}
/* Device convolultion-interpolation */
__global__ void devcon(kdata* dataset, complex* out_array)
{
int start = (NUM_SAMPLES / NUM_TBLOCKS) * blockIdx.x;
int length = NUM_SAMPLES / NUM_TBLOCKS;
start = start + threadIdx.x;
int blocks_per_row = GRID_SIZE / BLOCK_SIZE;
int x_start = (blockIdx.x % blocks_per_row) * BLOCK_SIZE;
int y_start = (blockIdx.x / blocks_per_row) * BLOCK_SIZE;
/* Grid sub-section in shared memory */
__shared__ complex local_block[BLOCK_SIZE*BLOCK_SIZE];
/* Perform convolution-interpolation */
for(int i = threadIdx.x; i < length; i = i + THREADS_PER_TBLOCK)
{
__shared__ kdata datasample;
datasample = dataset[i];
/* Find convolution points. */
int window_x1 = max(0, (int)ceil((datasample.x - x_start) - KERNEL_RADIUS));
int window_x2 = min((BLOCK_SIZE - 1), (int)floor((datasample.x - x_start) +
KERNEL_RADIUS));
int window_y1 = max(0, (int)ceil((datasample.y - y_start) - KERNEL_RADIUS));
int window_y2 = min((BLOCK_SIZE - 1), (int)floor((datasample.y - y_start) +
KERNEL_RADIUS));
/* Apply interpolation kernel to points within window */
for(int j = window_x1; j < window_x2 + 1; j++)
{
for(int k = window_y1; k < window_y2 + 1; k++)
{
int out_idx = j*BLOCK_SIZE + k;
int lidx = get_kernel_index(j, k, (datasample.x - x_start),
(datasample.y - y_start));
float conv_mul = const_kernel[lidx];
local_block[out_idx].real += datasample.data.real * conv_mul;
local_block[out_idx].img += datasample.data.img * conv_mul;
/* Ensure all threads finished updating their grid point before
attempting to read from the next point */
syncthreads();
}
}
}
/* Copy local block back to global memory */
for(int i = 0; i < BLOCK_SIZE; i = i + 1)
for(int j = threadIdx.x; j < BLOCK_SIZE; j = j + THREADS_PER_TBLOCK)
out_array[(y_start + i)*BLOCK_SIZE + x_start + j] =
local_block[i*BLOCK_SIZE + j];
}
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising