Bemutatók sablonja

GPGPUs and
their programming
Sándor Szénási
Aug 2013
(ver 1.1)
©Sándor Szénási
Table of contents
1. Introduction
2. Programming model
1. Basics of CUDA environment
2. Compiling and linking
3. Platform model
4. Memory model
5. Execution model
3. Programming interface
1. Using Visual Studio
2. Compute capabilities
3. CUDA language extensions
4. Asynchronous Concurrent Execution
5. CUDA events
6. Unified Virtual Address Space
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
2
Table of contents (2)
4. Optimization techniques
1. Using shared memory
2. Using atomic instructions
3. Occupancy considerations
4. Parallel Nsight
5. CUDA libraries
1. CUBLAS library
6. CUDA versions
1. CUDA 4 features
2. CUDA 5 features
7. References
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
3
1. Introduction
Computational power of GPUs
• GPUs have enormous computational power (mainly in the field of single
precision arithmetic)
szenasi.sandor@nik.uni-obuda.hu
Figure 1.1 [11]
Figure 1.4 [7]
2012.12.30
5
Real-world applications
• GPU computing applications developed on the CUDA architecture by
programmers, scientists, and researchers around the world.
• See more at CUDA Community Showcase
Figure 1.2
http://www.nvidia.com/object/cuda-apps-flash-new-changed.html#
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
6
Graphical Processing Units
• A Graphical Processing Unit (GPU) is a specialized electronic circuit designed
to rapidly manipulate and alter memory to accelerate the building of images
in a frame buffer intended for output to a display [1].
• Modern GPUs are very efficient at manipulating computer graphics, especially
in 3D rendering. These functions are usually available through some standard
APIs, like:
◦ OpenGL (www.opengl.org)
◦ Direct3D (www.microsoft.com)
Shaders
• Shader is a computer program or a hardware unit that is used to do shading
(the production of appropriate levels of light and darkness within an image)
[2]
• Older graphics cards utilize separate processing units for the main tasks:
◦ Vertex shader – the purpose is to transform each vertex’s 3D position in
the virtual space to a 2D position on the screen
◦ Pixel shader – the purpose is to compute color and lightness information
for all pixels in the screen (based on textures, lighting etc.)
◦ Geometry shader – the purpose is to generate new primitives and modify
the existing ones
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
7
Unified Shader Model
Figure 1.3 [3]
• Older graphics cards utilize separate processing units for each shader type
• It’s hard to optimize the number of the different shaders because different
tasks need different shaders:
◦ Task 1. :
geometry is quite simple
complex light conditions
◦ Task 2.:
geometry is complex
texturing is simple
Unified Shader
• Later shader models reduced the
differences between the physical
processing units (see SM 2.x and SM 3.0)
• Nowadays graphics cards are usually contains only one kind of processing
units which is capable for every tasks. These are flexibly schedulable to a
variety of tasks
• The Unified Shader Model uses a consistent instruction set across all shade
types. All shaders have almost the same capabilities – they can read
textures, data buffers and perform the same set of arithmetic instructions
[4]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
8
What is GPGPU
• The Unified Shader Model means that the GPU use the same processor core
to implement all functions. These are simple, processing units with a small
set of instructions.
• Therefore graphics card manufacturers can increase the number of execution
units. Nowdays a GPU usually have ~1000 units.
• Consequently GPUs have massive computing power. It’s worth to utilize this
computing power not only in the area of computer graphics:
GPGPU: General-Purpose Computing on Graphics Processor Units
Programmable graphics cards
• In the first time it was a hard job to develop software components for
graphics cards. The developer had to use the direct language of the shaders.
• Nowdays the graphics card manufacturers support the software developers
with convenient development frameworks:
◦ Nvidia CUDA
◦ ATI Stream
◦ OpenCL
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
9
GPGPU advantages
•
•
•
•
Outstanding peek computing capacity
Favorable price/performance ratio
Scalable with the ability of multi-GPU development
Dynamic development (partly due to the gaming industry)
GPU disadventages
• Running sequential algorithms on GPUs is not efficient
→ we have to implement a parallel version but it is not a trivial task (and not
always worth it: calculating factorial, tc.)
• GPU execution units are less independents than CPU cores
→the peek performance is available only in some special (especially data
parallel) tasks
• Graphics cards have a separated memory region and GPUs can not access the
system memory. Therefore we usually need some memory transfers before
the real processing
→we have to optimize the number of these memory transfers. In some cases
these transfers make unusable the whole GPU solution
• GPGPU programming is a new area, therefore the devices are less mature,
the development time and cost is significantly higher
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
10
CPU-GPGPU comparision
• It is visible in Figure 1.2, that in the case of CPUs, most of the die area is
used by the cache. In case of GPUs, the amount of cache memory is
minimal, most of the die area is used by the execution units
• To improve the execution efficiency, GPUs employ a very useful feature:
latency hiding. A load from device memory takes hundreds of cycles to
complete (without cache). During this interval, instructions dependent on
fetched values will block the thread. Utilizing the fast context-switching
feautre, the execution utils can start working in other threads
→ to utilize this feature, the number of threads must be greater than the
number of execution units
Figure 1.4 [5]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
11
Memory architecture
• In case of CPUs, we usually don’t
care about the memory
architecture, we use only the
global system memory and
registers
• In practice there are some other
memory levels (different kind of
cache memories), but the CPU
automatically handle these
• In case of GPUs the developer
must know the whole memory
architecture
→ sometimes it’s worth to load
the often requested variables to
some faster memory areas
(manually handling the cache
mechanism)
Figure 4.2.1
(Nvidia CUDA Programming Guide v2.0)
Figure 1.5 [3]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
12
SIMT execution
• Sources of parallelism (SIMD < SIMT < SMT) [25]
◦ In SIMD, elements of short vectors are processed in parallel
◦ In SMT, instructions of several threads are run in parallel
◦ SIMT is somewhere in between - an interesting hybrid between vector
processing and hardware threading
• In case of the well known SIMD commands, the developer must ensure that
all the operands will be in the right place and format. In case of SIMT
execution, the execution units can reach different addresses in the global
memory
• It is possible to use
conditions with SIMT
execution. The branches
of the condition will be
executed sequentally:
→ Try to avoid conditions
and cycles in GPU codes
Figure 1.6 [7]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
13
2. Programming model
2. PROGRAMMING MODEL
2.1 Basics of CUDA environment
CUDA environment
• CUDA (Compute Unified Device Architecture) is the compute engine in Nvidia
graphics processing units or GPUs, that is accessible to software developers
through industry standard programming languages
• Free development framework, downloadable for all developers
• Similar to C / C++ programming languages
Releases
• 2007.
• 2008.
• 2010.
• 2011.
• 2012.
June. – CUDA 1.0
Aug. – CUDA 2.0
March. – CUDA 3.0
May – CUDA 4.0
Oct. – CUDA 5.0
Supported GPUs
• Nvidia GeForce series
• Nvidia GeForce mobile series
• Nvidia Quadro series
• Nvidia Quadro mobile series
• Nvidia Tesla series
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
16
Requested components
• Appropriate CUDA compatible
Nvidia graphics driver
• CUDA compiler
To compile .cu programs
• CUDA debugger
To debug GPU code
• CUDA profiler
To profiling GPU code
• CUDA SDK
Sample applications,
documentation
Download CUDA
• CUDA components are available from:
https://developer.nvidia.com/cuda-downloads
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
Figure 2.1.1
17
CUDA platform overview
• The CUDA language is based on the C/C++ languages (host and device
code), but there are other alternatives (Fortran etc.)
• The CUDA environment contains some function libraries that simplify
programming (FFT, BLAS)
• Hardware abstraction mechanism hides the details of the GPU architecture
◦ It simplifies the high-level programming model
◦ It makes easy to change the GPU architecture in the future
Figure 2.1.2 [5]
Separate host and device code
• Programmers can mix GPU code
with general-purpose code for
the host CPU
• Common C/C++ source code
with different compiler forks
for CPUs and GPUs
• The developer can choose the
compiler of the host code
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
18
Parts of the CUDA programming interface
C language extensions
• A minimal set of extensions to the C language, that allow the programmer to
target portions of the source code for execution on the device
◦ function type qualifiers to specify whether a functions executes on the
host or on the device and whether it is callable from the host or from the
device
◦ variable type qualifiers to specify the memory location on the device of a
variable
◦ a new directive to specify how a kernel is executed on the device from
the host
◦ built-in variables that specify the grid and block dimensions and the block
and thread indices
Runtime library
• The runtime library split into:
◦ a host component, that runs on the host and provides functions to control
and access the compute devices
◦ a device component, that runs on the device and provides device-specific
functions
◦ a common component, that provides built-in types, and a subset of the C
library that are supported in both host and device code
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
19
CUDA software stack
• The CUDA software stack is composed of several layers as illustrated in
Figure 4.2.1:
◦ device driver
◦ application programming
interface (API) and it’s
runtime
◦ additional libraries (two
higher-level mathematical
libraries of common usage)
• Programmers can reach all the
three levels depending on
simplicity/efficiency requirements
• It’s not recommend to use only
one of these levels in one
component
• In these lessons we will always
Figure 2.1.3 [5]
use the “CUDA Runtime” level.
In this level we can utilize the features of the GPU (writing/executing kernels
etc.) and the programming is quite simple.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
20
Main steps of the CUDA development
• Analysis of the task
• Implement the C/C++ code
• Compile/link the source code
Analyzing of the task
• Unlike traditional programs in addition to selecting the right solution we have
to find the well parallelizable parts of the algorithm
• The ratio of parallelizable/nonparallelizable parts can be a good indicator that
it is worth to create a parallel version or not
• Sometimes we have to optimize the original solution (decrease the number of
memory transfers/kernel executions) or create an entirely new one
Implementing the C/C++ code
• In practice we have only one source file, but it contains booth the CPU and
the GPU source code:
◦ Sequential parts for the CPU
◦ Data Parallel parts for the GPU
Compiling and linking
• The CUDA framework contains several utilities, therefore the compiling and
linking means only the execution of the ncc compiler
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
21
2. PROGRAMMING MODEL
2.2 Compiling and linking
CUDA compilation process details
Input
• One source file contains the CPU and
GPU codes (in our practice in C/C++
language)
Compilation
• The EDG preprocessor parses the source
code and creates different files for the
two architectures
• For the host CPU, EDG creates standard
.cpp source files, ready for compilation
with either the Microsoft or GNU C/C++
compiler
• For Nvidia’s graphics processors, EDG
creates a different set of .cpp files (using
Open64)
Output
• The output can be an object file, a linked
executable file, .ptx code etc..
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
Ábra 2.2.1 [2]
23
Main parameters of the nvcc compiler (1)
Usage of the compiler
• Default path (in case of x64 Windows installation):
c:\CUDA\bin64\nvcc.exe
• Usage:
nvcc [options] <inputfile>
Specifying the compilation phase:
• --compile(-c)
Compile each .c/.cc/.cpp/.cxx/.cu input file into an object file
• --link(-link)
This option specifies the default behavior: compile and link all inputs
• --lib(-lib)
Compile all inputs into object files (if necessary) and add the results to the
specified output library file
• --run(-run)
This option compiles and links all inputs into an executable, and executes it
• --ptx(-ptx)
Compile all .cu/.gpu input files to device-only .ptx files. This step discards
the host code for each of these input file
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
24
Main parameters of the nvcc compiler (2)
Setting directory information
• --output-directory <directory>(-odir)
Specify the directory of the output file
• --output-file <file>(-o)
Specify name and location of the output file. Only a single input file is
allowed when this option is present in nvcc non-linking/archiving mode
• --compiler-bindir <directory> (-ccbin)
Specify the directory in which the compiler executable (Microsoft Visual
Studio cl, or a gcc derivative) resides. By default, this executable is
expected in the current executable search path
• --include-path <include-path>(-I)
Specify include search paths
• --library <library>(-l)
Specify libraries to be used in the linking stage. The libraries are searched
for on the library search paths that have been specified using option '-L'
• --library-path <library-path>(-L)
Specify library search paths
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
25
Main parameters of the nvcc compiler (3)
Options for steering GPU code generations>
• --gpu-name <gpu architecture name> (-arch)
Specify the name of the NVIDIA GPU to compile for. This can either be a
'real' GPU, or a 'virtual' ptx architecture. The architecture specified with
this option is the architecture that is assumed by the compilation chain up
to the ptx stage.
Currently supported compilation architectures are: virtual architectures
compute_10, compute_11, compute_12, compute_13, compute_20,
compute_30, compute_35; and GPU architectures sm_10, sm_11, sm_12,
sm_13, sm_20, sm_21, sm_30, sm_35
• --gpu-code <gpu architecture name> (-code)
Specify the name of NVIDIA GPU to generate code for. Architectures
specified for options -arch and -code may be virtual as well as real, but the
'code' architectures must be compatible with the 'arch' architecture. This
option defaults to the value of option '-arch'.
Currently supported GPU architectures: sm_10, sm_11, sm_12, sm_13,
sm_20, sm_21, sm_30, and sm_35
• --device-emulation(-deviceemu)
Generate code for the GPGPU emulation library
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
26
Main parameters of the nvcc compiler (4)
Miscellaneous options for guiding the compiler driver:
• --profile (-pg)
Instrument generated code/executable for use by gprof (Linux only)
• --debug (-g)
Generate debug information for host code.
• --optimize<level>(-O)
Specify optimization level for host code
• --verbose(-v)
List the compilation commands generated by this compiler driver, but
donot suppress their execution
• --keep (-keep)
Keep all intermediate files that are generated during internal
compilationsteps
• --host-compilation <language>
Specify C vs. C++ language for host code in CUDA source files.
Allowed values for this option: 'C','C++','c','c++'.
Default value: 'C++'
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
27
Compiling example
C:\CUDA\bin64\nvcc.exe
-ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"
-I"C:\CUDA\include"
-I"c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include"
-I"C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc„
-L"c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\lib\amd64“
--host-compilation C++
--link
--save-temps
"d:\hallgato\CUDA\sample.cu"
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
28
Overview of compilation
sample.cu
nvcc.exe
sample.cpp1.ii
sample.ptx
Figure 2.2.2
cl.exe
ptxas.exe
sample.obj
sample_sm_10.cubin
Libraries
sample.exe
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
29
2. PROGRAMMING MODEL
2.3 Platform model
CUDA platform model
• As visible in Figure 3.1.1 the CUDA
environment assumes that all the
threads are executed in a seperate
device
• Therefore we have to seperate the
host machine (responsibel for
memory allocations, thread handling)
and the device (responsible for
the execution of the threads)
Figure 2.3.1 [5]
Asynch execution
• With multiple devices one host can
control more than one CUDA device
• In case of Fermi an later cards, one
device can run parallel more than one
thread groups
• In case of Kepler and later cards, any
kernel can start other kernels
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
31
Inside one CUDA device
• Figure 3.1.2 illustrates the
CUDA hardware model for a
device
• Every device contains on or
more multiprocessors, and
these multiprocessors contains
one or (more frequently) more
SIMT execution units
one multiprocessor
SIMT execution units
Registers
Shared memory (available for
all threads)
• Read-only constant and texture
cache
Figure 2.3.2 [5]
Inside
•
•
•
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
32
Device management
Number of CUDA compatible devices
• The result of the cudaGetDeviceCount function is the number of CUDAavailable devices
1
2
int deviceCount;
cudaGetDeviceCount(&deviceCount);
• The function will store the number of CUDA compatbile devices into the
passed deviceCount variable
Select the active CUDA compatible device
• This function is used to select the device associated to the host thread. A
device must be selected before any __global__ function or any function from
the runtime API is called
• The parameter of this function is the number of the selected device
(numbering starts with 0)
1
2
int deviceNumber = 0;
cudaSetDevice(deviceNumber);
• Missing the function call, the framework will automatically select the first
available CUDA device
• The result of the function will affect the entire host thread
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
33
Detailed information about devices
The CUDA framework contains a class structure named cudaDeviceProp, to
store the detailed information of the devices. The main fields of this structure
are:
cudaDeviceProp structure
name
totalGlobalMem
sharedMemPerBlock
regsPerBlock
totalConstMem
warpSize
maxThreadsPerBlock
maxThreadsDim
maxGridSize
clockRate
minor, major
multiprocessorCount
deviceOverlap
2012.12.30
Name of the device
Size of the global memory
Size of the shared memory per block
Number of registers per block
Size of the constant memory
Size of the warps
Maximum number of threads by block
Maximum dimension of thread blocks
Maximum grid size
Clock frequency
Version numbers
Number of multiprocessors
Is the device capable to overlapped read/write
szenasi.sandor@nik.uni-obuda.hu
34
Acquire the detailed information about devices
• The result of the cudaGetDeviceProperties is the previously introduced
cudaDeviceProp structure.
• The first parameter of the function is a pointer to an empty cudaDevireProp
structure. The second parameter is the identifier of the device (numbering
starts with 0)
1
2
3
int deviceNumber = 1;
cudaDeviceProperty deviceProp;
cudaGetDeviceProperties(&deviceProp, deviceNumber);
Exam 2.3.1
Write out the number of available devices.
List the number of these devices.
List the detailed data of an user selected device.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
35
2. PROGRAMMING MODEL
2.4 Memory model
The memory concept
• Thread level
◦ Private registers (R/W)
◦ Local memory (R/W)
• Block level
◦ Shared memory (R/W)
◦ Constant memory (R)
• Grid level
◦ Global memory (R/W)
◦ Texture memory (R)
Device-host communication
• The global, constant and texture memory
spaces can be read from or written to by
the CPU and are persistent across kernel
launches by the same application
Figure 2.4.1 [5]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
37
CUDA memory model – global memory
•
•
•
•
•
•
Has the lifetime of the application
Accessible for all blocks/threads
Accessible for the host
Readable/writeable
Large
Quite slow
Declaration
• Use the __device__ keyword
• Example:
1
2
__device__ float *devPtr;
__device__ float devPtr[1024];
Figure 2.4.2 [5]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
38
CUDA memory model – constant memory
•
•
•
•
•
•
Has the lifetime of the application
Accessible for all blocks/threads
Accessible for the host
Readable/writeable for the host
Readable for the device
Cached
Declaration
• Use the __constant__ keyword
• Example:
1
2
__constant__ float *devPtr;
__constant__ float devPtr[1024];
Figure 2.4.3 [5]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
39
CUDA memory model – texture memory
•
•
•
•
•
•
Has the lifetime of the application
Accessible for all blocks/threads
Accessible for the host
Readable/writeable for the host
Readable for the device
Available for image manipulating functions
(texturing etc.). Not a common byte based
array.
Declaration
• We do not discuss
Figure 2.4.4 [5]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
40
CUDA memory model – shared memory
•
•
•
•
•
•
Has the lifetime of the block
Accessible for all threads in this block
Not accessible for the host
Readable/writeable for threads
Quite fast
Size is strongly limited (see kernel start)
Declaration
• Use the __shared__ keyword
• Example:
1
2
__shared__ float *devPtr;
__shared__ float devPtr[1024];
• Dynamic allocation example
1
2
3
4
5
6
7
2012.12.30
extern __shared__ float array[];
short array0[128];
float array1[64];
__device__ void func( ) {
float* array0 = (short*)array;
float* array1 = (float*)&array0[128];
}
szenasi.sandor@nik.uni-obuda.hu
Figure 2.4.5 [5]
41
CUDA memory model - registers
•
•
•
•
•
•
•
Has the lifetime of the thread
Accessible for only the owner thread
Not accessible for the host/other threads
Readable/writeable
Quite fast
Limited number of registers
Not dedicated registers, the GPU have a
fixed size register set
Declaration
• Default storing area for device variables
• Example
1
2
3
__global__ void kernel {
int regVar;
}
Figure 2.4.6 [5]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
42
CUDA memory model – local memory
•
•
•
•
•
Has the lifetime of the thread
Accessible for only the owner thread
Not accessible for the host/other threads
Readable/writeable
Quite slow
Declaration
• Looks like a normal register, but these
variables are stored in the „global”
memory
• If there aren’t any space for registers, the
compiler will automatically create the
variables in local memory
• Example
1
2
3
2012.12.30
__global__ void kernel {
int regVar;
}
Figure 2.4.7 [5]
szenasi.sandor@nik.uni-obuda.hu
43
Physical implementation of the CUDA memory model
Dedicated hardware memory
• The compiler will map here the
◦ registers,
◦ shared memory
• ~1 cycle
Device memory without cache
• The compiler will map here the
◦ local variables,
◦ global memory
• ~100 cycle
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Shared Memory
Registers
Processor 1
Device memory with cache
• The compiler will map here the
◦ constant memory,
◦ texture memory,
◦ instruction cache
• ~1-10-100 cycle
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
Registers
Processor 2
Registers
…
Instruction
Unit
Processor M
Constant
Cache
Texture
Cache
Device memory
Figure 3.3.1
(Programming Massively Parallel Processors
courses)
Figure 2.4.8 [5]
44
Memory handling
Static allocation
• Variabled declared as usual in C languages
• The declaration contains one of the previously introduced keywords
(__device__, __constant__ etc.)
• The variable is accessible as usual in C languages, we can use them as
operands and function parameters etc.
Dynamic allocation
• The CUDA class library have several memory handling functions. With
these function we can
◦ allocate memory
◦ copy memory
◦ free memory
• The memory is accessible via pointers
• Pointer usage is the same as common in C languages but it is important to
note that the device have a seperated address space (device and host
memory pointers are exchangeable)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
45
CUDA memory regions
Global memory
Constant memory
Texture memory
Figure 2.4.9
Grouped by visibility
Registers
Local memory
Shared memory
Grouped by accessibility
Eszköz
2012.12.30
Constant
Texture
Shared
Dynamic allocation
Dynamic allocation
Dynamic allocation
-
R/W
R/W
-
-
-
Static allocation
Static allocation
Static allocation
R/W
R
R/W
R/W
szenasi.sandor@nik.uni-obuda.hu
Registers
Local memory
Figure 2.4.10
Host
Global
46
Dynamic allocation – allocate memory
• Programmer can allocate and deallocate linear memory with the
appropriate functions in the host code
• The cudaMalloc function allocates device memory, parameters:
◦ address of a pointer to the allocated object
◦ size of allocated object (bye)
• For example, to allocate a float vector with size 256::
1
2
float *devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(float));
Free device memory
• Programmer can free allocated device memory regions with the
cudaFreeArray function
• The only parameter of the function is a pointer to the object
1
2
2012.12.30
float *devPtr = ...;
cudaFree(devPtr);
szenasi.sandor@nik.uni-obuda.hu
47
Transfer in device memory
• Programmer can copy data between the
host and the devices with the
cudaMemCopy function
• Required parameters:
◦ destination pointer
◦ source pointer
◦ number of bytes to copy
◦ direction of memory transfer
• Valid values for direction
◦ host → host
(cudaMemcpyHostToHost)
◦ host → device
(cudaMemcpyHostToDevice)
◦ device → host
(cudaMemcpyDeviceToHost)
◦ device → device
(cudaMemcpyDeviceToDevice)
1
2
3
2012.12.30
Figure 2.4.11 [5]
float *hostPtr = ...;
float *devPtr = ...;
cudaMemcpy(devPtr, hostPtr, 256 * sizeof(float), cudaMemcpyHostToDevice);
szenasi.sandor@nik.uni-obuda.hu
48
Pinned memory
• In the host side we can allocate pinned memory. This memory object is
always stored in the physical memory, therefore the GPU can fetch it
without the help of the CPU
• Non-pinned memory can stored in swap (in practice in the hard drive)
therefore it can cause page faults on access. So the driver needs to check
every access
• To use asynchronous memory transfers the memory must be allocated by
the special CUDA functions:
◦ cudaHostAlloc()
◦ cudaFreeHost()
• It has several benefits:
◦ Copies between pinned memory and device memory can be performed
concurrently with kernel execution for some devices
◦ Pinned memory can be mapped to the address space of the device on
some GPUs
◦ On systems with a front-side bus, bandwidth of memory transfer is
higher in case of using pinned memory in the host
• Obviously the OS can not allocate as many page-locked memory as
pageable. And the using of too much page-locked memory can decrease
the overall system performance
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
49
Zero-copy memory
• A special version of the pinned memory is the zero-copy memory. In this
case we don’t need to transfer data from host to the device, the kernel
can directly access the host memory
• Also called mapped memory because in this case the this memory region
is mapped into the CUDA address space
• Useful when
◦ the GPU has no memory and uses the system RAM
◦ the host side wants to access to data while kernel is still running
◦ the data does not fit into GPU memory
◦ we want to execute enough calculation to hide the memory transfer
latency
• Mapped memory is shared between host and device therefore the
application must synchronize memory access using streams or events
• The CUDA device properties structures has information the capabilities of
the GPU: canMapHostMemory = 1 if the mapping feature is available
Portable pinned memory
• Pinned memory allowed to move between host threads (in case of multiGPU environments)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
50
2. PROGRAMMING MODEL
2.5 Execution model
CUDA execution model - threads
• Each thread has a unique ID. So each
thread can decide what data to work on
• It can be
◦ 1 dimensional
◦ 2 dimensional (Fig. 3.3.1)
◦ 3 dimensional
• Thread ID is available in the kernel via
threadIdx variable
• In case of multidimensional index space,
the threadIdx is a structure with the
following fields:
◦ threadIdx.x
◦ threadIdx.y
◦ threadIdx.z
1 dimensional index space
0
1
2
3
4
2 dimensional index space
0,0
0,1
0,2
0,3
0,4
1,0
1,1
1,2
1,3
1,4
3 dimensional index space
1,0,0 1,0,1 1,0,2 1,0,3 1,0,4
0,0,0 0,0,1 0,0,2 0,0,3 0,0,4
1,0
1,1,0 1,1
1,1,1 1,2
1,1,2 1,3
1,1,3 1,1,4
0,1,0 0,1,1 0,1,2 0,1,3 0,1,4
Thread index
Figure 2.5.1
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
52
CUDA thread blocks
• CUDA devices has a limitation for the maximal number of paralell executable
threads. The index space of a complex task can be greater than this limit (for
example maximum 512 thread ↔ 100x100 matrix = 10000 threads)
• In these cases the device will split the entire index space to smaller thread
blocks. The scheduling mechnism will process all of these blocks and it will
decide the processing order (one-by-one or in case of more than one
multiprocessors in a parallel way)
• The hierarchy of blocks is the grid
Block splitting method
• In CUDA, the framework will create, initialize and start all of the threads. The
creation, initialization of the blocks is the framework’s task too.
• The programmer can influence this operation via the following parameters
(kernel start parameters):
◦ Number of threads within a single block (1,2 or 3 dimension)
◦ Number of blocks in the grid (1 or 2 dimension)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
53
CUDA thread block indexes
• Thread block also have a unique ID. So
a thread can reach the owner block
data
• It can be
◦ 1 dimensional
◦ 2 dimensional (Fig. 2.5.2)
◦ 3 dimensional (Fermi and after)
• Block ID is available in the kernel via
blockIdx variable
• In case of multidimensional index
space, the threadIdx is a structure with
the following fields:
◦ blockIdx.x
◦ blockIdx.y
◦ blockIdx.z
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Figure 2.5.2 [5]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
54
Global and local indices
Local identifier
• Every thread have a local identifier, it is stored in the previously introduced
threadIdx variable
• This number shows the thread’s place within the block
• The identifier of the „first” thread is (based on the block dimensions):
0 or [0,0] or [0,0,0]
Global identifier
• In case of more than one block, the local identifier is not unique anymore
• We know the identifer of the block (the owner of the thread), the previously
introduced blockIdx variable and the size of the blocks (blockDim variable),
we can calculate the global identifier of the thread:
Pl. Global_x_component = blockIdx.x * blockDim.x + threadIdx.x
• The programmer can not send unique parameters to the threads (for
example, wich matrix element to process). Therefore the thread must use it’s
unique global identifer to get it’s actual parameters
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
55
Some useful formulas
• Size of the index space: Gx, Gy
(derived from the problem space)
• Block size: Sx, Sy
(based on the current hardware)
• Number of threads: Gx * Gy
(number of all threads)
• Global indentifiers: (0..Gx - 1, 0..Gy – 1)
(unique identifier for all threads)
• Numberof blocks: (Wx, Wy) = ((Gx – 1)/ Sx+ 1, (Gy – 1)/ Sy + 1)
(number of block for the given block size)
• Global identifier: (gx, gy) = (wx * Sx + sx, wy * Sy + sy)
• Local identifier: (wx, wy) = ((gx – sx) / Sx, (gy – sy) / Sy)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
56
Create a kernel
• A CUDA kernel looks like a simple C function, but there are some significat
differences:
◦ there are some special keywords
◦ there are some special available variables in the function’s body (the
previously mentioned threadIdx etc.)
◦ directly not callable from the host code, there is a special kernel
invocation syntax
CUDA keywords to sign functions
__device__
◦ Executed in: device
◦ Callable from: device
__global__
◦ Executed in: device
◦ Callable from: host
__host__
◦ Executed in: host
◦ Callable from: host
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
57
Start a kernel
• Any host function can call a kernel using the following syntax:
Kernel_name<<<Dg, Db, Ns, S>>>(parameters)
where:
• Dg – grid size
A dim3 structure, that contains the size of the grid
Dg.x * Dg.y = number of blocks
• Db – block size
A dim3 structure, that contains the size of the blocks
Dg.x * Dg.y * Dg.z = number of thread within a single block
• Ns – size of the shared memory (optional parameter)
A size_t variable, that contains the size of the allocated shared memory for
each blocks
• S – stream (optional parameter)
A cudaStream_t variable, that contains the stream associated to the
command
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
58
Built-in types
dim3 type
• In case of kernel start the size of the grid and the size of the blocks are
stored in a dim3 variable. In case of the grid this is a 1 or 2 dimensional, in
case of blocks this is a 1, 2 or 3 dimensional vector
• Example for usage of dim3 variables:
1
2
3
4
dim3 meret;
meret = 10;
meret = dim3(10, 20);
meret = dim3(10, 20, 30);
size_t type
• Unsigned integer. Used to store memory sizes
cudaStream_t típus
• Identifies a stream. In practice an unsigned integer value
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
59
Kernel implementation
• The following example shows a simple kernel implementation (multiply all
values in the vector by 2):
1
2
3
4
5
__global__ void vectorMul(float* A)
{
int i = threadIdx.x;
A[i] = A[i] * 2;
}
•
•
•
•
•
The __global__ keyword signs that the device will execute the function
In case of device functions, there must be not any result values
The name of the kernel is vectorMul
The function has one parameter: the address of the vector
As it is clearly visible, the kernel don’t have any information about the
execution parameters (how many threads, how many blocks etc.)
• As discussed before, the kernel can use the threadIdx variable to determine
which vector element to multiply
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
60
Kernel invocation
• If the size of the vector is not greater than the number of maximum threads,
on block is enough to process the entire data space
• We use 1x1 grid size (first parameter)
• We use 200x1 block size (second parameter)
1
2
3
4
float*A = ...
... Transfer data ...
vectorMul<<<1, 200>>>(A);
... Transfer results ...
• With these execution parameters the device will create one block and 200
threads
• The local identifiers of the threads will be one dimensional numbers from 0 to
199
• The identifier of the block will be 0
• The block size will be 200
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
61
Using multiple-block kernel
• If we want to process 2000 items which is more than the number of
maximum threads in a single block, we have to create more than one blocks
in the device:
__global__ void vectorMul(float* A, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
{
A[i] = A[i] * 2;
}
}
1
2
3
4
5
6
7
8
•
In the first line the kernel calculates it’s global identifier. This will be a
globally unique number for each threads in each blocks
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
62
Invoking a multiple-block kernel
• If we want to process 1000 element and the maximum block size is 512 (with
Compute Capability 1.0), we can use the following parameters:
• 4 blocks (identifiers are 0, 1, 2 and 3)
• 250 threads (local identifiers are 0 .. 249)
1
2
3
4
float*A = ...
... Transfer data ...
vectorMul<<<4, 250>>>(A, 1000);
... Transfer results ...
• If we don’t know the number of elements at compile time, we can calculate
the correct block and thread numbers (N – vector size, BM – chosen block
size):
◦ Number of blocks: (N-1) / BM + 1
◦ Size of blocks: BM
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
63
Create the entire application
Exam 3.3.1
Create a CUDA application to solve the following problems:
• List the name of all CUDA compatible devices
• The user can choose on of them
• Allocate an A vector with size N (A)
• Fill the A vector with random data
• Move these values to the GPU global memory
• Create and start a kernel to calculate A = A * 2
Use N blocks and BlockN size blocks
• Move back the results to A in system memory
• Write out the result to the screen
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
64
3. Programming interface
3. PROGRAMMING INTERFACE
3.1 Using Visual Studio
Visual Studio capabilities
• Latest CUDA versions support Visual Studio 2008/2010
• After installing CUDA some new functions appear in Visual Studio
◦ New project wizard
◦ Custom build rules
◦ CUDA syntax highlighting
◦ Etc.
Figure 3.1.1
New project wizard
• Select File/New/Project/
Visual C++/CUDA[64]/
CUDAWinApp
• Click “Next” on the
welcome screen
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
67
New project wizard
Figure 3.1.2
• Select application type
◦ Windows application
◦ Console application – we will use this option in our examples
◦ DLL
◦ Static library
• Select header files for
◦ ATL
◦ MFC
• Set additional options
◦ Empty project
◦ Export symbols
◦ Precompiled header
• Click “Finish” to generate
an empty CUDA project
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
68
Custom build rules
Figure 3.1.3
• Right click on project name, and select “Custom build rules”
• There are one or more CUDA custom build rules in the appearing list
• Select the appropriate one based on the followings
◦ Want to use runtime API or Driver API
◦ CUDA Version
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
69
CUDA related project properties
Select project and click on “Project properties” and click on “CUDA Build Rule”
There are several options in multiple tabs (debug symbols, GPU arch., etc.)
These are the same options as discussed in nvcc compiler options part
The “Command Line” tab shows the actual compiling parameters
Figure 3.1.4
•
•
•
•
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
70
3. PROGRAMMING INTERFACE
3.2 Compute capabilities
Compute capability (1)
• The difference between the newer and older graphics cards are more than the
number of execution units and the speed of the processing elements. Often
there are really dramatic changes in the whole CUDA architecture. The
compute capability is a sort of hardware version number.
• The compute capability of a device is defined by a major revision number and
a minor revision number.
• Devices with the same major revision number are of the same core
architecture
Details for hardware versions
• Compute capability 1.0
◦ The maximum number of threads per block is 512
◦ The maximum sizes of the x-, y-, and z-dimension of a thread block are
512, 512,and 64, respectively
◦ The maximum size of each dimension of a grid of thread blocks is 65535
◦ The warp size is 32 threads
◦ The number of registers per multiprocessor is 8192
◦ The amount of shared memory available per multiprocessor is 16 KB
organized into 16 banks
◦ The total amount of constant memory is 64 KB
◦ The cache working set for constant memory is 8 KB per multiprocessor
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
72
Compute capability (2)
• Compute capability 1.0 (cont.)
◦ The cache working set for constant memory is 8 KB per multiprocessor
◦ The cache working set for texture memory varies between 6 and 8 KB per
multiprocessor
◦ The maximum number of active blocks per multiprocessor is 8
◦ The maximum number of active warps per multiprocessor is 24
◦ The maximum number of active threads per multiprocessor is 768
◦ For a texture reference bound to a one-dimensional CUDA array, the
maximum width is 213
◦ For a texture reference bound to a two-dimensional CUDA array, the
maximum width is 216 and the maximum height is 215
◦ For a texture reference bound to linear memory, the maximum width is
227
◦ The limit on kernel size is 2 million PTX instructions
◦ Each multiprocessor is composed of eight processors, so that a
multiprocessor is able to process the 32 threads of a warp in four clock
cycles
• Compute capability 1.1
◦ Support for atomic functions operating on 32-bit words in global memory
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
73
Compute capability (3)
• Compute capability 1.2
◦ Support for atomic functions operating in shared memory and atomic
functions operating on 64-bit words in global memory
◦ Support for warp vote functions
◦ The number of registers per multiprocessor is 16384
◦ The maximum number of active warps per multiprocessor is 32
◦ The maximum number of active threads per multiprocessor is 1024
• Compute capability 1.3
◦ Support for double-precision floating-point numbers
• Compute capability 2.0
◦ 3D grid of thread blocks
◦ Floating point atomic functions (addition)
◦ __ballot() function is available (warp vote)
◦ __threadfence_system() function is available
◦ __systhreads_count() function is available
◦ __systhreads_and() function is available
◦ __systhreads_or() function is available
◦ Maximum dimension of a block is 1024
◦ Maximum number of threads per block
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
74
Compute capability (4)
• Compute capability 2.0 (cont)
◦ Warp size is 32
◦ Maximum threads per multiprocessors is 1536
◦ Number of 32 bit registers per multiprocessors is 32K
◦ Number of shared memory banks is 32
◦ Amount of local memory per thread is 512KB
• Compute capability 3.0
◦ Atomic functions operating on 64-bit integer values in shared memory
◦ Atomic addition operating on 32-bit floating point values in global and
shared memory
◦ __ballot()
◦ __threadfence_system()
◦ __syncthreads_count()
◦ __syncthreads_and()
◦ __syncthreads_or()
◦ Surface functions
◦ 3D grid of thread blocks
◦ Maximum number of resident blocks per multiprocessor is 16
◦ Maximum number of resident warps per multiprocessor is 64
◦ Maximum number of resident threads per multiprocessor is 2048
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
75
Compute capability (5)
• Compute capability 3.0 (cont)
◦ Number of 32-bit registers per multiprocessor is 64K
• Compute capability 3.5
◦ Funnel Shift
◦ Maximum number of 32-bit registers per thread is 255
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
76
Device parameters (1)
Device name
Number of MPs
Compute capability
GeForce GTX 280
30
1.3
GeForce GTX 260
24
1.3
GeForce 9800 GX2
2x16
1.1
GeForce 9800 GTX
16
1.1
GeForce 8800 Ultra, 8800 GTX
16
1.0
GeForce 8800 GT
14
1.1
GeForce 9600 GSO, 8800 GS, 8800M GTX
12
1.1
GeForce 8800 GTS
12
1.0
GeForce 8500 GT, 8400 GS, 8400M GT, 8400M GS
2
1.1
GeForce 8400M G
1
1.1
Tesla S1070
4x30
1.3
Tesla C1060
30
1.3
Tesla S870
4x16
1.0
Tesla D870
2x16
1.0
Tesla C870
16
1.0
Quadro Plex 1000 Model S4
4x16
1.0
Quadro FX 1700, FX 570, NVS 320M, FX 1600M
4
1.1
GeForce GTX 480
15
2.0
GeForce GTX 470
14
2.0
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
77
Device parameters (2)
Device name
Compute capability
GeForce GT 610
2.1
GeForce GTX 460
2.1
GeForce GTX 560 Ti
2.1
GeForce GTX 690
3.0
GeForce GTX 670MX
3.0
GeForce GT 640M
3.0
Tesla K20X, K20
3.5
• More details can be found at
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
78
3. PROGRAMMING INTERFACE
3.3 CUDA language extensions
CUDA language extensions
• The CUDA source is similiar to a standard C or C++ source code and the
development steps are the same too. The nvcc compiler do most of the job
(seperate the CPU and GPU code, compile these sources, linking the
executable), this is invisible for the programmer
• There are some special operations for making kernels, executing kernels etc.
These are usually extended keywords and functions, but most of them looks
like standard C keywords and functions
• CUDA source code can be C or C++ based, in practice we will use standard C
language in these lessons
• The runtime library split into:
◦ host component, that runs on the host and provides functions to control
and access the compute devices
◦ device component, that runs on the device and provides device-specific
functions
◦ common component, that provides built-in types, and a subset of the C
library that are supported in both host and device code
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
80
Common component – new variable types
Built-in vector types
• New built-in types for vectors:
◦ char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4
◦ short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4
◦ int1, uint1, int2, uint2, int3, uint3, int4, uint4
◦ long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4
◦ float1, float2, float3, float4, double2
• For example int4, means a 4 integer size vector
• The components of the vectors are accessible via the x, y, z, w fields
(according to the dimension of the vector)
• All of these vectors have constructor function named make_type. For
example: int2 make_int2(int x, int y)
dim3 type
• This type is an integer vector type based on uint3 that is used to specify
dimensions
• When defining a variable of type dim3, any component left unspecified is
initialized to 1
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
81
Common component – available functions
Mathematical functions
• Kernels run in the device therefore most of the common C functions are
unavailable (I/O operations, complex functions, recursion etc.)
• CUDA supports most of the C/C++ standard library mathematical functions.
When executed in host code, a given function uses the C runtime
implementation if available
◦ basic arithmetic
◦ Sin/cos etc.
◦ Log, sqrt etc.
Time functions
• The clock() function should measure the runtime of the kernels. The
signature of this function:
clock_t clock
• The return value is the actual value of a continously incrementing counter
(based on the clock frequency)
• Provides a measure for each thread of the number of clock cycles taken by
the device to completely execute the thread, but not of the number of clock
cycles the device actually spent executing thread instructions.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
82
Device component - built-in variables
gridDim
• Type: dim3
• Contains the dimensions of the grid
blockIdx
• Type : uint3
• Contains the block index within the grid
blockDim
• Type : dim3
• Contains the dimensions of the block
threadIdx
• Type : uint3
• Contains the thread index within the block
warpSize
• Type : int
• Contains the warp size in threads
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
83
Device component - functions
Fast mathematical functions
• For some of the functions, a less accurate, but faster version exists in the
device runtime component
• It has the same name prefixed with __, like:
__fdividef__sinf, __cosf, __tanf, __sincosf,
__logf, __log2f, __log10f, __expf, __exp10f, __powf
• The common C functions are also available, but it is recommended to use the
functions above:
◦ Faster, based on the hardware units
◦ Less accurate
Synchronization within a block
• void __syncthreads()
◦ effect: synchronizes all threads in a block. Once all threads have reached
this point, execution resumes normally
◦ scope: threads in a single block
• __syncthreads is allowed in conditional code but only if the conditional
evaluates identically across the entire thread block, otherwise the code
execution is likely to hang or produce unintended side effects
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
84
Device component – atomic functions
• An atomic function performs a read-modify-write atomic operation on one 32bit or 64-bit word residing in global or shared memory :
atomicAdd, atomicSub, atomicExch, atomicMin, atomicMax, atomicInc,
atomicDec, atomicCAS, atomicAnd, atomicOr, atomicXor
• The operation is atomic in the sense that it is guaranteed to be performed
without interference from other threads
• Impair the effeciency of parallel algorithms
Warp vote functions
• Compute Capability 1.2 and after
• int __all(int condition)
Evaluates predicate for all threads of the warp and returns non-zero if and
only if predicate evaluates to non-zero for all of them
• int __any(int condition)
Evaluates predicate for all threads of the warp and returns non-zero if and
only if predicate evaluates to non-zero for any of them
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
85
Host component - functions
• Device handling functions
◦ See next chapter
• Context handling functions
◦ See next chapter
• Memory handling functions
◦ See next chapter
• Program module handling functions
◦ See next chapter
• Kernel handling functions
◦ See next chapter
Error handling
• cudaError_t cudaGetLastError()
Result is the error code of the last command
• Const char* cudaGetErrorString(cudaError_t error)
Result is the detailed description of an error code
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
86
3. PROGRAMMING INTERFACE
3.4 Asynchronous Concurrent Execution
Streams
Figure 3.4.1 [12]
• Applications manage concurrency through streams
• A stream is a sequence of commands (possibly issued by different host
threads) that execute in order. Different streams, on the other hand, may
execute their commands out of order with respect to one another or
concurrently; this behavior is not guaranteed and should therefore not be
relied upon for correctness [11]
• Streams support concurrent execution
◦ Operations in different streams may run concurrently
◦ Operations in different streams may be interleaved
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
88
Creating/destroying streams
• Stream is represented by a cudaStream_t type
• Create a stream with cudaStreamCreate function
◦ Parameters: pStream – pointer to a new stream identifier
1
2
cudaStream_t stream;
cudaStreamCreate(&stream);
• Destroy stream with cudaStreamDestroy function
◦ Parameters: pStream – stream to destroy
1
cudaStreamDestroy(stream);
• Common pattern to create/destroy an array of streams
1
2
3
4
5
2012.12.30
cudaStream_t stream[N];
for (int i = 0; i < N; ++i)
cudaStreamCreate(&stream[i]);
for (int i = 0; i < N; ++i)
cudaStreamDestroy(stream[i]);
szenasi.sandor@nik.uni-obuda.hu
89
Using streams
• Some CUDA functions have an additional stream parameter
◦ cudaError_t cudaMemcpyAsync(
void *dst,
const void *src,
size_t count,
enum cudaMemcpyKind kind,
cudaStream_t stream = 0)
◦ Kernel launch:
Func<<< grid_size, block_size, shared_mem, stream >>>
• Concurrent execution may need some other requirements
◦ Async memory copy to different directions
◦ Page locked memory
◦ Enough device resources
• In case of missing stream parameter the CUDA runtime use the default stream
(identified by 0)
◦ Used when no stream is specified
◦ Completely synchronous host to device calls
◦ Exception: GPU kernels are asynchronous with host by default if stream
parameter is missing
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
90
Using streams example
1
2
3
4...
5
6
7
8
9
10
11
12
13
14
15
cudaStream_t stream1, stream2;
cudaStreamCreate ( &stream1) ;
cudaStreamCreate ( &stream2) ;
cudaMalloc ( &dev1, size ) ;
cudaMallocHost ( &host1, size ) ;
cudaMalloc ( &dev2, size ) ;
cudaMallocHost ( &host2, size ) ;
cudaMemcpyAsync ( dev1, host1, size, H2D, stream1 ) ;
kernel2 <<< grid, block, 0, stream2 >>> ( …, dev2, … ) ;
kernel3 <<< grid, block, 0, stream1 >>> ( …, dev1, … ) ;
cudaMemcpyAsync ( host2, dev2, size, D2H, stream2 ) ;
...
• All stream1 and stream2 operations will run concurrently
• Data used by concurrent operations should be independent
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
91
Stream synchronization
• Synchronize everything with cudaDeviceSynchronize()
blocks host until all CUDA calls are complete
1
cudaDeviceSynchronize();
• Synchronize to a specific stream with cudaStreamSynchronize
◦ Parameters: stream – stream to synchronize
1
cudaStreamSynchronize(stream);
• Programmer can create specific events within streams for synchronization
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
92
Operations implicitly followed a synchronization
• Page-locked memory allocation
◦ cudaMallocHost
◦ cudaHostAlloc
• Device memory allocation
◦ cudaMalloc
• Non-async version of memory operations
◦ cudaMemcpy
◦ cudaMemset
• Change to L1/shared memory configuration
◦ cudaDeviceSetCacheConfig
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
93
Stream scheduling [12]
• Fermi hardware has 3 queues
◦ 1 Compute Engine queue
◦ 2 Copy engine queues
– Host to device copy engine
– Device to host copy engine
• CUDA operations are dispatched to devices in the sequence they were issued
◦ Placed in the relevant queue
◦ Stream dependencies between engine queues are maintained but lost
within an engine queue
• CUDA operation is dispatched from the engine queue if
◦ Preceding calls in the same stream have completed,
◦ Preceding calls in the same queue have been dispatched, and
◦ Resources are available
• CUDA kernels may be executed concurrently if they are in different streams
◦ Thread blocks for a given kernel are scheduled if all thread blocks for
preceding kernels have been scheduled and there still are SM resources
available
• Note a blocked operation blocks all other operations in the queue, even in
other streams
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
94
Concurrency support
• Compute Capability 1.0
◦ Support only for GPU/CPU concurrency
• Compute Capability 1.1
◦ Supports asynchronous memory copies
– Check asyncEngineCount device property
• Compute Capability 2.0
◦ Supports concurrent GPU kernels
– Check concurrentKernels device property
◦ Supports bidirectional memory copies based on the second copy engine
– Check asyncEngineCount device property
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
95
Blocked Queue example
• Two streams with the following operations
◦ Stream1: HDa1, HDb1, K1, DH1
◦ Stream2: DH2
Figure 3.4.2 [12]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
96
Blocked Kernel example
• Two streams with the following operations
◦ Stream1: Ka1, Kb1
◦ Stream2: Ka2, Kb2
Figure 3.4.3 [12]
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
97
3. PROGRAMMING INTERFACE
3.5 CUDA events
Create and destroy a new event
• The cudaEventCreate function creates a new CUDA event
cudaError_t cudaEventCreate(cudaEvent_t *event)
◦ The first parameter of the funtion is an event object pointer
◦ The function will create a new event object the passed pointer will
reference to this
◦ The result of the function is the common CUDA error code
◦ An example
1
2
cudaEvent_t test_event;
cudaEventCreate(&test_event);
• There is an advanced version of this function, called
cudaEventCreateWithFlags (see CUDA documentation)
• The cudaEventDestroy function destorys a CUDA event object
cudaError_t cudaEventDestroy(cudaEvent_t event)
◦ The first parameter the already existing event object to destroy
◦ An example:
1
2
3
2012.12.30
cudaEvent_t test_event;
cudaEventCreate(&test_event);
cudaEventDestroy(test_event);
szenasi.sandor@nik.uni-obuda.hu
99
Record an event
• The cudaEventRecord function records an already existing event in a specified
stream
cudaError_t cudaEventRecord (
cudaEvent_t
event,
cudaStream_t
stream = 0
)
◦ The first parameter is the event to record
◦ The second parameter is the stream in which to record the event
• The event is recorded after all preceding operations in the given stream have
been completed (in case of zero stream it is recorded after all preceding
operations in the entire CUDA context have been completed)
• cudaEventQuery() and/or cudaEventSynchronyze() must be called to
determine when the event actually been recorded (since this function call is
asynchronous)
• If the event has been recorded, then this will overwrite the existing state
1
2
3
4
2012.12.30
cudaEvent_t test_event;
cudaEventCreate(&test_event);
cudaEventRecord(test_event, 0); // use with zero stream
cudaEventRecord(test_event, stream); // use with non-zero stream
szenasi.sandor@nik.uni-obuda.hu
100
Synchronize an event
• The cudaEventSynchronize function synchronizes and event. It will wait until
the completion of all device operations preceding the most recent call to
cudaEventRecord() in the given stream
cudaError_t cudaEventSynchronize(cudaEvent_t event)
◦ The first parameter is the event to wait for
• If cudaEventRecord has not been called on the specified event the function will
return immediately
• Waiting for the event will cause the calling CPU thread to block until the event
has been completed by the device
1
2
3
4
5
6
7
8
2012.12.30
cudaEvent_t start_event, end_event;
cudaEventCreate(&start_event);
cudaEventCreate(&end_event);
cudaEventRecord(test_event, 0);
call_kernel<<<…, …>>>(...);
cudaEventRecord(end_event, 0);
cudaEventSynchronize(start_event);
cudaEventSynchronize(end_event);
szenasi.sandor@nik.uni-obuda.hu
101
Check an event
• The cudaEventQuery function returns information about and event
cudaError_t cudaEventQuery(cudaEvent_t event)
◦ The first parameter is the event to check for
• Query the status of all device work preceding the most recent call to
cudaEventRecord()
◦ If this work has successfully been completed by the device, or if
cudaEventRecord() has not been called on event, then cudaSuccess is
returned
◦ If this work has not yet been completed by the device then
cudaErrorNotReady is returned
1
2
3
4
5
6
7
2012.12.30
cudaEvent_t test_event;
…
if (cudaEventQuery(event) == cudaSuccess) {
... event has been finished …
} else {
}
... event has not been finished …
szenasi.sandor@nik.uni-obuda.hu
102
Synchronization with events
• The cudaStreamWaitEvent function will block a stream until an event finishes
cudaError_t cudaStreamWaitEvent(
cudaStream_t
stream,
cudaEvent_t
event,
unsigned int
flags
)
◦ The first parameter is the stream to block
◦ Second parameter is the event to wait on
◦ Third parameters are the optional flags (must be 0)
• Makes all future work submitted to stream wait until event reports completion
before beginning execution. This synchronization will be performed efficiently
on the device
• The event may be from a different context than stream, in which case this
function will perform cross-device synchronization
• The stream will wait only for the completion of the most recent host call to
cudaEventRecord() on event
• If stream is NULL, any future work submitted in any stream will wait for event
to complete before beginning execution. This effectively creates a barrier for
all future work submitted to the device on this thread
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
103
Synchronization with events (example)
1
2
3
4
5
6
7
8
9
10
11
12
2012.12.30
cudaEvent_t event;
cudaEventCreate (&event);
cudaMemcpyAsync ( d_in, in, size, H2D, stream1 );
cudaEventRecord (event, stream1);
cudaMemcpyAsync ( out, d_out, size, D2H, stream2 );
cudaStreamWaitEvent ( stream2, event );
kernel <<< , , , stream2 >>> ( d_in, d_out );
asynchronousCPUmethod ( … )
szenasi.sandor@nik.uni-obuda.hu
104
Calculate elapsed time between two events
• The cudaEventElapsedTime computes the elapsed time between two finished
events
cudaError_t cudaEventElapsedTime(float *ms,
cudaEvent_t start,
cudaEvent_t end
)
◦ The first parameter is a float pointer. The result will be stored into this
variable
◦ Start event is the first event
◦ End event is the second event
• cudaEventRecord() must be called for each events
• Both of the events must be in finished state
• Do not use the cudaEventDisableTiming flag (advanced event creation)
• If timing is not necessary for performance use:
cudaEventCreateWithFlags(&event, cudaEventDisableTiming)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
105
Calculate elapsed time (example)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2012.12.30
cudaEvent_t start_event, end_event;
cudaEventCreate(&start_event);
cudaEventCreate(&end_event);
cudaEventRecord(start_event, 0);
kernel<<<..., ...>>>(...);
cudaEventRecord(end_event, 0);
cudaEventSynchronize(start_event);
cudaEventSynchronize(end_event);
float elapsed_ms;
cudaEventElapsedTime(&elapsed_ms, start_event, end_event);
szenasi.sandor@nik.uni-obuda.hu
106
3. PROGRAMMING INTERFACE
3.6 Unified Virtual Address Space
CUDA Unified Virtual Address Management
• Unified virtual addressing (UVA) is a memory address management system
enabled by default in CUDA 4.0 and later releases on Fermi and Kepler GPUs
running 64-bit processes. The design of UVA memory management provides a
basis for the operation of RDMA for GPUDirect [11]
Figure 3.6.1 [9]
• In the CUDA VA space, addresses can be:
◦ GPU – page backed by GPU memory. Not accessible from the host
◦ CPU – page backed by CPU memory. Accessible from the host and the GPU
◦ Free – reserved for future CUDA allocations
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
108
Unified Virtual Address Space
• UVA means that a single memory address is used for the host and all the
devices
• CPU and GPU use the same unified virtual address space
◦ The driver can determine from an address where data resides (CPU, GPU,
one of the GPUs)
◦ Allocations still reside on the same device (in case of multi-GPU
environments)
• Availability
◦ CUDA 4.0 or later
◦ Compute Capability 2.0 or later
◦ 64bit operation system
• A pointer can reference an address in
◦ global memory on the GPU
◦ system memory on the host
◦ global memory on another GPU
• Applications may query if the unified address space is used for a particular
device by checking that the unifiedAddressing device property
(CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
109
Unified Virtual Address Space – check availability
• Which memory a pointer points to – host memory or any of the device
memories – can be determined from the value of the pointer using
cudaPointerGetAttributes( )
1
2
3
void* A;
cudaPointerAttributes attr;
cudaPointerGetAttributes( &attr, A );
• The result of this function is a cudaPointerAttributes structure:
struct cudaPointerAttributes {
enum cudaMemoryType memoryType;
int device;
void *devicePointer;
void *hostPointer;
}
◦ memoryType identifies the physical location of the memory associated
with pointer ptr. It can be cudaMemoryTypeHost for host memory or
cudaMemoryTypeDevice for device memory
◦ device is the device against which ptr was allocated
◦ devicePointer is the device pointer alias through which the memory
referred to by ptr may be accessed on the current device
◦ hostPointer is the host pointer alias through which the memory referred to
by ptr may be accessed on the host
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
110
Peer to peer communication between devices
Figure 3.6.2 [10]
• UVA memory copy
• P2P memory copy
• P2P memory access
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
111
Using Unified Addressing and P2P transfer
• All host memory allocated using cuMemAllocHost() or cuMemHostAlloc() is
directly accessible from all devices that support unified addressing
• The pointer value is the same in the host and in the device side, so it is not
necessary to call any functions (cudaHostGetDevicePointer())
• All the pointers are unique, so it is not necessary to specify information about
pointers to cudaMemCpy() or any other copy functions. The cudaMemCpy
functions needs a parameter about transfer direction, it would be
cudaMemcpyDefault. The runtime will know the location of the pointer from its
value
cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMemcpyDefault
• Enables libraries to simplify their interfaces
• Note that this will transparently fall back to a normal copy through the host if
P2P is not available
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
112
Peer-to-peer memory transfer between GPUs
• Check for P2P access between GPUs [10]:
1
2
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);
• Enable peer access between GPUs:
1
2
3
4
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);
• We can use UVA memory copy:
1
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault)
• Stop peer access:
1
2
3
4
2012.12.30
cudaSetDevice(gpuid_0);
cudaDeviceDisablePeerAccess(gpuid_1);
cudaSetDevice(gpuid_1);
cudaDeviceDisablePeerAccess(gpuid_0);
szenasi.sandor@nik.uni-obuda.hu
113
Peer-to-peer memory access between GPUs
• System requirements are the same as P2P memory transfer
• Same checking steps [10]:
1
2
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);
• Same initialization steps:
1
2
3
4
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);
• Same shutdown steps:
1
2
3
4
2012.12.30
cudaSetDevice(gpuid_0);
cudaDeviceDisablePeerAccess(gpuid_1);
cudaSetDevice(gpuid_1);
cudaDeviceDisablePeerAccess(gpuid_0);
szenasi.sandor@nik.uni-obuda.hu
114
Peer-to-peer memory access kernel
• Well known kernel copy an array from destination to target:
1
2
3
4
5
__global__ void CopyKernel(float *src, float *dst)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
dst[idx] = src[idx];
}
• We can start a kernel with different parameters:
1
2
3
4
CopyKernel<<<blocknum, threadnum>>>(gpu0_buf, gpu0_buf);
CopyKernel<<<blocknum, threadnum>>>(gpu1_buf, gpu1_buf);
CopyKernel<<<blocknum, threadnum>>>(gpu1_buf, gpu0_buf);
CopyKernel<<<blocknum, threadnum>>>(gpu0_buf, gpu1_buf);
• Due to UVA the kernel knows whether its argument is from another GPU
memory/host memory/local memory.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
115
CUDA Unified Virtual Address summary
•
•
•
•
•
Faster memory transfers between devices
Device to device memory transfers with less host overhead
Kernels in a device can access memory of other devices (read and write)
Memory addressing on different devices (other GPUs, host memory)
Requirements
◦ 64bit OS and application (Windows TCC)
◦ CUDA 4.0
◦ Fermi GPU
◦ Latest drivers
◦ GPUs need to be on same IOH
More information about UVA
• CUDA Programming Guide 4.0
◦ 3.2.6.4 Peer-to-Peer Memory Access
◦ 3.2.6.5 Peer-to-Peer Memory Copy
◦ 3.2.7 Unified Virtual Address Space
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
116
4. Optimization techniques
4. OPTIMIZATION TECHNIQUES
4.1 Using shared memory
Optimization strategies
• Memory usage
◦ Use registers
◦ Use shared memory
◦ Minimize CPU-GPU data transfers
◦ Processing data instead of moving it (move code to the GPU)
◦ Group data transfers
◦ Special memory access patterns (we don’t discuss)
• Maximize parallel execution
◦ Maximize GPU parallelism
– Hide memory latency by running as many threads as possible
◦ Use CPU-GPU parallelism
◦ Optimize block size
◦ Optimize number of blocks
◦ Use multiple-GPUs
• Instruction level optimization
◦ Use float arithmetic
◦ Use low precision
◦ Use fast mathematic functions
◦ Minimize divergent warps
– Branch conditions
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
119
Matrix multiplication
Exam 4.1.1
Create CUDA application to solve the following problems: multiplying 2
dimensional (NxN) matrices with the GPU
• N is a constant in source code
• Allocate memory for 3 NxN matrices(A, B, C)
• Fill the A matrix with numbers (for example: ai,j = i + j)
• Fill the B matrix with numbers (for example: bi,j = i - j)
• Allocate 3 NxN matrices in the global memory of the graphics card (devA,
devB, devC)
• Move the input date to the GPU: A → devA, B → devB
• Execute a kernel to calculate devC = devA * devB
• Move the results back to the system memory: devC → C
• List the values in the C vector to the screen
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
120
Multi-dimensional matrix in global memory
• We can use multi-dimensional arrays in C programs, but these are obviously
stored in a linear memory area
• For example a 4x4 matrix in the memory:
A matrix
a0,0 a0,1 a0,2 a0,3
a1,0 a1,1 a1,2 a1,3
A two dimensional array
A = a0,0
...
...
...
a0,0 a0,1 a0,2 a0,3 a1,0 a1,1
a2,0 a2,1 a2,2 a2,3
a1,2 a1,3 a2,0 a2,1 a2,2 a2,3 a3,0 a3,1 a3,2
a3,0 a3,1 a3,2 a3,3
a3,3
...
...
...
...
...
...
...
...
Access elements of a multi-dimensional array
• We know the address of the first item in the array and we know the size of
each elements. In this case we can use the following formula:
arow,col = a0,0 + (row * col_number + col) * item_size
• The CUDA kernel will get only the starting address of the array, we have to
use this formula to access the elements
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
121
Multi-dimensional matrix multiplication
• If one thread processes one item in the matrix, we need as many threads as
the number of matrix elements. A relatively small 30x30 matrix needs 900
threads in GPU, therefore we have to use multiple blocks
• Therefore we have to use the block identifier in the kernel. The improved
kernel for the devC = devA * devB matrix multiplication:
1
2
3
4
5
6
7
8
9
10
11
12
2012.12.30
__global__ static void MatrixMul(float *devA, float *devB, float *devC) {
int indx = blockIdx.x * blockDim.x + threadIdx.x;
int indy = blockIdx.y * blockDim.y + threadIdx.y;
}
if (indx < N && indy < N) {
float sum = 0;
for(int i = 0; i < N; i++) {
sum += devA[indy * N + i] * devB[i * N + indx];
}
devC[indy * N + indx] = sum;
}
szenasi.sandor@nik.uni-obuda.hu
122
Multi-dimensional matrix in the GPU memory
• Initialization, memory allocation
1
2
3
4
5
cudaSetDevice(0);
float A[N][N], B[N][N], C[N][N]; float *devA, *devB, *devC;
cudaMalloc((void**) &devA, sizeof(float) * N * N);
cudaMalloc((void**) &devB, sizeof(float) * N * N);
cudaMalloc((void**) &devC, sizeof(float) * N * N);
• Move input data
6
7
cudaMemcpy(devA, A, sizeof(float) * N * N, cudaMemcpyHostToDevice);
cudaMemcpy(devB, B, sizeof(float) * N * N, cudaMemcpyHostToDevice);
• Invoke the kernel
8
9
10
11
dim3 grid((N - 1) / BlockN + 1, (N - 1) / BlockN + 1);
dim3 block(BlockN, BlockN);
MatrixMul<<<grid, block>>>(devA, devB, devC);
cudaThreadSynchronize();
• Move the results back, free memory
12
13
2012.12.30
cudaMemcpy(C, devC, sizeof(float) * N * N, cudaMemcpyDeviceToHost);
cudaFree(devA); cudaFree(devB); cudaFree(devC);
szenasi.sandor@nik.uni-obuda.hu
123
Aligned arrays
• In some cases the number of columns in one matrix row differs from the size
of the rows in the memory. This can speed up the access of values because
technical reasons (for example with the real memory row size, we can use
faster multiplications or we can utilize the capacity of the GPU memory
controllers)
• A simple 5x5 matrix with 8 item alignment:
A matrix
A array in memory
A = a0,0
a0,0 a0,1 a0,2 a0,3 a0,4
a1,0 a1,1 a1,2 a1,3 a1,4
a2,0 a2,1 a2,2 a2,3 a2,4
a3,0 a3,1 a3,2 a3,3 a3,4
a4,0 a4,1 a4,2 a4,3 a4,4
...
...
...
a0,0 a0,1 a0,2 a0,3 a0,4
a1,1 a1,2 a1,3 a1,4
...
...
...
...
...
...
...
...
...
...
...
a1,0
a2,0 a2,1 a2,2 a2,3 a2,4
a3,0 a3,1 a3,2 a3,3 a3,4
a4,1 a4,2 a4,3 a4,4
...
...
...
...
...
a4,0
...
...
...
...
Access elements in case of aligned storage
• The formula is similar but we use the aligned row size:
arow,col = a0,0 + (row * aligned_row_size + col) * item_size
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
124
Aligned memory management
• The CUDA class library have several functions to manage aligned memory.
The following function allocates aligned memory area:
cudaMallocPitch(void** devPtr, size_t *pitch, size_t width, size_t height)
◦ devPtr – pointer to the aligned memory
◦ pitch – alignment
◦ width – size of one matrix row
◦ height – number of matrix rows
• Similar to the linear memory management the start address of the allocated
object will be stored in the devPtr variable
• The alignment is not an input value, this is one of the outputs of the function.
The CUDA library will determine the optimal value (based on the array and
device properties)
• Size of the matrix row is given by bytes
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
125
Copy aligned memory
• Because the different alignment the normal linear memory transfer is not
usable in case of pitched memory regions
• The following CUDA function transfers data from one region to an other
cudaMemcpy2D(void* dst, size_t dpitch, const void* src, size_t spitch,
size_t width, size_t height, enum cudaMemcpyKing kind)
◦ dst – destination pointer
◦ dpitch – destination pitch value
◦ src – source pointer
◦ spitch – source pith value
◦ width – size of on row of the 2 dimensional array
◦ height – number of rows of the 2 dimensional array
◦ kind – transfer direction
– host → host (cudaMemcpyHostToHost)
– host → device (cudaMemcpyHostToDevice)
– device → host (cudaMemcpyDeviceToHost)
– device → host (cudaMemcpyDeviceToDevice)
• In case of simple not aligned arrays, the pitch value is 0
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
126
Matrix multiplication with aligned arrays
Exam 4.1.2
Create CUDA application to solve the following problems: multiplying 2
dimensional (NxN) matrices with the GPU
• N is a constant in source code
• Allocate memory for 3 NxN matrices (A, B, C)
• Fill the A matrix with numbers (for example: ai,j = i + j)
• Fill the B matrix with numbers (for example: bi,j = i - j)
• Allocate 3 NxN pitched arrays in the global memory of the graphics card
(devA, devB, devC)
• Move the input date to the GPU: A → devA, B → devB
• Execute a kernel to calculate devC = devA * devB
• Move the results back to the system memory: devC → C
• List the values in the C vector to the screen
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
127
Kernel with aligned arrays
• The multiplier is the pitch value instead of the matrix column number
• The pitch size is given by bytes therefore in case of typed pointers we have to
correct it’s actual value by a sizeof(item_type) division
• devC = devA * devB source code:
1 __global__ static void MatrixMul(float *devA, float *devB, float *devC, size_t pitch) {
2 int indx = blockIdx.x * blockDim.x + threadIdx.x;
3 int indy = blockIdx.y * blockDim.y + threadIdx.y;
4
5 if (indx < N && indy < N) {
float sum = 0;
6
for(int i = 0; i < N; i++) {
7
sum += devA[indy * pitch/sizeof(float) + i] * devB[i * pitch/sizeof(float) + indx];
8
}
9
devC[indy * pitch/sizeof(float) + indx] = sum;
10
11 }
12 }
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
128
Invoke kernel with aligned arrays
• Initialization, allocate arrays
1
2
3
4
5
cudaSetDevice(0);
float A[N][N], B[N][N], C[N][N]; float *devA, *devB, *devC; size_t pitch;
cudaMallocPitch((void**) &devA, &pitch, sizeof(float) * N, N);
cudaMallocPitch((void**) &devB, &pitch, sizeof(float) * N, N);
cudaMallocPitch((void**) &devC, &pitch, sizeof(float) * N, N);
• Transfer input data (we assume pitch value is the same)
6 cudaMemcpy2D(devA, pitch, A, sizeof(float) * N, sizeof(float) * N, N, cudaMemcpyHostToDevice);
7 cudaMemcpy2D(devB, pitch, B, sizeof(float) * N, sizeof(float) * N, N, cudaMemcpyHostToDevice);
• Kernel invocation
8
9
10
11
dim3 grid((N - 1) / BlockN + 1, (N - 1) / BlockN + 1);
dim3 block(BlockN, BlockN);
MatrixMul<<<grid, block>>>(devA, devB, devC, pitch);
cudaThreadSynchronize();
• Transfer results, free memory
12
13
cudaMemcpy2D(C, sizeof(float) * N, devC, pitch, sizeof(float) * N, N, cudaMemcpyDeviceToHost);
cudaFree(devA); cudaFree(devB); cudaFree(devC);
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
129
Using shared memory
• Matrix multiplication uses relatively small amount of arithmetic operations for
the amount of memory transfers
• We need as many operations as possible to hide the latency caused by
memory transfers (the GPU tries to schedule the execution units in case of
memory latencies but without a lot of operations this is not possible)
• Out goal is to increase the ratio of arithmetic operations / memory transfers
Available solutions
• Increase parallelism (in this case it is not possible)
• Decrease the number of memory transfers (in practice this means manually
programmed caching)
◦ holding as many variables in registers as possible
◦ using the shared memory
• Find another solution
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
130
Tiled matrix multiplication
• One input cell is necessary for the calculations of more than one output cell.
In the not optimized version of the algorithm, more than one thread will read
the same input value from global memory
• It would be practical to harmonize these thread’s work:
◦ divide the entire output matrix to small regions (tiles)
◦ allocate shared memory for one regio
◦ in the region, every thread loads the corresponding value from the input
matrices to the shared memory
◦ every thread calculates one partial result based on the values in the
shared memory
• The size of the shared memory is limited therefore the steps above are
usually executable only in more than one steps. We have to divide the input
matrix to more than one tiles, and at the and of the kernel executions we
have to summarize the values in these tiles
• The latter case it is necessary to synchronize the threads. Every thread must
wait until all of the other threads loads the values from global memory to the
shared memory, and after that the threads must wait again until all of them
finished calculation before load the next value
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
131
Matrix multiplication
Exam 4.1.3
Create CUDA application to solve the following problems: multiplying 2
dimensional (NxN) matrices with the GPU
• N is a constant in source code
• Allocate memory for 3 NxN matrices(A, B, C)
• Fill the A matrix with numbers (for example: ai,j = i + j)
• Fill the B matrix with numbers (for example: bi,j = i - j)
• Allocate 3 NxN matrices in the global memory of the graphics card (devA,
devB, devC)
• Move the input date to the GPU: A → devA, B → devB
• Execute a kernel to calculate devC = devA * devB with tile technique
• Move the results back to the system memory: devC → C
• List the values in the C vector to the screen
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
132
Optimized matrix multiplication
As
Bs
1.
2.
3.
c
B
Division to tiles. In
this case 3x3 regions,
3x3 threads
Every thread copies
one value from the
A
global memory to the
shared memory
Synchronization
2012.12.30
C
szenasi.sandor@nik.uni-obuda.hu
133
Global memory
Optimized matrix multiplication (2)
As
Thread0,0
Bs
4.
5.
Every thread
calculated one cell’s
result in the shared
memory
Synchronization
2012.12.30
c
B
A
C
szenasi.sandor@nik.uni-obuda.hu
134
Global memory
Optimized matrix multiplication (3)
As
Thread0,
0
Bs
6.
7.
8.
9.
+
c
B
Load next tiles
Synchronization
Threads do the
multiplication again and
add the result to the A
already existing results
Synchronization
2012.12.30
C
szenasi.sandor@nik.uni-obuda.hu
135
Optimized matrix multiplication (4)
As
Thread0,0
Bs
6.
7.
8.
9.
+
c
B
Load next tile
Synchronization
Threads do the
multiplication again. The
result added to the
A
already existing
partial result
Synchronization
2012.12.30
C
szenasi.sandor@nik.uni-obuda.hu
136
Global memory
Optimized matrix multiplication (5)
As
Bs
10. Every thread copies
the result to the
result matrix C
11. When all of the
blocks finished, the
C matrix contains
the final result
2012.12.30
c
B
A
C
szenasi.sandor@nik.uni-obuda.hu
137
Optimized matrix multiplication source code
• Kernel invocation is the same as the not-optimized version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
__global__ static void MatrixMul(float *devA, float *devB, float *devC) {
__shared__ float Ashared[BlockN][BlockN];
__shared__ float Bshared[BlockN][BlockN];
int indx = blockIdx.x * blockDim.x + threadIdx.x;
int indy = blockIdx.y * blockDim.y + threadIdx.y;
float c = 0;
for(int k = 0; k < N / BlockN; k++) {
Ashared[threadIdx.y][threadIdx.x] = devA[k * BlockN + threadIdx.x + indy * N];
Bshared[threadIdx.y][threadIdx.x] = devB[indx + (k * BlockN + threadIdx.y) * N];
__syncthreads();
for(int i = 0; i < BlockN; i++) {
c += Ashared[threadIdx.y][i] * Bshared[i][threadIdx.x];
}
__syncthreads();
}
2012.12.30
}
devC[indx + indy * N] = c;
szenasi.sandor@nik.uni-obuda.hu
138
Comparing runtime of original and tiled algorithms
• Horizontal axis: size of matrix (N)
• Vertical axis: runtime (second)
1
0,9
0,8
0,7
0,6
0,5
Eredeti
Optimalizált
0,4
0,3
0,2
0,1
0
40
2012.12.30
80
120
160
szenasi.sandor@nik.uni-obuda.hu
200
139
4. OPTIMIZATION TECHNIQUES
4.2 Using atomic instructions
Atomic operations
• Atomic operations are operations which are performed without interference
from any other threads. Atomic operations are often used to prevent race
conditions which are common problems in multithreaded applications [8].
• In case of some task we need atomic operations, for example:
◦ sum/average of a data structure
◦ min/max item of a data structure
◦ Count of some items in a data structure
◦ etc.
Possible solutions
• We can implement some of these tasks in parallel environment (for example,
is there any special item in the data structure?)
• But some of them is hard to parallelize (for example, find the minimum value
in the data structure)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
141
CUDA atomic instructions
• The atomic instructions of the CUDA environment can solve the race
conditions mentioned before. When using atomic instructions the hardware
will guarantee the serialized execution
• Operand location
◦ variable in global memory
◦ variable in shared memory
• Operand size
◦ 32bit integer (Compute Capability 1.1)
◦ 64bit integer (Compute Capability 1.2)
Performance notes
• If two threads perform an atomic operation at the same memory address at
the same time, those operations will be serialized. This will slow down the
kernel execution
• In case of some special tasks, we can not avoid atomic instructions. But in
most cases if it is possible we would try to find another solution. The goal is
to use as less atomic instructions as possible.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
142
CUDA atomic instructions
• The first parameter of atomic instructions is usually a memory address (in
global or local memory), the second parameter is an integer
• int atomicAdd(int* address, int val)
Reads the 32-bit or 64-bit word old located at the address in global or shared
memory, computes (old + val), and stores the result back to memory at the
same address. These three operations are performed in one atomic
transaction. The function returns old
• int atomicSub(int* address, int val)
Reads the 32-bit word old located at the address in global or shared memory,
computes (old -val), and stores the result back to memory at the same
address. These three operations are performed in one atomic transaction.
The function returns old
• int atomicExch(int* address, int val);
Reads the 32-bit or 64-bit word old located at the address in global or shared
memory and stores val back to memory at the same address. These two
operations are performed in one atomic transaction. The function returns old
• int atomicMin(int* address, int val);
Reads the 32-bit word old located at the address in global or shared memory,
computes the minimum of old and val, and stores the result back to memory
at the same address. These three operations are performed in one atomic
transaction. The function returns old
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
143
CUDA atomic instructions (2)
• int atomicMax(int* address, int val);
Reads the 32-bit word old located at the address in global or shared memory,
computes the maximum of old and val, and stores the result back to memory
at the same address. These three operations are performed in one atomic
transaction. The function returns old
• unsigned int atomicInc(unsigned int* address, unsigned int val)
Reads the 32-bit word old located at the address in global or shared memory,
computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory
at the same address. These three operations are performed in one atomic
transaction. The function returns old
• unsigned int atomicDec(unsigned int* address, unsigned int val)
Reads the 32-bit word old located at the address in global or shared memory,
computes (((old == 0) | (old > val)) ? val : (old-1)), and stores the result
back to memory at the same address. These three operations are performed
in one atomic transaction. The function returns old
• int atomicCAS(int* address, int compare, int val)
Compare And Swap: reads the 32-bit or 64-bit word old located at the
address in global or shared memory, computes (old == compare ? val : old),
and stores the result back to memory at the same address. These three
operations are performed in one atomic transaction. The function returns old
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
144
CUDA atomic bitwise instructions
• int atomicAnd(int* address, int val)
Reads the 32-bit word old located at the address in global or shared memory,
computes (old & val), and stores the result back to memory at the same
address. These three operations are performed in one atomic transaction. The
function returns old
• int atomicOr(int* address, int val)
Reads the 32-bit word old located at the address in global or shared memory,
computes (old | val), and stores the result back to memory at the same
address. These three operations are performed in one atomic transaction. The
function returns old
• int atomicXor(int* address, int val)
Reads the 32-bit word old located at the address in global or shared memory,
computes (old ^ val), and stores the result back to memory at the same
address. These three operations are performed in one atomic transaction. The
function returns old
Exam 4.2.1
Create a CUDA application to solve the following problem. Find the minimal
value from a randomly filled vector (length: N). Use the atomic operations!
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
145
Find the minimal value of a vector – using global memory
• The source code is really simple. Every thread call the atomicMin atomic
instruction and pass the parameter from the array based on the thread
identifier
• In this implementation the first item of the array will contain the minimal value
of the array
1
2
3
4
__global__ static void MinSearch(float *devA) {
int indx = blockIdx.x * blockDim.x + threadIdx.x;
atomicMin(devA, devA[indx]);
}
• As it is visible this kernel can run in multi-block execution context. The atomic
instructions are useable in this environment.
Exam 4.2.2
Try to speed up the existing algorithm. Use the shared memory instead of
global memory.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
146
Find the minimal value of a vector – using shared memory
• First we have to initialize the localMin variable, the first thread in every block
will do this
• In the next step, every thread check the value indexed by its thread identifier
• After the next synchronization, the first thread will compare the local minimum
to the global minimum (every block have a local minimum)
1
2
3
4
5
6
7
8
9
2012.12.30
__global__ static void MinSearch(float *devA) {
__shared__ int localMin;
int indx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x == 0) localMin = devA[blockIdx.x * blockDim.x];
__syncthreads();
atomicMin(&localMin, devA[indx]);
__syncthreads();
if (threadIdx.x == 0) atomicMin(devA, localMin);
}
szenasi.sandor@nik.uni-obuda.hu
147
Comparing runtime of global and shared memory usage
• Horizontal axis: size of vector (N)
• Vertical axis: runtime (second)
10
9
8
7
6
5
Eredeti
Optimalizált
4
3
2
1
0
5000
2012.12.30
10000
20000
30000
szenasi.sandor@nik.uni-obuda.hu
40000
148
Parallelization inside the block
• We have to try avoid atomic instructions. It would be better to find a
parallelizable solution. We have to divide the task of each block into smaller
parts
• First load a part of global memory into the block’s shared memory:
every thread load one value from the global memory to the shared memory
array
• Inside the block every thread compare two values and store the smaller one
into the vector cell with smaller index
• In the next iteration we will check only the smaller items
• In the last step we have the minimal value of the block. We have to find only
the global minimum (same as before)
Exam 4.2.2
Create an algorithm based on the idea before.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
149
Parallel minimum search - loading
0
1
2
Min( A2 , A3 ) – Sz1
Block 0
Min( A4 , A5 ) – Sz2
Min( A6 , A7 ) – Sz3
Min( A8 , A9 ) – Sz0
Min( A10 , A11 ) – Sz1
Min( A12 , A13 ) – Sz2
Min( A14 , A15 ) – Sz3
3
4
5
6
7
8
Global memory
Min( A0 , A1 ) – Sz0
Shared memory
• One example
N = 24
BlockN = 4 (nbr of threads)
• Every block allocate
one array in the shared
memory (size is
BlockN*2)
• Every thread in every
blocks load 2 values
from the global
memory and stores
the smaller one
• If we have empty
spaces we have to
fill them with some
values
• Synchronization
9
10
11
12
Min( A16 , A17 ) – Sz0
Min( A18 , A19 ) – Sz1
Block 1
Min( A20 , A21 ) – Sz2
Min( A22 , A23 ) – Sz3
A24 – Sz0
A0 – Sz1
A0 – Sz2
A0 – Sz3
13
14
15
16
17
18
19
20
21
22
23
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
24
150
Parallel minimum search – find minimum of block
S0
S1
S2
S3
S4
S5
S6
S7
Min(S0,S1)
Min(S2,S3)
Min(S4,S5)
Min(S6,S7)
S4
S5
S6
S7
Min(S0,S1, S2, S3)
Min(S4,S5, S6, S7)
Min(S4,S5)
Min(S6,S7)
S4
S5
S6
S7
Blokk minimum
Min(S4,S5, S6, S7)
Min(S4,S5)
Min(S6,S7)
S4
S5
S6
S7
• Every thread do log2BlokkN number of iterations. In every iteration
the threads do the following operation:
Sx = Min(S2x , S2x+1)
• At the end of the last iteration, the first value of the array will be the
smallest one
• After that we find the globally minimum
◦ using atomic instructions
◦ we store the minimum values of blocks into another vector and
redo the minimum search to this vector (this solution is better in
case of large block number)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
151
Parallel minimum search - kernel
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ static void MinSearch(int *devA) {
__shared__ int localMin[BlockN*2];
int blockSize = BlockN;
int itemc1 = threadIdx.x * 2;
int itemc2 = threadIdx.x * 2 + 1;
2012.12.30
for(int k = 0; k <= 1; k++) {
int blockStart = blockIdx.x * blockDim.x * 4 + k * blockDim.x * 2;
int loadIndx = threadIdx.x + blockDim.x * k;
if (blockStart + itemc2 < N) {
int value1 = devA[blockStart + itemc1];
int value2 = devA[blockStart + itemc2];
localMin[loadIndx] = value1 < value2 ? value1 : value2;
} else
if (blockStart + itemc1 < N)
localMin[loadIndx] = devA[blockStart + itemc1];
else
localMin[loadIndx] = devA[0];
}
__syncthreads();
szenasi.sandor@nik.uni-obuda.hu
152
Parallel minimum search – kernel (2)
21
22
23
24
25
26
27
28
29 }
while (blockSize > 0) {
int locMin = localMin[itemc1] < localMin[itemc2] ? localMin[itemc1] : localMin[itemc2];
__syncthreads();
localMin[threadIdx.x] = locMin;
__syncthreads();
blockSize = blockSize / 2;
}
if (threadIdx.x == 0) atomicMin(devA, localMin[0]);
• A more optimized version is available at
http://developer.download.nvidia.com/compute/cuda/1.1Beta/x86_website/projects/reduction/doc/reduction.pdf
• Block size must be 2N
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
153
Comparing runtime of atomic and parallel version
• Horizontal axis: size of vector (N)
• Vertical axis: runtime (second)
0,3
0,25
0,2
0,15
Optimalizált 1
Optimalizált 2
0,1
0,05
0
5000
2012.12.30
10000
20000
30000
szenasi.sandor@nik.uni-obuda.hu
40000
154
Comparing of CPU and GPU implementation
• Horizontal axis: size of vector (N)
• Vertical axis: runtime (second)
0,6
0,5
0,4
0,3
CPU
Optimalizált 2
0,2
0,1
0
10000
2012.12.30
50000
100000
150000
Values do not contain transfer time from CPU to GPU!
szenasi.sandor@nik.uni-obuda.hu
200000
155
4. OPTIMIZATION TECHNIQUES
4.3 Occupancy considerations
Execution overview
• Problem space is divided into blocks
◦ Grid is composed of independent blocks
◦ Blocks are composed of threads
• Instructions are executed per warp
◦ In case of Fermi, 32 threads form a warp
◦ Fermi can have 48 active warps per SM (1536 threads)
◦ Warp will stall if any of the operands is not ready
• To avoid latency
◦ Switch between contextes while warps stalled
◦ Context switching latency is very small
• Registers and shared memory are allocated for a block as long as the block is
active
◦ Once a block is active it will stay active until all threads completed in that
block
◦ Registers/shared memory do not need store/reload in case of context
switching
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
157
Occupancy
• Occupancy is the ratio of active processing units to available processing units
Occupancy = Active Warps / Maximum Number of Warps
• Occupancy is limited by:
◦ Max Warps or Max Blocks per Multiprocessor
◦ Registers per Multiprocessor
◦ Shared memory per Multiprocessor
• Occupancy = Min( register occ., shared mem occ., block size occ.)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
158
Occupancy and registers
• Fermi has 32K registers per SM
• The maximum number of threads is 1536
• For example, if a kernel uses 40 registers per thread:
◦ Number of active threads: 32K / 40 = 819
◦ Occupancy: 819 / 1536 = 0,53
• In this case the number of registers limits the occupancy (meanwhile there
are some unused resources in the GPU)
• Goal: try to limit the register usage
◦ Check register usage: compile with –ptxax-options=-v
◦ Limit register usage: compile with –maxregcount
• For example, in case of 21 registers:
◦ Number of active threads: 32K / 21 = 1560
◦ Occupancy: 1560 / 1536 = ~1
◦ This means only that the number of registers will not limit the occupancy
(it is higly depends on other resources)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
159
Occupancy and shared memory
• Size of shared memory is configurable in Fermi
◦ 16K shared memory
◦ 48K shared memory (we use this configuration in the examples)
• For example, if a kernel uses 64 bytes of shared memory
◦ Number of active threads: 48K / 64 = 819
◦ Occupancy: 819 / 1536 = 0,53
• In this case the size of shared memory limits the occupancy (meanwhile
there are some unused resources in the GPU)
• Goal: try to limit the shared memory usage
◦ Check shared memory usage: compile with –ptxax-options=-v
◦ Limit shared memory usage
– Use lower shared memory in kernels (kernel invocation)
– Use appropriate L1/Shared configuration in case of Fermi
• For example, in case of 32 bytes of shared memory:
◦ Number of active threads: 48K / 32 = 1536
◦ Occupancy: 1536 / 1536 = 1
◦ This means only that the size of shared memory will not limit the
occupancy (it is higly depends on other resources)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
160
Occupancy and block size
• Each SM can have up to 8 active blocks
• There is a hardware based upper limit for block size
◦ Compute Capability 1.0 – 512
◦ Compute Capability 2.0 – 1024
• Lower limit is 1 but small block size will limit the total number of threads
• For example,
◦ Block size: 128
◦ Active threads in one SM: 128 * 8 = 1024
◦ Occupacy: 1536 / 1024 = 0,66
• In this case the block size limits the occupancy (meanwhile there are some
unused resources in the GPU)
• Goal: try to increase the block size (kernel invocation parameter)
• For example,
◦ Block size: 192
◦ Active threads in one SM: 192 * 8 = 1536
◦ Occupacy: 1536 / 1536 = 1
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
161
CUDA Occupancy calculator
• A CUDA tool to investigate the occupancy
• In practice it is an Excel sheet, located in „NVIDIA GPU Computing SDK
x.x\C\tools\CUDA_Occupancy_Calculator.xls”
• Input data:
◦ Hardware configuration
– Compute Capability
– Shared Memory Config
◦ Resource usage
– Threads per block
– Registers per thread
– Shared memory per block
• Output data:
◦ Active threads per MP
◦ Active warps per MP
◦ Active thread blocks per MP
◦ Occupancy of each MP
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
162
CUDA Occupancy calculator - example
Hardware configuration
Resource usage
Occupancy details
Physical limits
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
163
CUDA Occupancy calculator – impact of varying block size
Impact of Varying Block Size
My Block Size 256
Multiprocessor Warp Occupancy
(# warps)
48
40
32
24
16
8
0
0
64
128
192
256
320
384
448
512
576
640
704
768
832
896
960
1024
Threads Per Block
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
164
CUDA Occupancy calculator – impact of varying register count
Impact of Varying Register Count Per Thread
My Register Count 16
Multiprocessor Warp Occupancy
(# warps)
48
40
32
24
16
8
0
128
124
120
116
112
108
104
100
96
92
88
84
80
76
72
68
64
60
56
52
48
44
40
36
32
28
24
20
16
12
8
4
0
Registers Per Thread
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
165
CUDA Occupancy calculator – impact of varying shared memory
Impact of Varying Shared Memory Usage Per Block
Multiprocessor Warp Occupancy
(#warps)
My Shared Memory 4096
48
40
32
24
16
8
0
49152
47104
45056
43008
40960
38912
36864
34816
32768
30720
28672
26624
24576
22528
20480
18432
16384
14336
12288
10240
8192
6144
4096
2048
0
Shared Memory Per Block
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
166
Block size considerations [18]
• Choose number of threads per block as a multiple of warp size
• Avoid wasting computation on under-populated warps
• Optimize block size
◦ More thread block – better memory latency hiding
◦ Too much thread block – fewer register per thread, kernel invocation can
fail if too many are registers are used
• Heuristics
◦ Minimum: 64 threads per block
– Only if multiple concurrent blocks
◦ 192 or 256 threads a better choice
– Usually still enough registers to compile and invoke successfully
◦ This all depends on your computation!
– Experiment!
• Try to maximize occupancy
◦ Increasing occupancy does not necessarily increase performance
◦ But, low-occupancy multiprocessors cannot adequately hide latency on
memory-bound kernels
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
167
4. OPTIMIZATION TECHNIQUES
4.4 Parallel Nsight
Parallel Nsight
• Debugger for GPGPU development
• Available only for registered users(?):
http://www.nvidia.com/object/nsight.html
• Available editions
◦ Visual Studio Edition
https://developer.nvidia.com/nvidia-nsight-visual-studio-edition
◦ Nsight Eclipse Edition
• Main features
◦ Visual Studio/Eclipse support
◦ PTX/SASS Assembly Debugging
◦ CUDA Debugger (debug kernels directly)
◦ Use conditional breakpoints
◦ View GPU memory
◦ Graphics debugger
◦ Profiler functions
• Hardware requirements
◦ Analyzer -Single GPU system
◦ CUDA Debugger –Dual GPU system
◦ Direct3D Shader Debugger –Two separate GPU systems
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
169
Kernel debugging
• Main steps for local debugging
◦ Start Nsight Monitor
(All Programs > NVIDIA Corporation > Nsight Visual Studio Edition 2.2 >
Nsight Monitor)
◦ Set breakpoint
Like setting breakpoint in CPU code
◦ Start CUDA debugging in Visual Studio
(Nsight/Start CUDA debugging)
◦ Debugger will stop at the breakpoint
◦ All the common debugger commands are available
– Step over
– Step into
– Etc.
• Remote debugging
◦ We do not discuss
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
170
Watch GPU memory regions
• Nsight supports the Visual Studio „Memory” window for examining the
contents of GPU memory
◦ Shared memory
◦ Local memory
◦ Global memory
• To show a memory region, select Debug/Windows/Memory
◦ In case of kernel debugging just enter the name of the variable of the
direct address
◦ In case of direct addresses use the following keywords: __shared__,
__local__, __device__
◦ For example: (__shared__ float*)0
• The common visual studio functions also available
◦ Watch window to check kernel variables
◦ Move the cursor over a variable to see the actual value
• Built-in CUDA variables are also available
◦ threadIdx
◦ blockIdx
◦ blockDim
◦ gridDim
◦ etc.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
171
CUDA Debug Focus
• Some variables in CUDA belongs to a context
◦ Registers and local memory to threads
◦ Shared memory to blocks
• To see the variable actual value the developer must define the owner thread
(block index and thread index)
◦ Select Nsight/Windows/CUDA Debug Focus
◦ Set block index
◦ Set thread index
• Watch window/quick watch etc. will show information about the variables of
the corresponding thread
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
172
CUDA Device Summary
• An
◦
◦
◦
overview about the state of available devices
Select Nsight/Windows/CUDA Device Summary
Select a device from the list
Lots of statis and runtime parameters are displayed in the right
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
173
CUDA Device Summary - grid
• An overview about the state of available devices
◦ Select Nsight/Windows/CUDA Device Summary
◦ Select a grid from the list
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
174
CUDA Device Summary - warp
• An overview about the state of available devices
◦ Select Nsight/Windows/CUDA Device Summary
◦ Select a running warp
• Developer can check the current state of all running warps
• SourceFile/SourceLine can be very usefull to understand the execution
mechanism
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
175
Debugging PTX code
• Check the Tools/Options/Debugging options
◦ Select “Enable Address Level Debugging”
◦ Select “Show disassembly if source is not available”
• When the CUDA debugger is stopped
◦ Select “Go to Disassembly”
◦ The PTX code appears (SASS code is also available)
• Debugging is the same as CPU applications
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
176
Using the memory checker
• The CUDA Memory Checker detects problems in global and shared memory.
If the CUDA Debugger detects an MMU fault when running a kernel, it will not
be able to specify the exact location of the fault. In this case, enable the
CUDA Memory Checker and restart debugging, and the CUDA Memory
Checker will pinpoint the exact statements that are triggering the fault [22]
• Select Nsight/Options/CUDA
◦ Set “Enable Memory Checker” to true
• Launch the CUDA debugger and run the application
◦ During the execution if the kernel tries to write to an invalid memory
location (for example in case of arrays) the debugger will stop
◦ The debugger will stop before the execution of this instruction
• The CUDA memory checker will write results to the Output window
◦ Launch parameters
◦ Number of detected problems
◦ GPU state in these cases
– Block index
– Thread index
– Sourcecode line number
◦ Summary of access violations
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
177
CUDA memory checker result
================================================================================
CUDA Memory Checker detected 2 threads caused an access violation:
Launch Parameters
CUcontext = 003868b8
CUstream
= 00000000
CUmodule
= 0347e780
CUfunction = 03478980
FunctionName = _Z9addKernelPiPKiS1_
gridDim
= {1,1,1}
blockDim
= {5,1,1}
sharedSize = 0
Parameters:
Parameters (raw):
0x05200000 0x05200200 0x05200400
GPU State:
Address Size
Type
Block Thread
blockIdx threadIdx
PC Source
-----------------------------------------------------------------------------------------05200018
4 adr st
0
3
{0,0,0} {3,0,0} 0000f0 d:\sandbox\nsighttest\nsighttest\kernel.cu:12
05200020
4 adr st
0
4
{0,0,0} {4,0,0} 0000f0 d:\sandbox\nsighttest\nsighttest\kernel.cu:12
Summary of access violations:
================================================================================
Parallel Nsight Debug
Memory Checker detected 2 access violations.
error = access violation on store
blockIdx = {0,0,0}
threadIdx = {3,0,0}
address = 0x05200018
accessSize = 4
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
178
Possible error codes and meanings
• CUDA memory checker error codes:
CUDA memory checker error codes
mis ld
mis st
mis atom
adr ld
adr st
adr atom
2012.12.30
misaligned access during a memory
load
misaligned access during a memory
store
misaligned access during an atomic
memory transaction - an atomic
function was passed a misaligned
address
invalid address during a memory load
invalid address during a memory store attempted write to a memory location
that was out of range, also sometimes
referred to as a limit violation.
invalid address during an atomic
memory transaction - an atomic
function attempted a memory access at
an invalid address.
szenasi.sandor@nik.uni-obuda.hu
179
5. CUDA libraries
5. CUDA LIBRARIES
5.1 CUBLAS library
CUBLAS Library
• BLAS: Basic Linear Algebra Subprograms [14]
Basic Linear Algebra Subprograms (BLAS) is a de facto application
programming interface standard for publishing libraries to perform basic
linear algebra operations such as vector and matrix multiplication. Heavily
used in high-performance computing, highly optimized implementations of
the BLAS interface have been developed by hardware vendors such as by
Intel and Nvidia
• CUBLAS: CUDA BLAS library
CUBLAS is an implementation of the BLAS library based on the CUDA driver
and framework. It has some easy to use data types and functions. The library
is self-contained in the API level, so the CUDA is driver is unnecessary
• Technical details
◦ The interface to the CUBLAS library is the header file cublas.h
◦ Applications using CUBLAS need to link against the DSO the DLL
cublas.dll (for Windows applications) when building for the device,
◦ and against the DSO the DLL cublasemu.dll (for Windows applications)
when building for device emulation.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
182
Developing CUBLAS based applications
• Step 1 - Create CUBLAS data structures
◦ CUBLAS provides functions to create and destroy objects in the GPU
space
◦ There are not any special types (like matrices or vector), the library
functions usually needs typed pointers to the data structures
• Step 2 - Fill structures with data
◦ There are some functions to handle data transfers between the system
memory and the GPU memory
• Step 3 - Call CUBLAS function(s)
◦ The developer can call a CUBLAS function, or a sequence of these
functions
• Step 4 - Retrieve results to system memory
◦ Finally the developer can upload the function results from the GPU
memory to system memory.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
183
CUBLAS function result
• The type cublasStatus is used for function status returns
• CUBLAS helper functions return status directly, while the status of CUBLAS
core functions can be retrieved via cublasGetError( ) function
• Currently, the following values are defined:
CUBLAS error codes
CUBLAS_STATUS_SUCCESS
CUBLAS_STATUS_NOT_INITIALIZED
CUBLAS_STATUS_ALLOC_FAILED
Operation completed successfully
CUBLAS library not initialized
Resource allocation failed
Unsupported numerical value was
CUBLAS_STATUS_INVALID_VALUE
passed to function
Function requires an architectural
CUBLAS_STATUS_ARCH_MISMATCH
feature absent from the architecture of
the device
CUBLAS_STATUS_MAPPING_ERROR
Access to GPU memory space failed
CUBLAS_STATUS_EXECUTION_FAILED GPU program failed to execute
CUBLAS_STATUS_INTERNAL_ERROR An internal CUBLAS operation failed
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
184
CUBLAS helper functions
• cublasStatus cublasInit( )
Initializes the CUBLAS library: it allocates hardware resources for accessing
GPU. It must be called before any other CUBLAS functions
Return values:
◦ CUBLAS_STATUS_ALLOC_FAILED: if resources could not be allocated
◦ CUBLAS_STATUS_SUCCESS: if CUBLAS library initialized successfully
• cublasStatus cublasShutdown( )
Shuts down the CUBLAS library: deallocates any hardware resource in the
CPU side
Return values:
◦ CUBLAS_STATUS_NOT_INITIALIZED: if CUBLAS library was not initialized
◦ CUBLAS_STATUS_SUCCESS: CUBLAS library shut down successfully
• cublasStatus cublasGetError( )
Returns the last error that occurred on invocation of any of the CUBLAS core
functions (helper functions return the status directly, the core functions do
not)
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
185
CUBLAS memory handling
• cublasStatus cublasAlloc(int n, int elemSize, void **ptr)
Creates an object in GPU memory space capable of holding an array of n
elements, where each element’s size is elemSize byte. The result of the
function is the common status code, the ptr pointer points to the new
allocated memory space
• cublasStatus cublasFree(const void *ptr)
Deallocates the object in the GPU memory referenced by the ptr pointer
• cublasStatus cublasSetVector(int n, int elemSize,
const void *x,
int incx,void *y,
int incy)
The function copies n elements from a vector in the system memory (pointed
by x reference) to the y vector in the GPU memory (pointed by the y
reference). Storage spacing between elements is incx in the source vector
and incy in the destination vector
• cublasStatus cublasGetVector(int n, int elemSize,
const void *x,
int incx,void *y,
int incy)
Similar to cublasSetVector function. It copies n elements from a vector in the
GPU memory (pointed by x reference) to the y vector in the system memory
(pointed by the y reference).Storage spacing between elements is incx in the
source vector and incy in the destination vector.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
186
BLAS functions overview
• The BLAS functionality is divided into three levels: 1, 2 and 3
• The CUBLAS framework uses the same division method as the original BLAS
library
• BLAS level 1 functions
◦ This level contains vector operations of the form as well as scalar dot
products and vector norms, among other things
◦ Functions are grouped into subgroups by the operand types
– Single-precision BLAS1 functions
– Single-precision complex BLAS1 functions
– Double-precision BLAS1 functions
– Double-precision complex BLAS1 functions
• BLAS level 2 functions
◦ This level contains matrix –vector operations, solving equals, among
other things
• BLAS level 3 functions
◦ This level contains matrix –matrix operations. This level contains the
widely used general matrix multiply operation
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
187
Some CUBLAS level 1 functions
• int cublasIsamax(int n, const float *x, int incx)
Finds the smallest index of the maximum element (result is 1-based
indexing!)
Parameters:
◦ n: number of elements in input vector
◦ x: single-precision vector with n elements
◦ incx: storage spacing between elements of x
Error codes:
◦ CUBLAS_STATUS_NOT_INITIALIZED: if CUBLAS library was not initialized
◦ CUBLAS_STATUS_ALLOC_FAILED: if function could not allocate reduction
buffer
◦ CUBLAS_STATUS_EXECUTION_FAILED: if function failed to launch on
GPU
• float cublasSasum(int n, const float *x, int incx)
Computes the sum of the values of the elements in the vector
…
• See the CUBLAS library documentation for full list of available functions
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
188
Some CUBLAS level 2 functions
• void cublasSsbmv( char uplo,
int n,
int k,
float alpha,
const float *A,
int lda,
const float *x,
int incx,
float beta,
float *y,
int incy)
Performs the following matrix-vector operation:
y = alpha * A * x + beta * y
where
◦ alpha, beta –scalars
◦ x, y –vectors
◦ A –matrix
• void cublasStrsv(char uplo, char trans, char diag, int n,
const float *A, int lda, float *x, int incx)
Performs the following matrix-solves a system of equations
…
• See the CUBLAS library documentation for full list of available functions
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
189
Some CUBLAS level 3 functions
• void cublasSgemm(char transa,
char transb,
int m,
int n,
int k,
float alpha,
const float *A,
int lda,
const float *B,
int ldb,
float beta,
float *C,
int ldc)
Performs the following matrix-matrix operation:
C = alpha * op(A) * op(B) + beta * C
(where op(x) = x or op(x) = xT)
where
◦ alpha, beta – scalars
◦ lda, ldb, ldc – leading dims
◦ A, B, C – matrices
◦ if transa = ”T” then op(A) = AT
◦ if transb = ”T” then op(B) = BT
• See the CUBLAS library documentation for full list of available functions
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
190
6. CUDA versions
6. CUDA VERSIONS
6.1 CUDA 4 features
CUDA 4.0 features
Share GPUs accross multiple threads
• Easier porting of multi-threaded applications. CPU threads can share one GPU
(OpenMP etc.)
• Launch concurrent threads from different host threads (eliminates context
switching overhead)
• New, simple context management APIs. Old context migration APIs still
supported
One thread can access all GPUs
• Each host thread can access all GPUs
(CUDA had a „1 thread – 1 GPU” limitation before)
• Single-threaded application can use multi-GPU features
• Easy to coordinate more than GPUs
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
193
Set current device
• All CUDA operations are issued to the “current” GPU (except asynchronous P2P
memory copies)
• To select the current device, use cudaSetDevice()
cudaError_t cudaSetDevice(int device)
◦ First parameter is the number of the device
• Any device memory subsequently allocated from this host thread using
cudaMalloc(), cudaMallocPitch() or cudaMallocArray() will be physically
resident on device
• Any host memory allocated from this host thread using cudaMallocHost() or
cudaHostAlloc() or cudaHostRegister() will have its lifetime associated with
device
• Any streams or events created from this host thread will be associated with
device
• Any kernels launched from this host thread using the <<< >>> operator or
cudaLaunch() will be executed on device
• This call may be made from any host thread, to any device, and at any time
• This function will do no synchronization with the previous or new device, and
should be considered a very low overhead call
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
194
Current device - streams, events
• Streams and events are per device
◦ Streams are created in the current device
◦ Events are created in the current device
• NULL stream (or 0 stream)
◦ Each device has its own default stream
◦ Default streams of different devices are independents
• Using streams and events
◦ Streams can contain only events of the same device
• Using current device
◦ Calls to streams are available only when the appropriate device is current
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
195
Multi-GPU example
• Synchronization between devices
• eventB belongs to streamB and device 1
• At cudaEventSynchronize the current GPU is device 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
2012.12.30
cudaStream_t streamA, streamB;
cudaEvent_t eventA, eventB;
cudaSetDevice( 0 );
cudaStreamCreate( &streamA );
cudaEventCreate( &eventA );
cudaSetDevice( 1 );
cudaStreamCreate( &streamB );
cudaEventCreate( &eventB );
kernel<<<..., …, streamB>>>(...);
cudaEventRecord( eventB, streamB );
cudaSetDevice( 0 );
cudaEventSynchronize( eventB );
kernel<<<..., …, streamA>>>(...);
szenasi.sandor@nik.uni-obuda.hu
196
Using multiple CPU threads
• In case of multiple CPU threads of the same process
◦ GPU handling is the same as single-thread environment
◦ Every thread can select the current device
◦ Every thread can communicate to any GPUs
◦ The process has its own address space, all of the threads can reach this
region
• In case of multiple processes
◦ Processes have their own memory address spaces
◦ It’s like the processes are on different nodes
◦ Therefore some CPU side messaging needed (MPI)
• In case of different nodes
◦ The CPUs have to solve the communication
◦ From the GPUs perspective it is the same as the single-node environment
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
197
Vector multiplication with multiple GPUs - kernel
• Simple kernel to multiply all items in the array by 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#define N 100
#define blockN 10
#define MaxDeviceCount 4
__global__ static void VectorMul(float *A, int NperD) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
}
2012.12.30
if (i < NperD) {
A[i] = A[i] *2;
}
szenasi.sandor@nik.uni-obuda.hu
198
Vector multiplication with multiple GPUs – memory allocation
• Get information about devices and allocate memory in all devices
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
int main(int argc, char* argv[])
int deviceCount;
cudaGetDeviceCount(&deviceCount);
printf("Available devices:\n");
cudaDeviceProp properties[MaxDeviceCount];
for(int di = 0; di < deviceCount; di++) {
cudaGetDeviceProperties(&properties[di], di);
printf("'%d' - %s\n", di, properties[di].name);
}
2012.12.30
float A[N], oldA[N];
for(int i = 0; i < N; i++) {
A[i] = I; oldA[i] = A[i];
}
int NperD = N / deviceCount;
float* devA[MaxDeviceCount];
for(int di = 0; di < deviceCount; di++) {
cudaSetDevice(di);
cudaMalloc((void**) &devA[di], sizeof(float) * NperD);
}
szenasi.sandor@nik.uni-obuda.hu
199
Vector multiplication with multiple GPUs – kernel invocation
•
•
•
•
•
•
1
2
3
4
5
6
7
8
9
10
11
12
13
Select one of the devices
Copy the appropriate part of the input array (asynchronously)
Start a kernel in the selected device
Copy back the results to the host memory (asynchronously)
Do the iteration before for all devices
After this synchronize all devices
for(int di = 0; di < deviceCount; di++) {
cudaSetDevice(di);
cudaMemcpy(devA[di], &A[di * NperD], sizeof(float) * NperD,
cudaMemcpyHostToDevice);
dim3 grid((NperD - 1) / blockN + 1);
dim3 block(blockN);
VectorMul<<<grid, block>>>(devA[di], NperD);
cudaMemcpy(&A[di * NperD], devA[di], sizeof(float) * NperD,
cudaMemcpyDeviceToHost);
}
cudaThreadSynchronize();
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
200
Vector multiplication with multiple GPUs – kernel invocation
• Free all memory objects in devices
• Print out the resuls
1
2
3
4
5
6
7
for(int di = 0; di < deviceCount; di++) {
cudaFree(devA[di]);
}
for(int i = 0; i < N; i++) {
printf("A[%d] = \t%f\t%f\n", i, oldA[i], A[i]);
}
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
201
6. CUDA VERSIONS
6.2 CUDA 5 features
CUDA 5.0 features [26]
Dynamic Parallelism
• GPU threads can dynamically spawn new threads, allowing the GPU to adapt to
the data. By minimizing the back and forth with the CPU, dynamic parallelism
greatly simplifies parallel programming. And it enables GPU acceleration of a
broader set of popular algorithms, such as those used in adaptive mesh
refinement and computational fluid dynamics applications.
GPU-Callable Libraries
• A new CUDA BLAS library allows developers to use dynamic parallelism for
their own GPU-callable libraries. They can design plug-in APIs that allow other
developers to extend the functionality of their kernels, and allow them to
implement callbacks on the GPU to customize the functionality of third-party
GPU-callable libraries.
• The “object linking” capability provides an efficient and familiar process for
developing large GPU applications by enabling developers to compile multiple
CUDA source files into separate object files, and link them into larger
applications and libraries
GPUDirect Support for RDMA
• Enables direct communication between GPUs and other PCI-E devices, and
supports direct memory access between network interface cards and the GPU.
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
203
Dynamic parallelism
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
__device__ float buf[1024];
__global__ void dynamic(float *data)
{
int tid = threadIdx.x;
if(tid % 2)
buf[tid/2] = data[tid]+data[tid+1];
__syncthreads();
if(tid == 0) {
launch<<< 128, 256 >>>(buf);
cudaDeviceSynchronize();
}
__syncthreads();
cudaMemcpyAsync(data, buf, 1024);
cudaDeviceSynchronize();
}
Dynamic parallelism example [27]
• Programmer can use kernel launch <<< >>> in any kernel
• Launch is per-thread
• __synctreads() includes all launches by any thread in the block
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
204
7. References
References
[1] Wikipedia – Graphics processing unit
http://en.wikipedia.org/wiki/Graphics_processing_unit
[2] Wikipedia – Shader
http://en.wikipedia.org/wiki/Shader
[3] S. Patidar, S. Bhattacharjee, J. M. Singh, P. J. Narayanan: Exploiting the Shader
Model 4.0 Architecture
http://researchweb.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf
[4] Wikipedia – Unified shader model
http://en.wikipedia.org/wiki/Unified_shader_model
[5] CUDA Programming Guide
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_
CUDA_ProgrammingGuide.pdf
[6] S. Baxter: GPU Performance
http://www.moderngpu.com/intro/performance.html
[7] K. Fatahalian: From Shader Code to a Teraflop: How Shader Cores Work
http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
206
References (2)
[8] CUDA tutorial 4 – Atomic Operations
http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations
[9] Developing a Linux Kernel Module using RDMA for GPUDirect
http://www.moderngpu.com/intro/performance.html
[10] T. C. Schroeder: Peer-to-Peer & Unified Virtual Addressing
http://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect
_uva.pdf
[11]CUDA C Programming Guide
http://docs.nvidia.com/cuda/cuda-c-programming-guide
[12]S. Rennich: CUDA C/C++ Streams and Concurrency
http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrency
Webinar.pdf
[13]P. Micikevicius: Multi-GPU Programming
http://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu
.pdf
[14]Wikipedia – Basic Linear Algebra Subprograms
http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
207
References (3)
[15]NVIDIA CUBLAS
http://developer.nvidia.com/cublas
[16]CUBLAS Library
http://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cudadoc/CUBLAS_Library.pdf
[17]J. Luitjens, S. Rennich: CUDA Warps and Occupancy
http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAnd
Occupancy.pdf
[18]C. Zeller: CUDA Performance
http://gpgpu.org/static/s2007/slides/09-CUDA-performance.pdf
[19]NVIDIA’s Next Generation: Fermi
http://www.nvidia.pl/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute
_Architecture_Whitepaper.pdf
[20]Tom R. Halfhill: Parallel Processing with CUDA
http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
208
References (4)
[21]David Kirk, Wen-Mei Hwu:Programming Massively Parallel Processors courses
http://courses.ece.uiuc.edu/ece498/al/
[22] NVIDIA Nsight Visual Studio Edition 2.2 User Guide
http://http.developer.nvidia.com/NsightVisualStudio/2.2/Documentation/UserGu
ide/HTML/Nsight_Visual_Studio_Edition_User_Guide.htm
[23]Memory Consistency
http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf
[25]SIMD < SIMT < SMT: parallelism in NVIDIA GPUs
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
[26]CUDA 5.0 production released
http://gpuscience.com/software/cuda-5-0-production-released/
[26]S. Jones: Introduction to Dynamic Parallelism
http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012CUDA-Programming-Model.pdf
2012.12.30
szenasi.sandor@nik.uni-obuda.hu
209
Download PDF
Similar pages