CUDA Programming
1
Basics of CUDA
Programming
Weijun Xiao
Department of Electrical and Computer Engineering
University of Minnesota
2
Outline
•
•
•
•
•
•
•
What’s GPU computing?
CUDA programming model
Basic Memory Management
Basic Kernels and Execution
CPU and GPU Coordination
CUDA debugging and profiling
Conclusions
3
What is GPU?
• Graphic Processing Unit
Logical Representation
of Visual Information
Output Signal
4
Performance Gap between GPUs and CPUs
5
GPU = Fast Parallel Machine
• GPU speed increasing at faster pace than Moore’s Law.
• This is a consequence of the data-parallel streaming aspects
of the GPU.
• Gaming market simulates the development of GPU
• GPUs are cheap ! Put enough together, and you can get a
super-computer.
So can we use the GPU for general-purpose
computing ?
6
Sure, thousands of Applications
•
•
•
•
•
•
•
•
•
•
Large matrix/vector operations (BLAS)
Protein Folding (Molecular Dynamics)
FFT (signal processing)
VMD(Visual Molecular Dynamics)
Speech Recognition (Hidden Markov Models, Neural nets)
Databases
Sort/Search
Storages
MRI
…
7
Why are We Interested in GPU?
•
•
•
•
•
High-performance Computing
High Parallelism
Low Cost
GPU can be Programmable
GPGPU
8
Growth and Development of GPU
• A quiet revolution and potential build-up
GFLOPS
–
–
–
Calculation: 367 GFLOPS vs. 32 GFLOPS
Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
Before CUDA , programmed through graphics API
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
–
GPU in every PC and workstation – massive volume and
potential impact
16 highly threaded MP, 128 Cores, 367 GFLOPS, 768 MB
DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
Host
Input Assembler
Thread Execution Manager
r
tu
x
e
T
r
tu
x
e
T
r
tu
x
e
T
Load/store
Parallel Data
Cache
Load/store
Parallel Data
Cache
Parallel Data
Cache
r
tu
x
e
T
Parallel Data
Cache
Load/store
Global Memory
r
tu
x
e
T
rT
tu
x
e
T
r
tu
x
e
Load/store
Parallel Data
Cache
Parallel Data
Cache
r
tu
x
e
T
Parallel Data
Cache
Load/store
Parallel Data
Cache
r
tu
x
e
T
9
GeForce 8800
Load/store
10
Telsa 2050
14 MP, 448 Cores, 1.03 TFLOPS/515 GFLOPS, 3GB GDDR5 DRAM with ECC, 144GB/S
Mem BW, PCIe 2 x16 (8GB/S BW to CPU)
11
GPU Languages
• Assembly
• Cg (NVIDIA)
- C for Graphics
• GLSL (OpenGL)
- OpenGL Shading Language
• HLSL (Microsoft)
- High-level Shading language
• Brook C/C++ (AMD)
• CUDA (NVIDIA)
• Open CL
12
How GPGPU Works before CUDA?
• Follow graphics pipeline
• Pretend to be graphics
• Take an advantage of massive parallelism of
GPU
• Disguise data as textures or geometry
• Disguise algorithm as render passes
• Fool graphics pipeline to do computation
13
CUDA Programming Model
• Compute Unified Device Architecture
• Simple and General-Purpose Programming
Model
• Standalone driver to load computation
programs into GPU
• Graphics-free API
• Data sharing with OpenGL buffer objects
• Easy to use and low-learning curve
14
CUDA – C with no shader
limitations!
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
...
15
CUDA Devices and Threads
•
A compute device
–
–
–
–
•
•
Is a coprocessor to the CPU or host
Has its own DRAM (device memory)
Runs many threads in parallel
Is typically a GPU but can also be another type of parallel
processing device
Data-parallel portions of an application are expressed as
device kernels which run on many threads
Differences between GPU and CPU threads
–
GPU threads are extremely lightweight
•
–
Very little creation overhead
GPU needs 1000s of threads for full efficiency
•
Multi-core CPU needs only a few
© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009
ECE 498AL, University of Illinois, Urbana­Champaign
16
Extended C
• Declspecs
– global, device, shared,
local, constant
__device__ float filter[N];
__global__ void convolve (float *image)
__shared__ float region[M];
...
• Keywords
– threadIdx, blockIdx
region[threadIdx] = image[i];
• Intrinsics
– __syncthreads
__syncthreads()
...
• Runtime API
– Memory, symbol,
execution management
image[j] = result;
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
• Function launch
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
{
17
Compiling a CUDA Program
C/C++ CUDA
Application
float4 me = gx[gtid];
me.x += me.y * me.z;
•
Parallel Thread
eXecution (PTX)
–
CPU Code
NVCC
–
PTX Code
Virtual
Physical
PTX to Target
Compiler
G80
…
–
GPU
Target code
ld.global.v4.f32
mad.f32
Virtual Machine
and ISA
Programming
model
Execution
resources and
state
{$f1,$f3,$f5,$f7}, [$r9+0];
$f1, $f5, $f3, $f1;
18
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
–
–
All threads run the same code (SPMD)
Each thread has an ID that it uses to compute memory addresses and
make control decisions
threadID
0 1 2 3 4 5 6 7
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009
ECE 498AL, University of Illinois, Urbana­Champaign
19
Thread Blocks: Scalable
Cooperation
• Divide monolithic thread array into multiple blocks
– Threads within a block cooperate via shared memory,
atomic operations and barrier synchronization
– Threads in different blocks cannot cooperate
Thread Block 1
Thread Block 0
threadID
0
1
2
3
4
5
6
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
7
0
1
2
3
4
5
6
Thread Block N - 1
7
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009
ECE 498AL, University of Illinois, Urbana­Champaign
0
…
1
2
3
4
5
6
7
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
20
Block IDs and Thread IDs
•
Each thread uses IDs to
decide what data to work on
–
–
•
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when
processing
multidimensional data
–
–
–
Image processing
Solving PDEs on volumes
…
Click to edit Master text styles
HostSecondDevice
level
Third level Grid 1
Fourth
level
Kernel
Block
Block
1
Fifth
level (0, 0) (1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Kernel
2
Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organizatio
21
CUDA Memory Model
• Global memory
– Main means of
communicating R/W
Data between host
and device
– Contents visible to
all threads
– Long latency
Host
access
© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009
ECE 498AL, University of Illinois, Urbana­Champaign
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
22
Basic Memory Management
23
Memory Spaces
• CPU and GPU have separate memory spaces
– Data is moved across PCIe bus
– Use functions to allocate/set/copy memory on GPU
• Very similar to corresponding C functions
• Pointers are just addresses
– Can’t tell from the pointer value whether the address is on
CPU or GPU
– Must exercise care when dereferencing:
• Dereferencing CPU pointer on GPU will likely crash
• Same for vice versa
24
GPU Memory Allocation / Release
• Host (CPU) manages device (GPU) memory:
– cudaMalloc (void ** pointer, size_t nbytes)
– cudaMemset (void * pointer, int value, size_t count)
– cudaFree (void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
25
Data Copies
• cudaMemcpy( void *dst, void *src, size_t nbytes,
enum cudaMemcpyKind direction);
– returns after the copy is complete
– blocks CPU thread until all bytes have been copied
– doesn’t start copying until previous CUDA calls complete
• enum cudaMemcpyKind
– cudaMemcpyHostToDevice
– cudaMemcpyDeviceToHost
– cudaMemcpyDeviceToDevice
• Non-blocking memcopies are provided
26
Code Walkthrough 1
•
•
•
•
•
Allocate CPU memory for n integers
Allocate GPU memory for n integers
Initialize GPU memory to 0s
Copy from GPU to CPU
Print the values
27
Code Walkthrough 1
#include <stdio.h>
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
28
Code Walkthrough 1
#include <stdio.h>
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a )
{
printf("couldn't allocate memory\n");
return 1;
}
29
Code Walkthrough 1
#include <stdio.h>
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a )
{
printf("couldn't allocate memory\n");
return 1;
}
cudaMemset( d_a, 0, num_bytes );
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
30
Code Walkthrough 1
#include <stdio.h>
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a )
{
printf("couldn't allocate memory\n");
return 1;
}
cudaMemset( d_a, 0, num_bytes );
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
for(int i=0; i<dimx; i++)
printf("%d ", h_a[i] );
printf("\n");
free( h_a );
cudaFree( d_a );
return 0;
}
31
Basic Kernels and Execution
on GPU
32
CUDA Function Declarations
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc()
device
device
__global__ void
device
host
host
host
__host__
•
KernelFunc()
float HostFunc()
__global__ defines a kernel function
– Must return void
•
__device__ and __host__ can be
used together
© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009
ECE 498AL, University of Illinois, Urbana­Champaign
33
CUDA Function Declarations (cont.)
__device__ functions cannot have their
address taken
• For functions executed on the device:
•
– Can only access GPU memory
– No recursion
– No static variable declarations inside the
function
– No variable number of arguments
© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009
ECE 498AL, University of Illinois, Urbana­Champaign
34
Code Walkthrough 2
•
•
•
•
Build on Walkthrough 1
Write a kernel to initialize integers
Copy the result back to CPU
Print the values
35
Kernel Code (executed on GPU)
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = 7;
}
36
Launching kernels on GPU
• Launch parameters:
– grid dimensions (up to 2D), dim3 type
– thread-block dimensions (up to 3D), dim3 type
– shared memory: number of bytes per block
• for extern smem variables declared without size
• Optional, 0 by default
– stream ID
• Optional, 0 by default
dim3 grid(16, 16);
dim3 block(16,16);
kernel<<<grid, block, 0, 0>>>(...);
kernel<<<32, 512>>>(...);
37
#include <stdio.h>
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = 7;
}
int main()
{
int dimx = 16;
int num_bytes = dimx*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a )
{
printf("couldn't allocate memory\n");
return 1;
}
cudaMemset( d_a, 0, num_bytes );
dim3 grid, block;
block.x = 4;
grid.x = dimx / block.x;
kernel<<<grid, block>>>( d_a );
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
for(int i=0; i<dimx; i++)
printf("%d ", h_a[i] );
printf("\n");
free( h_a );
cudaFree( d_a );
return 0;
}
38
Kernel Variations and Output
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = 7;
}
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = blockIdx.x;
}
__global__ void kernel( int *a )
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = threadIdx.x;
}
Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
39
Code Walkthrough 3
•
•
•
•
Build on Walkthruogh 2
Write a kernel to increment n×m integers
Copy the result back to CPU
Print the values
40
Kernel with 2D Indexing
__global__ void kernel( int *a, int dimx, int dimy )
{
int ix = blockIdx.x*blockDim.x + threadIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;
int idx = iy*dimx + ix;
a[idx] = a[idx]+1;
}
41
int main()
{
int dimx = 16;
int dimy = 16;
int num_bytes = dimx*dimy*sizeof(int);
int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
if( 0==h_a || 0==d_a )
{
printf("couldn't allocate memory\n");
return 1;
}
__global__ void kernel( int *a, int dimx, int dimy )
{
int ix = blockIdx.x*blockDim.x + threadIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;
int idx = iy*dimx + ix;
cudaMemset( d_a, 0, num_bytes );
dim3 grid, block;
block.x = 4;
block.y = 4;
grid.x = dimx / block.x;
grid.y = dimy / block.y;
a[idx] = a[idx]+1;
kernel<<<grid, block>>>( d_a, dimx, dimy );
}
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
for(int row=0; row<dimy; row++)
{
for(int col=0; col<dimx; col++)
printf("%d ", h_a[row*dimx+col] );
printf("\n");
}
free( h_a );
cudaFree( d_a );
return 0;
}
42
Blocks must be independent
• Any possible interleaving of blocks should be valid
– presumed to run to completion without pre-emption
– can run in any order
– can run concurrently OR sequentially
• Blocks may coordinate but not synchronize
– shared queue pointer: OK
– shared lock: BAD … can easily deadlock
• Independence requirement gives scalability
43
Blocks must be independent
• Thread blocks can run in any order
– Concurrently or sequentially
– Facilitates scaling of the same code across
many devices
Scalability
44
Coordinating CPU and GPU
Execution
45
Synchronizing GPU and CPU
• All kernel launches are asynchronous
– control returns to CPU immediately
– kernel starts executing once all previous CUDA calls
have completed
• Memcopies are synchronous
– control returns to CPU once the copy is complete
– copy starts once all previous CUDA calls have
completed
• cudaThreadSynchronize()
– blocks until all previous CUDA calls complete
• Asynchronous CUDA calls provide:
– non-blocking memcopies
– ability to overlap memcopies and kernel execution
46
CUDA Error Reporting to CPU
• All CUDA calls return error code:
– except kernel launches
– cudaError_t type
• cudaError_t cudaGetLastError(void)
– returns the code for the last error (“no error” has a code)
• char* cudaGetErrorString(cudaError_t code)
– returns a null-terminated character string describing the
error
printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );
47
Device Management
• CPU can query and select GPU devices
– cudaGetDeviceCount( int* count )
– cudaSetDevice( int device )
– cudaGetDevice( int *current_device )
– cudaGetDeviceProperties( cudaDeviceProp* prop, int device )
– cudaChooseDevice( int *device, cudaDeviceProp* prop )
• Multi-GPU setup:
– device 0 is used by default
– one CPU thread can control one GPU
• multiple CPU threads can control the same GPU
– calls are serialized by the driver
48
CUDA Debugging and Profiling
49
What’s cuda-gdb?
•
•
•
•
•
All-in-one debugging tool
Host and CUDA codes
Extension to linux gdb
32/64-bit Linux
4.0 release
50
Debug Compilation
• -g –G
• nvcc –g –G foo.cu –o foo
• Fermi
-gencode arch=compute_20,code=sm_20
• Makefile
• CUDA-GDB error: undefined reference to
'$gpu_registers‘ (2.2 beta or previous)
• Ptxvars.cu
nvcc "/usr/local/cuda/bin/ptxvars.cu" -g -G --host-compilation=c -c -definealways-macro _DEVICE_LAUNCH_PARAMETERS_H__ -Xptxas -fext
51
Extension to GDB
• Debug both host and GPU code seamlessly
• GPU memory is as an extension to host
memory
• GPU thread/blocks are as extensions to host
threads
• Breakpoints at any host and/or device
function symbol or source file line number
• Single-step individual warps
52
Debug commands
• thread <<<(BX,BY),(TX,TY,TX)>>>
thread <<<170>>>
thread <<<2,(10,10)>>>
• cuda block (n,m) thread (x,y,z)
• info cuda state (replacing with devices,
kernels, system, warp, sm…)
53
Debugging Commands(cont.)
• break
• print
• continue
• next
• step
• quit
• set args
…
• GDB quick reference
http://users.ece.utexas.edu/~adnan/gdb-refcard.pdf
54
Example code
• 8-bit bit reverse
• 00011101 -> 10111000
• 10010111 -> 11101001
55
Algorithms
r= 0;
for (int i=0;i<8;i++) {
r = r <<1;
if (x mod 2) r += 1; x=x>>1;
}
x = (((0xf0 &x )>>4) | ((0x0f &x) << 4) );
x = (((0xcc &x )>>2) | ((0x33 &x) << 2 ));
x = (((0xaa &x ) >>1) | ((0x55 &x> <<1));
56
Code
1 #include <stdio.h>
2 #include <stdlib.h>
3
4 // Simple 8-bit bit reversal Compute test
5
6 #define N 256
7
8 __global__ void bitreverse(unsigned int *data)
9{
10 unsigned int *idata = data;
11
12 unsigned int x = idata[threadIdx.x];
13
14 x = ((0xf0f0f0f0 & x) >> 4 | ((0x0f0f0f0f & x) <<
4);
15 x = ((0xcccccccc & x) >> 2 | ((0x33333333 & x)
<< 2);
16 x = ((0xaaaaaaaa & x) >> 1 | ((0x55555555 & x)
<< 1);
17
18 idata[threadIdx.x] = x;
19 }
20
21 int main(void)
22 {
23 unsigned int *d = NULL; int i;
24 unsigned int idata[N], odata[N];
25
26 for (i = 0; i < N; i++)
27 idata[i] = (unsigned int)i;
28
29 cudaMalloc((void**)&d, sizeof(int)*N);
30 cudaMemcpy(d, idata, sizeof(int)*N,
31 cudaMemcpyHostToDevice);
32
33 bitreverse<<<1, N>>>(d);
34
35 cudaMemcpy(odata, d, sizeof(int)*N,
36 cudaMemcpyHostToDevice);
37
38 for (i = 0; i < N; i++)
39 printf(“%u -> %u\n”, idata[i], odata[i]);
40
41 cudaFree((void*)d);
42 return 0;
43 }
57
Cuda-gdb supported platform
• Host platform
X11 cannot be running on the GPU used for
debugging
One GPU: Disable X11
Two more GPUs
• GPU requirements
All CUDA-enable GPUs except 8800GTS,
8800GTX, 8800 Ultra, FX4600, and FX5600
58
Debugging example code
• Step 1
nvcc –g –G bitreverse.cu –o bitreverse
• Step 2
Cuda-gdb ./bitreverse
• Step 3
Set breakpoints(break main, break bitreverse, break 18)
• Step 4
Run CUDA application
(cuda-gdb) run
• Step 5
Continue and watch variables
(cuda-gdb) continue
(cuda-gdb) thread
(cuda-gdb) print x
59
Profiling Tools
• CUDA memcheck
• Occupancy Calculator
• Visual Profiler
60
CUDA Visual Profiler
61
CUDA Counters
62
Profiler Counters for Fermi
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
branch, divergent branch
instruction issued, instruction executed
sm cta launched
gld request, gst request
local load, local store
share load, share store
warps launched, threads launched
l1 global load hit, l1 global load miss
l1 local load hit, l1 local load miss
l1 local store hit, l1 local store miss
l1 share bank conflicts
uncached global load transaction
global store transaction
l2 read requests, l2 write requests
l2 read misses, l2 write misses
dram reads, dram writes
tex cache requests, tex cache misses
63
Memory throughput
• Compute capability<2.0
Global read throughput= (((gld_32*32) + (gld_64*64) + (gld_128*128)) *
TPC) / gputime
Global write throughput = (((gst_32*32) + (gst_64*64) + (gst_128*128)) *
TPC) / gputime
• Compute capability>=2.0
Global read throughput = (dram reads * 32 )/gputime
Global write throughput = (dram writes * 32 )/gputime
• Gmem overall throughput = read throughput + write throughput
• Tesla C2050 , theoretical bandwidth 144GB/s
64
Conclusions
•
•
•
•
•
GPU as an accelerator for HPC
CUDA programming model
CUDA thread and kernel
CUDA example codes
CUDA debugging and profiling
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising