GPU and CUDA

GPU COMPUTING
Introduction
Available processing power
Available processing power
Nvidia Tesla K20X
single prec.: 3950 GFLOPS
double prec.: 1310 GFLOPS
Intel Core i7-3900
187 GFLOPS
2014
Available processing power
Nvidia Tesla P100 (Pascal)
single prec.: 10600 GFLOPS
double prec.: 5300 GFLOPS
Intel Core E7-8870V3
single prec.: ~1720 GFLOPS
today double prec.: ~860 GFLOPS
CPU vs. GPU
GeForce 8800 (2007)
¨
Host
¨
Input Assembler
16 highly threaded Streaming
Multiprocessors (SMs),
128 Streaming Processors SPs,
¤
Thread Execution Manager
367 GFLOPS, 768 MB DRAM,
86.4 GB/S Mem BW, 4GB/S BW to CPU
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
G80 Characteristics
¨
367 GFLOPS peak performance
¤
¨
¨
¨
¨
(25-50 times of current high-end microprocessors)
265 GFLOPS sustained
Massively parallel, 128 cores, 90W
Massively threaded, sustains 1000s of threads per app
30-100 times speedup over high-end microprocessors
on scientific and media applications: medical imaging,
molecular dynamics
“I think they're right on the money, but the huge
performance differential (currently 3 GPUs ~= 300
SGI Altix Itanium2s) will invite close scrutiny so I have
to be careful what I say publically until I triple check
those numbers.”
-John Stone, VMD group, Physics UIUC
2010 - Fermi Architecture
¨
~1.5 TFLOPS (SP)/~800 GFLOPS (DP)
230 GB/s DRAM Bandwidth
16 SM x 32 SP = 512 cores
( and a true cache )
2012 – Kepler Architecture
¨
~3.9TFLOPS (SP)/~1300GFLOPS (DP)
250GB/s DRAM Bandwidth
15 SM x (192 sing.p. + 64 double.p.) SP = 3840 “cores”
(Dynamic parallelism)
Nvidia Pascal GPU Architecture
Nvidia Pascal GP100 GPU Architecture
¨
¨
¨
¨
¨
An array of 6 Graphics Processing Clusters (GPCs)
Each GPC has 10 Streaming Multiprocessors (SMs)
Each SM has 64 CUDA Cores
3840 single precision
CUDA Cores
A total of 4096 KB
of L2 cache
Future Apps Reflect a Concurrent World
¨
Exciting applications in future mass computing
market have been traditionally considered
“supercomputing applications”
¤ Molecular
dynamics simulation, Video and audio coding
and manipulation, 3D imaging and visualization, Consumer
game physics, and virtual reality products
¤ These
“Super-apps” represent and model physical,
concurrent world
¨
Various granularities of parallelism exist, but…
¤ programming
model must not hinder parallel
implementation
¤ data delivery needs careful management
Stretching Traditional Architectures
¨
Traditional parallel architectures cover some superapplications
¤ DSP,
¨
GPU, network apps, Scientific
The game is to grow mainstream architectures “out”
or domain-specific architectures “in”
¤ CUDA
is latter
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles
CUDA Programming Model
Before CUDA
¨
Dealing with graphics API
¤
¨
Addressing modes
¤
¨
Input Registers
Fragment Program
Limited outputs
Instruction sets
¤
¨
Limited texture size/dimension
Shader capabilities
¤
¨
Working with the corner cases of the graphics API
Lack of Integer & bit ops
¤
Between pixels
No scatter to memory locations
Texture
Constants
Temp Registers
Communication limited
¤
per thread
per Shader
per Context
Output Registers
FB
Memory
CUDA
¨
¨
“Compute Unified Device Architecture”
General purpose programming model
¤
¤
¨
Targeted software stack
¤
¨
User kicks off batches of threads on the GPU
GPU = dedicated super-threaded, massively data parallel co-processor
Compute oriented drivers, language, and tools
Driver for loading computation programs into GPU
¤
¤
¤
¤
¤
Standalone Driver - Optimized for computation
Interface designed for compute – graphics-free API
Data sharing with OpenGL buffer objects
Guaranteed maximum download & readback speeds
Explicit GPU memory management
CUDA basics
¨
The computing system consists in:
a HOST running serial or modestly parallel C code
¤ one or more DEVICES running kernel C code
¤ exploiting massive data parallelism
¤
n
the program property where many arithmetic operations
can be safely performed on the data structures in a
simultaneous manner
SERIAL CODE
(HOST)
PARALLEL KERNEL
(DEVICES)
...
CUDA Devices and threads
¨
A compute device
¤
¤
¤
¤
¨
¨
Is a coprocessor to the CPU or host
Has its own DRAM (device memory)
Runs many threads in parallel
Is typically a GPU but can also be another type of parallel
processing device
Data-parallel portions of an application are expressed as
device kernels which run on many threads
Differences between GPU and CPU threads
¤
GPU threads are extremely lightweight
n
¤
Very little creation overhead
GPU needs 1000s of threads for full efficiency
n
Multi-core CPU needs only a few
The thread hierarchy (bottom-up)
¨
¨
KERNEL is C function that, when called, is executed N times in parallel by N different
CUDA THREADS.
Threads are organized in BLOCKS:
¤
¤
¤
¤
Threads in the same block share the same processor and its resources.
On current GPUs, a block may contain at most 2048 threads.
Threads within a block can cooperate by sharing data through some shared memory
Threads have a 1/2/3-dimensional identifier:
n
¨
threadIdx.x, threadIdx.y, threadIdx.z
Blocks are organized in GRIDS:
¤
¤
¤
The number of thread blocks in a grid is usually dictated by the size of the data being
processed or the number of processors in the system, which it can greatly exceed.
A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common
choice.
Blocks have a 1/2-dimensional identifier:
n
¤
¤
blockIdx.x, blockIdx.y
Blocks are required to execute independently: It must be possible to execute them in any
order, in parallel or in series. This independence requirement allows thread blocks to be
scheduled in any order across any number of cores
Automatic scalability
The thread hierarchy (top-down)
¨
A GRID is a piece of work that
can be executed by the GPU
¤
¨
A BLOCK is an independent subpiece of work that can be
executed in any order by a
Streaming Multiprocessor SM
¤
¤
¨
A 2D array of BLOCKs
A 3D array of threads
Max no. of threads in a block
depends on hardware.
A THREAD is the minimal unit of
work.
¤
¤
All the threads execute the same
KERNEL function
Threads are grouped in WARPS
of 32 for scheduling, i.e. they are
the minimal units of scheduling
Fake “Hello World!” example
__global__ void kernel( void ) { }
int main( void ) {
kernel<<<2,3>>>();
printf( "Hello, World!\n" );
return 0;
}
¨
¨
__global__
declare a kernel function to be run on the GPU
kernel<<<2,3>>>();
runs 2 blocks with 3 threads each executing the function
kernel
Fake “Hello World!” example
__global__ void kernel( void ) { }
int main( void ) {
// set the grid and block sizes
dim3 dimGrid(3,1); // a 3x1 array of blocks
dim3 dimBlock(2,2,2); // a 2x2x2 array of threads
// invoke the kernel
kernel<<<dimGrid, dimBlock>>>();
printf( "Hello, World!\n" );
return 0;
}
¨
dim3 is the data type used to declare grid and block sizes
Memory
¨
Global Memory is the on-board
device memory
Data transfers occur between Host
memory and Global memory
¤ It is accessible by any thread
¤ It is (relatively) costly
¤ Constant memory (64K) supports
read-only access by the GPU,
which has a short latency
Grid
Block (0, 0)
Block (1, 0)
¤
¨
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Provides fast access but it is very
limited in size
Registers are private to threads
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Shared Memory is shared by the
threads in the same block
¤
¨
Shared Memory
Host
Memory
¨
cudaMalloc()
¤
¤
Allocates memory in the device Global Memory
Parameters: Address of a pointer to the allocated object, Size of of
allocated object
¨
cudaFree()
¨
Frees memory from device Global Memory
¤ Parameters: Pointer to freed object
cudaMemcpy()
¤
¤
¤
memory data transfer
Parameters: Pointer to destination, Pointer to source, Number of
bytes copied
n
¤
Type of transfer: Host to Host, Host to Device, Device to Host, Device to
Device
Asynchronous transfer
“Hello World!” example (1)
// Host function
// set the grid and block sizes
int main(int argc, char** argv) {
dim3 dimGrid(3);
// desired output
// 3 blocks
dim3 dimBlock(4); // 4 threads
char str[] = "Hello World!”;
// mangle contents of output
// invoke the kernel
// the null character is left intact
helloWorld<<< dimGrid,
dimBlock >>>(d_str);
for(int i = 0; i < 12; i++)
str[i] -= i;
// retrieve the results
cudaMemcpy(str, d_str, size,
cudaMemcpyDeviceToHost);
// allocate memory on the device
char *d_str;
int size = sizeof(str);
// free allocated memory
cudaMalloc((void**)&d_str, size);
cudaFree(d_str);
// copy the string to the device
printf("%s\n", str);
cudaMemcpy(d_str, str, size,
cudaMemcpyHostToDevice);
return 0;
}
“Hello World!” example (2)
__global__
void helloWorld(char* str)
{
// determine where in the thread grid we are
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// unmangle output
str[idx] += idx;
}
Summary on threads, blocks, etc.
Function declaration
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc()
device
device
__global__ void
device
host
host
host
__host__
¨
¨
¨
KernelFunc()
float HostFunc()
__global__ defines a kernel function
__device__
¤ No recursion
¤ No static variables inside the function
¤ No variable number of arguments
__device__ and __host__ can be used together
Invoking a kernel function
__global__ void KernelFunc(...);
dim3
DimGrid(100, 50);
// 5000 blocks
dim3
DimBlock(4, 8, 8);
// 256 threads per block
KernelFunc<<< DimGrid, DimBlock >>>(...);
¨
¨
¨
Kernel calls are asynchronous
gridDim.x/y blockIdx.x/y
blockDim.x/y/z threadIdx.x/y/x
identify threads and blocks in a grid within kernel
function
Sizes and limitations (Compute capability 1.0):
¤
Warp size = 32;
Threads per block = 512;
Warps per SM = 24; Blocks per SM = 8;
Threads per SM = 768
Automatic (Transparent) Scalability
¨
Do we need to take care of the device computing power
(number of SM) ?
A Grid contains a set of independent blocks, which can be executed
in any order.
¤ No, because the block scheduler can re-arrange blocks accordingly
¤
Kernel grid
Device
Device
Block 0 Block 1
Block 2 Block 3
Block 0
Block 1
Block 4 Block 5
Block 6 Block 7
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
time
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Vector add example
Objective
Blocks
Threads
Vector add (1)
// Device kernel
¨
__global__
void blockthreadAdd( int* a,
int* b,
int* c)
{
// position in the block/grid
int i = threadIdx.x +
blockIdx.x*blockDim.x;
c[i] = a[i] + b[i];
}
¨
Kernel definition
Each thread has a
single cell of the
input/output arrays
Vector add (2)
// Host function
¨
int main(int argc, char** argv) {
int *a, *b, *c;
// vectors
int N = 1024;
// vectors length
int size = N*sizeof(int); // memory size
// allocate memory
a = new int[N];
b = new int[N];
c = new int[N];
// initialize
for (int i=0; i<N; i++) {
a[i] = rand() % 1000;
b[i] = rand() % 1000;
}
¨
Allocate memory on host
Initialize with random numbers
Vector add (3)
// allocate memory on device
¨
Allocate memory on device
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
// copy input on device
¨
cudaMemcpy(dev_a, a, size,
cudaMemcpyHostToDevice);
Copy input from host memory to
device memory
cudaMemcpy(dev_b, b, size,
cudaMemcpyHostToDevice);
Invoke kernel
¨ the number of threads depend
dev_c);
on the input size
¨
// launch kernels with 8 threads each
blockthreadAdd<<<N/8,8>>>(dev_a, dev_b,
// copy result on host
cudaMemcpy(c, dev_c, size,
cudaMemcpyDeviceToHost);
¨
Get results back
Vector add (4)
// freee memory on device
cudaFree(dev_a);
¨
cudaFree(dev_b);
cudaFree(dev_c);
Clean up memory on
device and on host
// free memory
delete [] a;
delete [] b;
delete [] c;
return 0;
}
¨
. the end .
In order to compile this
¨
You must
download and install the CUDA sdk
¤ Compile similarly as with g++
¤
n
¤
nvcc helloworld.cu --gpu-architecture=sm_52
Make sure to read the nvcc manual and chose the
propose architecture
Dot product example
Objective
a[ |||||||||||||||||||||||| ]
* * * * * * * * * * * * * * * * * * * * * * * * *
b[ |||||||||||||||||||||||| ]
SUM
c[ ]
¨
How to split the work among blocks and threads ?
1.
2.
Every block should compute a partial sum
Sum the partial sums
Pseudo-code
// Device kernel
__global__
void dotproduct( int* a, int* b, int* c) {
// position in the block/grid
int i = threadIdx.x + blockIdx.x*blockDim.x;
// compute product and store it in block-shared memory
// wait all threads to compute their product
// make one thread in the warp to compute the block sum
// increase safely the global current sum
}
Pseudo-code
#define THREADS_PER_BLOCK 512
// Device kernel
__global__
void dotproduct( int* a, int* b, int* c) {
__shared__ int temp[THREADS_PER_BLOCK];
// position in the block/grid
int i = threadIdx.x + blockIdx.x*blockDim.x;
// compute product and store it in block-shared memory
temp[threadIdx.x] = a[i]*b[i];
// wait all threads to compute their product
// make one thread in the warp to compute the block sum
// increase safely the global current sum
}
Pseudo-code
#define THREADS_PER_BLOCK 512
// Device kernel
__global__
void dotproduct( int* a, int* b, int* c) {
__shared__ int temp[THREADS_PER_BLOCK];
// position in the block/grid
int i = threadIdx.x + blockIdx.x*blockDim.x;
// compute product and store it in block-shared memory
temp[threadIdx.x] = a[i]*b[i];
// wait all threads to compute their product
__syncthreads();
// make one thread in the warp to compute the block sum
// increase safely the global current sum
}
Pseudo-code
#define THREADS_PER_BLOCK 512
// Device kernel
__global__
void dotproduct( int* a, int* b, int* c) {
__shared__ int temp[THREADS_PER_BLOCK];
// position in the block/grid
int i = threadIdx.x + blockIdx.x*blockDim.x;
// compute product and store it in block-shared memory
temp[threadIdx.x] = a[i]*b[i];
// wait all threads to compute their product
__syncthreads();
// make one thread in the warp to compute the block sum
if (threadIdx.x==0) {
int sum = 0;
for (int j=0; j<THREADS_PER_BLOCK; j++)
sum += temp[j];
}
// increase safely the global current sum
Pseudo-code
#define THREADS_PER_BLOCK 512
// Device kernel
__global__
void dotproduct( int* a, int* b, int* c) {
__shared__ int temp[THREADS_PER_BLOCK];
// position in the block/grid
int i = threadIdx.x + blockIdx.x*blockDim.x;
// compute product and store it in block-shared memory
temp[threadIdx.x] = a[i]*b[i];
// wait all threads to compute their product
__syncthreads();
// make one thread in the warp to compute the block sum
if (threadIdx.x==0) {
int sum = 0;
for (int j=0; j<THREADS_PER_BLOCK; j++)
sum += temp[j];
// increase safely the global current sum
atomicAdd(c, sum);
}
}
Summary
¨
Atomic functions
perform a read-modify-write atomic operation on one
32-bit or 64-bit word residing in global or shared
memory
¤ atomicAdd(), atomicSub(), atomicExch(), atomicMin(),
atomicMax(), atomicInc(), atomicDec(), atomicAnd(),
atomicOr(), atomicXor()
¤
¨
__synchthreads()
¤
Barrier for threads in the same block, possibly in
different warps.
Memory (2)
¨
Each thread can:
¤
Read/write per thread
registers
n
¤
Read/write per block shared
memory
n
¤
Block (1, 0)
Shared Memory
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
__constant__ qualifier
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Constant Memory
__device__ qualifier
Read per grid constant memory
n
Block (0, 0)
__shared__ qualifier
Read/write per grid global
memory
n
¤
Automatic variables
Grid
Host
Advanced Thread and Memory
Management
Fermi Dual Warp Scheduler (DWS)
¨
¨
¨
¨
The SM schedules threads in groups of 32 parallel threads called
warps
Each SM features two warp schedulers and two instruction
dispatch units.
(DWS) selects two warps, and issues one instruction from each
warp to a group of sixteen cores
Kepler architecture provides 4 warp schedulers and 2 independent
instructions can be dispatched at each cycle
Thread divergence
Parallel sum
Parallel sum
Mini-Warp 2
Mini-Warp 4
Mini-Warp 3
Mini-Warp 1
Parallel sum
Prefix sum
Mini-Warp 2
Mini-Warp 1
Global Memory
¨
¨
¨
Accesses to global memory from a warp are split
into half-warp requests
Each half-warp request if coalesced in a small
number of memory “transactions”
The coalescing depends on the device compute
capability
Global Memory
¨
Compute Capability 1.0 e 1.1
The size of the words accessed by the threads must be
4, 8, or 16 bytes;
¤ If this size is:
¤
4, all 16 words must lie in the same 64-byte segment,
n 8, all 16 words must lie in the same 128-byte segment,
n 16, the first 8 words must lie in the same 128-byte segment
and the last 8 words in the following 128-byte segment;
n
¤
Threads must access the words in sequence: The k-th
thread in the half-warp must access the k-th word.
Global Memory
¨
Compute Capability 1.2 e 1.3
¤
¤
Threads can access any words in any order, including the same words, and a
single memory transaction for each segment addressed by the half-warp is
issued.
Find the memory segment that contains the address requested by the active
thread with the lowest thread ID. The segment size depends on the size of the
words accessed by the threads:
n
n
n
¤
¤
Find all other active threads whose requested address lies in the same segment.
Reduce the transaction size, if possible:
n
n
¤
¤
32 bytes for 1-byte words,
64 bytes for 2-byte words,
128 bytes for 4-, 8- and 16-byte words.
If the transaction size is 128 bytes and only the lower or upper half is used,
reduce the transaction size to 64 bytes;
If the transaction size is 64 bytes and only the lower or upper half is used, reduce
the transaction size to 32 bytes.
Carry out the transaction and mark the serviced threads as inactive.
Repeat until all threads in the half-warp are serviced.
Global Memory
¨
Compute Capability 2.x
Memory accesses are cached
¤ A cache line is 128 bytes and maps to a 128-byte aligned
segment in device memory.
¤ If the size of the words accessed by each thread is more
than 4 bytes, a memory request by a warp is first split into
separate 128-byte memory requests that are issued
independently:
¤
n
Two memory requests, one for each half-warp, if the size is 8
bytes,
Four memory requests, one for each quarter-warp, if the
size is 16 bytes.
¤ Note that threads can access any words in any order,
including the same words.
n
Global Memory
Global Memory
Global Memory
Shared Memory (2.0)
¨
¨
¨
¨
¨
32 memory banks organized such that successive
words reside in different banks
Each bank has a bandwidth of 32 bits per 2 clock
cycles
32 adjacent words are accessed in parallel from
32 different memory banks
A bank conflict occurs if two threads access to
different words within the same bank
When multiple threads access the same word
A broadcast occurs in case of read
¤ Only one threads writes (which is undetermined)
¤
References
¨
Parallel Programming
¤
Ch. 7 General Purpose GPU Programming