ROCm - gpgpu-10

ROCm:
An open platform for GPU
computing exploration
1
Ben Sander:
Senior Fellow, Radeon Open Compute
Gregory Stoner:
Senior Director, Radeon Open Compute
FEBRUARY 2017 | ROCM
REVOLUTION IN GPU COMPUTING
Radeon Open Compute Platform (ROCm)
Modern Heterogeneous HPC and Hyper Scale Accelerator Platform for Large Scale Systems
2
Performance
Open
Hyper Scale
Rich foundation built for latency
reduction and throughput
optimization
First fully “Open Source”
professional GPU accelerator
computing solution
Built from the ground up
to service multiaccelerators in node and
across the rack
FEBRUARY 2017 | ROCM
Introducing ROCm Software Platform
A new, fully “Open Source” foundation for Hyper Scale and HPC-class GPU computing
Graphics Core Next Headless Linux® 64-bit
Driver
•
•
•
•
3
HSA Drives Rich Capabilities Into the ROCm
Hardware and Software
•
•
•
•
•
Large memory single allocation
Peer-to-Peer Multi-GPU
Peer-to-Peer with RDMA
Systems management API and tools
User mode queues
Architected queuing language
Flat memory addressing
Atomic memory transactions
Process concurrency & preemption
Rich Compiler Foundation For HPC Developer
“Open Source” Tools and Libraries
•
•
•
•
•
•
•
•
•
LLVM native GCN ISA code generation
Offline compilation support
Standardized loader and code object format
GCN ISA assembler and disassembler
Full documentation to GCN ISA
FEBRUARY 2017 | ROCM
Rich Set of “Open Source” math libraries
Tuned “Deep Learning” frameworks
Optimized parallel programing frameworks
CodeXL profiler and GDB debugging
ROCm Programming Model Options
4
HIP
HCC
OpenCL
Convert CUDA to portable C++
True single-source C++
accelerator language
Khronos Industry Standard
accelerator language
• Single-source Host+Kernel
• Single-source Host+Kernel
• Split Host/Kernel
• C++ Kernel Language
• C++ Kernel Language
• C99-based Kernel Language
• C Runtime
• C++ Runtime
• C Runtime
• Platforms: AMD GPU, NVIDIA
(same perf as native CUDA)
• Platforms: AMD GPU
• Platforms: CPU, GPU, FPGA
When to use it?
When to use it?
When to use it?
• Port existing CUDA code
• Port existing OpenCL code
• Developers familiar with CUDA
• New projects where true C++
language preferred
• New project that needs
portability to AMD and NVIDIA
• Use features from latest ISO
C++ standards
FEBRUARY 2017 | ROCM
• New project that needs
portability to CPU,GPU,FPGA
HIP : Key Features

Strong support for most commonly used parts of CUDA API
‒
‒

Full C++ support including templates, namespace, classes, lambdas
‒
‒

‒
‒
On CUDA, developers can use native CUDA tools (nvcc, nvprof, etc)
On ROCM, developers can use native ROCM tools (hcc, rocm-prof, codexl)
HIP ecosystem includes hipBlas, hipFFT, hipRNG, MIOpen
Hipify tools automate the translation from CUDA to HIP
‒
5
AMD’s open-source GPU compiler based on near-tip clang+llvm
Support C++11, C++14, some C++17 features
Hipified code is portable to AMD/ROCM and NVIDIA/CUDA
‒

Streams, events, memory allocation/deallocation, profiling
HIP includes driver API support (modules and contexts)
Developers should expect some final cleanup and performance tuning
FEBRUARY 2017 | ROCM
Hipification of CUDA Kernel (CAFFE)
HIPIFY
(Automated)
CUDA
HIP
namespace caffe {
namespace caffe {
C++ Features
template <typename Dtype>
Unchanged!
__global__ void
BNLLForward(const int n,
const Dtype* in, Dtype* out)
{
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < (n); i += blockDim.x * gridDim.x) {
out[index] = in[index] > 0 ?
in[index] + log(1. + exp(-in[index])) :
log(1. + exp(in[index]));
}
}
6
FEBRUARY 2017 | ROCM
template <typename Dtype>
__global__ void
BNLLForward(hipLaunchParm lp, const int n,
const Dtype* in, Dtype* out)
{
for (int i = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
i < (n); i += hipBlockDim_x * hipGridDim_x) {
out[index] = in[index] > 0 ?
in[index] + log(1. + exp(-in[index])) :
log(1. + exp(in[index]));
Math Libs
Unchanged!
}
}
Hipification of CUDA Runtime APIs (CAFFE)
CUDA
HIPIFY
(Automated)
void SyncedMemory::async_gpu_push(const cudaStream_t&
stream) {
CHECK(head_ == HEAD_AT_CPU);
if (gpu_ptr_ == NULL) {
cudaGetDevice(&gpu_device_);
cudaMalloc(&gpu_ptr_, size_);
own_gpu_data_ = true;
}
const cudaMemcpyKind put = cudaMemcpyHostToDevice;
cudaMemcpyAsync(gpu_ptr_, cpu_ptr_, size_, put, stream);
// Assume caller will synchronize on the stream
head_ = SYNCED;
}
7
FEBRUARY 2017 | ROCM
HIP
void SyncedMemory::async_gpu_push(const hipStream_t&
stream) {
CHECK(head_ == HEAD_AT_CPU);
if (gpu_ptr_ == NULL) {
hipGetDevice(&gpu_device_);
hipMalloc(&gpu_ptr_, size_);
own_gpu_data_ = true;
}
const hipMemcpyKind put = hipMemcpyHostToDevice;
hipMemcpyAsync(gpu_ptr_, cpu_ptr_, size_, put, stream);
// Assume caller will synchronize on the stream
head_ = SYNCED;
}
Porting with hipify tool
CUDA
hipify
 ~99%+ Automatic Conversion
Developer
Cleanup and
Tuning
Portable
HIP C++
8
FEBRUARY 2017 | ROCM
 Developer maintains HIP port
 Resulting C++ code runs on NVIDIA or
AMD GPUs
HIP Compilation Process
Portable HIP C++
(Kernels + HIP API)
AMD
NVIDIA
 HIP API implemented as
inlined calls to CUDA
Runtime
HIP->HC
HIP->CUDA
Header
Header
 Compute kernels mostly
unchanged
HCC C++
(Kernels + CUDA API)
(Kernels + HC/ROCr)
NVCC
HCC
CUDA Executable
FEBRUARY 2017 | ROCM
 Compute kernels mostly
unchanged
 Code compiled with HCC
 Can use CodeXL Profiler/Debugger
 Can use nvprof, CUDA
debugger, other tools
9
 Uses HCC’s hc::accelerator,
hc::accelerator_view,
hc::completion_future
 Some calls directly into ROCm RT
CUDA
 Code compiled with
NVCC (same as CUDA)
 HIP API implemented with
lightweight HIP runtime
 Source Portable
 Not Binary Portable
HCC Executable
ROCm : Deep Learning Gets HIP
Complexity of
Application Porting:
CAFFE
Bringing a faster path to bring deep learning application to AMD GPUs
• The Challenge: CAFFE
• Popular machine-learning framework
• Tip version on GitHub has 55000+ lines-of-code
• GPU-accelerated with CUDA
• 99.6% of code unmodified or automatically converted
• Port required less than 1 week developer time
• Supports all CAFFE features (multi-gpu, P2P, FFT
filters)
• HIPCAFFE is the fastest CAFFE on AMD hardware –
1.8X faster than CAFFE/OpenCL
Lines of Code Changed
• Results:
35000
30000
25000
Manual,
32227
20000
15000
10000
5000
0
Manual, 219
Automatic, 688
OpenCL Port
HIP Port
AMD Internal Data
10
FEBRUARY 2017 | ROCM
HCC: Heterogeneous Compute Compiler
Architecture:
Dialects:


Built on open-source CLANG/LLVM

Single source compiler for both CPU & GPU

Standard object code can be linked with
g++, clang, icc




11
Performance optimized for accelerators:

Explicit and implicit data movement

Scratchpad memories

Asynchronous commands
FEBRUARY 2017 | ROCM
HC:

C++ runtime – hc::accelerator,
hc::accelerator_view,
hc::completion_future
Kernels launched with parallel_for_each
around lambda expression
ISO C++

C++17 Parallel Standard Template
Library

Next steps include executors and
concurrency controls
OpenMP

OpenMP 3.1 support for CPU

OpenMP 4.5 GPU offload at SC2016
HCC Example – “HC” Syntax
AUTOMATIC MEMORY MANAGEMENT VIA ARRAY_VIEW
#include <hc.hpp>
int main(int argc, char *argv[])
{
int sizeElements = 1000000;
// Alloc auto-managed array_view
hc::array_view<double> A(sizeElements);
hc::array_view<double> B(sizeElements);
hc::array_view<double> C(sizeElements);
// Initialize host memory
for (int i=0; i<sizeElements; i++) {
A[i] = 1.618 * i;
B[i] = 3.142 * i;
}
// Tell runtime not to copy CPU host data.
C.discard_data();
12
FEBRUARY 2017 | ROCM
// Launch kernel onto default accelerator
// HCC runtime ensures that A and B are available on
the accelerator before kernel launch:
hc::parallel_for_each(hc::extent<1> (sizeElements),
[=] (hc::index<1> idx) [[hc]] {
// Kernel is lambda inside for-loop
int i = idx[0];
C[i] = A[i] + B[i];
});
// Check result
for (int i=0; i<sizeElements; i++) {
double ref= 1.618 * i + 3.142 * i;
// Because C is an array_view, the HCC runtime
will copy C back to host at first access:
if (C[i] != ref) {
printf ("error:%d computed=%6.2f,
reference=%6.2f\n", I , C[i], ref);
}
};
}
}
HCC Example – “HC” Syntax
EXPLICIT MEMORY MANAGEMENT VIA ARRAY
#include <hc.hpp>
int main(int argc, char *argv[])
{
int sizeElements = 1000000;
// Alloc GPU arrays
hc::array<double> Ad(sizeElements);
hc::array<double> Bd(sizeElements);
hc::array<double> Cd(sizeElements);
double * Ah = malloc(sizeElements*8);
double * Bh = malloc(sizeElements*8);
double * Ch = malloc(sizeElements*8);
// Initialize host memory
for (int i=0; i<sizeElements; i++) {
Ah[i] = 1.618 * i;
Bh[i] = 3.142 * i;
}
// Copy host data to GPU
copy(Ah, Ad);
copy(Bh, Bd);
13
FEBRUARY 2017 | ROCM
// Launch kernel onto default accelerator
// HCC runtime ensures that A and B are available on
the accelerator before kernel launch:
hc::parallel_for_each(hc::extent<1> (sizeElements),
[=] (hc::index<1> idx) [[hc]] {
// Kernel is lambda inside for-loop
int i = idx[0];
Cd[i] = Ad[i] + Bd[i];
});
copy(Cd, Ch); // Copy results GPU to host
// Check result
for (int i=0; i<sizeElements; i++) {
double ref= 1.618 * i + 3.142 * i;
if (Ch[i] != ref) {
printf ("error:%d computed=%6.2f,
reference=%6.2f\n", I , C[i], ref);
}
};
}
}
HCC “HC” Mode
KEY FEATURES

Many core structures similar to C++AMP
‒
‒
‒

Controls over asynchronous kernel and data commands
‒
‒
‒

‒
Approachable hc::array_view for managed memory and implicit synchronization
Explicit pointer-based memory allocation (am_alloc / am_free)
Language Restrictions
‒
‒
‒
14
hc::parallel_for_each returns hc::completion_future
Asynchronous copy commands
C++17 then, when_any, when_all for managing device-side dependencies [under development]
Memory Management
‒

Implementation uses "hc" namespace
hc::accelerator_view, hc::array_view, hc::completion_future
With expanded capabilities…
Remove C++AMP “restrict”
Support rich set of C++ language features and data types
Advanced C++ language features (virtual functions, recursion, etc) [under development]
FEBRUARY 2017 | ROCM
ISO C++17 Parallel STL
Standard Template Library
adjacent_find
all_of
for_each
none_of
search
for_each_n
nth_element
search_n
any_of
generate
partial_sort
set_difference
copy
generate_n
partial_sort_copy
set_intersection
copy_if
includes
partition
set_symmetric_difference
copy_n
inclusive_scan
partition_copy
set_union
count
inplace_merge
reduce
sort
count_if
is_heap
remove
stable_partition
‒
equal
is_partitioned
remove_copy
stable_sort
‒
exclusive_scan
is_sorted
remove_copy_if
swap_ranges
fill
‒
is_sorted_until
remove_if
lexicographical_co
mpare
replace
transform
find
max_element
replace_copy
uninitialized_copy_n
find
merge
replace_copy_if
uninitialized_fill
find_end
min_element
reverse
uninitialized_fill_n
find_first_of
minmax_element reverse_copy
unique
find_if
mismatch
rotate
unique_copy
find_if_not
move
rotate_copy
fill_n

‒
‒

Execution policy
‒
uninitialized_copy
‒

FEBRUARY 2017 | ROCM
Approved in Jacksonville ISO meeting!
Next steps:
‒
‒
15
New first parameter to PSTL function
par indicates algorithm can be run in parallel
Can accelerate and run on GPU or multicore CPU
Abstraction allows use of architecture-specific
optimizations (workgroups, LDS)
Formalization of ideas in TBB, NV Thrust, Bolt libs
Proposal for C++ 17 Parallelism Tech Spec
‒

sort(data.begin(), data.end()); //STL
sort(par, data.begin(), data.end()); // PSTL
Executors to control where (which device)
Provide std::future to track status
ISO C++: Template Library for Parallel For Loops
 http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0075r1.pdf
 Proposed for C++20 and currently under discussion
 Provides straightforward porting of OpenMP #pragma loops into C++
 Key advantage over Parallel STL is that “position” (i) inside loop can be easily determined
 For_loop, for_loop_strided, reductions, inductions
 Similar to PSTL, par policy can be extended with Executors to control where/how kernel is executed
// Propose ISO C++ parallel for_loop:
void saxpy_ref(int n, float a, float x[], float y[]) {
for_loop(par, 0, n, [&](int i) {
y[i] += a *x[i];
});
}
16
FEBRUARY 2017 | ROCM
ISO: Concurrency TS

GPU Architecture Basics
‒
‒
‒
Memory-based queues used to schedule and execute commands
Commands include data-copies, parallel execution “kernels”, dependencies, configuration
Hardware-based dependency resolution
‒

hc::completion_future
‒
‒
‒
Based on C++ std::future
Returned by asynchronous commands
Extend “then” to schedule device-side commands (no host intervention)
‒
‒
‒
HCC implementation identifies accelerator commands via specialization and leverages GPU HW
copy(…). then(for_each(…). then(copy(…);
when_all, when_any (N4501)
‒
‒
17
Efficiently wait for dependencies, signal completion – all without host intervention
Combine futures, return another future, in a single function
Can leverage dependency resolution hardware
FEBRUARY 2017 | ROCM
Delivering An Open Platform For GPU Computing
Language neutral solution to match developer needs as heterogeneous programing models evolve
Compiler Front End
(CLANG)
GCN Compiler
 Direct-to-ISA
 GCN Docs
 CLANG/LLVM
 GCN Assembler
GCN
Assembl
y
GCN
Compiler
CPU
Compiler
LLVM Opt
Passes
LLVM Opt
Passes
GCN Target
CPU ISA
Target
 Open-source
Language Runtime API
ROCr System Runtime API
ROCk/AMDGPU Driver
CPU Code
GPU Code
Linux OS
18
FEBRUARY 2017 | ROCM
UCX
Benefits from Open Source Community
nVidia NVCC closed-source compiler
typedef float MyFloat_t;
__global__ void
scale(MyFloat_t *c, MyFloat_t *a)
{
const Myflaot_t scalar = 3.0;
const int i = blockDim.x *
blockIdx.x + threadIdx.x;
c[i] = scalar * a[i];
}
typo_type.cpp(8): error: identifier
"Myflaot_t" is undefined
AMD HCC open-source compiler
typo_type.cpp:8:11: error: unknown type name
'Myflaot_t'; did you mean 'MyFloat_t'?
const Myflaot_t scalar = 3.0;
^~~~~~~~~
MyFloat_t
typo_type.cpp:3:15: note: 'MyFloat_t' declared here
typedef float MyFloat_t;
ROCm Supports OpenCL™
OpenCL 1.2+
New Core Foundation to best leverage ROCr runtime
OpenCL 2.0 Kernel Language
OpenCL 1.2 compatible runtime
New GCN ISA LLVM Code Generator
Support GCN ISA assembly optimization, Assembler,
Disassembler, inline ASM
Support Offline, ahead of time compilation
Register allocation and occupancy controls
20
FEBRUARY 2017 | ROCM
Key Features
Coarse Grain SVM
C11 Atomics
OpenCL 2.0 Images Support
Latency to compute
optimization
User Mode DMA – Dual
engines with ASYNC transfer,
User Mode Queue support
Innovation by Terminology?
Term
HIP
HC
OpenCL 1.2
Device
Queue
Event
int deviceId (0..n-1)
hipStream_t
hipEvent_t
cl_device
cl_command_queue
cl_event
Memory
void *
hc::accelerator
hc::accelerator_view
hc::completion_future
void *; hc::array;
hc::array_view
grid
block
thread
warp
extent
tile
thread
wavefront
NDRange
work-group
work-item
sub-group
Device Kernel
Kernel Launch
__global__
hipLaunchKernel
lambda inside
hc::parallel_for_each or [[hc]] __kernel
hc::parallel_for_each
clEnqueueNDRangeKernel
Atomic Builtins
Precise Math
atomicAdd
cos(f)
hc::atomic_fetch_add
hc::precise_math::cos(f)
21
FEBRUARY 2017 | ROCM
cl_mem
atomic_add
cos(f)
Extending Support To A Broader Hardware Ecosystem
ROCm “Open Source” platform brings a rich foundation to these new ecosystems
AMD64 Support
• AMD “Zen”
• Intel Xeon E5 v3 v4
ARM® AArch64 Support
Cavium ThunderX
IBM OpenPower Support
‒ IBM Power 8
ROCm is being built to support next generation I/O Interfaces
GenZ Founding Member
22
FEBRUARY 2017 | ROCM
CCIX Founding Member
OpenCAPI Founding Member
miOpen

Open-source optimized Deep Learning GPU kernels for OpenCL and HIP




-- Normalization
-- Activation Functions
-- Data as 4-D tensor
Describes operations as a function on tensors


Convolutions
Pooling
Softmax
Example: a convolution
Support major MI frameworks including CAFFE, TensorFlow, Torch [under development]
Weights
Input-Img
a
b
c
d
u
23
e i mq
af ej in mr q
bg afkaejeo insi mrmq q
ch bglbfkpf jotj nsnr r
dv chwcglgx kpyk otos s
u dvdhwh lxl pypt t
u v w x y
uv w x y
FEBRUARY 2017 | ROCM
a
b
c
d
u
n
…
c
e i mq
af ej in mr q
bg afk aejoeins m
r mq
i
ch bgl bfkp fjot jns nr
dv chw cglxgkpykot os
u dv dhw hlx lpy pt
u v w x y
u v w x
q
r
s
t
y
x
h
1
3
6
5 2
16 54 2
37 168 154 52 2
6 37 368 64 4
6 7 8
6 7 8
Output-Img
=
a
b
c
d
ae
bf
c
g
dh g
ei h
f i
Open-source computing – who wins?
24
Developers
Applications
Community delivers superior tools
First access to new language features
Source access enables control and
optimization
Research
Customers
Innovate above the infrastructure
ROCm : First open GPU compiler
Value and request open solutions
FEBRUARY 2017 | ROCM
Some ROCm Research Opportunities
OPEN SOURCE COMPILER AND RUNTIME

GPU Register Allocation and Optimization
‒
‒
‒
‒

Feedback-directed Optimization
‒
‒

Best way to identify optimal code generation is to run the code
Can we capture appropriate state from one or more runs and use this to influence future compilation?
Dynamic Parallelism Done Right
‒
‒
25
Large register files (i.e. 256/thread)
Complex relationship between IPC and occupancy
Unique Scalar and Vector Registers, Uniform Access is important optimization
ROCm LLVM compiler exposes full compiler stack including register allocator, scheduler
“Architected Queuing Language” :
Standard architecture command packet, enabled GPUs to send work to themselves or other GPUs
FEBRUARY 2017 | ROCM
Some ROCm Research Opportunities
OPEN-SOURCE KERNEL DRIVER AND LIBRARIES

Peer-to-Peer communication
‒
‒

Memory Management
‒
‒
‒

Recent GPUs include automated migration of data to GPU
Enables single unified pool of memory from developer perspective
Many heuristics and optimization opportunities for when to migrate
MIOpen
‒
26
Large-BAR access from other PCIe devices to all of GPU’s memory
Enables interesting experimentation with other open-source device drivers (FPGAs, NVME, NIC, etc)
Innovate with new algorithms, layer fusion, tuning, understanding
FEBRUARY 2017 | ROCM
Where To Go Deeper On ROCm
https://radeonopencompute.github.io/index.html
27
FEBRUARY 2017 | ROCM
Open Source Professional Computing Solution
Foundation For Direct Access To The Hardware
Delivering Choice in Programming Models
28
FEBRUARY 2017 | ROCM
29
FEBRUARY 2017 | ROCM
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing
manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or
revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof
without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO
RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE.
IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES
ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES.
ATTRIBUTION
© 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, FirePro and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. ARM is a registered trademark of ARM Limited in the UK and other
countries. PCIe is a registered trademarks of PCI-SIG Corporation. OpenCL and the OpenCL logo are trademarks of Apple, Inc. and used by
permission of Khronos. OpenVX is a trademark of Khronos Group, Inc. Other names are for informational purposes only and may be trademarks
of their respective owners. Use of third party marks / names is for informational purposes only and no endorsement of or by AMD is intended or
implied.
30
FEBRUARY 2017 | ROCM
radeonopencompute.github.io
31
FEBRUARY 2017 | ROCM
Download PDF