Report

Report
Boillod-Cerneux France
WS 2010/2011
GPU - PETSc for neutron
transport equation
partnerships:
Thanks
I thank G. Haase for to help for the outcome of my project, and backing me in my work
during the all semester.
Outlines
Introduction
page 5
0. CEA
page 6
I. Presentation of neutron transport equation
1.1 Context
1.2 Neutron transport equation
1.3 Neutron diffusion equation
page 7
page 7
page 8
page 8
II. The linear system to solve
page 10
2.1 Discretization of neutron diffusion equation
page 10
2.2 Domain decomposition
page 12
2.3 The linear system
2.4 The MINOS solver
2.5 The linear system used for our project
page 13
page 14
page 15
III Introduction to KSP PETSc methods
3.1 The numerical methods
page 16
page 16
IV Introduction to Multigrid with PETSc
4.1 The Multigrid concept
4.2 Multigrid with PETSc
page 17
page 17
page 18
V PETSc-CUDA presentation
5.1 PETSc with CUDA
page 20
page 20
VI PETSc used in neutronic domain
page 21
6.1 Multigrid code
6.2 Multigrid Results
6.3 KSP code
6.4 KSP Results
page 21
page 24
page 33
page 36
VII GPU used in neutronic domain
7.1 Introduction
7.2 Conjugate gradient with CUDA-CPU
7.3 Conjugate gradient with Cublas
7.4 Results
page 44
page 44
page 46
page 47
page 48
Conclusion
Bibliography
Webography
page 55
page 56
page 56
Abstract. This paper is about the optimisation for solving the neutron transport equation. We are
looking for to implement the conjugate gradient method with GPU programmation, in order to
speed-up the resolution of neutron transport equation. We will study different implementations of
conjugate-gradient, with CUDA and CUBLAS langage, and also with PETSc-CUDA. Then, we will
discuss on multigrid implementation with PETSc, and finally conclude on obtained perfomances.
Key words. Parallel Computing, Eigenproblem, Conjugate-gradient method, multigrid, Graphics
Programmation Unit
Introduction
As computer power increases, scientists want to solve more and more complicated
problems. They need to optimize both memory requirements and time computation, which is
commonly called « High Performance Computing ».
HPC is applied to sciences as well as business or data mining and holds a leading place in
every domain which uses computer science. More than a concept, HPC is nowadays a standard
and is used in several domains, like petrology, sismology, economy, nuclear …
Many codes developped 10 or 15 years ago need to be rewritten in order to take advantage
of parallel computers. This is the case of MINOS, a CEA neutronics code. MINOS'target is to solve
the neutron transport equation in order to compute the power produced by a power plant.
Indeed, the control of nuclear reaction is based on the neutron transport equation, that's
why its resolution is a major purpose.
Moreover, the computation time is essential, so as to intervene fast to reduce or to improve
a nuclear reaction, in terms of the needs energy.
Although MINOS is considerably optimised, CEA is still doing research to ameliorate it. The
recent researchs about GPU programmation prove that using graphic cards to make a part of
calculations can considerably speed-up the time calculation. It exists differents langages in order to
program on graphic cards, such that CUDA or CUBLAS. As the performances of GPU are really
satisfying, some libraries and langages such as PETSc or Python are working on improve their
perfomance by using GPU programmation. PETSc is a free library which is very adapted to
factorize and solve linear system using parallelism. Indeed, PETSc uses MPI or openMPI and
offers a lot of functions for linear and non linear systems.
The subject of this project is to implement the conjugate-gradient method with several
langages using GPU programmation, in order to use it later for solving the neutron transport
equation in one dimension, as in this case, the matrix is symmetric and positiv-definite. We will use
the langages CUDA, CUBLAS, and the PETSc library with CUDA, as it extends a lot of numerical
methods to solve linear systems, notably several version of conjugate-gradient method.
The organization of this paper is as follows. First, we will present the neutron transport
equation, and then explain how we obtain from this equation a linear system. We will focus next on
the generated codes for conjugate-gradient methode, and their structure. We will then discuss on
multigrid programmation with PETSC. Finally, we will study the performances we obtain, and
conclude on it.
0. CEA
Based in Ile-de-France, CEA Saclay is a national and european research center. CEA
gathers applied research as well as fundamental research. With 5000 researchers, CEA Saclay
contributes to improve nuclear central park optimization, but also its functionment and does
research to manage nuclear garbages, respecting the environment. To do so, CEA bases its
research on simulation, especially with computer science.
I. Presentation of neutron transport equation
The neutron transport equation is fundamental in neutronic science. Indeed, scientists must
solve it in order to understand the nuclear reaction in a core. In this part, we will present first the
issue and then the neutron transport equation. We will end with the neutron diffusion equation,
linked with the first mentioned equation.
1.1 Context
In a nuclear reactor, free neutrons collide with atoms. Two scenarios are possible :
the neutron is absorped by the atom (that's an absorption reaction),
or the atom generates two or three other free neutrons : That's a fission reaction.
Suppose that there are N fissions at time t. We call « Keff » the number of neutrons
rejected in a fission. In case of a fision, we obtain N∙Keff fission at time t+1, N∙Keff²
fissions at time t+2.
We deduce that at t+k time (k is a natural integer), we obtain N∙Keffk fissions : (N∙Keffk) is
a geometric suite (its reason is Keff) which converges only if Keff < 1.
When Keff < 1 (resp. Keff>1), we qualify the nuclear reaction as a under (resp. upper)
critical reaction.
When Keff=1, we qualify it as a critical reaction.
The coefficient Keff determines if nuclear reaction will decrease or increase in the futur.
Solving the neutron transport equation gives us the value of Keff, that's why we have to
solve it the fastest as possible.
On the previous picture, we observe that Keff is equal to 3 for each fission.
The fission phenomena produces heat, which will create water vapor. This vapor is
employed to action turbines, which then produces electricity.
As we presented the general functionment of a nuclear core, we will now study the neutron
transport equation.
1.2 Neutron transport equation
The neutron transport equation has been established by L. Boltzman in 1872, and is
also called Boltzmann's equation¹. We present it presented as follows :
¹
¹ Reference : Précis de neutronique Paul REUSS
Where :
Φ is the neutron flux (mol/m³),
the current is (mol/m²sec).
This equation cannot be solved with some particular conditions, for example, in case of
stationary state of Fick's laws (etablished in 1855). Before introducing the Fick's law, let's
see the diffusion definition : «a system tends to homogenize its chemical elements
concentrations : this natural phenomena is called diffusion.»
First Fick's law: it relates the diffusive flux to the concentration field, by postulating that
the flux goes from regions of high concentration to regions of low concentration. The
magnitude of flux is proportional to the concentration gradient. In one dimension, we obtain
this equation:
Where:
D is the diffusion coefficient (m²/sec).
Second Fick's law: it predicts how diffusion causes the concentration field to change with
time. It is represented with this equation:
Where:
x is the position.
The use of Fick's law in a core reactor leads to the diffusion approximation: Boltzmann's
equation can be simplify and aproximate by the neutron diffusion equation.
We will now focus on this equation, which depends upon 7 variables only.
1.3 Neutron diffusion equation
We saw previously the diffusion definition: To illustrate this definition with neutrons exemple,
that's means that in a core, the dense neutrons area tends to populate the sparse neutron
areas. In the neutron diffusion equation, we introduce a new variable :
Where:
p is the current.
The neutron diffusion equation is as follows:
Where:
Sφ are the scattering sources,
Sf are the fission sources,
R is the domain,
σa is the absorption coefficient,
σf is the fission coefficient,
λ is Keff.
²
² Reference : Pierre Guérin thesis
In fact, this equation doesn't consider the angular variable.
Now we have presented the neutron diffusion equation, and justify its utility, we will focus on
the corresponding linear system. Indeed, it is more adapted to use the linear system because of
the large panel of numerical methods that can solve the equation.
II. The linear system to solve
In this part, we will present the method used to discretize the equation diffusion, and the
decomposition domain method. Then we will explain the obtained linear system and finally, we will
present the solver MINOS, the CEA's program to modelize the resolution of linear system.
2.1 Discretization of neutron diffusion equation
Let's consider the previous neutron diffusion equation¹ :
(1.1)
(1.2)
(1.3)
These coupled equations are discretized with Raviart Thomas with Finite elements method.
There, we will detail how to obtain the linear system from these equations.
We consider:
and
satisfying the neutron diffusion
equation.
Let multiply the first equation line by a vector q, and the second with a function φ (φ is a
squared-integrable function). We write this system with its variational form, in order to
discretize the set of equation (1.1) and (1.3) with a finite element method. We finally obtain
the following system :
(2.1)
(2.2)
(2.3)
We now apply the Green's formula on equation (2.1):
(3.1)
(3.2)
(3.3)
We employ the Raviart Thomas Finite element method to discretize the different functional
spaces. The Raviart Thomas method is illustrated as follows :
The discretized space of current is denoted by :
The discretized space of flux which is denoted by :
We have:
Let set :
Then, the discretization of (3.1) and (3.3) writes :
(4.1)
(4.2)
(4.3)
with
We consider now the following written:
and so we obtain the following matricial system :
, statisfying the previous system:
Bφ –
AP = 0
Tφ –
BTP =
S
with :
This system can be rewritten with P as single unknwon member :
φ
= T -1 (
system
:
S – BT P) and replace the new epression of φ in the previous matrix
(BT -1BT + A)P = BT -1S
The biggest eigen value of this linear
system is
In the Raviart Thomas case, px and py are uncoupled and the system writes :
B = Bx + By
A=
| Ax
| -
- |
Ay |
We fix: Wx = Ax + BxTx -1BxT and Wy = Ay + ByTy -1ByT
Finally, we obtain the following linear system :
Wx
BxTx -1ByT
ByTy -1BxT
Wy
To solve this linear system, CEA currently factorizes the previous matrix with Cholesky
method, and solve it with a Gauss Seidel per blocs.
In our specific case, the domain decomposition method is used too to solve more
precisely the diffusion equation.
2.2 Domain decomposition
On every domain, we intervene on the boundary conditions only. Note that in our case, we
consider a cartesian mesh. To explain how we decompose the domain, we illustrate it with a
scheme :
We consider two domains, R1 and R2 and the normal vectors associated with each domain.
In neutron transport equation case, we rewritte the p vector as follows :
p1n • n1 + alpha•φ1 = - p2n-1 • n2
p2n • n2 + alpha•φ2 = - p1n-1 • n1
and introduce this it in the following system :
(5.1)
(5.2)
With the same operations detailed in the previous paragraph, the linear system which is
very similar as the one mentioned before. Indeed, this decomposition domain method is an
iterative one : in consequence, only the second member of the system is modify.
All the previous operations contributes to give a linear system, which form is Ax = b. It is
easier to solve this kind of system, because we dispose of a lot of numerical methods to solve it.
In the next paragraph, we will focus on the obtained linear system, which returns the coefficient
Keff.
2.3 The linear system
The following linear system1 is the result of the decomposition domain and discretization
method application, realised on equation in neutron diffusion. This system is a sequential
linear system.
1
Ax + BxTx -1BxT
BxTx -1ByT
ByTy -1BxT
Ay + ByTy -1ByT
1Reference
: Pierre guerin thesis
Caracteristics of matrices :
H' is symmetric positiv defined,
T is diagonal, but it depends on the choice of Wh and Vh
B is bidiagonal,
A is a dense matrix (A is also called reference of coupling of currents matrix)
W has the same profil as A,
S is the second member.
It results that the biggest eigen value of this linear system is
: to solve this linear
system is a major issue for the control of a nuclear reaction.
Currently, CEA implements a Cholesky method to factorize the matrix H' and then, Gauss
Seidel method to solve the linear system.
As the matrice size is huge, we use parallelization to implement the solver of this linear
system. Particularly, the MINOS Solver is a parallelised code, developped in C++, which makes
this resolution.
2.4 The MINOS solver
The Minos solver contains different C++ programs, which are optimised. Indeed, The C++
struture offers stability and is well adapted for parallelization. We will focus on the linear
system representation in Minos only.
2.4.1 Matrix storage
The matrix H is as follows:
Every row contains only three
non zero elements, and every
bloc is sharing one element.
The matrix H' is store as flat
profil. We conserve the values
of each bloc from off diagonal
(as the matrix is symmetric) in
an array called 'matCur'. The
'mut' array is an array
containing the number off
diagonal elements for each row,
and 'matCurD' contains all the
diagonal values. Note also that
H' is symmetric, positiv-definite.
We mentioned it before, the linear system is solved under a Gauss-Seidel method.
In every Gauss Seidel step, the linear system corresponds to a decomposition of H'
matrice for each direction. These systems (for each direction) are solved with a
Cholesky method.
MINOS iterates Cholesky resolution, until the error Є is inferior as a value fixed by
the user. Currently, the error is fixed by a limit :
1
1
Reference pierre guérin thesis
where:
Sn+1 (resp. Sn) is the result obtained at the n+1 (resp. n) iteration. Right now, the
error is from order 10(-5).
MINOS solver uses parallelization to solve the linear system. To be more precise,
we use MPI and openMPI, which accelerate computation time.
2.4.2 Parallelisation
The data distribution is parallelised with MPI functions : it presently exists one
method to distribute data with parallelism.
All data of matrices and vectors are distributed on all processes, then, we
decompose the data under the number of process and each process makes
calculations with a part of sent data. It has been proved that the time for
communication is more important than the time of calculation, as soon as we use
very large matrices and an important number of process.
2.5 The linear system used for our project
As I can not access to the exact CEA matrix and vectors, I choose to use an other
symmetric positive definite matrix. Moreover, the storage format of CEA matrix is not
optimal, so we loose a considerable time to translate it as a PETSc storage format. In this
case, we modify a litlle bit the matrix structure, and use a tridiagonal symmetric matrix.
That will be better to compare the time calculation, because the time to translate the
matrix format will not interfer.
The aim of this project is to implement the conjugate-gradient method, in order to solve the
linear system with it. We implement it by using graphic card programmation, so we can speed-up
the time calculation. In what follows, we will briefly study the PETSc library, and detail all the KSP
methods proposed to solve our linear system.
III Introduction to KSP PETSc methods
PETSc in an abreviation for «Portable, Extensible Toolkit for Scientific computation». This
library provides functions and many tools to implements numerical methods used to solve linear
and non linear systems. PETSc is particularly adapted for sparse an dense systems as well as
large-scale systems, because most of PETSc's functions are parallelized. In this paragraph, we will
introduce the KSP application for solving linear systems.
3.1 The numerical methods
In this part, we will present the diverse numerical methods we chose to solve our linear
system. We will first explain theoritically the different numerical methods which are developped.
The Conjugate Gradient methods :
This method is an effective means to solve linear systems where the coefficient matrix is
symmetric and positive definite.
To begin, we consider a linear system like Ax = b, where A is symmetric and positive
definite. We fixe a vector x(0) , and successively calculate x(1) , x(2) , x(3) ,..., x(k-1) to
generate a sequence of {x(k)} to approximate the solution x of Ax = b. We decompose A as
the followed format : A = L+D+L (t) where L is a strictly lower triangular matrix and D a
diagonal matrix of the same size as L.
We need next to calculate the matrix M, where M= (D+L)D(-1)(D+L)(t)
We apply next the following algorithm :
p(0) = r(0) = b – Ax(0)
Mr ' (0) = r(0)
until convergence, do :
a(k) = (r(k) , r ' (k)) / (p(k) , Ap(k))
x(k+1) = x(k) + a(k) p(k)
r(k+1) = r(k) – a(k)Ap(k)
Mr ' (k+1)= r(k+1)
b(k) = (r(k+1) , r ' (k+1)) / (r(k) , r ' (k))
p(k+1) = r ' (k+1) + b(k)p(k) = b – Ax(k+1)
we converge when p(k+1) = 0.
This method corresponds to the KSP_TYPE « CG ».
Conjugate Gradient Method on the Normal Equations
The cgne solver applies too the CG iterative method, but it is applied to the normal
equations without explicitely forming the matrix A(t)A.
This method corresponds to the KSP_TYPE « CGNE ». As we saw, the KSP object in
PETSc are particularly adapted to solve linear systems.
However, PETSc also developped the multigrid concept, as it is more and more used in
HPC domain. PETSc offers a DMMG object, which creates multigrid, very adapted for linear or non
linear systems. In what follows, we will study the DMMG concept with PETSc.
IV Introduction to Multigrid with PETSc
PETSc in an abreviation for «Portable, Extensible Toolkit for Scientific computation». This
library provides functions and many tools to implements numerical methods used to solve linear
and non linear systems. PETSc is particularly adapted for sparse an dense systems as well as
large-scale systems, because most of PETSc's functions are parallelized. In this paragraph, we will
introduce the multigrid application for solving linear systems, and then explain how PETSc
implements it.
4.1 The Multigrid concept
To introduce the multigrid concept, we will consider the classic example: Poisson equation
in 1 dimension.
-Δ u = f (1), on [0,1]
(1.1)
u(0) = u(1) = 0 (boundary conditions).
(1.2)
We apply the finite difference method (second order) on (1.1), and then obtain the following
linear system:
- u(i+1) + 2u(i) - u(i-1) = fi
(2.1)
h² (h is the grid size)
(2.2)
h is generally 1/(n+1).
With can rewritte these equation with a matrix form:
Au = f
(3.1)
u = (u1, u2, … , uN-1)T
(3.2)
u(0) = u(N) = 0
(3.3)
f = (f1, f2, … , fN-1)T
(3.4)
The discretized form of equation (3.1) is as a linear system:
A h u h = fh
(4.1)
Applying Jacobi or Gauss-Seidel algorithm on linear system like (3.1) determine the u
vector solution, which has the following form:
u(i) = (u(i+1) + u(i-1) + h²fi) / 2 ; i=1, …, N
(5.1)
Note that the Gauss-Seidel method is like Jacobi one, by using for u(i+1) the previous
calculated value (updated for each iteration), until we obtain the convergence. The system
Au = f admitt an exact solution, u. We define the error as:
e=u–v
(6.1)
The best example to study the different frequency componant of error is to consider the
Fourrier modes. They have been introduced as initial data for Gauss-Seidel iteration
By considering the Fourrier's mode for several frequencies, we can etablish the two
following deductions:
The error of classic iterativ methods is smoothed for every iteration. By ddecomposing
the error as a Fourrier serie, we conclude that the high frequencies for a mesh with gridsize
h are better amortized than the lower one.
The error low frequencies on a fine mesh will be considered as high frequencies on a
coarse mesh, and so being amortized with iterativ method.
We illustrate the steps for multigrid resolution of Au = f:
We make several Gauss-Seidel resolutions of Ah uh = fh so we obtain a vector vh, which
is an approximation of u on the finest grid (pre-smooth)
We calculate the associated residu rh = fh – Ah vh
We consider this residu on a coarser grid (with 2h as gridsize) by using a restriction
operator R
We solve the following equation: A2h (u2h – v2h) = A2h e2h = r2h
We use a prolongation (or interpolation) operator on e2h on the finest grid to calculate eh
We modify in consequence the approximation vh: vh = vh + eh
We make several iterations of Gauss-Seidel on Ah uh = fh, considering as initial solution:
vh.
We will now focus our attention on the multigrid with PETSc, as PETSc offers methods to
handle with it.
4.2 Multigrid with PETSc
Before using Multigrid methods with PETSc,
one should know that the multigrid in PETSc
is considered as a preconditioner and not as
a standalone solver. However, that can be
changed by using the PETSc KSP methods.
The multigrid concept is represented by a
structure in PETSc, the DMMG structure,
which interacts with DA structure, KSP, PC
and some others PETSc types (such as Mat,
Vec, ...). More generally, DMMG represents
an object orriented framework programming
style. Each object are virtually created by
DMMG, and DMMG runs them.
With PETSc, the multgrid methods work as
follows :
Each solver (smoothers and coarse grid solve) is represented by a KSP object, but there is
no good reason to explain this. Moreover, we have to use a PCType of PCKSP for the
composite preconditioners if we decide to employ the KSP methods.
In these conditions, the DMMG (ie multigrid with PETSc) manages the construction of
multigrid preconditioners and multigrid solvers. The biggest advantage with PETSc is that
we can easilly use PETSc multigrid to solve linear (using KSP methods) or non linear (using
SNES methods) systems.
The PETSc Library provides also codes to generate all the parameters of multigrid such as
the right hand side (only available for linear systems), but also the Jacobian matrices, the
interpolation or restriction operators, and the function evaluations for a given level of
discretization (only available for non linear systems). All these access are executed with
PETSc functions, which are easy to handle.
What exactly does the DMMG ? That creates a KSP (in case of linear system) or SNES
(non linear system) object, and set the PCType by using the DMMG functions which can
access to PC associated to DMMG. For each level, the vectors, restriction (or interpolation)
functions and the matrices are created and filled up. Note that, to introduce the values in
initial solution, we use the DA structure (allows logically rectangular meshes creation in 1,2
or 3 dimensions). However, this one is not available with CUDA, it is replaced by DM one.
The DM structure works as follows :
That create a KSP or SNES object, then set the preconditioner type. The vectors, and
restriction (or interpolation) functions, as well as the matrices are created and filled up for
each level.
We will now explain how CUDA is used with the DMMG object.
V PETSc-CUDA presentation
Nowadays PETSc is available with CUDA. However, this version of PETSc is only available
in petsc-dev, but will be included in the next PETSc version (3.2). The main advantage of
PETSc is that the programmer does not haveto implement any CUDA's code: the
programmation is still with PETSc functions, and CUDA appears only with the arguments
when we run the code. This is still the same idea of PETSc, one code, but several runnings.
The other advantage is that anyone can code on GPU, without deep knowledge about
CUDA.
5.1 PETSc with CUDA
As the manipulation of CUDA with PETSc is really simple, the installation of PETSc-dev is
still the most complicated part of work! One should consider, before installing the PETScdev version, which versions of cusp, thrust and CUDA are already installed on computer.
Actually, the thrust version 1.4.0 and cusp 0.4.0 are available with PETSc-CUDA, which is
a problem as it is complicated to find the good thrust version (there is no historic-versions
on official website).
The PETSc-dev version is currently installed on fermi, but there is still a problem of
compatibility with the thrust versions. In consequence, I cannot run my KSP code with
CUDA, as one of the necessary library to do so was'nt correctly installed.
However, I could perfectly run my multigrid code: the DMMG object is particularly adapted
for using CUDA. The main advantage of PETSc is that you have to implement your code,
and then run it with the PETSc-CUDA options to use it : one code, but two way to run it, is
the concept of PETSc-CUDA.
Some operations are directly run on the GPU :
MatMult(...),
KSPSolve(...),
VecAXPY(...),
VecWAYPX(...),
as well as other vectors operations.
Nowadays, we have to use the jacobi preconditioners if we want to run our code on
GPU. Most of KSP types are also available with the GPU.
As we mentioned before, the programmer does not have to consider any CUDA
implementation. We use the classic PETSc functions, and just run the code with the
PETSc-CUDA options. For example, one should run a PETSc code with da_vec_type cuda
if we want that the vector, associated to the DA/DMMG, runs on the GPU.
Now we studied the PETSc-CUDA, as well as CUDA behaviour with PETSc, let's see the
different PETSc codes.
VI PETSc used in neutronic domain
The solver MINOS is currently parallelized to optimize the computation time. However,
PETSc offers functions which are parallelized : the user doesn't have to consider the MPI
communication, which are optimized as well as the CUDA implementation. Besides, the storages
format for matrices are optimized and adapted for such systems as neutron transport. We will
present first the PETSc multigrid apply to a similar system as neutron transport equation (briefly
presented in 2.5), then the codes using only KSP functions to solve the same system.
6.1 Multigrid code
We implemented a program to solve the linear system (presented in 2.5) by using the
DMMG functions. This is justify by the PETSc-CUDA implementation : as we said before,
the DMMG options allow to run PETSc code on GPU.
The code is implemented as follows :
void ksp_Results(KSP ksp) ; which returns the KSP results,
void ksp_Options(KSP ksp) ; which inserts the differents options to the KSP,
PetscErrorCode ComputeInitialSolution(DMMG dmmg) ; which compute the initial
solution for DMMG,
PetscErrorCode ComputeRHS_VecSet(DMMG dmmg,Vec b) ; which compute the right
hand side for DMMG,
PetscErrorCode ComputeMatrix_A(DMMG dmmg,Mat jac, Mat A) ; which compute the
matrix associated to the DMMG.
We won't detail void ksp_Results(KSP ksp) ; and void ksp_Options(KSP ksp) ; as they
are similar as the KSP code presented in 4.4.
This function ComputeInitialSolution(DMMG dmmg) uses the DA structure to acces to the
initial solution: in term of perfomance, this is much better. We insert the values to the vector
by using a DA structure.
PetscErrorCode ComputeInitialSolution(DMMG dmmg) {
DA
PetscInt
da = (DA) dmmg->dm;
my,ys,ym,j;
PetscInt
PetscScalar
DAGetInfo(da, 0, &mx, &my, 0,0,0,0,0,0,0,0);
DAGetCorners(da,&xs,&ys,0,&xm,&ym,0);
Vec x = (Vec) dmmg->x;
DAVecGetArray(da, x, &array);
for (j=ys; j<ys+ym; j++){
for(i=xs; i<xs+xm; i++){
array[j][i] = 1.0;
}
}
mx,xs,xm,i;
**array;
DAVecRestoreArray(da, x, &array);
VecAssemblyBegin(x);
VecAssemblyEnd(x);
return(0);
}
In the function ComputeRHS_VecSet(DMMG dmmg,Vec b), we use the VecSet(...) function, as it
presents also good perfomances gain.
PetscErrorCode ComputeRHS_VecSet(DMMG dmmg,Vec b) {
PetscPrintf(PETSC_COMM_WORLD,"in rhs VecSet\n");
VecSet(b,1.0);
VecAssemblyBegin(b);
VecAssemblyEnd(b);
return(0);
}
the function ComputeMatrix_A(DMMG dmmg,Mat jac, Mat A) insert the values in the matrix
associated to the DMMG. We use MatSetValuesStencil, to insert all the values cause this function
is particularly adapted for sparse matrix, and use the grid index.
PetscErrorCode ComputeMatrix_A(DMMG dmmg,Mat jac, Mat A) {
PetscInt m,n;
MatStencil row, col[3];
PetscScalar v[3];
PetscInt
DA
da = (DA) dmmg->dm;
i,j,mx,my,xm,ym,xs,ys;
DAGetInfo(da, 0, &mx, &my, 0,0,0,0,0,0,0,0);
DAGetCorners(da,&xs,&ys,0,&xm,&ym,0);
for(i=xs; i<xs+xm; i++){
row.i=i;
for (j=ys; j<ys+ym; j++){
row.j=j;
if(i==xs){
v[0] = 1/(abs(i-j)+1.0);
col[0].i = i; col[0].j = j;
v[1] = 1/(abs(i-j)+1.0);
col[1].i = i; col[1].j = j+1;
MatSetValuesStencil(A,1,&row,2,col,v,INSERT_VALUES);
}else{
if(j==ys){
v[0] = 1/(abs(i-j)+1.0);
col[0].i = i; col[0].j = j;
v[1] = 1/(abs(i-j)+1.0);
col[1].i = i; col[1].j = j-1;
MatSetValuesStencil(A,1,&row,2,col,v,INSERT_VALUES);
}else{
if(j>=(i-1) && j<=(i+1)){
v[0] = 1/(abs(i-j)+1.0);
col[0].i = i; col[0].j = j;
v[1] = 1/(abs(i-j)+1.0);
col[1].i = i; col[1].j = j-1;
v[2] = 1/(abs(i-j)+1.0);
col[2].i = i; col[2].j = j+1;
MatSetValuesStencil(A,1,&row,3,col,v,INSERT_VALUES);
}
}
}
}
}
MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);
MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);
return(0);
}
int main(int argc,char **argv) {
DMMG
*dmmg; DA
PetscReal
norm;
da;
PC
pc;
PetscInitialize(&argc,&argv,(char *)0,help);
PetscOptionsSetFromOptions();
DMMGCreate(PETSC_COMM_WORLD,1,PETSC_NULL,&dmmg);
DMMGSetDM(dmmg,(DM)da);
ComputeInitialSolution(*dmmg);
DMMGSetKSP(dmmg,ComputeRHS_VecSet,ComputeMatrix_A);
ksp_Options(DMMGGetKSP(dmmg));
KSPGetPC(DMMGGetKSP(dmmg),&pc);
PCSetType(pc,PCJACOBI);
PCFactorSetShiftType(pc,MAT_SHIFT_POSITIVE_DEFINITE);
DMMGSetUp(dmmg);
DMMGSolve(dmmg);
VecAssemblyBegin(DMMGGetRHS(dmmg));
VecAssemblyEnd(DMMGGetRHS(dmmg));
ksp_Results(DMMGGetKSP(dmmg));
VecAssemblyBegin(DMMGGetx(dmmg));
VecAssemblyEnd(DMMGGetx(dmmg));
DMMGView(dmmg,PETSC_VIEWER_STDOUT_WORLD);
MatMult(DMMGGetJ(dmmg),DMMGGetx(dmmg),DMMGGetr(dmmg));
VecAXPY(DMMGGetr(dmmg),-1.0,DMMGGetRHS(dmmg));
VecNorm(DMMGGetr(dmmg),NORM_2,&norm);
DMMGDestroy(dmmg);
DADestroy(da);
PetscFinalize();
return
0;
}
We choose to run this code with three different solvers : the classic conjugate gradient (CG), the
conjugate gradient squared method (CGS) and finally, the CGNE one, which corresponds to the
Conjugate Gradient iterative method. However, the perfomances obtained with CGS solver are
similar to the CG one, so we won't detail the array of results for CGS solver.
6.2 Multigrid Results
CGNE
Grid size: 10 000
■ Without Cuda
level
Time
1
2
3
4
5
6
7
8
9
10
11
1.278
1.310
1.350
1.403
1.595
1.730
2.148
2.886
4.266
6.691
11.300
Flops
21920000
33440000
51980000
80050000
124200000
192500000
305100000
482400000
772900000
1162000000
1812000000
Flops/sec
DMMG iterations Residual norm
17150000
87
0.0150379
25520000
66
0.0185555
38510000
51
0.022592
57070000
39
0.0277488
77890000
30
0.0339612
111300000
23
0.0417193
142100000
18
0.0504998
167200000
14
0.061509
181200000
11
0.074435
173700000
8
0.0959991
160400000
6
0.121133
VecDot
2085
2080
2117
2101
1720
1412
1206
1147
1101
1054
1095
VecNorm
1797
1827
1829
1834
1793
1761
1614
1561
1556
1567
1572
■
level
1
2
3
4
5
6
7
8
9
10
11
VecDot
109
200
352
562
799
1152
1568
1895
2150
2312
2401
VecNorm
124
240
475
929
1791
3420
6200
10521
16289
22647
28023
VecAXPY
VecWAXPY
1680
984
1698
992
1683
987
1638
981
1584
948
1376
892
1191
826
1166
763
1122
729
1078
702
1114
737
KSPSolve
898
904
896
886
792
735
679
641
613
592
592
PCApply
456
454
452
444
398
351
285
258
241
230
234
MatMult
MatMultTransposee
682
719
679
735
679
717
676
713
605
630
589
608
589
609
571
587
550
578
553
577
557
584
With Cuda
Time
1.35900
1.35600
1.37000
1.47100
1.54800
1.63100
1.93500
2.49600
3.59600
5.66400
9.79500
Flops
22800000
34780000
54060000
83250000
129200000
200200000
317300000
501600000
803700000
1208000000
1884000000
Flops/sec
DMMG iterations
Residual norm
16780000
87
0.0150379
25650000
66
0.0185555
39450000
51
0.022592
56580000
39
0.0277488
83470000
30
0.0339612
122700000
23
0.0417193
164000000
18
0.0504998
200900000
14
0.061509
223500000
11
0.074435
213300000
8
0.0959991
192300000
6
0.121133
VecAXPY
1360
2487
4007
5268
6671
7790
8448
8633
8454
7874
7187
VecAYPX
1243
2338
4054
5567
7363
9137
10488
11373
11805
12011
12112
KSPSolve
322
516
744
939
1080
1238
1334
1343
1322
1259
1185
PCApply
718
1215
1692
1821
1749
1687
1527
1334
1130
463
727
MatMult
MatMultTranspose
2577
339
3466
382
4382
402
5182
409
5522
416
5686
446
5649
464
5352
463
4907
463
4188
463
3624
463
For each PETSc function, we observe that the Megaflop rates is increasing with the number of
levels, which is completely normal regarding the size of matrix for each level.
With the first levels (1 to 5), we can conclude to a considerable gain of Megaflops rate for VecDot
function. For first level, the difference is about 20. With the biggest level, the Megaflop rate is just 2
times better.
For VecNorm function, the gain is impressive: for the 11 level, the Megaflop rate is more than 10,
compared to the rate without using Cuda.
We make the same conclusion about VecAXPY and VecAYPX, the gain is about 7 to 8 times
superior.
The gain obtained on KSPSolve is disappointing, as it is only about 2 times better with Cuda.
However, the MatMult function shows also good gain perfomances, about 3 to 10 times better with
Cuda (gains are differents for each level).
The gain obtained on PCApply function is also very significant, but we can't conclude on a general
gain for each level, as the gain is quite different for each level (the gain is about 2 to 3 times
superior).
Which is now surprising is that the Time and global Megaflops rates are not so different with or
without Cuda. We had the same conclusion with a PETSc example, available in the tutorial of
PETSc. However, this conclusion is not available anymore when the grid size increases.
grid size: 100 000
■ without Cuda
level
VecDot
2074
1608
1329
1195
1134
1121
1142
1112
1
2
3
4
5
6
7
8
1.392
1.539
1.798
2.291
3.125
4.840
7.973
13.770
Flops
89200000
139400000
219800000
340600000
522200000
885400000
1452000000
2265000000
VecNorm
1817
1826
1736
1601
1556
1563
1569
1570
VecAXPY
1613
1564
1366
1189
1139
1109
1116
1124
VecAYPX
980
917
895
821
752
754
756
719
■
Time
With Cuda
Flops/sec
DMMG iterations
Residual norm
64070000
35 0.0301563
90610000
27 0.0368426
122300000
21 0.0447776
148700000
16 0.0553796
167100000
12 0.0694722
183000000
10 0.0802987
182100000
8 0.0959991
164400000
6 0.121133
KSPSolve
862
769
720
653
633
625
611
593
PCApply
428
390
322
268
260
251
242
227
MatMult
MatMultTransposee
651
698
587
613
599
607
557
587
563
583
570
589
560
584
570
590
level
1
2
3
4
5
6
7
8
VecDot
544
823
1150
1418
1636
1805
1885
1940
VecNorm
1140
2172
4114
7063
10353
15393
21564
27020
Time
0.09203
0.99900
1.19400
1.58500
2.30900
3.75400
6.58600
11.84000
Flops
92800000
145000000
228600000
354200000
543000000
920600000
1509000000
2354000000
Flops/sec
DMMG iterations
Residual norm
100800000
35
0.0301563
145200000
27
0.0368426
191400000
21
0.0447776
223500000
16
0.0553796
235200000
12
0.0694722
245200000
10
0.0802987
229200000
8
0.0959991
198800000
6
0.121133
VecAXPY
5443
6835
7962
7868
7949
7754
7261
6431
VecAYPX
5750
7805
9568
10732
11452
11821
11995
12060
KSPSolve
849
1017
1136
1156
1142
1140
1089
1034
PCApply
1602
1500
1515
1273
1061
925
781
638
MatMult
MatMultTranspose
4833
344
5220
377
5375
397
5062
397
4459
397
4122
405
3691
400
3073
406
For a 100 000 grid size, we can observe time gain (from less than 10 seconds to 2 seconds).
However, the global Megaflops rates are still comparable, the difference is not so significant.
The gain about Megaflops rate are quite the same. For PCApply, we observe it is now about 3 to 4
times better.
grid size: 1 000 000
■ Without Cuda
level
VecDot
1143
1123
1137
1088
1
2
3
4
2.129
3.147
5.200
9.021
Flops
392000000
594000000
998000000
1606000000
VecNorm
1557
1556
1565
1571
VecAXPY
1160
1133
1137
1132
VecAYPX
763
777
751
701
■
Time
With Cuda
Flops/sec
DMMG iterations
Residual norm
184100000
15 0.0582586
188800000
11 0.074435
191900000
9 0.0873468
178000000
7 0.106906
KSPSolve
640
632
616
592
PCApply
261
257
245
229
MatMult
MatMultTransposee
559
583
567
589
557
583
552
580
level
1
2
3
4
VecDot
1523
1725
1852
1906
VecNorm
8181
11965
17456
23417
Time
1.38700
2.26000
4.04800
7.45700
Flops
408000000
618000000
1038000000
1670000000
Flops/sec
DMMG iterations
Residual norm
294200000
15
0.0582586
273500000
11
0.074435
256400000
9
0.0873468
223900000
7
0.106906
VecAXPY
8232
7846
7527
6914
VecAYPX
11018
11641
11911
12042
KSPSolve
1162
1151
1111
1059
PCApply
1230
968
853
701
MatMult
MatMultTranspose
4971
397
4353
407
3947
399
3388
400
For a 1 000 000 grid size, the time gain is about 1 to 2 seconds. The gain about the Vec operations
are still very satisfiable, (about 20 times still for VecAYPX).
grid size: 10 000 000
■ Without Cuda
level
VecDot
1139
1
6.663
Flops
1670000000
VecNorm
1568
VecAXPY
1140
VecAYPX
781
■
Time
KSPSolve
597
PCApply
231
MatMult
MatMultTransposee
566
584
With Cuda
level
1
VecDot
1884
Flops/sec
DMMG iterations
Residual norm
250700000
6 0.121133
VecNorm
25236
Time
5.07600
Flops
1740000000
Flops/sec
DMMG iterations
Residual norm
342800000
6
0.121133
VecAXPY
6519
VecAYPX
12054
KSPSolve
1026
PCApply
632
MatMult
MatMultTranspose
3057
404
We have the same conclusion as before for 10 000 000 grid size: the gain for each functions are
still the same.
We can conclude, regardings the previous results, that Cuda with PETSc offers very good gain
perfomances for Megaflops rates (for each detailed function). However, the global rates for time
and Megaflops rates are disappointing, we were expecting more.
Time
Mgflop rates
Mgflop rates for each function
Iterations
We run the same code with a different solver, the CG to compare the results with the CGNE solver:
the question was, does the solver have an influence on the results? Can we obtain different gains
just because of solver?
CG
Grid size: 10 000
■ Without Cuda
level
Time
1
2
3
4
5
6
7
8
9
10
11
VecDot
2123
2058
2117
2116
2044
1602
1213
1134
1078
1091
1099
1.326
1.361
1.404
1.486
1.620
1.885
2.355
3.155
4.565
7.020
11.840
VecNorm
1837
1850
1845
1848
1828
1780
1666
1554
1553
1567
1571
Flops
63820000
90650000
128100000
182900000
259400000
369100000
531000000
762700000
1111000000
1577000000
2416000000
VecAXPY
1689
1703
1695
1689
1605
1470
1210
1168
1103
1100
1114
Flops/sec
DMMG iterations
Residual norm
48130000
354 0.000998738
66620000
251 0.00140858
91270000
177 0.00199748
123100000
126 0.00280598
160100000
89 0.00397251
195800000
63 0.00561196
225500000
45 0.00785674
241800000
32 0.0110485
243300000
23 0.0153719
224600000
16 0.0220971
204100000
12 0.0294628
VecAYPX
992
983
989
985
979
907
816
758
722
717
737
KSPSolve
1057
1057
1059
1047
984
909
804
751
712
704
696
PCApply
463
461
459
450
409
379
302
259
237
231
224
MatMult
679
680
678
668
618
596
587
569
552
556
554
■
level
1
2
3
4
5
6
7
8
9
10
11
VecDot
With Cuda
Time
0.1464
0.1429
0.1396
0.1410
0.1447
0.1553
0.1793
0.2245
0.3199
0.5056
11.3200
VecNorm
125
242,00000
483,00000
945,00000
1820,00000
3476,00000
6399,00000
10679,00000
16723,00000
22921,00000
28443,00000
125
241,00000
480,00000
920,00000
1752,00000
3203,00000
5541,00000
8522,00000
11878,00000
14657,00000
16682,00000
Flops
67370000
95690000
135200000
193100000
273800000
389600000
560500000
804900000
1172000000
1664000000
2549000000
VecAXPY
1455
2608,00000
4237,00000
5890,00000
7609,00000
9027,00000
9840,00000
10074,00000
9974,00000
9480,00000
8852,00000
Flops/sec
DMMG iterations
46000000
354
66980000
251
96890000
177
136900000
126
189200000
89
250900000
63
312500000
45
358500000
32
366500000
23
329000000
16
225200000
12
VecAYPX
1272,00000
2362,00000
4040,00000
5492,00000
7554,00000
9417,00000
10662,00000
11448,00000
11868,00000
12093,00000
12143,00000
KSPSolve
339,00000
640,00000
1213,00000
6694,00000
3450,00000
4974,00000
6196,00000
6655,00000
6410,00000
5590,00000
4395,00000
Residual norm
0.000998738
0.00140858
0.00199748
0.00280598
0.00397251
0.00561196
0.00785674
0.0110485
0.0153719
0.0220971
0.0294628
PCApply
720,00000
1231,00000
1835,00000
2122,00000
2121,00000
1997,00000
1732,00000
1433,00000
1127,00000
853,00000
648,00000
MatMult
3003,00000
4232,00000
5542,00000
2129,00000
7458,00000
7738,00000
7673,00000
7330,00000
6744,00000
5836,00000
4609,00000
Clearly, the Megaflops gains for each function are the same, but we notify a sinificant difference
about the time, which is better than CGNE solver. The solver CG is about ten times faster as
CGNE solver.
grid size: 100 000
■ without Cuda
level
Time
1
2
3
4
5
6
7
8
1.484
1.660
1.962
2.540
3.443
5.075
8.019
13.700
Flops
202600000
291000000
410200000
591000000
837400000
1215000000
1740000000
2559000000
Flops/sec
DMMG iterations
Residual norm
136500000
112 0.00315673
175300000
80 0.00441942
209100000
56 0.00631345
232600000
40 0.00883883
243200000
28 0.0126269
239400000
20 0.0176777
217000000
14 0.0252538
186800000
10 0.0353553
VecDot
2091
1947
1520
1197
1134
1123
1139
1109
■
VecNorm
1838
1832
1778
1598
1548
1561
1569
1571
VecAXPY
1675
1592
1337
1200
1130
1121
1122
1117
VecAYPX
984
960
895
810
766
747
800
725
KSPSolve
1036
970
883
769
742
733
721
690
PCApply
443
409
361
273
257
252
239
213
MatMult
661
611
596
557
563
569
560
571
With Cuda
level
1
2
3
4
5
6
7
8
VecDot
1127
2119,00000
3805,00000
6226,00000
9617,00000
11685,00000
14331,00000
16117,00000
Time
0.8964
0.9385
0.1080
0.1386
0.1974
0.3223
0.5622
1.047e+01
VecNorm
1154
2222,00000
4143,00000
7208,00000
12466,00000
15780,00000
21499,00000
27049,00000
Flops
213900000
307200000
433000000
623800000
883800000
1282000000
1836000000
2700000000
VecAXPY
6367
8042,00000
8976,00000
9655,00000
9895,00000
9330,00000
8702,00000
7901,00000
Flops/sec
DMMG iterations
238600000
112
327300000
80
401000000
56
450200000
40
447700000
28
397900000
20
326500000
14
257800000
10
VecAYPX
6145,00000
8133,00000
9661,00000
10817,00000
11604,00000
11849,00000
12021,00000
12053,00000
KSPSolve
2471,00000
3868,00000
5225,00000
6026,00000
6452,00000
5471,00000
4718,00000
3845,00000
Residual norm
0.00315673
0.00441942
0.00631345
0.00883883
0.0126269
0.0176777
0.0252538
0.0353553
PCApply
1988,00000
1979,00000
1740,00000
1457,00000
1262,00000
894,00000
660,00000
508,00000
MatMult
6880,00000
7425,00000
7417,00000
7018,00000
6860,00000
5551,00000
5022,00000
4114,00000
For 100 000 grid size, we conclude on the sam gain rates for each functions. The time gain is
about 2 times better with Cuda, except for last levels.
grid size: 1 000 000
■ Without Cuda
level
Time
1
2
3
4
2.398
3.507
5.462
9.228
Flops
658000000
966000000
1366000000
2022000000
Flops/sec
DMMG iterations
Residual norm
274400000
36 0.00982093
275500000
26 0.0135982
250100000
18 0.0196419
219100000
13 0.0271964
VecDot
1134
1128
1137
1074
■
VecNorm
1566
1559
1567
1571
VecAXPY
1158
1132
1129
1114
VecAYPX
760
750
753
728
KSPSolve
750
740
726
697
PCApply
266
254
246
228
MatMult
558
567
557
555
With Cuda
level
1
2
3
4
VecDot
7122
9733,00000
12508,00000
14987,00000
Time
0.1152
0.1909
0.3441
0.6458
VecNorm
8613
12627,00000
17454,00000
23321,00000
Flops
695000000
1020000000
1442000000
2134000000
VecAXPY
9739
9656,00000
9165,00000
8531,00000
Flops/sec
DMMG iterations
603100000
36
534400000
26
419100000
18
330500000
13
VecAYPX
11045,00000
11653,00000
11941,00000
12007,00000
KSPSolve
6184,00000
5986,00000
5280,00000
4497,00000
Residual norm
0.00982093
0.0135982
0.0196419
0.0271964
PCApply
1368,00000
1066,00000
809,00000
629,00000
MatMult
7029,00000
6462,00000
5483,00000
4656,00000
With 1 000 000 grid size, we can observe that time computation is much better with Cuda: the
difference is very significant.
grid size: 10 000 000
■ Without Cuda
level
Time
1
VecDot
1141
■
7.121
VecNorm
1571
With Cuda
Flops
2260000000
VecAXPY
1143
Flops/sec
DMMG iterations
Residual norm
317400000
12 0.0294628
VecAYPX
746
KSPSolve
698
PCApply
225
MatMult
544
level
1
VecDot
15595
Time
0.4003
VecNorm
25216
Flops
2390000000
VecAXPY
8366
Flops/sec
DMMG iterations
597000000
12
VecAYPX
12063,00000
KSPSolve
4330,00000
Residual norm
0.0294628
PCApply
588,00000
MatMult
4580,00000
One can conclude also, for each grid size level, about the Megaflop rate from KSPSolve: it much
more than the one with CGNE, and for each grid size, the Megaflop rate is at least 2 times better
with Cuda.
Time
Mgflop rates
Mgflop rates for each function
Iterations
PETSc presents the advantage that we can pass the KSP type as a parameter, so we
have one code for three diferent solvers.
We will study in the next part the developped code.
6.3 KSP code
The code is implement as follows : we etablished two functions: Translate_Matrix_H(int
matrix_size) which create and insert the values of matrix, and CG_Solve(Mat mat, Vec b),
which solves the linear system Ax = b and insert the solution in right hand side vector
because of gain memory.
Let's detailed first the function Translate_Matrix_H(...):
We pass as parameter the size of matrix, so we can allocate the non zero structure for
PETSc storage format. Actually, by storing the good non zero structure, we can obtain a
good gain perfomance.
The array nnz (PetscInt nnz[matrix_size]) contains the number of non zero
elements for each row. In our case, every row contains three non zero elements, just the
first and last one contains two non zero elements. The array v (PetscScalar v[3]) contains
the non zero elements to insert for each row. Note that we use the PETSc types, because
of pefomances. We then insert the values in nnz array and create the matrix by using the
folowing functions:
MatCreateSeqAIJ(PETSC_COMM_WORLD,matrix_size,
matrix_size,PETSC_DECIDE,nnz,&mat);
MatSetOption(mat,MAT_SYMMETRIC,PETSC_TRUE);
MatZeroEntries(mat);
MatSetFromOptions(mat);
With these functions, we create a sparse matrix, symmetric (because of options). The
function MatSetFromOptions(...) is used to apply the options passed as arguments.
We now insert the values by using the function MatSetValues(...), it is very perfomant
because we insert all the non zero element for each row. We used before MatSetValue(...),
which insert just on value, and observe that the time computation was very long, so we
optimized the code by insert all the values for each row.
Mat Translate_Matrix_H(int matrix_size)
{
Mat mat;
PetscInt idxm[1],idxn[3];
PetscScalar v[3];
PetscInt nnz[matrix_size];
PetscFunctionBegin;
MatCreate(PETSC_COMM_WORLD,&mat);
nnz[0] = 2;
nnz[matrix_size-1] = 2;
for(i=1;i<matrix_size-1;i++)
{
nnz[i] = 3;
}
MatCreateSeqAIJ(PETSC_COMM_WORLD,matrix_size,
matrix_size,PETSC_DECIDE,nnz,&mat);
MatSetOption(mat,MAT_SYMMETRIC,PETSC_TRUE);
MatZeroEntries(mat);
MatSetFromOptions(mat);
idxm[0] = 0; idxn[0] = 0; v[0] = 1.0; idxn[1] = 1; v[1] = 0.5;
MatSetValues(mat,1,idxm,2,idxn,v,INSERT_VALUES);
for(i=1;i<matrix_size-1;i++){
idxm[0] = i;
idxn[0] = i-1;
v[0] = 0.5;
idxn[1] = i;
v[1] = 1.0;
idxn[2] = i+1;
v[2] = 0.5;
MatSetValues(mat,1,idxm,3,idxn,v,INSERT_VALUES);
}
idxm[0] = matrix_size-1;
idxn[0] = matrix_size-2;
v[0] = 0.5;
idxn[1] = matrix_size-1;
v[1] = 1.0;
MatSetValues(mat,1,idxm,2,idxn,v,INSERT_VALUES);
MatAssemblyBegin(mat,MAT_FINAL_ASSEMBLY);
MatAssemblyEnd(mat,MAT_FINAL_ASSEMBLY);
return mat;
}
We will now detail the function which solves the linear system. The function CG_Solve(Mat
mat, Vec b) has as parameter the matrix and righ hand side vector.
We create the KSP context by using the function KSPCreate(...), then add a preconditioner
with KSPGetPC(...). We can add also some options with KSPSetFromOptions(...): it
considers the arguments passed when we run the code, such as the KSP type (CG, CGNE
or CGS). The function KSPSolve(...) solve the linear system Ax = b. In our case, the final
solution is stored in right hand side vector b.
KSP CG_Solve(Mat mat, Vec b)
{
PC
pc;
KSP ksp;
KSPConvergedReason
PetscReal
reason;
normVector;
int
its;
PetscFunctionBegin;
KSPCreate(MPI_COMM_WORLD,&ksp);
KSPSetOperators(ksp,mat,mat,DIFFERENT_NONZERO_PATTERN);
KSPSetInitialGuessNonzero(ksp,PETSC_TRUE);
KSPGetPC(ksp,&pc);
PCSetType(pc,PCJACOBI);
PCFactorSetShiftType(pc,MAT_SHIFT_POSITIVE_DEFINITE);
KSPSetFromOptions(ksp);
KSPSetNormType(ksp,KSP_NORM_PRECONDITIONED);
KSPSetUp(ksp);
KSPSetTolerances(ksp,0.000010,0.000000,10000.000000,10000);
KSPSolve(ksp,b,b);
KSPGetConvergedReason(ksp,&reason);
if (reason==KSP_DIVERGED_INDEFINITE_PC) {
PetscPrintf(PETSC_COMM_WORLD,"\nDivergence because of
indefinite preconditioner;\nRun the executable again but with '-pc_factor_shift_type
POSITIVE_DEFINITE' option.\n",PETSC_VIEWER_STDOUT_WORLD);
}
else{
if (reason<0) {
PetscPrintf(PETSC_COMM_WORLD,"\nOther kind of
divergence: this should not happen:
%f\n",reason,PETSC_VIEWER_STDOUT_WORLD);
}
else {
KSPGetIterationNumber(ksp,&its);
printf("\nConvergence in %d iterations.\n",(int)its);
}
}
VecNorm(b,NORM_2,&normVector);
return ksp;
}
int main(int argc,char **argv){
PetscInt
n = 10;
PetscMPIInt
size;
PetscErrorCode
ierr;
int
matrix_size=524288;
Mat
A;
Vec
b;
KSP
ksp;
PetscInitialize(&argc,&argv,PETSC_NULL,help);
MPI_Comm_size(PETSC_COMM_WORLD,&size);
A=Translate_Matrix_H(matrix_size);
ierr = VecCreate(PETSC_COMM_WORLD,&b);CHKERRQ(ierr);
ierr = VecSetSizes(b,PETSC_DECIDE,matrix_size);CHKERRQ(ierr);
ierr = VecSetFromOptions(b);CHKERRQ(ierr);
ierr = VecSet(b,1.0);CHKERRQ(ierr);
ksp=CG_Solve(A,b);
KSPDestroy(ksp);
MatDestroy(A);
VecDestroy(b);
PetscFinalize();
return 0;
}
In our case, we decide to run this code with three different solvers (CG,CGNE and CGS).
6.4 KSP Results
We will just present some results, as this code is not run with Cuda.
CG
Matrix size : 10 000
Convergence in 354 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cg
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=10000, cols=10000
total: nonzeros=29998, allocated nonzeros=29998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
****************************************************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
****************************************************************************************************************
---------------------------------------------- PETSc Performance Summary:---------------------------------Max
Max/Min Avg
Total
Time (sec):
1.329e+00
1.00000 1.329e+00
Objects:
1.300e+01
1.00000 1.300e+01
Flops:
6.382e+07
1.00000 6.382e+07 6.382e+07
Flops/sec:
4.803e+07
1.00000 4.803e+07 4.803e+07
MPI Messages:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Message Lengths:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Reductions:
1.300e+01
1.00000
Matrix size : 100 000
Convergence in 112 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cg
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=100000, cols=100000
total: nonzeros=299998, allocated nonzeros=299998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
****************************************************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
****************************************************************************************************************
Max
Max/Min
Avg
Total
Time (sec):
1.472e+00
1.00000 1.472e+00
Objects:
1.300e+01
1.00000 1.300e+01
Flops:
2.026e+08
1.00000 2.026e+08 2.026e+08
Flops/sec:
1.376e+08
1.00000 1.376e+08 1.376e+08
MPI Messages:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Message Lengths:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Reductions:
1.300e+01
1.00000
Matrix size : 1000 000
Convergence in 36 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cg
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=1000000, cols=1000000
total: nonzeros=2999998, allocated nonzeros=2999998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
********************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
********************************************************************************
Max
Max/Min
Avg Total
Time (sec):
2.306e+00
1.00000 2.306e+00
Objects:
1.300e+01
1.00000 1.300e+01
Flops:
6.580e+08
1.00000 6.580e+08 6.580e+08
Flops/sec:
2.853e+08
1.00000 2.853e+08 2.853e+08
MPI Messages:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Message Lengths:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Reductions:
1.300e+01
1.00000
CGNE
Matrix size : 10 000
Convergence in 150 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cgne
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=10000, cols=10000
total: nonzeros=29998, allocated nonzeros=29998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
****************************************************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
****************************************************************************************************************
Max
Max/Min
Avg
Total
Time (sec):
1.299e+00
1.00000 1.299e+00
Objects:
1.400e+01
1.00000 1.400e+01
Flops:
3.773e+07
1.00000 3.773e+07 3.773e+07
Flops/sec:
2.905e+07
1.00000 2.905e+07 2.905e+07
MPI Messages:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Message Lengths:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Reductions:
1.400e+01
1.00000
Matrix size : 100 000
Convergence in 61 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cgne
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=100000, cols=100000
total: nonzeros=299998, allocated nonzeros=299998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
********************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
********************************************************************************
Max
Max/Min
Avg Total
Time (sec):
1.447e+00
1.00000 1.447e+00
Objects:
1.400e+01
1.00000 1.400e+01
Flops:
1.548e+08
1.00000 1.548e+08 1.548e+08
Flops/sec:
1.070e+08
1.00000 1.070e+08 1.070e+08
MPI Messages:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Message Lengths:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Reductions:
1.400e+01
1.00000
Matrix size : 1000 000
Convergence in 25 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cgne
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=1000000, cols=1000000
total: nonzeros=2999998, allocated nonzeros=2999998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
****************************************************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
****************************************************************************************************************
Max
Max/Min
Avg
Total
Time (sec):
2.409e+00
1.00000 2.409e+00
Objects:
1.400e+01
1.00000 1.400e+01
Flops:
6.480e+08
1.00000 6.480e+08 6.480e+08
Flops/sec:
2.690e+08
1.00000 2.690e+08 2.690e+08
MPI Messages:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Message Lengths:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Reductions:
1.400e+01
1.00000
CGS
Matrix size : 10 000
Convergence in 40 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cgs
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=10000, cols=10000
total: nonzeros=29998, allocated nonzeros=29998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
********************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
********************************************************************************
Time (sec):
Objects:
Flops:
Flops/sec:
MPI Messages:
MPI Message Lengths:
MPI Reductions:
Max
Max/Min
Avg Total
1.285e+00
1.00000 1.285e+00
1.700e+01
1.00000 1.700e+01
1.249e+07
1.00000 1.249e+07 1.249e+07
9.720e+06
1.00000 9.720e+06 9.720e+06
0.000e+00
0.00000 0.000e+00 0.000e+00
0.000e+00
0.00000 0.000e+00 0.000e+00
1.700e+01
1.00000
Matrix size : 100 000
Convergence in 19 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cgs
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=100000, cols=100000
total: nonzeros=299998, allocated nonzeros=299998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
****************************************************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
****************************************************************************************************************
Max
Max/Min
Avg
Total
Time (sec):
1.353e+00
1.00000 1.353e+00
Objects:
1.700e+01
1.00000 1.700e+01
Flops:
5.980e+07
1.00000 5.980e+07 5.980e+07
Flops/sec:
4.419e+07
1.00000 4.419e+07 4.419e+07
MPI Messages:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Message Lengths:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Reductions:
1.700e+01
1.00000
Matrix size : 1000 000
Convergence in 9 iterations.
****************Results : KSP***************
Let's see the ksp context
KSP Object:
type: cgs
maximum iterations=10000
tolerances: relative=1e-05, absolute=0, divergence=10000
left preconditioning
using nonzero initial guess
using PRECONDITIONED norm type for convergence test
PC Object:
type: jacobi
linear system matrix = precond matrix:
Matrix Object:
type: seqaij
rows=1000000, cols=1000000
total: nonzeros=2999998, allocated nonzeros=2999998
total number of mallocs used during MatSetValues calls =0
not using I-node routines
**************************************
****************************************************************************************************************
WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this
document
****************************************************************************************************************
Max
Max/Min
Avg
Total
Time (sec):
1.909e+00
1.00000 1.909e+00
Objects:
1.700e+01
1.00000 1.700e+01
Flops:
2.880e+08
1.00000 2.880e+08 2.880e+08
Flops/sec:
1.508e+08
1.00000 1.508e+08 1.508e+08
MPI Messages:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Message Lengths:
0.000e+00
0.00000 0.000e+00 0.000e+00
MPI Reductions:
1.700e+01
1.00000
Conclusion
Comparaison of CG and CGNE:
CGNE converges better than CG solver (it is like twice faster).
CG, matrix size 1 000 000
Max
Max/Min
Avg Total
2.306e+00 1.00000 2.306e+00
1.300e+01 1.00000 1.300e+01
6.580e+08 1.00000 6.580e+08 6.580e+08
2.853e+08 1.00000 2.853e+08 2.853e+08
Time (sec):
Objects:
Flops:
Flops/sec:
CGNE, matrix size 1 000 000
Max
Max/Min
Avg Total
1.447e+00 1.00000 1.447e+00
1.400e+01 1.00000 1.400e+01
1.548e+08 1.00000 1.548e+08 1.548e+08
1.070e+08 1.00000 1.070e+08 1.070e+08
Time (sec):
Objects:
Flops:
Flops/sec:
We can just conclude on a significant difference regarding the flops rate. We
illustrated it with matrix size of 1 000 000, but it is also available for 10 000 (CG has
twice grower rates than CGNE).
Comparaison of CG and CGS
The CGS solver converges better than CG (more than three times better).
CG, matrix size 1 000 000
Max
Max/Min
Avg Total
2.306e+00 1.00000 2.306e+00
1.300e+01 1.00000 1.300e+01
6.580e+08 1.00000 6.580e+08 6.580e+08
2.853e+08 1.00000 2.853e+08 2.853e+08
Time (sec):
Objects:
Flops:
Flops/sec:
CGS, matrix size 1 000 000
Time (sec):
Objects:
Flops:
Flops/sec:
1.909e+00
1.00000 1.909e+00
1.700e+01
1.00000 1.700e+01
2.880e+08
1.00000 2.880e+08 2.880e+08
1.508e+08
1.00000 1.508e+08 1.508e+08
About times perfomances, we can't conclude on a difference for solvers. However,
the Flops rates are really mcuh better for CG solver.
Comparaison of CGS and CGNE
Regarding the numrber of iteration for convergence, CGS is better than CGNE (not
exactly two times better).
Considering the flops rates, we see that they are similar.
CGNE, matrix size 1 000 000
Time (sec):
Objects:
Flops:
Flops/sec:
Max
Max/Min
Avg Total
1.447e+00 1.00000 1.447e+00
1.400e+01 1.00000 1.400e+01
1.548e+08 1.00000 1.548e+08 1.548e+08
1.070e+08 1.00000 1.070e+08 1.070e+08
CGS, matrix size 1 000 000
Time (sec):
Objects:
Flops:
Flops/sec:
1.909e+00
1.700e+01
2.880e+08
1.508e+08
1.00000
1.00000
1.00000
1.00000
1.909e+00
1.700e+01
2.880e+08 2.880e+08
1.508e+08 1.508e+08
VII GPU used in neutronic domain
The aim of this project was to implement the Conjugate Gradient method on GPU, by using
CUDA, but also Cublas. We studied before the perfomances of Cublas and CUDA on scalar
product examples, and it appears that CUDA offers a good gain perfomance, but still less
than Cublas. In this context, we decide to apply it in neutronic domain, as the time and
Megaflops rates gain must be more and more bigger.
In what follows, we will study the architecture of codes, and explain our implementation
choices.
Then, we will briefly explain the CUDA's code, and Cublas one. We will finally present the
obtained results.
7.1 Introduction
In order to implement a professionnal code, we decide to use the following architecture:
We have six folders:
The Makefile contains in the main folder is doing the compilation of all codes in subsidiary
folders. This Makefile contains the variables necesary to export the compiler path, and the
command to execute the compilations in all subsidiary folders.
In each subsidiary folders, excepted the src/ one, we have statics libraries: all Makefiles in
subsidiary folders are compiling the codes such that we obtain statics libraries. It will be
latter modify, to compile it as a dynamics libraies, as the codes in Clock and Matrix_Market
must not be modified anymore.
The src/ folder contains the main.cu code, which includes all the generated libraries.
This structure is justified by the fact that we use lot of different functions, and so the
programmer and user can handle the code easilly. Currently, the user just have to modify
the following variable in the principal Makefile:
export
FOLDER_HOME
=
/home/phd_hpc/fboillod/double_prec
You have to indicate the path to the main folder of project. Note that you also have to modify
th epath to compiler, and Cuda installation folder.
For compilation, we have the following scheme:
The main difficulties for this architecture was about the langages: for example, the main.cu
Makefile had to mix several langages (C++ and Cuda).
We represented it by the following scheme:
This project is available yet on gpu4u and fermi computers.
We will now study the diverse functions used to implement the Conjugate gradient.
7.2 Conjugate gradient with CUDA-CPU
First, we wanted to implement the conjugate gradient exclusively on the GPU. Because of
time, I could'nt finish this code, as it was not stable enough. I decide, in consequence, to
implement CG by using both CPU and GPU.
All the functions we will present in what follows are in the libcgcuda.cu file.
To realise the CG function, we used the matrix/vector product, which is realised on GPU.
We also implemented two versions of scalar product, one using exclusively the GPU, and
the other one doing the product of vectors on GPU, but the addition of elements is realised
on CPU. We implemented also a function __global__ void GPU_addvec(double *a,double
alpha,double *b,double beta,const int N) which is doing the following operation:
a[ ] = alpha*a[ ] + beta*b[ ]
This
function
is
realised
only
on
GPU.
The main difficulties to implement this code was to debug it, as GPU does not allow host
functions.
7.3 Conjugate gradient with Cublas
The Cublas version of Conjugate Gradient was more easier to implement, as we can debug it
quicklier. Actually, it exists a function CublasGetVector(...) so we can obtain the copy of
device vector on a host vector.
Besides, the Cublas functions are easy to handle. We use, to implement the Conjugate
Gradient function the following Cublas functions:
cublasDcopy ( n, b_d, 1, h_d, 1);
This function copy the vector b in h one.
cublasDgemv ('N', n, n, 1.0, A_d, n, h_d, 1, 0.0, temp_Ah, 1);
This function realises the following function:
temp_Ah = 1.0 * A*h + 0.0 * temp_Ah
cublasDscal (n, v, h_d, 1);
This function is doing the scalar product of v and h.
cublasDaxpy (n, -1.0, g_d, 1, h_d, 1);
This function realises the following operation:
h = -1.0 * g
norm_cub = cublasDdot (n, x_d, 1, x_d, 1);
This function calculates the norm of x vector.
The function we created is the following:
void GPU_CG_CUBLAS (double *A_d, double *x_d, double *b_d, double *vecnul_d, int n,
int LOOP,double epsilon, const int nBytes,int *its_cublas,dim3 dimGrid)
A is the matrix (on device), b the right hand side, and x the initial solution. Vecnul is a vector
on device which contains only nul values, as the cublas vectors are not initialized when you
create them. N is the size of vector (and matrix, we used a squared one). LOOP is just the
maximum iterations we allow, and epsilon, the error we fixed to stop the iteration (0.0001).
its_cublas is the number of iterations at convergence, and other variables are used for
memory allocation.
7.4 Results
We run the previous code on two computers: on fermi (Gtx GeForce 480) and gpu4u (Gtx
GeForce 280).
Fermi:
Gpu4u:
Comparaisons:
The first scheme shows the Gflops Cublas rates, by considering the used graphic card.
The second scheme shows the Gflops/sec rate for Cuda's function, by considering the
graphic card we used.
We can conclude that our Cuda's code is like two times better with the fermi computer.
About the Cublas code, the performance is not so satisfiable.
We studied all codes we implemented to solve our linear system with Conjugate Gradient. We will
now conclude on this project.
Conclusion
During my last placement in CEA, we already studied the PETSc library, and its impact on MINOS
solver. We concluded that we obtained better convergences performances with PETSc, but in term
of computation time, PETSc was much slower.
The issue was now to study the Conjugate Gradient by using the Graphic card: to do so, we wanted
to use PETSc, as we worked with it before, but also to program the Conjugate gradient with Cuda
and Cublas.
We concluded that Cublas offers better performances than Cuda. However, the Cuda's code is much
more efficient with fermi graphic cards, than Cublas' code.
Regarding PETSc, we conclude on a very good gain performance for Multigrid objects, by using
GPU. The computation's time for PETSc 's code is however not so significantly different as the
Megaflops rates obtained (with and without GPU).
The next step is now to study our KSP code, with GPU, but to do so, it is necessary to reinstall
PETSc development version. We also developed a matrix interface, which allows to load every
matrix available on Matrix market website. In the future, this program will be ameliorated, and this
interface will be used.
Thesis of M. Pierre Guérin,
Thesis of M. Berlier,
Petsc User Manual
Cuda user manual
Cublas user manual
Bibliography
Webography
Site de l'Idris : http://www.idris.fr/data/cours/parallel/mpi/choix_doc.html
Site de Petsc : http://www.mcs.anl.gov/petsc/petsc-as/
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement