Boillod-Cerneux France WS 2010/2011 GPU - PETSc for neutron transport equation partnerships: Thanks I thank G. Haase for to help for the outcome of my project, and backing me in my work during the all semester. Outlines Introduction page 5 0. CEA page 6 I. Presentation of neutron transport equation 1.1 Context 1.2 Neutron transport equation 1.3 Neutron diffusion equation page 7 page 7 page 8 page 8 II. The linear system to solve page 10 2.1 Discretization of neutron diffusion equation page 10 2.2 Domain decomposition page 12 2.3 The linear system 2.4 The MINOS solver 2.5 The linear system used for our project page 13 page 14 page 15 III Introduction to KSP PETSc methods 3.1 The numerical methods page 16 page 16 IV Introduction to Multigrid with PETSc 4.1 The Multigrid concept 4.2 Multigrid with PETSc page 17 page 17 page 18 V PETSc-CUDA presentation 5.1 PETSc with CUDA page 20 page 20 VI PETSc used in neutronic domain page 21 6.1 Multigrid code 6.2 Multigrid Results 6.3 KSP code 6.4 KSP Results page 21 page 24 page 33 page 36 VII GPU used in neutronic domain 7.1 Introduction 7.2 Conjugate gradient with CUDA-CPU 7.3 Conjugate gradient with Cublas 7.4 Results page 44 page 44 page 46 page 47 page 48 Conclusion Bibliography Webography page 55 page 56 page 56 Abstract. This paper is about the optimisation for solving the neutron transport equation. We are looking for to implement the conjugate gradient method with GPU programmation, in order to speed-up the resolution of neutron transport equation. We will study different implementations of conjugate-gradient, with CUDA and CUBLAS langage, and also with PETSc-CUDA. Then, we will discuss on multigrid implementation with PETSc, and finally conclude on obtained perfomances. Key words. Parallel Computing, Eigenproblem, Conjugate-gradient method, multigrid, Graphics Programmation Unit Introduction As computer power increases, scientists want to solve more and more complicated problems. They need to optimize both memory requirements and time computation, which is commonly called « High Performance Computing ». HPC is applied to sciences as well as business or data mining and holds a leading place in every domain which uses computer science. More than a concept, HPC is nowadays a standard and is used in several domains, like petrology, sismology, economy, nuclear … Many codes developped 10 or 15 years ago need to be rewritten in order to take advantage of parallel computers. This is the case of MINOS, a CEA neutronics code. MINOS'target is to solve the neutron transport equation in order to compute the power produced by a power plant. Indeed, the control of nuclear reaction is based on the neutron transport equation, that's why its resolution is a major purpose. Moreover, the computation time is essential, so as to intervene fast to reduce or to improve a nuclear reaction, in terms of the needs energy. Although MINOS is considerably optimised, CEA is still doing research to ameliorate it. The recent researchs about GPU programmation prove that using graphic cards to make a part of calculations can considerably speed-up the time calculation. It exists differents langages in order to program on graphic cards, such that CUDA or CUBLAS. As the performances of GPU are really satisfying, some libraries and langages such as PETSc or Python are working on improve their perfomance by using GPU programmation. PETSc is a free library which is very adapted to factorize and solve linear system using parallelism. Indeed, PETSc uses MPI or openMPI and offers a lot of functions for linear and non linear systems. The subject of this project is to implement the conjugate-gradient method with several langages using GPU programmation, in order to use it later for solving the neutron transport equation in one dimension, as in this case, the matrix is symmetric and positiv-definite. We will use the langages CUDA, CUBLAS, and the PETSc library with CUDA, as it extends a lot of numerical methods to solve linear systems, notably several version of conjugate-gradient method. The organization of this paper is as follows. First, we will present the neutron transport equation, and then explain how we obtain from this equation a linear system. We will focus next on the generated codes for conjugate-gradient methode, and their structure. We will then discuss on multigrid programmation with PETSC. Finally, we will study the performances we obtain, and conclude on it. 0. CEA Based in Ile-de-France, CEA Saclay is a national and european research center. CEA gathers applied research as well as fundamental research. With 5000 researchers, CEA Saclay contributes to improve nuclear central park optimization, but also its functionment and does research to manage nuclear garbages, respecting the environment. To do so, CEA bases its research on simulation, especially with computer science. I. Presentation of neutron transport equation The neutron transport equation is fundamental in neutronic science. Indeed, scientists must solve it in order to understand the nuclear reaction in a core. In this part, we will present first the issue and then the neutron transport equation. We will end with the neutron diffusion equation, linked with the first mentioned equation. 1.1 Context In a nuclear reactor, free neutrons collide with atoms. Two scenarios are possible : the neutron is absorped by the atom (that's an absorption reaction), or the atom generates two or three other free neutrons : That's a fission reaction. Suppose that there are N fissions at time t. We call « Keff » the number of neutrons rejected in a fission. In case of a fision, we obtain N∙Keff fission at time t+1, N∙Keff² fissions at time t+2. We deduce that at t+k time (k is a natural integer), we obtain N∙Keffk fissions : (N∙Keffk) is a geometric suite (its reason is Keff) which converges only if Keff < 1. When Keff < 1 (resp. Keff>1), we qualify the nuclear reaction as a under (resp. upper) critical reaction. When Keff=1, we qualify it as a critical reaction. The coefficient Keff determines if nuclear reaction will decrease or increase in the futur. Solving the neutron transport equation gives us the value of Keff, that's why we have to solve it the fastest as possible. On the previous picture, we observe that Keff is equal to 3 for each fission. The fission phenomena produces heat, which will create water vapor. This vapor is employed to action turbines, which then produces electricity. As we presented the general functionment of a nuclear core, we will now study the neutron transport equation. 1.2 Neutron transport equation The neutron transport equation has been established by L. Boltzman in 1872, and is also called Boltzmann's equation¹. We present it presented as follows : ¹ ¹ Reference : Précis de neutronique Paul REUSS Where : Φ is the neutron flux (mol/m³), the current is (mol/m²sec). This equation cannot be solved with some particular conditions, for example, in case of stationary state of Fick's laws (etablished in 1855). Before introducing the Fick's law, let's see the diffusion definition : «a system tends to homogenize its chemical elements concentrations : this natural phenomena is called diffusion.» First Fick's law: it relates the diffusive flux to the concentration field, by postulating that the flux goes from regions of high concentration to regions of low concentration. The magnitude of flux is proportional to the concentration gradient. In one dimension, we obtain this equation: Where: D is the diffusion coefficient (m²/sec). Second Fick's law: it predicts how diffusion causes the concentration field to change with time. It is represented with this equation: Where: x is the position. The use of Fick's law in a core reactor leads to the diffusion approximation: Boltzmann's equation can be simplify and aproximate by the neutron diffusion equation. We will now focus on this equation, which depends upon 7 variables only. 1.3 Neutron diffusion equation We saw previously the diffusion definition: To illustrate this definition with neutrons exemple, that's means that in a core, the dense neutrons area tends to populate the sparse neutron areas. In the neutron diffusion equation, we introduce a new variable : Where: p is the current. The neutron diffusion equation is as follows: Where: Sφ are the scattering sources, Sf are the fission sources, R is the domain, σa is the absorption coefficient, σf is the fission coefficient, λ is Keff. ² ² Reference : Pierre Guérin thesis In fact, this equation doesn't consider the angular variable. Now we have presented the neutron diffusion equation, and justify its utility, we will focus on the corresponding linear system. Indeed, it is more adapted to use the linear system because of the large panel of numerical methods that can solve the equation. II. The linear system to solve In this part, we will present the method used to discretize the equation diffusion, and the decomposition domain method. Then we will explain the obtained linear system and finally, we will present the solver MINOS, the CEA's program to modelize the resolution of linear system. 2.1 Discretization of neutron diffusion equation Let's consider the previous neutron diffusion equation¹ : (1.1) (1.2) (1.3) These coupled equations are discretized with Raviart Thomas with Finite elements method. There, we will detail how to obtain the linear system from these equations. We consider: and satisfying the neutron diffusion equation. Let multiply the first equation line by a vector q, and the second with a function φ (φ is a squared-integrable function). We write this system with its variational form, in order to discretize the set of equation (1.1) and (1.3) with a finite element method. We finally obtain the following system : (2.1) (2.2) (2.3) We now apply the Green's formula on equation (2.1): (3.1) (3.2) (3.3) We employ the Raviart Thomas Finite element method to discretize the different functional spaces. The Raviart Thomas method is illustrated as follows : The discretized space of current is denoted by : The discretized space of flux which is denoted by : We have: Let set : Then, the discretization of (3.1) and (3.3) writes : (4.1) (4.2) (4.3) with We consider now the following written: and so we obtain the following matricial system : , statisfying the previous system: Bφ – AP = 0 Tφ – BTP = S with : This system can be rewritten with P as single unknwon member : φ = T -1 ( system : S – BT P) and replace the new epression of φ in the previous matrix (BT -1BT + A)P = BT -1S The biggest eigen value of this linear system is In the Raviart Thomas case, px and py are uncoupled and the system writes : B = Bx + By A= | Ax | - - | Ay | We fix: Wx = Ax + BxTx -1BxT and Wy = Ay + ByTy -1ByT Finally, we obtain the following linear system : Wx BxTx -1ByT ByTy -1BxT Wy To solve this linear system, CEA currently factorizes the previous matrix with Cholesky method, and solve it with a Gauss Seidel per blocs. In our specific case, the domain decomposition method is used too to solve more precisely the diffusion equation. 2.2 Domain decomposition On every domain, we intervene on the boundary conditions only. Note that in our case, we consider a cartesian mesh. To explain how we decompose the domain, we illustrate it with a scheme : We consider two domains, R1 and R2 and the normal vectors associated with each domain. In neutron transport equation case, we rewritte the p vector as follows : p1n • n1 + alpha•φ1 = - p2n-1 • n2 p2n • n2 + alpha•φ2 = - p1n-1 • n1 and introduce this it in the following system : (5.1) (5.2) With the same operations detailed in the previous paragraph, the linear system which is very similar as the one mentioned before. Indeed, this decomposition domain method is an iterative one : in consequence, only the second member of the system is modify. All the previous operations contributes to give a linear system, which form is Ax = b. It is easier to solve this kind of system, because we dispose of a lot of numerical methods to solve it. In the next paragraph, we will focus on the obtained linear system, which returns the coefficient Keff. 2.3 The linear system The following linear system1 is the result of the decomposition domain and discretization method application, realised on equation in neutron diffusion. This system is a sequential linear system. 1 Ax + BxTx -1BxT BxTx -1ByT ByTy -1BxT Ay + ByTy -1ByT 1Reference : Pierre guerin thesis Caracteristics of matrices : H' is symmetric positiv defined, T is diagonal, but it depends on the choice of Wh and Vh B is bidiagonal, A is a dense matrix (A is also called reference of coupling of currents matrix) W has the same profil as A, S is the second member. It results that the biggest eigen value of this linear system is : to solve this linear system is a major issue for the control of a nuclear reaction. Currently, CEA implements a Cholesky method to factorize the matrix H' and then, Gauss Seidel method to solve the linear system. As the matrice size is huge, we use parallelization to implement the solver of this linear system. Particularly, the MINOS Solver is a parallelised code, developped in C++, which makes this resolution. 2.4 The MINOS solver The Minos solver contains different C++ programs, which are optimised. Indeed, The C++ struture offers stability and is well adapted for parallelization. We will focus on the linear system representation in Minos only. 2.4.1 Matrix storage The matrix H is as follows: Every row contains only three non zero elements, and every bloc is sharing one element. The matrix H' is store as flat profil. We conserve the values of each bloc from off diagonal (as the matrix is symmetric) in an array called 'matCur'. The 'mut' array is an array containing the number off diagonal elements for each row, and 'matCurD' contains all the diagonal values. Note also that H' is symmetric, positiv-definite. We mentioned it before, the linear system is solved under a Gauss-Seidel method. In every Gauss Seidel step, the linear system corresponds to a decomposition of H' matrice for each direction. These systems (for each direction) are solved with a Cholesky method. MINOS iterates Cholesky resolution, until the error Є is inferior as a value fixed by the user. Currently, the error is fixed by a limit : 1 1 Reference pierre guérin thesis where: Sn+1 (resp. Sn) is the result obtained at the n+1 (resp. n) iteration. Right now, the error is from order 10(-5). MINOS solver uses parallelization to solve the linear system. To be more precise, we use MPI and openMPI, which accelerate computation time. 2.4.2 Parallelisation The data distribution is parallelised with MPI functions : it presently exists one method to distribute data with parallelism. All data of matrices and vectors are distributed on all processes, then, we decompose the data under the number of process and each process makes calculations with a part of sent data. It has been proved that the time for communication is more important than the time of calculation, as soon as we use very large matrices and an important number of process. 2.5 The linear system used for our project As I can not access to the exact CEA matrix and vectors, I choose to use an other symmetric positive definite matrix. Moreover, the storage format of CEA matrix is not optimal, so we loose a considerable time to translate it as a PETSc storage format. In this case, we modify a litlle bit the matrix structure, and use a tridiagonal symmetric matrix. That will be better to compare the time calculation, because the time to translate the matrix format will not interfer. The aim of this project is to implement the conjugate-gradient method, in order to solve the linear system with it. We implement it by using graphic card programmation, so we can speed-up the time calculation. In what follows, we will briefly study the PETSc library, and detail all the KSP methods proposed to solve our linear system. III Introduction to KSP PETSc methods PETSc in an abreviation for «Portable, Extensible Toolkit for Scientific computation». This library provides functions and many tools to implements numerical methods used to solve linear and non linear systems. PETSc is particularly adapted for sparse an dense systems as well as large-scale systems, because most of PETSc's functions are parallelized. In this paragraph, we will introduce the KSP application for solving linear systems. 3.1 The numerical methods In this part, we will present the diverse numerical methods we chose to solve our linear system. We will first explain theoritically the different numerical methods which are developped. The Conjugate Gradient methods : This method is an effective means to solve linear systems where the coefficient matrix is symmetric and positive definite. To begin, we consider a linear system like Ax = b, where A is symmetric and positive definite. We fixe a vector x(0) , and successively calculate x(1) , x(2) , x(3) ,..., x(k-1) to generate a sequence of {x(k)} to approximate the solution x of Ax = b. We decompose A as the followed format : A = L+D+L (t) where L is a strictly lower triangular matrix and D a diagonal matrix of the same size as L. We need next to calculate the matrix M, where M= (D+L)D(-1)(D+L)(t) We apply next the following algorithm : p(0) = r(0) = b – Ax(0) Mr ' (0) = r(0) until convergence, do : a(k) = (r(k) , r ' (k)) / (p(k) , Ap(k)) x(k+1) = x(k) + a(k) p(k) r(k+1) = r(k) – a(k)Ap(k) Mr ' (k+1)= r(k+1) b(k) = (r(k+1) , r ' (k+1)) / (r(k) , r ' (k)) p(k+1) = r ' (k+1) + b(k)p(k) = b – Ax(k+1) we converge when p(k+1) = 0. This method corresponds to the KSP_TYPE « CG ». Conjugate Gradient Method on the Normal Equations The cgne solver applies too the CG iterative method, but it is applied to the normal equations without explicitely forming the matrix A(t)A. This method corresponds to the KSP_TYPE « CGNE ». As we saw, the KSP object in PETSc are particularly adapted to solve linear systems. However, PETSc also developped the multigrid concept, as it is more and more used in HPC domain. PETSc offers a DMMG object, which creates multigrid, very adapted for linear or non linear systems. In what follows, we will study the DMMG concept with PETSc. IV Introduction to Multigrid with PETSc PETSc in an abreviation for «Portable, Extensible Toolkit for Scientific computation». This library provides functions and many tools to implements numerical methods used to solve linear and non linear systems. PETSc is particularly adapted for sparse an dense systems as well as large-scale systems, because most of PETSc's functions are parallelized. In this paragraph, we will introduce the multigrid application for solving linear systems, and then explain how PETSc implements it. 4.1 The Multigrid concept To introduce the multigrid concept, we will consider the classic example: Poisson equation in 1 dimension. -Δ u = f (1), on [0,1] (1.1) u(0) = u(1) = 0 (boundary conditions). (1.2) We apply the finite difference method (second order) on (1.1), and then obtain the following linear system: - u(i+1) + 2u(i) - u(i-1) = fi (2.1) h² (h is the grid size) (2.2) h is generally 1/(n+1). With can rewritte these equation with a matrix form: Au = f (3.1) u = (u1, u2, … , uN-1)T (3.2) u(0) = u(N) = 0 (3.3) f = (f1, f2, … , fN-1)T (3.4) The discretized form of equation (3.1) is as a linear system: A h u h = fh (4.1) Applying Jacobi or Gauss-Seidel algorithm on linear system like (3.1) determine the u vector solution, which has the following form: u(i) = (u(i+1) + u(i-1) + h²fi) / 2 ; i=1, …, N (5.1) Note that the Gauss-Seidel method is like Jacobi one, by using for u(i+1) the previous calculated value (updated for each iteration), until we obtain the convergence. The system Au = f admitt an exact solution, u. We define the error as: e=u–v (6.1) The best example to study the different frequency componant of error is to consider the Fourrier modes. They have been introduced as initial data for Gauss-Seidel iteration By considering the Fourrier's mode for several frequencies, we can etablish the two following deductions: The error of classic iterativ methods is smoothed for every iteration. By ddecomposing the error as a Fourrier serie, we conclude that the high frequencies for a mesh with gridsize h are better amortized than the lower one. The error low frequencies on a fine mesh will be considered as high frequencies on a coarse mesh, and so being amortized with iterativ method. We illustrate the steps for multigrid resolution of Au = f: We make several Gauss-Seidel resolutions of Ah uh = fh so we obtain a vector vh, which is an approximation of u on the finest grid (pre-smooth) We calculate the associated residu rh = fh – Ah vh We consider this residu on a coarser grid (with 2h as gridsize) by using a restriction operator R We solve the following equation: A2h (u2h – v2h) = A2h e2h = r2h We use a prolongation (or interpolation) operator on e2h on the finest grid to calculate eh We modify in consequence the approximation vh: vh = vh + eh We make several iterations of Gauss-Seidel on Ah uh = fh, considering as initial solution: vh. We will now focus our attention on the multigrid with PETSc, as PETSc offers methods to handle with it. 4.2 Multigrid with PETSc Before using Multigrid methods with PETSc, one should know that the multigrid in PETSc is considered as a preconditioner and not as a standalone solver. However, that can be changed by using the PETSc KSP methods. The multigrid concept is represented by a structure in PETSc, the DMMG structure, which interacts with DA structure, KSP, PC and some others PETSc types (such as Mat, Vec, ...). More generally, DMMG represents an object orriented framework programming style. Each object are virtually created by DMMG, and DMMG runs them. With PETSc, the multgrid methods work as follows : Each solver (smoothers and coarse grid solve) is represented by a KSP object, but there is no good reason to explain this. Moreover, we have to use a PCType of PCKSP for the composite preconditioners if we decide to employ the KSP methods. In these conditions, the DMMG (ie multigrid with PETSc) manages the construction of multigrid preconditioners and multigrid solvers. The biggest advantage with PETSc is that we can easilly use PETSc multigrid to solve linear (using KSP methods) or non linear (using SNES methods) systems. The PETSc Library provides also codes to generate all the parameters of multigrid such as the right hand side (only available for linear systems), but also the Jacobian matrices, the interpolation or restriction operators, and the function evaluations for a given level of discretization (only available for non linear systems). All these access are executed with PETSc functions, which are easy to handle. What exactly does the DMMG ? That creates a KSP (in case of linear system) or SNES (non linear system) object, and set the PCType by using the DMMG functions which can access to PC associated to DMMG. For each level, the vectors, restriction (or interpolation) functions and the matrices are created and filled up. Note that, to introduce the values in initial solution, we use the DA structure (allows logically rectangular meshes creation in 1,2 or 3 dimensions). However, this one is not available with CUDA, it is replaced by DM one. The DM structure works as follows : That create a KSP or SNES object, then set the preconditioner type. The vectors, and restriction (or interpolation) functions, as well as the matrices are created and filled up for each level. We will now explain how CUDA is used with the DMMG object. V PETSc-CUDA presentation Nowadays PETSc is available with CUDA. However, this version of PETSc is only available in petsc-dev, but will be included in the next PETSc version (3.2). The main advantage of PETSc is that the programmer does not haveto implement any CUDA's code: the programmation is still with PETSc functions, and CUDA appears only with the arguments when we run the code. This is still the same idea of PETSc, one code, but several runnings. The other advantage is that anyone can code on GPU, without deep knowledge about CUDA. 5.1 PETSc with CUDA As the manipulation of CUDA with PETSc is really simple, the installation of PETSc-dev is still the most complicated part of work! One should consider, before installing the PETScdev version, which versions of cusp, thrust and CUDA are already installed on computer. Actually, the thrust version 1.4.0 and cusp 0.4.0 are available with PETSc-CUDA, which is a problem as it is complicated to find the good thrust version (there is no historic-versions on official website). The PETSc-dev version is currently installed on fermi, but there is still a problem of compatibility with the thrust versions. In consequence, I cannot run my KSP code with CUDA, as one of the necessary library to do so was'nt correctly installed. However, I could perfectly run my multigrid code: the DMMG object is particularly adapted for using CUDA. The main advantage of PETSc is that you have to implement your code, and then run it with the PETSc-CUDA options to use it : one code, but two way to run it, is the concept of PETSc-CUDA. Some operations are directly run on the GPU : MatMult(...), KSPSolve(...), VecAXPY(...), VecWAYPX(...), as well as other vectors operations. Nowadays, we have to use the jacobi preconditioners if we want to run our code on GPU. Most of KSP types are also available with the GPU. As we mentioned before, the programmer does not have to consider any CUDA implementation. We use the classic PETSc functions, and just run the code with the PETSc-CUDA options. For example, one should run a PETSc code with da_vec_type cuda if we want that the vector, associated to the DA/DMMG, runs on the GPU. Now we studied the PETSc-CUDA, as well as CUDA behaviour with PETSc, let's see the different PETSc codes. VI PETSc used in neutronic domain The solver MINOS is currently parallelized to optimize the computation time. However, PETSc offers functions which are parallelized : the user doesn't have to consider the MPI communication, which are optimized as well as the CUDA implementation. Besides, the storages format for matrices are optimized and adapted for such systems as neutron transport. We will present first the PETSc multigrid apply to a similar system as neutron transport equation (briefly presented in 2.5), then the codes using only KSP functions to solve the same system. 6.1 Multigrid code We implemented a program to solve the linear system (presented in 2.5) by using the DMMG functions. This is justify by the PETSc-CUDA implementation : as we said before, the DMMG options allow to run PETSc code on GPU. The code is implemented as follows : void ksp_Results(KSP ksp) ; which returns the KSP results, void ksp_Options(KSP ksp) ; which inserts the differents options to the KSP, PetscErrorCode ComputeInitialSolution(DMMG dmmg) ; which compute the initial solution for DMMG, PetscErrorCode ComputeRHS_VecSet(DMMG dmmg,Vec b) ; which compute the right hand side for DMMG, PetscErrorCode ComputeMatrix_A(DMMG dmmg,Mat jac, Mat A) ; which compute the matrix associated to the DMMG. We won't detail void ksp_Results(KSP ksp) ; and void ksp_Options(KSP ksp) ; as they are similar as the KSP code presented in 4.4. This function ComputeInitialSolution(DMMG dmmg) uses the DA structure to acces to the initial solution: in term of perfomance, this is much better. We insert the values to the vector by using a DA structure. PetscErrorCode ComputeInitialSolution(DMMG dmmg) { DA PetscInt da = (DA) dmmg->dm; my,ys,ym,j; PetscInt PetscScalar DAGetInfo(da, 0, &mx, &my, 0,0,0,0,0,0,0,0); DAGetCorners(da,&xs,&ys,0,&xm,&ym,0); Vec x = (Vec) dmmg->x; DAVecGetArray(da, x, &array); for (j=ys; j<ys+ym; j++){ for(i=xs; i<xs+xm; i++){ array[j][i] = 1.0; } } mx,xs,xm,i; **array; DAVecRestoreArray(da, x, &array); VecAssemblyBegin(x); VecAssemblyEnd(x); return(0); } In the function ComputeRHS_VecSet(DMMG dmmg,Vec b), we use the VecSet(...) function, as it presents also good perfomances gain. PetscErrorCode ComputeRHS_VecSet(DMMG dmmg,Vec b) { PetscPrintf(PETSC_COMM_WORLD,"in rhs VecSet\n"); VecSet(b,1.0); VecAssemblyBegin(b); VecAssemblyEnd(b); return(0); } the function ComputeMatrix_A(DMMG dmmg,Mat jac, Mat A) insert the values in the matrix associated to the DMMG. We use MatSetValuesStencil, to insert all the values cause this function is particularly adapted for sparse matrix, and use the grid index. PetscErrorCode ComputeMatrix_A(DMMG dmmg,Mat jac, Mat A) { PetscInt m,n; MatStencil row, col[3]; PetscScalar v[3]; PetscInt DA da = (DA) dmmg->dm; i,j,mx,my,xm,ym,xs,ys; DAGetInfo(da, 0, &mx, &my, 0,0,0,0,0,0,0,0); DAGetCorners(da,&xs,&ys,0,&xm,&ym,0); for(i=xs; i<xs+xm; i++){ row.i=i; for (j=ys; j<ys+ym; j++){ row.j=j; if(i==xs){ v[0] = 1/(abs(i-j)+1.0); col[0].i = i; col[0].j = j; v[1] = 1/(abs(i-j)+1.0); col[1].i = i; col[1].j = j+1; MatSetValuesStencil(A,1,&row,2,col,v,INSERT_VALUES); }else{ if(j==ys){ v[0] = 1/(abs(i-j)+1.0); col[0].i = i; col[0].j = j; v[1] = 1/(abs(i-j)+1.0); col[1].i = i; col[1].j = j-1; MatSetValuesStencil(A,1,&row,2,col,v,INSERT_VALUES); }else{ if(j>=(i-1) && j<=(i+1)){ v[0] = 1/(abs(i-j)+1.0); col[0].i = i; col[0].j = j; v[1] = 1/(abs(i-j)+1.0); col[1].i = i; col[1].j = j-1; v[2] = 1/(abs(i-j)+1.0); col[2].i = i; col[2].j = j+1; MatSetValuesStencil(A,1,&row,3,col,v,INSERT_VALUES); } } } } } MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY); MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY); return(0); } int main(int argc,char **argv) { DMMG *dmmg; DA PetscReal norm; da; PC pc; PetscInitialize(&argc,&argv,(char *)0,help); PetscOptionsSetFromOptions(); DMMGCreate(PETSC_COMM_WORLD,1,PETSC_NULL,&dmmg); DMMGSetDM(dmmg,(DM)da); ComputeInitialSolution(*dmmg); DMMGSetKSP(dmmg,ComputeRHS_VecSet,ComputeMatrix_A); ksp_Options(DMMGGetKSP(dmmg)); KSPGetPC(DMMGGetKSP(dmmg),&pc); PCSetType(pc,PCJACOBI); PCFactorSetShiftType(pc,MAT_SHIFT_POSITIVE_DEFINITE); DMMGSetUp(dmmg); DMMGSolve(dmmg); VecAssemblyBegin(DMMGGetRHS(dmmg)); VecAssemblyEnd(DMMGGetRHS(dmmg)); ksp_Results(DMMGGetKSP(dmmg)); VecAssemblyBegin(DMMGGetx(dmmg)); VecAssemblyEnd(DMMGGetx(dmmg)); DMMGView(dmmg,PETSC_VIEWER_STDOUT_WORLD); MatMult(DMMGGetJ(dmmg),DMMGGetx(dmmg),DMMGGetr(dmmg)); VecAXPY(DMMGGetr(dmmg),-1.0,DMMGGetRHS(dmmg)); VecNorm(DMMGGetr(dmmg),NORM_2,&norm); DMMGDestroy(dmmg); DADestroy(da); PetscFinalize(); return 0; } We choose to run this code with three different solvers : the classic conjugate gradient (CG), the conjugate gradient squared method (CGS) and finally, the CGNE one, which corresponds to the Conjugate Gradient iterative method. However, the perfomances obtained with CGS solver are similar to the CG one, so we won't detail the array of results for CGS solver. 6.2 Multigrid Results CGNE Grid size: 10 000 ■ Without Cuda level Time 1 2 3 4 5 6 7 8 9 10 11 1.278 1.310 1.350 1.403 1.595 1.730 2.148 2.886 4.266 6.691 11.300 Flops 21920000 33440000 51980000 80050000 124200000 192500000 305100000 482400000 772900000 1162000000 1812000000 Flops/sec DMMG iterations Residual norm 17150000 87 0.0150379 25520000 66 0.0185555 38510000 51 0.022592 57070000 39 0.0277488 77890000 30 0.0339612 111300000 23 0.0417193 142100000 18 0.0504998 167200000 14 0.061509 181200000 11 0.074435 173700000 8 0.0959991 160400000 6 0.121133 VecDot 2085 2080 2117 2101 1720 1412 1206 1147 1101 1054 1095 VecNorm 1797 1827 1829 1834 1793 1761 1614 1561 1556 1567 1572 ■ level 1 2 3 4 5 6 7 8 9 10 11 VecDot 109 200 352 562 799 1152 1568 1895 2150 2312 2401 VecNorm 124 240 475 929 1791 3420 6200 10521 16289 22647 28023 VecAXPY VecWAXPY 1680 984 1698 992 1683 987 1638 981 1584 948 1376 892 1191 826 1166 763 1122 729 1078 702 1114 737 KSPSolve 898 904 896 886 792 735 679 641 613 592 592 PCApply 456 454 452 444 398 351 285 258 241 230 234 MatMult MatMultTransposee 682 719 679 735 679 717 676 713 605 630 589 608 589 609 571 587 550 578 553 577 557 584 With Cuda Time 1.35900 1.35600 1.37000 1.47100 1.54800 1.63100 1.93500 2.49600 3.59600 5.66400 9.79500 Flops 22800000 34780000 54060000 83250000 129200000 200200000 317300000 501600000 803700000 1208000000 1884000000 Flops/sec DMMG iterations Residual norm 16780000 87 0.0150379 25650000 66 0.0185555 39450000 51 0.022592 56580000 39 0.0277488 83470000 30 0.0339612 122700000 23 0.0417193 164000000 18 0.0504998 200900000 14 0.061509 223500000 11 0.074435 213300000 8 0.0959991 192300000 6 0.121133 VecAXPY 1360 2487 4007 5268 6671 7790 8448 8633 8454 7874 7187 VecAYPX 1243 2338 4054 5567 7363 9137 10488 11373 11805 12011 12112 KSPSolve 322 516 744 939 1080 1238 1334 1343 1322 1259 1185 PCApply 718 1215 1692 1821 1749 1687 1527 1334 1130 463 727 MatMult MatMultTranspose 2577 339 3466 382 4382 402 5182 409 5522 416 5686 446 5649 464 5352 463 4907 463 4188 463 3624 463 For each PETSc function, we observe that the Megaflop rates is increasing with the number of levels, which is completely normal regarding the size of matrix for each level. With the first levels (1 to 5), we can conclude to a considerable gain of Megaflops rate for VecDot function. For first level, the difference is about 20. With the biggest level, the Megaflop rate is just 2 times better. For VecNorm function, the gain is impressive: for the 11 level, the Megaflop rate is more than 10, compared to the rate without using Cuda. We make the same conclusion about VecAXPY and VecAYPX, the gain is about 7 to 8 times superior. The gain obtained on KSPSolve is disappointing, as it is only about 2 times better with Cuda. However, the MatMult function shows also good gain perfomances, about 3 to 10 times better with Cuda (gains are differents for each level). The gain obtained on PCApply function is also very significant, but we can't conclude on a general gain for each level, as the gain is quite different for each level (the gain is about 2 to 3 times superior). Which is now surprising is that the Time and global Megaflops rates are not so different with or without Cuda. We had the same conclusion with a PETSc example, available in the tutorial of PETSc. However, this conclusion is not available anymore when the grid size increases. grid size: 100 000 ■ without Cuda level VecDot 2074 1608 1329 1195 1134 1121 1142 1112 1 2 3 4 5 6 7 8 1.392 1.539 1.798 2.291 3.125 4.840 7.973 13.770 Flops 89200000 139400000 219800000 340600000 522200000 885400000 1452000000 2265000000 VecNorm 1817 1826 1736 1601 1556 1563 1569 1570 VecAXPY 1613 1564 1366 1189 1139 1109 1116 1124 VecAYPX 980 917 895 821 752 754 756 719 ■ Time With Cuda Flops/sec DMMG iterations Residual norm 64070000 35 0.0301563 90610000 27 0.0368426 122300000 21 0.0447776 148700000 16 0.0553796 167100000 12 0.0694722 183000000 10 0.0802987 182100000 8 0.0959991 164400000 6 0.121133 KSPSolve 862 769 720 653 633 625 611 593 PCApply 428 390 322 268 260 251 242 227 MatMult MatMultTransposee 651 698 587 613 599 607 557 587 563 583 570 589 560 584 570 590 level 1 2 3 4 5 6 7 8 VecDot 544 823 1150 1418 1636 1805 1885 1940 VecNorm 1140 2172 4114 7063 10353 15393 21564 27020 Time 0.09203 0.99900 1.19400 1.58500 2.30900 3.75400 6.58600 11.84000 Flops 92800000 145000000 228600000 354200000 543000000 920600000 1509000000 2354000000 Flops/sec DMMG iterations Residual norm 100800000 35 0.0301563 145200000 27 0.0368426 191400000 21 0.0447776 223500000 16 0.0553796 235200000 12 0.0694722 245200000 10 0.0802987 229200000 8 0.0959991 198800000 6 0.121133 VecAXPY 5443 6835 7962 7868 7949 7754 7261 6431 VecAYPX 5750 7805 9568 10732 11452 11821 11995 12060 KSPSolve 849 1017 1136 1156 1142 1140 1089 1034 PCApply 1602 1500 1515 1273 1061 925 781 638 MatMult MatMultTranspose 4833 344 5220 377 5375 397 5062 397 4459 397 4122 405 3691 400 3073 406 For a 100 000 grid size, we can observe time gain (from less than 10 seconds to 2 seconds). However, the global Megaflops rates are still comparable, the difference is not so significant. The gain about Megaflops rate are quite the same. For PCApply, we observe it is now about 3 to 4 times better. grid size: 1 000 000 ■ Without Cuda level VecDot 1143 1123 1137 1088 1 2 3 4 2.129 3.147 5.200 9.021 Flops 392000000 594000000 998000000 1606000000 VecNorm 1557 1556 1565 1571 VecAXPY 1160 1133 1137 1132 VecAYPX 763 777 751 701 ■ Time With Cuda Flops/sec DMMG iterations Residual norm 184100000 15 0.0582586 188800000 11 0.074435 191900000 9 0.0873468 178000000 7 0.106906 KSPSolve 640 632 616 592 PCApply 261 257 245 229 MatMult MatMultTransposee 559 583 567 589 557 583 552 580 level 1 2 3 4 VecDot 1523 1725 1852 1906 VecNorm 8181 11965 17456 23417 Time 1.38700 2.26000 4.04800 7.45700 Flops 408000000 618000000 1038000000 1670000000 Flops/sec DMMG iterations Residual norm 294200000 15 0.0582586 273500000 11 0.074435 256400000 9 0.0873468 223900000 7 0.106906 VecAXPY 8232 7846 7527 6914 VecAYPX 11018 11641 11911 12042 KSPSolve 1162 1151 1111 1059 PCApply 1230 968 853 701 MatMult MatMultTranspose 4971 397 4353 407 3947 399 3388 400 For a 1 000 000 grid size, the time gain is about 1 to 2 seconds. The gain about the Vec operations are still very satisfiable, (about 20 times still for VecAYPX). grid size: 10 000 000 ■ Without Cuda level VecDot 1139 1 6.663 Flops 1670000000 VecNorm 1568 VecAXPY 1140 VecAYPX 781 ■ Time KSPSolve 597 PCApply 231 MatMult MatMultTransposee 566 584 With Cuda level 1 VecDot 1884 Flops/sec DMMG iterations Residual norm 250700000 6 0.121133 VecNorm 25236 Time 5.07600 Flops 1740000000 Flops/sec DMMG iterations Residual norm 342800000 6 0.121133 VecAXPY 6519 VecAYPX 12054 KSPSolve 1026 PCApply 632 MatMult MatMultTranspose 3057 404 We have the same conclusion as before for 10 000 000 grid size: the gain for each functions are still the same. We can conclude, regardings the previous results, that Cuda with PETSc offers very good gain perfomances for Megaflops rates (for each detailed function). However, the global rates for time and Megaflops rates are disappointing, we were expecting more. Time Mgflop rates Mgflop rates for each function Iterations We run the same code with a different solver, the CG to compare the results with the CGNE solver: the question was, does the solver have an influence on the results? Can we obtain different gains just because of solver? CG Grid size: 10 000 ■ Without Cuda level Time 1 2 3 4 5 6 7 8 9 10 11 VecDot 2123 2058 2117 2116 2044 1602 1213 1134 1078 1091 1099 1.326 1.361 1.404 1.486 1.620 1.885 2.355 3.155 4.565 7.020 11.840 VecNorm 1837 1850 1845 1848 1828 1780 1666 1554 1553 1567 1571 Flops 63820000 90650000 128100000 182900000 259400000 369100000 531000000 762700000 1111000000 1577000000 2416000000 VecAXPY 1689 1703 1695 1689 1605 1470 1210 1168 1103 1100 1114 Flops/sec DMMG iterations Residual norm 48130000 354 0.000998738 66620000 251 0.00140858 91270000 177 0.00199748 123100000 126 0.00280598 160100000 89 0.00397251 195800000 63 0.00561196 225500000 45 0.00785674 241800000 32 0.0110485 243300000 23 0.0153719 224600000 16 0.0220971 204100000 12 0.0294628 VecAYPX 992 983 989 985 979 907 816 758 722 717 737 KSPSolve 1057 1057 1059 1047 984 909 804 751 712 704 696 PCApply 463 461 459 450 409 379 302 259 237 231 224 MatMult 679 680 678 668 618 596 587 569 552 556 554 ■ level 1 2 3 4 5 6 7 8 9 10 11 VecDot With Cuda Time 0.1464 0.1429 0.1396 0.1410 0.1447 0.1553 0.1793 0.2245 0.3199 0.5056 11.3200 VecNorm 125 242,00000 483,00000 945,00000 1820,00000 3476,00000 6399,00000 10679,00000 16723,00000 22921,00000 28443,00000 125 241,00000 480,00000 920,00000 1752,00000 3203,00000 5541,00000 8522,00000 11878,00000 14657,00000 16682,00000 Flops 67370000 95690000 135200000 193100000 273800000 389600000 560500000 804900000 1172000000 1664000000 2549000000 VecAXPY 1455 2608,00000 4237,00000 5890,00000 7609,00000 9027,00000 9840,00000 10074,00000 9974,00000 9480,00000 8852,00000 Flops/sec DMMG iterations 46000000 354 66980000 251 96890000 177 136900000 126 189200000 89 250900000 63 312500000 45 358500000 32 366500000 23 329000000 16 225200000 12 VecAYPX 1272,00000 2362,00000 4040,00000 5492,00000 7554,00000 9417,00000 10662,00000 11448,00000 11868,00000 12093,00000 12143,00000 KSPSolve 339,00000 640,00000 1213,00000 6694,00000 3450,00000 4974,00000 6196,00000 6655,00000 6410,00000 5590,00000 4395,00000 Residual norm 0.000998738 0.00140858 0.00199748 0.00280598 0.00397251 0.00561196 0.00785674 0.0110485 0.0153719 0.0220971 0.0294628 PCApply 720,00000 1231,00000 1835,00000 2122,00000 2121,00000 1997,00000 1732,00000 1433,00000 1127,00000 853,00000 648,00000 MatMult 3003,00000 4232,00000 5542,00000 2129,00000 7458,00000 7738,00000 7673,00000 7330,00000 6744,00000 5836,00000 4609,00000 Clearly, the Megaflops gains for each function are the same, but we notify a sinificant difference about the time, which is better than CGNE solver. The solver CG is about ten times faster as CGNE solver. grid size: 100 000 ■ without Cuda level Time 1 2 3 4 5 6 7 8 1.484 1.660 1.962 2.540 3.443 5.075 8.019 13.700 Flops 202600000 291000000 410200000 591000000 837400000 1215000000 1740000000 2559000000 Flops/sec DMMG iterations Residual norm 136500000 112 0.00315673 175300000 80 0.00441942 209100000 56 0.00631345 232600000 40 0.00883883 243200000 28 0.0126269 239400000 20 0.0176777 217000000 14 0.0252538 186800000 10 0.0353553 VecDot 2091 1947 1520 1197 1134 1123 1139 1109 ■ VecNorm 1838 1832 1778 1598 1548 1561 1569 1571 VecAXPY 1675 1592 1337 1200 1130 1121 1122 1117 VecAYPX 984 960 895 810 766 747 800 725 KSPSolve 1036 970 883 769 742 733 721 690 PCApply 443 409 361 273 257 252 239 213 MatMult 661 611 596 557 563 569 560 571 With Cuda level 1 2 3 4 5 6 7 8 VecDot 1127 2119,00000 3805,00000 6226,00000 9617,00000 11685,00000 14331,00000 16117,00000 Time 0.8964 0.9385 0.1080 0.1386 0.1974 0.3223 0.5622 1.047e+01 VecNorm 1154 2222,00000 4143,00000 7208,00000 12466,00000 15780,00000 21499,00000 27049,00000 Flops 213900000 307200000 433000000 623800000 883800000 1282000000 1836000000 2700000000 VecAXPY 6367 8042,00000 8976,00000 9655,00000 9895,00000 9330,00000 8702,00000 7901,00000 Flops/sec DMMG iterations 238600000 112 327300000 80 401000000 56 450200000 40 447700000 28 397900000 20 326500000 14 257800000 10 VecAYPX 6145,00000 8133,00000 9661,00000 10817,00000 11604,00000 11849,00000 12021,00000 12053,00000 KSPSolve 2471,00000 3868,00000 5225,00000 6026,00000 6452,00000 5471,00000 4718,00000 3845,00000 Residual norm 0.00315673 0.00441942 0.00631345 0.00883883 0.0126269 0.0176777 0.0252538 0.0353553 PCApply 1988,00000 1979,00000 1740,00000 1457,00000 1262,00000 894,00000 660,00000 508,00000 MatMult 6880,00000 7425,00000 7417,00000 7018,00000 6860,00000 5551,00000 5022,00000 4114,00000 For 100 000 grid size, we conclude on the sam gain rates for each functions. The time gain is about 2 times better with Cuda, except for last levels. grid size: 1 000 000 ■ Without Cuda level Time 1 2 3 4 2.398 3.507 5.462 9.228 Flops 658000000 966000000 1366000000 2022000000 Flops/sec DMMG iterations Residual norm 274400000 36 0.00982093 275500000 26 0.0135982 250100000 18 0.0196419 219100000 13 0.0271964 VecDot 1134 1128 1137 1074 ■ VecNorm 1566 1559 1567 1571 VecAXPY 1158 1132 1129 1114 VecAYPX 760 750 753 728 KSPSolve 750 740 726 697 PCApply 266 254 246 228 MatMult 558 567 557 555 With Cuda level 1 2 3 4 VecDot 7122 9733,00000 12508,00000 14987,00000 Time 0.1152 0.1909 0.3441 0.6458 VecNorm 8613 12627,00000 17454,00000 23321,00000 Flops 695000000 1020000000 1442000000 2134000000 VecAXPY 9739 9656,00000 9165,00000 8531,00000 Flops/sec DMMG iterations 603100000 36 534400000 26 419100000 18 330500000 13 VecAYPX 11045,00000 11653,00000 11941,00000 12007,00000 KSPSolve 6184,00000 5986,00000 5280,00000 4497,00000 Residual norm 0.00982093 0.0135982 0.0196419 0.0271964 PCApply 1368,00000 1066,00000 809,00000 629,00000 MatMult 7029,00000 6462,00000 5483,00000 4656,00000 With 1 000 000 grid size, we can observe that time computation is much better with Cuda: the difference is very significant. grid size: 10 000 000 ■ Without Cuda level Time 1 VecDot 1141 ■ 7.121 VecNorm 1571 With Cuda Flops 2260000000 VecAXPY 1143 Flops/sec DMMG iterations Residual norm 317400000 12 0.0294628 VecAYPX 746 KSPSolve 698 PCApply 225 MatMult 544 level 1 VecDot 15595 Time 0.4003 VecNorm 25216 Flops 2390000000 VecAXPY 8366 Flops/sec DMMG iterations 597000000 12 VecAYPX 12063,00000 KSPSolve 4330,00000 Residual norm 0.0294628 PCApply 588,00000 MatMult 4580,00000 One can conclude also, for each grid size level, about the Megaflop rate from KSPSolve: it much more than the one with CGNE, and for each grid size, the Megaflop rate is at least 2 times better with Cuda. Time Mgflop rates Mgflop rates for each function Iterations PETSc presents the advantage that we can pass the KSP type as a parameter, so we have one code for three diferent solvers. We will study in the next part the developped code. 6.3 KSP code The code is implement as follows : we etablished two functions: Translate_Matrix_H(int matrix_size) which create and insert the values of matrix, and CG_Solve(Mat mat, Vec b), which solves the linear system Ax = b and insert the solution in right hand side vector because of gain memory. Let's detailed first the function Translate_Matrix_H(...): We pass as parameter the size of matrix, so we can allocate the non zero structure for PETSc storage format. Actually, by storing the good non zero structure, we can obtain a good gain perfomance. The array nnz (PetscInt nnz[matrix_size]) contains the number of non zero elements for each row. In our case, every row contains three non zero elements, just the first and last one contains two non zero elements. The array v (PetscScalar v[3]) contains the non zero elements to insert for each row. Note that we use the PETSc types, because of pefomances. We then insert the values in nnz array and create the matrix by using the folowing functions: MatCreateSeqAIJ(PETSC_COMM_WORLD,matrix_size, matrix_size,PETSC_DECIDE,nnz,&mat); MatSetOption(mat,MAT_SYMMETRIC,PETSC_TRUE); MatZeroEntries(mat); MatSetFromOptions(mat); With these functions, we create a sparse matrix, symmetric (because of options). The function MatSetFromOptions(...) is used to apply the options passed as arguments. We now insert the values by using the function MatSetValues(...), it is very perfomant because we insert all the non zero element for each row. We used before MatSetValue(...), which insert just on value, and observe that the time computation was very long, so we optimized the code by insert all the values for each row. Mat Translate_Matrix_H(int matrix_size) { Mat mat; PetscInt idxm[1],idxn[3]; PetscScalar v[3]; PetscInt nnz[matrix_size]; PetscFunctionBegin; MatCreate(PETSC_COMM_WORLD,&mat); nnz[0] = 2; nnz[matrix_size-1] = 2; for(i=1;i<matrix_size-1;i++) { nnz[i] = 3; } MatCreateSeqAIJ(PETSC_COMM_WORLD,matrix_size, matrix_size,PETSC_DECIDE,nnz,&mat); MatSetOption(mat,MAT_SYMMETRIC,PETSC_TRUE); MatZeroEntries(mat); MatSetFromOptions(mat); idxm[0] = 0; idxn[0] = 0; v[0] = 1.0; idxn[1] = 1; v[1] = 0.5; MatSetValues(mat,1,idxm,2,idxn,v,INSERT_VALUES); for(i=1;i<matrix_size-1;i++){ idxm[0] = i; idxn[0] = i-1; v[0] = 0.5; idxn[1] = i; v[1] = 1.0; idxn[2] = i+1; v[2] = 0.5; MatSetValues(mat,1,idxm,3,idxn,v,INSERT_VALUES); } idxm[0] = matrix_size-1; idxn[0] = matrix_size-2; v[0] = 0.5; idxn[1] = matrix_size-1; v[1] = 1.0; MatSetValues(mat,1,idxm,2,idxn,v,INSERT_VALUES); MatAssemblyBegin(mat,MAT_FINAL_ASSEMBLY); MatAssemblyEnd(mat,MAT_FINAL_ASSEMBLY); return mat; } We will now detail the function which solves the linear system. The function CG_Solve(Mat mat, Vec b) has as parameter the matrix and righ hand side vector. We create the KSP context by using the function KSPCreate(...), then add a preconditioner with KSPGetPC(...). We can add also some options with KSPSetFromOptions(...): it considers the arguments passed when we run the code, such as the KSP type (CG, CGNE or CGS). The function KSPSolve(...) solve the linear system Ax = b. In our case, the final solution is stored in right hand side vector b. KSP CG_Solve(Mat mat, Vec b) { PC pc; KSP ksp; KSPConvergedReason PetscReal reason; normVector; int its; PetscFunctionBegin; KSPCreate(MPI_COMM_WORLD,&ksp); KSPSetOperators(ksp,mat,mat,DIFFERENT_NONZERO_PATTERN); KSPSetInitialGuessNonzero(ksp,PETSC_TRUE); KSPGetPC(ksp,&pc); PCSetType(pc,PCJACOBI); PCFactorSetShiftType(pc,MAT_SHIFT_POSITIVE_DEFINITE); KSPSetFromOptions(ksp); KSPSetNormType(ksp,KSP_NORM_PRECONDITIONED); KSPSetUp(ksp); KSPSetTolerances(ksp,0.000010,0.000000,10000.000000,10000); KSPSolve(ksp,b,b); KSPGetConvergedReason(ksp,&reason); if (reason==KSP_DIVERGED_INDEFINITE_PC) { PetscPrintf(PETSC_COMM_WORLD,"\nDivergence because of indefinite preconditioner;\nRun the executable again but with '-pc_factor_shift_type POSITIVE_DEFINITE' option.\n",PETSC_VIEWER_STDOUT_WORLD); } else{ if (reason<0) { PetscPrintf(PETSC_COMM_WORLD,"\nOther kind of divergence: this should not happen: %f\n",reason,PETSC_VIEWER_STDOUT_WORLD); } else { KSPGetIterationNumber(ksp,&its); printf("\nConvergence in %d iterations.\n",(int)its); } } VecNorm(b,NORM_2,&normVector); return ksp; } int main(int argc,char **argv){ PetscInt n = 10; PetscMPIInt size; PetscErrorCode ierr; int matrix_size=524288; Mat A; Vec b; KSP ksp; PetscInitialize(&argc,&argv,PETSC_NULL,help); MPI_Comm_size(PETSC_COMM_WORLD,&size); A=Translate_Matrix_H(matrix_size); ierr = VecCreate(PETSC_COMM_WORLD,&b);CHKERRQ(ierr); ierr = VecSetSizes(b,PETSC_DECIDE,matrix_size);CHKERRQ(ierr); ierr = VecSetFromOptions(b);CHKERRQ(ierr); ierr = VecSet(b,1.0);CHKERRQ(ierr); ksp=CG_Solve(A,b); KSPDestroy(ksp); MatDestroy(A); VecDestroy(b); PetscFinalize(); return 0; } In our case, we decide to run this code with three different solvers (CG,CGNE and CGS). 6.4 KSP Results We will just present some results, as this code is not run with Cuda. CG Matrix size : 10 000 Convergence in 354 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cg maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=10000, cols=10000 total: nonzeros=29998, allocated nonzeros=29998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** **************************************************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document **************************************************************************************************************** ---------------------------------------------- PETSc Performance Summary:---------------------------------Max Max/Min Avg Total Time (sec): 1.329e+00 1.00000 1.329e+00 Objects: 1.300e+01 1.00000 1.300e+01 Flops: 6.382e+07 1.00000 6.382e+07 6.382e+07 Flops/sec: 4.803e+07 1.00000 4.803e+07 4.803e+07 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.300e+01 1.00000 Matrix size : 100 000 Convergence in 112 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cg maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=100000, cols=100000 total: nonzeros=299998, allocated nonzeros=299998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** **************************************************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document **************************************************************************************************************** Max Max/Min Avg Total Time (sec): 1.472e+00 1.00000 1.472e+00 Objects: 1.300e+01 1.00000 1.300e+01 Flops: 2.026e+08 1.00000 2.026e+08 2.026e+08 Flops/sec: 1.376e+08 1.00000 1.376e+08 1.376e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.300e+01 1.00000 Matrix size : 1000 000 Convergence in 36 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cg maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=1000000, cols=1000000 total: nonzeros=2999998, allocated nonzeros=2999998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** ******************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ******************************************************************************** Max Max/Min Avg Total Time (sec): 2.306e+00 1.00000 2.306e+00 Objects: 1.300e+01 1.00000 1.300e+01 Flops: 6.580e+08 1.00000 6.580e+08 6.580e+08 Flops/sec: 2.853e+08 1.00000 2.853e+08 2.853e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.300e+01 1.00000 CGNE Matrix size : 10 000 Convergence in 150 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cgne maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=10000, cols=10000 total: nonzeros=29998, allocated nonzeros=29998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** **************************************************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document **************************************************************************************************************** Max Max/Min Avg Total Time (sec): 1.299e+00 1.00000 1.299e+00 Objects: 1.400e+01 1.00000 1.400e+01 Flops: 3.773e+07 1.00000 3.773e+07 3.773e+07 Flops/sec: 2.905e+07 1.00000 2.905e+07 2.905e+07 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.400e+01 1.00000 Matrix size : 100 000 Convergence in 61 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cgne maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=100000, cols=100000 total: nonzeros=299998, allocated nonzeros=299998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** ******************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ******************************************************************************** Max Max/Min Avg Total Time (sec): 1.447e+00 1.00000 1.447e+00 Objects: 1.400e+01 1.00000 1.400e+01 Flops: 1.548e+08 1.00000 1.548e+08 1.548e+08 Flops/sec: 1.070e+08 1.00000 1.070e+08 1.070e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.400e+01 1.00000 Matrix size : 1000 000 Convergence in 25 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cgne maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=1000000, cols=1000000 total: nonzeros=2999998, allocated nonzeros=2999998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** **************************************************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document **************************************************************************************************************** Max Max/Min Avg Total Time (sec): 2.409e+00 1.00000 2.409e+00 Objects: 1.400e+01 1.00000 1.400e+01 Flops: 6.480e+08 1.00000 6.480e+08 6.480e+08 Flops/sec: 2.690e+08 1.00000 2.690e+08 2.690e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.400e+01 1.00000 CGS Matrix size : 10 000 Convergence in 40 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cgs maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=10000, cols=10000 total: nonzeros=29998, allocated nonzeros=29998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** ******************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ******************************************************************************** Time (sec): Objects: Flops: Flops/sec: MPI Messages: MPI Message Lengths: MPI Reductions: Max Max/Min Avg Total 1.285e+00 1.00000 1.285e+00 1.700e+01 1.00000 1.700e+01 1.249e+07 1.00000 1.249e+07 1.249e+07 9.720e+06 1.00000 9.720e+06 9.720e+06 0.000e+00 0.00000 0.000e+00 0.000e+00 0.000e+00 0.00000 0.000e+00 0.000e+00 1.700e+01 1.00000 Matrix size : 100 000 Convergence in 19 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cgs maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=100000, cols=100000 total: nonzeros=299998, allocated nonzeros=299998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** **************************************************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document **************************************************************************************************************** Max Max/Min Avg Total Time (sec): 1.353e+00 1.00000 1.353e+00 Objects: 1.700e+01 1.00000 1.700e+01 Flops: 5.980e+07 1.00000 5.980e+07 5.980e+07 Flops/sec: 4.419e+07 1.00000 4.419e+07 4.419e+07 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.700e+01 1.00000 Matrix size : 1000 000 Convergence in 9 iterations. ****************Results : KSP*************** Let's see the ksp context KSP Object: type: cgs maximum iterations=10000 tolerances: relative=1e-05, absolute=0, divergence=10000 left preconditioning using nonzero initial guess using PRECONDITIONED norm type for convergence test PC Object: type: jacobi linear system matrix = precond matrix: Matrix Object: type: seqaij rows=1000000, cols=1000000 total: nonzeros=2999998, allocated nonzeros=2999998 total number of mallocs used during MatSetValues calls =0 not using I-node routines ************************************** **************************************************************************************************************** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document **************************************************************************************************************** Max Max/Min Avg Total Time (sec): 1.909e+00 1.00000 1.909e+00 Objects: 1.700e+01 1.00000 1.700e+01 Flops: 2.880e+08 1.00000 2.880e+08 2.880e+08 Flops/sec: 1.508e+08 1.00000 1.508e+08 1.508e+08 MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 MPI Reductions: 1.700e+01 1.00000 Conclusion Comparaison of CG and CGNE: CGNE converges better than CG solver (it is like twice faster). CG, matrix size 1 000 000 Max Max/Min Avg Total 2.306e+00 1.00000 2.306e+00 1.300e+01 1.00000 1.300e+01 6.580e+08 1.00000 6.580e+08 6.580e+08 2.853e+08 1.00000 2.853e+08 2.853e+08 Time (sec): Objects: Flops: Flops/sec: CGNE, matrix size 1 000 000 Max Max/Min Avg Total 1.447e+00 1.00000 1.447e+00 1.400e+01 1.00000 1.400e+01 1.548e+08 1.00000 1.548e+08 1.548e+08 1.070e+08 1.00000 1.070e+08 1.070e+08 Time (sec): Objects: Flops: Flops/sec: We can just conclude on a significant difference regarding the flops rate. We illustrated it with matrix size of 1 000 000, but it is also available for 10 000 (CG has twice grower rates than CGNE). Comparaison of CG and CGS The CGS solver converges better than CG (more than three times better). CG, matrix size 1 000 000 Max Max/Min Avg Total 2.306e+00 1.00000 2.306e+00 1.300e+01 1.00000 1.300e+01 6.580e+08 1.00000 6.580e+08 6.580e+08 2.853e+08 1.00000 2.853e+08 2.853e+08 Time (sec): Objects: Flops: Flops/sec: CGS, matrix size 1 000 000 Time (sec): Objects: Flops: Flops/sec: 1.909e+00 1.00000 1.909e+00 1.700e+01 1.00000 1.700e+01 2.880e+08 1.00000 2.880e+08 2.880e+08 1.508e+08 1.00000 1.508e+08 1.508e+08 About times perfomances, we can't conclude on a difference for solvers. However, the Flops rates are really mcuh better for CG solver. Comparaison of CGS and CGNE Regarding the numrber of iteration for convergence, CGS is better than CGNE (not exactly two times better). Considering the flops rates, we see that they are similar. CGNE, matrix size 1 000 000 Time (sec): Objects: Flops: Flops/sec: Max Max/Min Avg Total 1.447e+00 1.00000 1.447e+00 1.400e+01 1.00000 1.400e+01 1.548e+08 1.00000 1.548e+08 1.548e+08 1.070e+08 1.00000 1.070e+08 1.070e+08 CGS, matrix size 1 000 000 Time (sec): Objects: Flops: Flops/sec: 1.909e+00 1.700e+01 2.880e+08 1.508e+08 1.00000 1.00000 1.00000 1.00000 1.909e+00 1.700e+01 2.880e+08 2.880e+08 1.508e+08 1.508e+08 VII GPU used in neutronic domain The aim of this project was to implement the Conjugate Gradient method on GPU, by using CUDA, but also Cublas. We studied before the perfomances of Cublas and CUDA on scalar product examples, and it appears that CUDA offers a good gain perfomance, but still less than Cublas. In this context, we decide to apply it in neutronic domain, as the time and Megaflops rates gain must be more and more bigger. In what follows, we will study the architecture of codes, and explain our implementation choices. Then, we will briefly explain the CUDA's code, and Cublas one. We will finally present the obtained results. 7.1 Introduction In order to implement a professionnal code, we decide to use the following architecture: We have six folders: The Makefile contains in the main folder is doing the compilation of all codes in subsidiary folders. This Makefile contains the variables necesary to export the compiler path, and the command to execute the compilations in all subsidiary folders. In each subsidiary folders, excepted the src/ one, we have statics libraries: all Makefiles in subsidiary folders are compiling the codes such that we obtain statics libraries. It will be latter modify, to compile it as a dynamics libraies, as the codes in Clock and Matrix_Market must not be modified anymore. The src/ folder contains the main.cu code, which includes all the generated libraries. This structure is justified by the fact that we use lot of different functions, and so the programmer and user can handle the code easilly. Currently, the user just have to modify the following variable in the principal Makefile: export FOLDER_HOME = /home/phd_hpc/fboillod/double_prec You have to indicate the path to the main folder of project. Note that you also have to modify th epath to compiler, and Cuda installation folder. For compilation, we have the following scheme: The main difficulties for this architecture was about the langages: for example, the main.cu Makefile had to mix several langages (C++ and Cuda). We represented it by the following scheme: This project is available yet on gpu4u and fermi computers. We will now study the diverse functions used to implement the Conjugate gradient. 7.2 Conjugate gradient with CUDA-CPU First, we wanted to implement the conjugate gradient exclusively on the GPU. Because of time, I could'nt finish this code, as it was not stable enough. I decide, in consequence, to implement CG by using both CPU and GPU. All the functions we will present in what follows are in the libcgcuda.cu file. To realise the CG function, we used the matrix/vector product, which is realised on GPU. We also implemented two versions of scalar product, one using exclusively the GPU, and the other one doing the product of vectors on GPU, but the addition of elements is realised on CPU. We implemented also a function __global__ void GPU_addvec(double *a,double alpha,double *b,double beta,const int N) which is doing the following operation: a[ ] = alpha*a[ ] + beta*b[ ] This function is realised only on GPU. The main difficulties to implement this code was to debug it, as GPU does not allow host functions. 7.3 Conjugate gradient with Cublas The Cublas version of Conjugate Gradient was more easier to implement, as we can debug it quicklier. Actually, it exists a function CublasGetVector(...) so we can obtain the copy of device vector on a host vector. Besides, the Cublas functions are easy to handle. We use, to implement the Conjugate Gradient function the following Cublas functions: cublasDcopy ( n, b_d, 1, h_d, 1); This function copy the vector b in h one. cublasDgemv ('N', n, n, 1.0, A_d, n, h_d, 1, 0.0, temp_Ah, 1); This function realises the following function: temp_Ah = 1.0 * A*h + 0.0 * temp_Ah cublasDscal (n, v, h_d, 1); This function is doing the scalar product of v and h. cublasDaxpy (n, -1.0, g_d, 1, h_d, 1); This function realises the following operation: h = -1.0 * g norm_cub = cublasDdot (n, x_d, 1, x_d, 1); This function calculates the norm of x vector. The function we created is the following: void GPU_CG_CUBLAS (double *A_d, double *x_d, double *b_d, double *vecnul_d, int n, int LOOP,double epsilon, const int nBytes,int *its_cublas,dim3 dimGrid) A is the matrix (on device), b the right hand side, and x the initial solution. Vecnul is a vector on device which contains only nul values, as the cublas vectors are not initialized when you create them. N is the size of vector (and matrix, we used a squared one). LOOP is just the maximum iterations we allow, and epsilon, the error we fixed to stop the iteration (0.0001). its_cublas is the number of iterations at convergence, and other variables are used for memory allocation. 7.4 Results We run the previous code on two computers: on fermi (Gtx GeForce 480) and gpu4u (Gtx GeForce 280). Fermi: Gpu4u: Comparaisons: The first scheme shows the Gflops Cublas rates, by considering the used graphic card. The second scheme shows the Gflops/sec rate for Cuda's function, by considering the graphic card we used. We can conclude that our Cuda's code is like two times better with the fermi computer. About the Cublas code, the performance is not so satisfiable. We studied all codes we implemented to solve our linear system with Conjugate Gradient. We will now conclude on this project. Conclusion During my last placement in CEA, we already studied the PETSc library, and its impact on MINOS solver. We concluded that we obtained better convergences performances with PETSc, but in term of computation time, PETSc was much slower. The issue was now to study the Conjugate Gradient by using the Graphic card: to do so, we wanted to use PETSc, as we worked with it before, but also to program the Conjugate gradient with Cuda and Cublas. We concluded that Cublas offers better performances than Cuda. However, the Cuda's code is much more efficient with fermi graphic cards, than Cublas' code. Regarding PETSc, we conclude on a very good gain performance for Multigrid objects, by using GPU. The computation's time for PETSc 's code is however not so significantly different as the Megaflops rates obtained (with and without GPU). The next step is now to study our KSP code, with GPU, but to do so, it is necessary to reinstall PETSc development version. We also developed a matrix interface, which allows to load every matrix available on Matrix market website. In the future, this program will be ameliorated, and this interface will be used. Thesis of M. Pierre Guérin, Thesis of M. Berlier, Petsc User Manual Cuda user manual Cublas user manual Bibliography Webography Site de l'Idris : http://www.idris.fr/data/cours/parallel/mpi/choix_doc.html Site de Petsc : http://www.mcs.anl.gov/petsc/petsc-as/

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement