Large-scale time parallelization for molecular dynamics problems JOHANNES BULIN Master of Science Thesis Stockholm, Sweden 2013 Large-scale time parallelization for molecular dynamics problems JOHANNES BULIN Master’s Thesis in Scientific Computing (30 ECTS credits) Master Programme in Scientific Computing (120 credits) Royal Institute of Technology year 2013 Supervisor at KTH was Michael Schliephake Examiner was Michael Hanke TRITA-MAT-E 2013:44 ISRN-KTH/MAT/E--13/44--SE Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci Abstract As modern supercomputers draw their power from the sheer number of cores, an efficient parallelization of programs is crucial for achieving good performance. When one tries to solve differential equations in parallel this is usually done by parallelizing the computation of one single time step. As the speedup of such parallelization schemes is usually limited, e.g. by the spatial size of the problem, additional parallelization in time may be useful to achieve better scalability. This thesis will introduce two well-known schemes for time-parallelization, namely the waveform relaxation method and the parareal algorithm. These methods are then applied to a molecular dynamics problem which is a useful test example as the number of required time steps is high while the number of unknowns is relatively low. Afterwards it is investigated how these methods can be adapted to large-scale computations. Referat Storskalig tidsparallelisering för molekyldynamik Moderna superdatorer använder ett stort antal processorer för att uppnå hög prestanda. Därför är det nödvändigt att parallelisera sina program på ett effektivt sätt. När man löser differentialekvationer så brukar man parallelisera beräkningen av en enda tidspunkt. Speedupen av sådana program är ofta begränsad, till exempel av problemets storlek. Genom att använda ytterligare parallelisering i tid kan man uppnå bättre skalbarhet. Denna avhandling presenterar två välkanda algoritmer för tidsparalellisering: waveform relaxation och parareal. Dessa metoder används för att lösa ett molekyldynamikproblem där tidsdomänen är stor jämförd med antalet obekanta. Slutligen undersöks några förbättringar för att möjliggöra storskaliga beräkningar. Definitions and abbreviations • ODE – ordinary differential equation • PDE – partial differential equation • WR – waveform relaxation • MD – molecular dynamics • diag(A) describes the diagonal matrix that has the same entries on the diagonal as A. • ẏ describes the first derivative in time of the function y and ÿ the second derivative in time. • P denotes always the number of available processors. When algorithms are discussed then p0 , . . . , pP −1 are used to identify each of the P processors. • CF describes the cost of the evaluation of the function F , i.e. the required flops or the time on a specified system. • Jf (y, t) denotes the Jacobian matrix of a given function f (y, t). Contents 1 Introduction 1 2 Time parallelization methods 2.1 Waveform relaxation method . . . . . . . . . 2.1.1 Algorithm . . . . . . . . . . . . . . . . 2.1.2 Parallel waveform relaxation . . . . . 2.1.3 Numerical properties . . . . . . . . . . 2.1.4 Comments . . . . . . . . . . . . . . . . 2.2 Parareal algorithm . . . . . . . . . . . . . . . 2.2.1 Algorithm . . . . . . . . . . . . . . . . 2.2.2 Numerical properties . . . . . . . . . . 2.3 Solving the toyproblem in a time-parallel way 2.3.1 Waveform relaxation . . . . . . . . . . 2.3.2 Parareal algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 8 9 9 11 12 14 14 17 19 20 21 3 Introduction to molecular dynamics 3.1 Force fields and potentials . . . . . . 3.2 Time stepping . . . . . . . . . . . . . 3.3 The molecular dynamics problem . . 3.4 Using time-parallel methods . . . . . 3.4.1 Waveform relaxation . . . . . 3.4.2 Parareal algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 27 29 30 32 35 4 Improved methods for the molecular dynamics problem 4.1 Choice of the force splitting for the waveform relaxation . . 4.2 Improving the parareal algorithm . . . . . . . . . . . . . . . 4.2.1 Using windowing . . . . . . . . . . . . . . . . . . . . 4.2.2 Choice of the coarse operators . . . . . . . . . . . . 4.2.3 A multilevel parareal algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 38 38 39 40 5 Evaluation 5.1 Choice of the force splitting for the waveform relaxation . . . . . . . 5.2 Improving the parareal algorithm . . . . . . . . . . . . . . . . . . . . 5.2.1 Using windowing . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 49 49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 5.2.3 Choice of the coarse operators . . . . . . . . . . . . . . . . . A multilevel parareal algorithm . . . . . . . . . . . . . . . . . 50 53 6 Conclusions 57 Bibliography 59 Chapter 1 Introduction Due to the structure of modern (super-)computers, parallelization of algorithms has become important in order to solve certain problems faster. As big compute clusters can easily have several thousand cores it is important to write programs that actually can use this large number of processors in an efficient way. In this thesis the usefulness of parallelization in time for time-dependent differential equations stemming from molecular dynamics will be investigated. The main focus is on the behavior of these time-parallel methods for a large number of cores. Parallelization means that one tries to adapt a program in such a way that several processors cooperate in order to solve a given problem. Reasons to parallelize a program can be limited memory on a single machine or the desire to solve problems faster. In this thesis the focus is on the latter, i.e. how to make programs faster by using more processors. Before we start we need to define two elementary terms: Definition 1. For a given problem let T ∗ be the time that the fastest known serial solver (i.e. on one processor) needs to solve the problem. Let furthermore TP be the time that the currently investigated algorithm needs to solve the problem using P processors. Then the speedup SP is defined as T∗ TP (1.1) T∗ SP = . P TP P (1.2) SP = and the efficiency EP is given by EP = A lot of different algorithms can be accelerated by using parallelization but with very different results. One reason for this is that introducing parallelism usually requires some kind of communication between the processors. A very simple example of this is the calculation of the sum of a large array a1 , . . . , an in parallel (see figure 1.1): Usually, the array is split into several parts and each part is assigned to a single processors. Now, every processor calculates the sum of its assigned subarray. 1 CHAPTER 1. INTRODUCTION Finally, these subtotals have to be merged together in order to get the final sum of the whole array. To do this, the processors must exchange their local sums with each other. The big problem is that communication tends to be very slow. If the processors reside on different computers then the data that has to be exchanged must travel through some kind of network which is slow compared to the computational speed of a processor. Furthermore, each communication has a startup time ( latency) before the actual data exchange starts. Therefore it is usually better to send large chunks of data at once instead of sending a lot of smaller data packages. p0 Pn/2 i=1 p1 Pn ai ai i=n/2+1 Communication Pn i=1 Pn ai i=1 ai Figure 1.1. Calculating the sum of an array on two processors It is also quite common that parallel algorithms have some so called sequential parts that cannot be executed in parallel. These sequential parts limit the maximal possible speedup as using more processors will not reduce the execution time of those. These factors combined cause a quite typical behavior of many parallel algorithms that is shown in figure 1.2: One can see that the growth rate of the speedup declines for a higher number of processors. Finally a point is reached where using additional processors does not make the program run any faster, but often even slower. The reason for this behavior is that the communication time is negligible in the beginning as the workload per processor is high enough to hide the communication cost. When the number of processors increases, the computational workload per processor is usually decreasing. The communication time, however, is sometimes even increasing when more processors are added. Therefore the communication part of the algorithm becomes the predominant part of the runtime. The sequential part of the algorithm does also limit the speedup. The famous Amdahl’s law states that the speedup is bounded by SP ≤ 1 s + 1−s P where s is the fraction of the algorithm that cannot be executed in parallel. 2 (1.3) 140 120 100 SP 80 60 40 20 0 0 100 200 300 400 500 P Figure 1.2. Example of a speedup curve As this thesis will focus on the parallel solution of time-dependent differential equations, it is worth to have a look at how these problems are parallelized. Probably the most common way to do so is to parallelize the computation of one single time step. If the differential equation has a spatial representation as it is the case for PDEs and some ODEs then the problem can usually be decomposed in space in such a way that the resulting sub-problems can be solved efficiently in parallel. Figure 1.3 shows a Ω Ω0 Ω1 Ω2 Ω3 Ω4 Ω5 Figure 1.3. Simple decomposition in space; Ωi is assigned to pi 3 CHAPTER 1. INTRODUCTION very simple decomposition of the 2D domain Ω into several sub-domains Ωi which can be used to solve PDEs in parallel. First, each processor pi is assigned to a part Ωi of the full domain. When one uses explicit time-stepping methods and simple differentiation stencils then the solution in each Ωi can be calculated using only the points in Ωi and the boundary points of the neighbouring subdomains. Therefore it is sufficient to assign Ωi to pi and communicate the points near the boundary to the neighbouring processors in order to solve the problem. If implicit methods have to be used, more complicated – and usually iterative – parallelization approaches have to be used. One example of these methods are parallel multi-grid methods[21] which can solve the (non-)linear system of equations that has to be solved in each time step using a multi-grid approach. A different idea is used for Schwarz methods[12] which assign artificial boundary conditions to the sub-domains Ωi in figure 1.3. This makes it possible to calculate the solution in these sub-domains in parallel. After each iteration data is exchanged and the boundary conditions are updated until the global solution has converged. Unfortunately, parallelization in space cannot always be applied. Some properties of the problem like a small spatial dimension compared to the number of processors can limit the speedup similar to the situation in figure 1.2. Furthermore, it may actually be impossible to parallelize in space if the differential equation does not have a spatial representation which is the case for some ODEs. An obvious alternative would be to use parallelism on the level of the basic numerical methods. If the time-stepping for example corresponds to the solution of a linear system Ax = b, this system can be solved using parallel linear equation solvers. Otherwise one can exploit parallelism inside the time-stepping scheme itself, for example by using parallel Runge-Kutta methods or parallel prediction-correction algorithms[10]. The speedup that one can obtain with the two last-mentioned methods is usually quite small[10][32]. An approach to overcome these problems might be to use parallelization in time. This means that one tries to calculate the solution in several time steps at once, compared to other methods where usually all processors try to calculate the solution in a single time step. In some problems, the time-domain is much “bigger” than the spatial domain, for example in molecular dynamics problems. For this kind of problems it is common to have a quite low number of unknown functions but several billion time steps. As soon as spatial or other kinds of parallelization do not yield good results anymore it might be advantageous to parallelize in the time dimension. As most problems are sequential in time as the underlying physics are sequential in time it may however be quite difficult to obtain reasonable speedups using this kind of parallelization. In this thesis I will try to apply time parallelization to two differential equations to see how efficient it actually is and if there are ways to improve the performance. 4 Both differential equations can be written as ODEs of the type ẏ = f (y, t), t ∈ [0, T ] (1.4) y(0) = y0 . The first equation that will be solved is a simple linear toyproblem. The other one is a non-linear equation which is derived from molecular dynamics. The term molecular dynamics describes a group of mathematical simulations that are used to predict the movement of atoms or molecules and corresponding quantities like temperature. As already mentioned, these simulations require a lot of time steps while the usefulness of spatial parallelization may be limited due to an insufficient problem size. Furthermore, these problems are usually non-linear and are sensitive to small deviations during the solution. Therefore it is reasonable to use these kinds of equations as test problems instead of relying on relatively simple linear equations. Time-parallel solvers have been investigated for at least 50 years[14] and have already been successfully applied to certain problems. Unfortunately there are very few papers available that investigate the scaling properties of time-parallel algorithms for a higher numbers of cores, i.e. more than 50 cores. The fastest supercomputer in the beginning of 2013, the TITAN at the Oak Ridge National Laboratory, has about 300,000 cores plus additional accelerators[3], so it would be interesting to know how time-parallel methods behave for at least one hundred cores. This thesis will therefore focus on the question how time-parallel methods behave when such larger number of cores are used. To test algorithms, the supercomputer Lindgren at PDC[2] was used and computational time was provided by the CRESTA project[1]. In the next chapter, two well-known time-parallel methods are presented. As it will turn out, these methods do not become faster as soon as a certain number of processors is used even if the time domain is sufficiently large. Therefore, some work will be done in chapter 4 to find ways to allow scaling even after this threshold. 5 Chapter 2 Time parallelization methods In the previous chapter time-parallel methods were motivated. This chapter will give a short overview of these methods and will explain two algorithms in more detail. The usual approach, i.e. using all available processors to calculate one single time step, will be called sequential time-stepping in this chapter. Time-parallel methods differ from these sequential schemes in the way that the solutions at several time steps is calculated in parallel. Several different algorithms that can be called time-parallel exist, for example space-time-parallel multigrid methods. These methods have for example been used in [19] to solve parabolic equations by using the multigrid technique in space and time. This means that one does not use multigrid to compute the approximate solution yi in one single time step by solving a linear equation A(yi ) = b. Instead, one calculates several time steps at once by adding the time as an additional dimension to the multigrid solving step. Thus, the space-time-parallel multigrid has to solve a bigger system of equations: Ã(yi , . . . , yi+k ) = B̃. A different approach has been presented in [34]: Here, implicit time-stepping is used and the resulting equations that have to be solved in each time step are solved using an iterative solver. The calculation of the next time step is started before the iterative solver has finally converged which leads to a time-parallel algorithm. For very special problem classes additional time-parallel methods exist, for example [29] for planetary movements, [20] for certain parabolic partial differential equations and [35] for molecular dynamics problems1 . These special algorithms can achieve quite good speedups, even for a higher number of processors. For general methods the results have usually been quite sobering. The maximal achievable speedup is usually quite small and only a handful of cores can be used usefully. The algorithms that will be investigated here are the waveform relaxation method and the parareal algorithm. These two algorithms were chosen because they are 1 This algorithm is not using a pure mathematical approach to solve these problems in parallel. Instead, a database with a lot of previous molecular dynamics simulations is used to predict the state in the next time steps. 7 CHAPTER 2. TIME PARALLELIZATION METHODS quite general, i.e. they can theoretically be applied to almost every type of ODE, of course with varying success. This is important as the molecular dynamics problem that is presented in the next chapter is lacking a lot of nice properties, like linearity or the possibility to easily create coarse-grid approximations. Furthermore, both methods can in theory converge on very large time-intervals which might be necessary as large-scale time-parallelization should be achieved. These two methods shall be presented now. In order to verify the theory and to show some features of the algorithm I will use a simple toyproblem. This first example is a linear ODE that has been derived from the 2D heat equation using finite differences. It has the form t ∈ [0, T ], y ∈ Rn ẏ = Ay + f (t), 2 (2.1) y(0) = 0 where n = 30 and A is the block tridiagonal matrix B I A = (n + 1)2 I B .. . I .. . I .. ∈ Rn2 ×n2 . I . B I B Here, I denotes the identity matrix of size n × n and B is defined as −4 1 1 −4 1 .. .. .. B= . . . 1 −4 1 1 −4 ∈ Rn×n . The right-hand side f is chosen as the time-dependend vector2 f (t) = 10 2.1 sin(75πt) cos(6π(1/n2 )) sin(75πt) cos(6π(2/n2 )) .. . sin(75πt) cos(6π(n2 /n2 )) . Waveform relaxation method The waveform relaxation method is an algorithm that can be used to solve ODEs in an iterative way. It was first published in 1982[24] as a method to solve problems related to integrated circuits. Sometimes it is not seen as a pure time-parallel method 2 This right-hand side does not correspond to any real problem. It was simply chosen in such a way that the function values will neither become too large nor converge to zero when t becomes larger. Otherwise calculations on bigger time intervals would be meaningless. 8 2.1. WAVEFORM RELAXATION METHOD but as a method that exploits parallelism across the system[10]. It is presented as a time-parallel method here because it is still possible to calculate several time steps simultaneously in one iteration. Furthermore, it also allows to solve ODEs in parallel that have no spatial representation. 2.1.1 Algorithm The idea of the WR algorithm is to approximate the solution of equation (1.4) in an iterative fashion. It calculates the approximate solutions y (k+1) in each iteration k by solving the following series of modified problems ẏ (k+1) = F (y (k+1) , y (k) , t), y (k+1) t ∈ [0, T ] (2.2) (0) = y0 instead of equation (1.4). The right-hand side F (x, y, t) has to be chosen such that F (y, y, t) = f (y, t) ∀y. If this holds true as well as a couple of other conditions that will be presented later, the iterates y (i) converge towards the actual solution y ∗ of the equation (1.4). In order to implement the WR algorithm on a computer, one has to discretize the time in the usual way, resulting in a number of discrete time points 0 = t0 < t1 < · · · < tm = T. One also has to define the function values that should be approximated: (k) ≈ y (k) (ti ). yi (0) Now one has to define the initial waveform yi ∀i. The easiest approach to do this when no additional information is available is to set (0) yi = y0 , (k+1) i.e. to choose a constant initial waveform. Then one can calculate the values yi (k) (k+1) using the already known values yi , ∀i and yj , j < i. The whole WR algorithm is also summed up in algorithm 1. 2.1.2 Parallel waveform relaxation It is not completely obvious how the waveform relaxation can be parallelized – especially in a time-parallel fashion. The first step is similar to space-parallel methods: (k) The components of yi ∀i are assigned to different processors. Now one can choose the modified right-hand side F in such a way that each processor can calculate its (k+1) (k+1) part of yi without requiring those components of yi that are located on 9 CHAPTER 2. TIME PARALLELIZATION METHODS Algorithm 1 Waveform relaxation method for solving equation (1.4) Choose appropriate function F (x, y, t) for i = 0, 1, . . . , m do (0) yi = y0 end for k←1 while not converged do (k) y0 = y0 for i = 1, . . . , m do (k) Calculate yi by applying a time-stepping scheme to equation (2.2) end for k =k+1 end while another processor. This can for example be achieved by choosing F (x, y, t) = F (x, y, t)1 F (x, y, t)2 .. . F (x, y, t)n and F (x, y, t)i = f ((y1 , . . . , yi−1 , xi , yi+1 , . . . , yn ), t). This results in the so called Jacobi waveform relaxation. By using this decoupling, (k+1) each processor can calculate its part of yi on a part of the time domain (so called window) without communicating with other processors. After each iteration, (k+1) (k+1) i.e. when yi has been calculated ∀i, all yi have to be distributed to the (k+2) other processors in order to calculate yi . Compared to normal space-parallel methods this has the advantage that the number of communications is lower as data is exchanged once per iteration and not in every time step. This can be advantegeous when the communication latency is prohibiting higher speedups. On the other hand this method usually needs several iterations to converge, so it can easily be much slower than space-parallel methods when the number of iterations is too high. Remark 1. The WR algorithm is usually not beneficial for explicit methods. As F must satisfy F (x, x, t) = f (x, t), each single iteration is at least as costly as solving the problem sequentially when one ignores communication costs. For implicit methods the WR method is more useful as F can be chosen in such a way that solving the system of equations in each time step becomes cheaper. An example of that would be to choose F as F (x, y, t) = f (y) − diag(Jf (y))(x − y) 10 2.1. WAVEFORM RELAXATION METHOD if f is non-linear. This replaces the solution of a non-linear equation at each time step by solving a linear system which is much cheaper. This kind of force splitting is also called Waveform Newton. 2.1.3 Numerical properties An important question is always if a numerical method actually converges to the right solution. It is also interesting to know how fast the algorithm converges in that case. It can actually be shown that the waveform relaxation method converges superlinearly to the actual solution. A theorem and a corresponding proof can be found for example in [8] and the theorem will briefly be recalled here: Theorem 1. Let the function F (x, y, t) satisfy F (v, v, t) = f (v, t) ∀v where f is the function from equation (1.4). Define also y ∗ as the exact solution of the problem in equation (1.4). If F satisfies the Lipschitz condition ||F (y ∗ , y ∗ , t) − F (x̃, ỹ, t)|| ≤ K||y ∗ − x̃|| + L||y ∗ − ỹ||, K, L ∈ R n (2.3) o for all t and x̃, ỹ ∈ y (k) (t) : k = 0, . . . , ∞ and if the differential equation (2.2) is solved exactly, then one can bound the error e(k) = y ∗ − y (k) by: sup ||e(k) (t)|| ≤ 0≤t≤T (LT )k KT e sup ||e(0) (t)|| k! 0≤t≤T (2.4) Usually it is of course not possible to solve the equation (2.2) exactly but one uses numerical methods instead. Then it is necessary to know if the solution of the WR algorithm converges towards the solution that one obtains by solving the original problem (1.4) using the same time-stepping. For some time-stepping methods it has been proved that this is the case, for example for forward and backward Euler[8]: Theorem 2. Let Y1 , . . . , Yl be the solution at times t1 , . . . , tl that one obtains by using serial time-stepping with the forward Euler method. If the conditions from (k) (k) theorem 1 are satisfied, then the iterates y1 , . . . , yl of the forward Euler WR (k) (k) algorithm converge superlinearly to Y1 , . . . , Yl . By defining ei = Yi − yi one obtains the convergence rate: sup i=1,...,l (k) ||ei || ≤ ∆t L 1 + ∆t K k ! l−1 (0) (1 + ∆t K)l−1 sup ||ei || k i=1,...,l−k (2.5) Now let Z1 , . . . , Zl be the solution that one obtains by using serial time-stepping with (k) (k) (k) the backward Euler method. One also has to define ei as ei = Zi − yi this time. Assume that F fulfills the requirements in theorem 1 and that the step size in time ∆t also fulfills 1 ∆t < . K +L 11 CHAPTER 2. TIME PARALLELIZATION METHODS (k) (k) In that case the iterates y1 , . . . , yl converge linearly to Z1 , . . . , Zl : sup i=1,...,l (k) ||ei || ≤ ∆t L 1 − ∆t K k of the backward Euler WR algorithm will ! l+k−1 (0) (1 − ∆t K)1−l sup ||ei ||. k i=1,...,l (2.6) Example 1. Recall the toy problem (2.1) with the function splitting F (x, y, t) = (A − diag(A))y + diag(A)x + b(t) (2.7) For this simple example it is easy to show that the necessary conditions for convergence according to theorem 1 are fulfilled: ||F (y ∗ , y ∗ , t) − F (x̃, ỹ, t)|| = ||diag(A)(y ∗ − x̃) + (A − diag(A))(y ∗ − ỹ)|| ≤ ||diag(A)|| ||y ∗ − x̃|| + ||A − diag(A)|| ||y ∗ − ỹ||. Therefore we get the Lipschitz constants K = ||diag(A)|| and L = ||A − diag(A)||. If we are using the maximum norm here we can get K = 3600 and L = 3600. Using theorem 1, we can easily get an upper bound for the error: sup ||ek (t)||∞ ≤ 0≤t≤T (3600T )k 3600T e sup ||e0 (t)||∞ . k! 0≤t≤T k ) 3600T e are plotted with respect to k for In the figures 2.1 and 2.2 the terms (3600T k! two different values of T . It is clearly visible that the convergence rate is highly dependent on the value of T in this example. Large values of T lead to a very slow convergence as e3600T will become very large and 3600T 1. 2.1.4 Comments One result of theorem 1 and the observations in example 1 is that the WR algorithm may converge very slowly on large time intervals. Therefore one usually divides the time domain into several so called windows and the differential equation is solved on each window at a time. This does not only improve performance but also limits (k−1) the amount of memory that is needed as all values yi have to be saved in order (k) to calculate yi . It is also important to know how one should define a stopping criterion. The stopping criterion that is used in my WR implementation is (k) (k−1) ||yi (t) − yi (t)|| ≤ ε, ∀t ∈ [0, T ], ∀i. (2.8) In the continuous case this choice can be motivated by the following theorem: Theorem 3. Assume that the assumptions in theorem 1 are fulfilled and that all y (k) are continuous. If the inequality in (2.8) holds true then we can bound the error e(k) = y ∗ − y (k) by ||e(k) (t)|| ≤ Ltεet(K+L) 12 2.1. WAVEFORM RELAXATION METHOD 40 10 30 10 20 (LT)k eKT / k! 10 10 10 0 10 −10 10 −20 10 −30 10 0 20 40 60 80 100 120 140 160 180 k Figure 2.1. Evolution of (3600T )k 3600T e k! for T = 10−2 Proof. ||e(k) (t)|| = y0 + Z t F (y ∗ , y ∗ , s) ds − y0 − 0 Z t 0 F (y (k) , y (k−1) , s) ds Z t ≤ F (y ∗ , y ∗ , s) − F (y (k) , y (k−1) , s) ds 0 Z t K||e(k) (s)|| + L||e(k−1) (s)|| ds ≤ 0 Z t = K||e(k) (s)|| + L||e(k−1) (s) + e(k) (s) − e(k) (s)|| ds 0 ≤ Z t 0 (k) K||e(k) (s)|| + L(||e(k) (s)|| + ||yi ≤ Ltε + Z t (K + L)||e(k) (s)|| ds 0 Applying Grönwall’s lemma to this finally gives us ||e(k) (t)|| ≤ Ltεet(K+L) 13 (k−1) − yi ||) ds CHAPTER 2. TIME PARALLELIZATION METHODS 5 10 0 (LT)k eKT / k! 10 −5 10 −10 10 −15 10 0 5 10 Figure 2.2. Evolution of 2.2 15 k 20 (3600T )k 3600T e k! 25 30 for T = 10−3 Parareal algorithm The parareal algorithm is a method that can be used to solve more or less arbitrary differential equations in a time-parallel fashion. Originally presented by [26] in 2001, it “has received a lot of attention over the past few years”[14, p. 556] and “has become quite popular among people involved in domain decomposition methods”[4, p. 425]. 2.2.1 Algorithm The parareal algorithm can be motivated in several different ways, for example as a multiple shooting method or as a multigrid method[14]. The basic idea is always to divide the time domain [0, T ] into several blocks T0 = [t0 , t1 ], T1 = [t1 , t2 ], . . . , Tn = [tn , tn+1 ] with 0 = t0 < t1 < · · · < tn+1 = T. Here we also assume that ti+1 −ti = ∆T is constant ∀i. Then the parareal algorithm tries in each iteration k to approximate the initial values Ujk ≈ y(tj ) in all time blocks Tj . To do this, a coarse and a fine time-stepping scheme are needed. The fine 14 2.2. PARAREAL ALGORITHM time-stepping F should yield more accurate results than the coarse time-stepping G, either by using smaller time steps or higher order methods. F(Ujk ) and G(Ujk ) should symbolize the numerical solution of t ∈ [tj , tj+1 ] ẏ = f (y, t), y(tj ) = Ujk at time tj+1 using the corresponding time-stepping scheme. y U3 U1 U2 U0 t1 T0 t2 T1 t3 T2 t4 t T3 Figure 2.3. Parareal as a multiple-shooting method One way to motivate the parareal algorithm is to see it as a multiple shooting method as shown in [14]: One can divide the time-domain as explained above and guess initial values Ui for each time block. By using these artificial initial values one can solve the differential equations inside each time-block using the F propagator in parallel. The problem here are the jumps in the solution at the block borders. Figure 2.3 shows this behaviour in the case that y in equation (1.4) is one-dimensional. To prohibit this, one can eliminate these jumps by enforcing that U0 − y0 U1 − F(U0 ) U2 − F(U1 ) .. . =0 Un − F(Un−1 ) | {z =:N (U ) 15 } (2.9) CHAPTER 2. TIME PARALLELIZATION METHODS which would require the solution of a nonlinear system of equations. This could be done by using standard methods for nonlinear equations like Newton methods. Applying the Newton method to this system yields U k+1 = U k − JN (U k )−1 N (U k ) where JN (U k ) denotes the Jacobian matrix of N (U k ). Due to the special form of equation (2.9) the term JN (U k )−1 N (U k ) can be calculated explicitly and we get k+1 Uj+1 = F(Ujk ) + ∂F (U k )(Ujk+1 − Ujk ). ∂Uj j (2.10) Using Taylor expansion we obtain ∂F (U k )(Ujk+1 − Ujk ) ≈ F(Ujk+1 ) − F(Ujk ) ∂Uj j which would result in the normal serial time-stepping when we plug this into equation (2.10). So one uses instead ∂F (U k )(Ujk+1 − Ujk ) ≈ F(Ujk+1 ) − F(Ujk ) ≈ G(Ujk+1 ) − G(Ujk ) ∂Uj j which gives us the updates for the initial values in each parareal iteration: k+1 k ). ) + G(Uj−1 Ujk+1 = F(Ujk ) − G(Uj−1 (2.11) 0 ), i 6= 0. Even though The initial values Ui0 are usually chosen as Ui0 = G(Ui−1 this initialization is purely sequential, it will not introduce additional costs to the algorithm compared to a simple initialization like Ui0 ≡ y0 , ∀i (see [5]). By using this we can finalize the update formula as 0 U00 = y0 , Ui0 = G(Ui−1 ) k+1 k k ) + G(Uj−1 ). U0k+1 = y0 , Ujk+1 = F(Uj−1 ) − G(Uj−1 (2.12) k ) Compared to normal sequential time-stepping we have the advantage that all F(Uj−1 can be calculated independently from each other which makes the parallelization of this operation very simple. This update formula leads to the parareal algorithm in algorithm 2. Remark 2. As for the WR algorithm we also need a stopping criterion for the parareal algorithm. According to [27], stopping criteria for the parareal algorithm have not been subject to a lot of research. A very simple and common stopping criterion that will also be used here is to stop the calculation as soon as ||Ujk − Ujk−1 || < ε, 16 ∀j. (2.13) 2.2. PARAREAL ALGORITHM Algorithm 2 Parareal framework for solving equation (1.4) Set U00 = y0 for all processors pi in parallel do p=i if i 6= 0 then Receive Up0 from processor pi−1 end if g = G(Up0 ) 0 Up+1 =g if p 6= P − 1 then 0 Send Up+1 to processor pi+1 end if while not converged do k+1 Up+1 = F(Upk ) if p 6= 0 and p − 1 has not converged then k+1 k+1 Up+1 = Up+1 −g k+1 Receive Up from processor pi−1 k+1 g = G(Up ) k+1 k+1 +g Up+1 = Up+1 end if if p 6= P − 1 then k+1 Send Up+1 to processor pi+1 end if end while end for 2.2.2 Numerical properties Convergence It is trivial to show that the initial values of time-block Tj in iteration k Ujk will equal the solution that is obtained by using F in a serial fashion if k ≥ j. Therefore we know that the parareal algorithm will always converge. In order to obtain a speedup bigger than 1, we have to show that the parareal algorithm can converge in less than P iterations. One theorem that also covers non-linear right-hand sides f can be found in [16] and [15] and will be briefly repeated here: Theorem 4. Assume that F solves the underlying differential equation (1.4) exactly. Assume also that the local truncation error of G is limited by C1 ∆T p+1 when ∆T is small enough. If G also satisfies the Lipschitz condition ||G(x) − G(y)|| ≤ (1 + C2 ∆T )||x − y|| 17 CHAPTER 2. TIME PARALLELIZATION METHODS then we can get the error bound sup ||y ∗ (tn ) − Unk || ≤ n (C1 T )k C2(T −(n+1)∆T ) e ∆T pk sup ||y ∗ (tn ) − Un0 ||. (k)! n (2.14) Not so much is known about the convergence of the parareal algorithm in the discrete case, i.e. when F is not the exact solution but the output of a numerical method. For linear one-dimensional problems the convergence can be studied as shown in [14]. For higher-dimensional linear systems or non-linear differential equations no general convergence results exist as far as I know. Therefore the convergence in the discrete case must be evaluated numerically here. Parallel efficiency Now we will investigate which parallel efficiency we can achieve when we try to solve an ODE on the time-interval [0, P ∆T ]: When we look at equation (2.11) we see that each iteration of the parareal framework requires one evaluation of F and k ). This G3 . We also have to note that each Ujk requires the calculation of G(Uj−1 serial dependence introduces an additional cost of P CG to the parareal algorithm. Assume that the cost CG of the evaluation of G can be written as CG = aCF where CF is the cost of the F and a 1. When one is ignoring communication costs and defines K as the maximum number of iterations we get the theoretical maximal speedup of[5] SP ≤ P CF P 1 1 = = ≤ . K((1 + a)CF ) + P aCF K(1 + a) + P a K/P (1 + a) + a a (2.15) This means that the efficiency is always lower than EP ≤ 1 . K + (K + P )a (2.16) This result is discouraging when one wants to use the parareal framework on a lot of cores. The speedup is bounded by 1/a and is probably even worse as we usually also have K 2 unless we use a very accurate coarse solver G. Additionally, K will probably increase when the interval size P ∆T increases. This means that the parareal algorithm in this generic form is likely to be unsuitable for large-scale applications. k In the given equation G is evaluated two times but one can save one evaluation as G(Uj−1 ) has already been calculated in the previous iteration. 3 18 2.3. SOLVING THE TOYPROBLEM IN A TIME-PARALLEL WAY 2.3 Solving the toyproblem in a time-parallel way Now these two algorithms should be used to solve the toyproblem in equation (2.1). To solve this problem one usually applies implicit solvers to it as stability is an issue here. The reference solution is calculated using the implicit midpoint rule with step-size ∆t = 10−3 in a sequential fashion. As the problem size is very small (900 unknowns), parallization of one single time step is probably not very efficient in this case. To verify this, the linear system of equations that has to be solved in each time step was solved in parallel using a parallel solver from the PETSc toolkit[7]. Figure 2.4 shows that the program becomes only marginally faster when using more cores4 . To check the accuracy of the time parallel methods the approximate solution at the final time send was compared to the reference solution rend . Then the relative error ||rend − send ||∞ ||rend ||∞ (2.17) was investigated. 1.25 1.2 1.15 SP 1.1 1.05 1 0.95 0.9 0.85 0 10 1 2 10 10 3 10 P Figure 2.4. Achieved speedup using a parallel linear equation solver on the time interval [0, 100] 4 The bad performance on very few cores compared to the solution on one single core can be explained by the way how PETSc solves linear systems. PETSc uses by default Krylov methods to solve linear systems. If more than one core is available, PETSc uses a different preconditioner which leads to slightly slower convergence in this case. 19 CHAPTER 2. TIME PARALLELIZATION METHODS 2.3.1 Waveform relaxation In example 1 we have already predicted that the convergence of the WR algorithm will be very slow if the length of one window is too large. Here we will solve the equation on the time interval [0, 10] using several different window sizes. The timestepping scheme that is used for solving the modified problems in equation (2.2) is the same as for the reference solution, i.e. the implicit midpoint rule with step size ∆t = 10−3 . The modified right-hand side F (x, y, t) is the same as in equation (2.7). As a stopping criterion (k) ||yi (k−1) − yi ||∞ ≤ 10−5 , ∀i was used. By using this stopping criterion, the relative error ||w − s||∞ ||s||∞ was always lower than 5e − 3. Here s stands for the reference solution at time T and w for the approximate solution that was obtained using the WR method. Table 2.1 shows the average number of iterations per window that are necessary to fulfill this stopping criterion. We can see that the necessary number of iterations grows very fast for window lengths > 10−3 . The motivation to use the WR algorithm was to lower the number of necessary communications which would be advantegeous if the communication latency is limiting the speedup. In this case it does not work, unfortunately. In order to do fewer communications, the number of iterations should be much lower than the number of discrete time points in the window. Otherwise one needs only one communication per iteration but due to the high number of iterations one may require more communications in total than in the case where only one single time step is parallelized. window size 1 · 10−3 4 · 10−3 10 · 10−3 average number of iterations per window 11.4 25.0 48.7 Table 2.1. Average number of iterations for different window sizes Now we can have a look at the speedup in the case where the window size is 10 · 10−3 . Figure 2.5 shows the speedup for the WR algorithm with respect to the sequential reference solver, which is extremely low in this case. The reason for this is of course the high number of iterations that were shown in table 2.1. On the other hand we can show the positive effects of sending data only once per iteration: Figure 2.6 shows the speedup of the WR algorithm when one uses the WR algorithm on one processor as the reference solver. The achieved speedup is much better than in the sequential case (figure 2.4) which probably means that the influence of the communication latency became smaller. 20 2.3. SOLVING THE TOYPROBLEM IN A TIME-PARALLEL WAY 0.034 0.032 SP 0.03 0.028 0.026 0.024 0.022 0 10 1 2 10 10 3 10 P Figure 2.5. Achieved speedup with respect to sequential time-stepping using the WR algorithm and window size 10 · 10−3 Finally one can say that the basic WR algorithm is unsuitable for this kind of problems. This was already shown in example 1 and has also been noted in several papers[20][22]. Some ideas have been proposed to improve the convergence of WR methods for parabolic problems, for example by using a multigrid approach[20] or by employing successive overrelaxation[22]. 2.3.2 Parareal algorithm For the parareal algorithm we choose the same solver for F as we did in the sequential case, i.e. the implicit midpoint rule with step size ∆t = 10−3 . As a coarse propagator G the implicit Euler method was chosen, one time with ∆t = 10−2 and the other time with ∆t = 10−1 . In this case we get CG ≈ 81 CF when the step size of G is 1 10−2 and CG ≈ 50 CF for a step size of 10−1 .5 It is also easy to show that the requirements in theorem 4 are fulfilled due to the linearity of the used solvers. This time we choose the large time interval [0, 100] and look at the speedup in figure 2.7. In the case when G uses the step size ∆t = 10−2 we obtain a quite limited speedup which can easily be explained by the upper bound for the speedup 5 To solve the linear systems in each time step, linear equation solvers from the PETSc toolkit were used. These solvers are not direct solvers but Krylov methods. It seems that the speed of these solvers implicitly depends on the step size of the time-stepping scheme. Therefore using a 10 times bigger step size does not make the solver 10 times faster. Forcing PETSc to use direct solvers makes the computation much slower. 21 CHAPTER 2. TIME PARALLELIZATION METHODS 1.45 1.4 1.35 1.3 SP 1.25 1.2 1.15 1.1 1.05 1 0 10 1 2 10 10 3 10 P Figure 2.6. Achieved speedup with respect to the WR algorithm on one processor using the WR algorithm and window size 10 · 10−3 that was obtained in equation (2.15). This equation limits the speedup SP by SP ≤ 1 1 ≈ = 8. a 1/8 If we use the G with step size 10−1 we may get a bigger speedup according to equation (2.15) (SP ≤ 50). We can also see that this holds true by looking at figure 2.8. It is worth noting that the parareal algorithm always converges after two iterations for this simple toyproblem, seemingly independent of the step-size for the coarse solver. Furthermore, the accuracy of the results is usually very good with the relative error being around 1e − 4. For general problems the number of iterations will probably increase when a less accurate coarse solver G is used but this does not seem to be the case for this toyproblem. 22 2.3. SOLVING THE TOYPROBLEM IN A TIME-PARALLEL WAY 8 7 6 SP 5 4 3 2 1 0 10 1 2 10 3 10 P 10 4 10 Figure 2.7. Achieved speedup for the parareal algorithm with coarse time step 10−2 35 30 25 S P 20 15 10 5 0 0 10 1 2 10 10 3 10 P Figure 2.8. Achieved speedup for the parareal algorithm with coarse time step 10−1 23 Chapter 3 Introduction to molecular dynamics Molecular dynamics is the term for a class of numerical simulations that try to predict the motion of “particles” and corresponding physical properties like temperature or the structure of molecules. The particles in these calculations can be • basic nucleons like protons, neutrons and electrons • whole atoms or atom-groups • bigger molecules or groups of molecules, for example proteins • a combination of the three groups above where certain movements are simulated on a very fine scale while others are calculated on a coarse level. Here I will briefly present the so called classical MD. This basically means that the movement of particles can be derived from Newton’s laws of motion and not from quantum mechanics. The basic idea behind MD is usually the same, independent of the used scale: One tries to simulate the movement of some particles ρ1 , . . . , ρn where each particle ρi has the coordinates xi (t), yi (t) (in the 2D case). Then the basic equation for molecular dynamics is M d¨ = f (d, t), t ∈ [0, T ] (3.1) d(0) = d0 ˙ d(0) = v0 where d : R 7→ R2n is defined as d(t) = (x1 (t), y1 (t), x2 (t), y2 (t), . . . , xn (t), yn (t))T . M is in this case a mass matrix which determines the mass of each particle ρi . The function f (d, t) determines the interaction of the particles with each other as well as additional external forces. In 3.1 it will be explained how this function f can be chosen. 25 CHAPTER 3. INTRODUCTION TO MOLECULAR DYNAMICS Equation (3.1) can be rewritten in such a way that it has the same form as the basic equation (1.4) by introducing vik (t) = d˙ki (t). Then one obtains v̇ d˙ ! = | {z } ẏ M −1 f (d, t) v ! {z } | g(y) , t ∈ [0, T ] (3.2) d(0) = d0 v(0) = d˙0 where v describes the velocity of the particles. This allows us to apply the theory from chapter 2 to these equations. 3.1 Force fields and potentials The big question is now how one should choose the right-hand side function f (d, t). In theory one could solve MD problems with arbitrary precision by using Schrödinger’s equation[25]. This is impossible unless one has very special (and small) problems as solving Schrödingers equation is very expensive. Instead, a lot of approximations have been developed which are able to model some properties of the molecular system. These approximations are usually sufficient to accurately approximate the solution for a certain class of problems. In classical MD, the right-hand side f (d, t) can usually be written as[23] f (d, t) = −∇V (d) (3.3) where V is called the potential function and ∇V the force field. V describes the potential energy in the system. By differentiating with respect to the particle positions, the actual force can be calculated. A lot of different force fields exists, some popular examples are the AMBER and CHARMM force fields which are part of several important molecular dynamics software packages. One very simple potential that is often used for demonstration purposes is the Lennard-Jones potential that can actually be used to simulate the behaviour of noble gases. In that case the potential function is computed by calculating and summing up all pairwise potentials σ U (rij ) = αε rij !12 − σ rij !6 (3.4) where rij denotes the distance between the particles ρi and ρj and α, ε and σ are constants that have to be adapted to the actual problem. The whole potential function V (d) is in this case V (d) = n X n X i=1 j=i+1 U (||rij ||), rij = 26 d2i−1 d2i ! − d2j−1 d2j ! . (3.5) 3.2. TIME STEPPING 1 Uij 0.5 0 −0.5 −1 0 1 2 rij 3 4 5 Figure 3.1. The function U (rij ) with respect to rij , α = ε = σ = 1 By calculating −∇V one obtains the forces that act on each particle: fxi fyi ! n X 1 = 24ε rij j=1 σ6 σ 12 − 2 6 12 rij rij ! xj − xi yj − yi ! . (3.6) j6=i By looking at figure 3.1 or equation (3.6), one can see that particles that are far away almost have no influence on each other. This holds also true for a couple of other potentials which therefore are called short range potentials. In this case it is common to ignore the interactions of two particles ρi and ρj if their distance is larger than a certain cutoff-radius rcut . In that case equation (3.6) becomes fxi fyi n X ! = j=1 j6=i rij <rcut 1 24ε rij σ 12 σ6 − 2 6 12 rij rij ! xj − xi yj − yi ! . (3.7) This cutoff-radius can also be used to parallelize such MD problems in space. To calculate the force that acts on a certain particle ρi we need no longer all particles in the domain but only a couple of other particles which are located near ρi . Therefore one can divide the domain into several parts Ωi as shown in figure 3.2. All particles inside one subdomain are assigned to one single processor. To calculate the forces for the particles, only particles close to the subdomain boundaries (in the gray area) have to be exchanged. 3.2 Time stepping The standard methods to solve the molecular dynamics problems (3.1) are Verlet integrators[31]. The classical velocity Verlet method calculates the position of the 27 CHAPTER 3. INTRODUCTION TO MOLECULAR DYNAMICS Ω0 Ω1 cutoff radius Ω2 Ω3 Figure 3.2. Spatial decomposition for short range potentials particles di+1 and the corresponding velocities vi+1 at time ti+1 in the following way[30]: di+1 = di + ∆t vi + vi+1 = vi + ∆2t −1 M f (di , ti ) 2 (3.8) ∆t −1 M (f (di , ti ) + f (di+1 , ti+1 )). 2 The Verlet integrators are second-order explicit methods that also have the nice property that they are symplectic. An symplectic integrator can be defined as follows: “An integration method can be interpreted as a mapping in phase space [...]. If the integration method is applied to a measurable set of points in the phase space, this set is mapped to another measurable set in the phase space. The integration method is called symplectic if the measure of both of those sets is equal.”[18, p. 46] Symplectic methods are usually used in MD because they preserve the energy of a system very well compared to non-symplectic methods. Other examples of symplectic methods are the symplectic Euler method or the implicit midpoint rule[31]. A big problem that occurs in MD are instabilities. The time step for explicit methods is usually “limited by the smallest oscillation period that can be found in the simulated system”[33, p. 6543] – which can easily be about 10−15 s (femto seconds) for some calculations. The time step can be increased slightly by using additional techniques like force splitting and freezing certain bonds (RATTLE/SHAKE), see also [31]. Implicit time-stepping could – in theory – allow almost arbitrary large 28 3.3. THE MOLECULAR DYNAMICS PROBLEM step sizes which are only limited by accuracy consideration, not by stability. A single implicit time step is of course much more expensive than an explicit step as it requires the solution of a non-linear system. The paper [31] has investigated the usefulness of implicit methods in molecular dynamics. The authors basically came to the conclusion that implicit methods do not allow much bigger timesteps because the necessary non-linear solvers do not converge for bigger time steps as the non-linear system becomes ill-conditioned. On the other hand the paper [33] managed to solve certain MD problems using an implicit waveform relaxation method in a competitive time. 3.3 The molecular dynamics problem To have a more meaningful problem, that is also more difficult to solve from a numerical point of view a problem from molecular dynamics has been chosen. This example has been taken from [18, p. 157ff] and is simulating the evolution of a crack in silver. The initial configuration is a mesh consisting of 4870 silver atoms with a small crack in the domain as shown in figure 3.3. Figure 3.3. Initial positions of the silver atoms; an additional force is applied to the lower and upper border (white atoms) Now a force is exercised on the particles on the lower and upper boundary which is pulling the two sides away from each other. Due to these forces the crack will open wider. The internal forces between the particles are modeled by a new potential, which is called the Finnis-Sinclair potential. Unlike the Lennard-Jones potential which was already presented it is a multibody potential which means that the potential between the particles ρi and ρj does not only depend on ρi and ρj but potentially also on some other particles. The potential function for the Finnis29 CHAPTER 3. INTRODUCTION TO MOLECULAR DYNAMICS Sinclair potential is given by V (d) = ε n X n X i=1 j=i+1 σ rij where Si = n X j=1 j6=i σ rij !12 q − c (Si ) !6 . By calculating −∇V we obtain the forces for each particle: fxi fyi where ! 1 σ Lij = −ε 2 12 rij rij = n X Lij j=1 j6=i !12 xj − xi yj − yi ! 1 1 − 3c √ + p Sj Si (3.9) ! σ rij !6 . The external force that is exercised on the particles on the lower and upper boundary is given by fyext i = g(ymax − yi ), g(ymin − yi ), 0, ρi lies on the upper boundary ρi lies on the lower boundary otherwise This causes the particles on the upper/lower boundary to be pulled up/down until the y coordinate of the particle becomes ymax (upper boundary) or ymin (lower boundary). We also use the cutoff-radius technique here to reduce the computational cost of each function evaluation. The used constants in this example are also taken from [18, p. 157ff] and are ε = 1, σ = 1, c = 10.7, M = I, fext = 3, ymax = 47.5, ymin = −0.5, g = 0.6 and rcut = 5.046875. 3.4 Using time-parallel methods After doing some tests with time-parallel methods in the previous chapter we will now apply these methods to the presented MD problem. There are not many papers available where it was tried to parallelize MD problems in time. In the few cases where this was done[33][6], very simple and small problems were solved. Therefore it is difficult to predict how our more realistic problem will behave. Before we start to solve the problem one thing has to be pointed out: MD problems usually show chaotic behaviour, i.e. even a very small error will grow exponentially in time. Figure 3.4 shows the solution of the crack propagation problem at time t = 50. Both solutions were calculated with the Verlet algorithm, using a 30 3.4. USING TIME-PARALLEL METHODS 50 50 40 40 30 30 20 20 10 10 0 0 0 10 20 30 40 50 60 70 80 0 90 10 20 30 40 50 60 70 80 90 Figure 3.4. Solution at t = 100 with slightly modified initial conditions step size of ∆t = 10−2 . The only difference was that in the initial condition the velocity of two particles was exchanged. As one can see the pictures are slightly different. Due to this chaotic behaviour, comparing the positions or velocities of the particles with some reference solution is usually meaningless. To check the accuracy of the time-parallel methods the following way is used: “In molecular dynamics simulations, the evaluation of numerical methods (and determination of the quality of a numerical trajectory) must be based on the magnitude of the observed energy drift. From one time step to the next, the energy can fluctuate quite considerably in an MD simulation, regardless of the method, and these local fluctuations are generally larger in a low-order method than in a higher-order one, at a given stepsize.”[9, p. 10] This means that we have to calculate the total energy E(ρ1 , . . . , ρn ) in the system. To calculate E, we need the kinetic energy Ekin (ρ1 , . . . , ρn ) and the potential energy Epot (ρ1 , . . . , ρn ) in the system. These two values are given by[18] Epot (ρ1 , . . . , ρn ) = V (ρ1 , . . . , ρn ) Ekin (ρ1 , . . . , ρn ) = n 1X mi vi2 2 i=1 where mi is the mass of particle ρi and vi is its velocity. Then one gets the total energy by E(ρ1 , . . . , ρn ) = Epot (ρ1 , . . . , ρn ) + Ekin (ρ1 , . . . , ρn ). (3.10) Again a reference solution was calculated to determine the accuracy of the timeparallel methods. In this case this solution was obtained by using the Verlet algo31 CHAPTER 3. INTRODUCTION TO MOLECULAR DYNAMICS rithm with step size 5·10−3 . As shown in figure 3.5, the energy remains more or less constant after an initial phase. Therefore, we calculate the average energy Emean on the time interval [40, 100]. To check the accuracy of other solvers, we compare the total energy Ei at time ti with the reference energy Emean . Then we define the maximal energy deviation Edev for this method by Edev = max δEi = 20 40 i:ti ∈[40,100] Emean − Ei E i:ti ∈[40,100] mean max (3.11) 5 x 10 E −1.9412 −1.9413 0 60 80 100 t Figure 3.5. Total energy when using a smaller step size (5 · 10−3 ) If we look back at figure 3.4 we see that the final positions of the particles are different. Figure 3.6 on the other hand shows the deviation from Emean is quite similar in both cases. In both cases we have that Edev ≈ 5 · 10−6 . 3.4.1 Waveform relaxation We will start with testing the WR algorithm. First, we have to choose a splitting function F (x, y, t). Here the function ˜ (y, t) x − y F (x, y, t) = f (y, t) + Jf 32 3.4. USING TIME-PARALLEL METHODS −5 1.6 x 10 1.4 |(Emean − E) / Emean| 1.2 1 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 60 80 100 t −5 1.6 x 10 1.4 |(Emean − E) / Emean| 1.2 1 0.8 0.6 0.4 0.2 0 0 20 40 t Figure 3.6. Deviation from the average energy for the unmodified (top) and the modified initial condition (bottom) ˜ (y, t) is the block-diagonal Jacobian matrix of f that is defined by will be used. Jf T1 (y) .. ˜ (y, t) = Jf . Tn (y) 33 CHAPTER 3. INTRODUCTION TO MOLECULAR DYNAMICS where Ti (y) ∈ R2×2 is given by Ti (y) = ∂f2i /∂y2i ∂f2i /∂y2i+1 ∂f2i+1 /∂y2i ∂f2i+1 /∂y2i+1 ! . Each block of the Jacobian matrix corresponds to one single atom whose position is defined by its x and y coordinate. Therefore this force-splitting will be called atom-wise Waveform Newton. As explained in remark 1, using the WR algorithm with explicit time-stepping is usually not very useful. Therefore the implicit midpoint rule was chosen as a time-stepping scheme. The idea here is to use bigger step sizes in order to counter the higher cost of the implicit time-stepping scheme as it was done in [33]. We start with analyzing the convergence rates of the WR algorithm. Here we cannot get a theoretical convergence estimate as easily as in the previous chapter. It is also easy to see that no force splitting will satisfy the Lipschitz condition in theorem 1. This can be shown by choosing x̃ = ỹ in this theorem in such a way that these vectors correspond to a system where two particles have the same position. In that case we have that F (x̃, ỹ, t) = F (x̃, x̃, t) = f (x̃, t) = ±∞ as we have to divide by rij = 0 in the force calculation. To check if convergence is still possible, several combinations of step sizes and window sizes were tested for the time interval [0, 1]. Only one core was used in the beginning. The corresponding number of iterations per window are shown in table 3.1. As a stopping criterion, the criterion in equation (2.8) was used with ε = 10−4 . By comparing these results with the sequential Verlet algorithm which step size 1 · 10−2 1 · 10−2 1 · 10−2 5 · 10−2 5 · 10−2 5 · 10−2 5 · 10−2 10 · 10−2 window size 1 · 10−2 5 · 10−2 10 · 10−2 5 · 10−2 10 · 10−2 20 · 10−2 50 · 10−2 10 · 10−2 average # of iterations per window 4.0 6.0 8 10.6 15.5 23.6 no convergence after 100 iterations no convergence after 100 iterations wall-clock time in s 1159 1041 1283 617 664 860 - Table 3.1. Average number of iterations for different window sizes and step sizes needed approximately 80 seconds to solve the problem on this time interval we can see that the WR algorithm performs quite badly again. If we use small step and window sizes we get a quite fast convergence. However, this does not help us as each iteration is now much more costly as we are using implicit time-stepping and a more expensive right-hand side function F (x, y, t). When we try to use bigger step 34 3.4. USING TIME-PARALLEL METHODS sizes to counter the higher costs, the number of iterations grows also quite fast or the solution does not converge at all. As the maximal possible window sizes are also quite small (≤ 10·stepsize) we still have to exchange data after a few time steps. Tests showed also that even when one uses 100 cores the WR algorithm is only marginally faster than the Verlet algorithm on one single processor. Furthermore, the WR algorithm does not seem to conserve energy very well with Edev being around 2 · 10−3 for a step size of 5 · 10−2 and a window size of 5 · 10−2 . 3.4.2 Parareal algorithm To test the parareal algorithm, the Verlet algorithm with step size 10−2 was chosen as the fine solver F. As using a bigger step size leads to instabilities, the coarser solver G was chosen to be the equal to F but the cutoff radius was halved in order to make G faster. This gives us approximately CF ≈ 4CG . When testing the parareal algorithm, the outcome was basically that one always has to do almost P iterations in order to fulfill the stopping criterion if P / 150. This leads to runtimes that are even worse than for serial time stepping using only one processor. Further investigations showed that the reason for this behaviour seems to be the update step k+1 k k ). Ujk+1 = F(Uj−1 ) − G(Uj−1 ) + G(Uj−1 If the size of one time block is quite large – which is the case if P is small – the values k ), G(U k ) and G(U k+1 ) may differ substantially. As U k+1 is calculated of F(Uj−1 j−1 j j−1 by simply summing up these terms it regularly happens that some particles in Ujk+1 become too close to each other. By looking at equation (3.9) one can see that small distances between particles lead to very strong repulsive forces which causes subsequent applications of time-stepping schemes to become unstable. Therefore the parareal algorithm does not converge in this case and only the observation that the result of the parareal algorithm after P iterations equals the solution of sequential time-stepping makes it possible to obtain the right solution in this case. For higher numbers of processors the algorithm converges and we are at least able to get a small speedup as shown in figure 3.7. Table 3.2 shows the number of parareal iterations that were required for convergence. It also shows the theoretical upper bound for the speedup that was derived in chapter 2 (equation (2.15)) and compares it with the actual speedup. We can see that the algorithm came very close to the theoretical speedup. The energy conservation of the parareal algorithm was quite good in this case with values Edev ≤ 6 · 10−6 . This is quite surprising, given that the parareal algorithm is not symplectic, even if F and G are symplectic[11]. 35 CHAPTER 3. INTRODUCTION TO MOLECULAR DYNAMICS 2.6 2.5 2.4 SP 2.3 2.2 2.1 2 1.9 1.8 100 200 300 400 500 600 700 800 P Figure 3.7. Achieved speedup for the parareal algorithm P 192 384 768 number of iterations 43 44 37 maximal theoretical speedup 1.9 2.5 3.2 achieved speedup 1.8 2.4 2.6 Table 3.2. Average number of iterations for different window sizes and step sizes 36 Chapter 4 Improved methods for the molecular dynamics problem In chapter 2 the parareal algorithm and the WR method were presented. For the MD problem in the previous chapter both algorithms performed very bad and do not seem to work in their basic forms. This chapter will present some ideas how the algorithms can be improved. In the next chapter these improvements are tested and evaluated. The waveform relaxation suffers from the fact that the number of necessary iterations is growing very fast when the window sizes become bigger. However, big window sizes are required in order to use a high number of cores efficiently. Maybe an even bigger problem is that the solution does not even converge when one uses big time steps or window sizes. This makes it very unlikely that large-scale time parallelization can be achieved using the WR method. Furthermore, using implicit methods may not be useful for molecular dynamics as it does not allow much bigger step sizes than explicit methods[31]. Because of this, the main focus of this chapter will be on the parareal method which is more promising here. The parareal algorithm in its basic form suffers from the fact that it becomes unstable and does not converge. But we can see that convergence is possible when the number of processors is high enough and the size of each time block is small enough. But the obtained speedup is still relatively small. Therefore some components of the algorithm shall be investigated and a new multilevel parareal algorithm is proposed here. 4.1 Choice of the force splitting for the waveform relaxation The choice of the function splitting for the WR method is likely to have an influence on the speed of the method. Therefore three different force splittings will be compared with respect to their rate of convergence and the overall speed of the resulting method. We will use the Picard splitting which was probably the first 37 CHAPTER 4. IMPROVED METHODS FOR THE MOLECULAR DYNAMICS PROBLEM splitting that was used for WR like methods. We will also test a diagonal Newton approach for the modified right-hand side. These two splittings will be compared to the splitting that was used in the previous chapter. The mentioned splittings are defined as follows: • Picard iteration: F (x, y, t) = f (y, t) • Diagonal Newton WR iteration: F (x, y, t) = f (y, t) + diag(Jf (y, t))(x − y) • The force splitting from chapter 3: ˜ (y, t)(x − y) F (x, y, t)i = f (y, t) + Jf 4.2 4.2.1 Improving the parareal algorithm Using windowing When applying the parareal algorithm we saw that the algorithm became unstable when the size of each time block is too large. Furthermore, theorem 4 suggests that using smaller time blocks may be benefitial for convergence. Therefore it is often not appropriate to apply the parareal method to the whole time domain. Instead, one can solve the differential equation on a part of the time domain (on a window) as shown in figure 4.1. As soon as the problem has been solved completely on one window, the solution on the next window starts, using the final solution of the previous window as the initial condition. p0 p1 t=0 1st window p0 p1 p2 p2 t=T 2nd window p0 p1 p2 t=0 3rd window p0 p1 p2 t=T Figure 4.1. Top: Using parareal on the whole domain; Bottom: Using parareal with windowing 38 4.2. IMPROVING THE PARAREAL ALGORITHM 4.2.2 Choice of the coarse operators The choice of the coarse operator in each parareal-like algorithm is very important in order to obtain fast convergence. If the coarse operator is too inaccurate, a lot of iterations are needed for convergence. If it is too expensive on the other hand, it will also influence the parallel efficiency in a bad way. To construct coarse solvers we basically have four different possibilities: 1. Use a bigger step-size than for the fine operator. 2. Use another time-stepping method that has a lower order and which is therefore faster. 3. Simplify the underlying physics. 4. Use an iterative solver and do only a limited number of iterations. This can be done if the WR algorithm or a comparable iterative method is used for the time-stepping. The ideas 1 and 2 are unfortunately not very doable when one tries to solve MD problems. Idea 1 will not work as one is usually already using the biggest possible time step for which the system is stable. Using a bigger time step will therefore result in an unstable method unless one uses implicit methods. But in [31] it was shown that the non-linear equation solvers that are needed for implicit time-stepping do not converge when the time steps become only slightly larger. Idea 2 will not be working either, as we are using the Verlet algorithm which is a quite cheap and explicit time stepping scheme. Furthermore, the Verlet algorithm has much better stability properties than for example the explicit Euler scheme. Constructing methods that are faster than Verlet and similarly stable may therefore be very difficult. The easiest way to simplify the underlying physics is to simply make the cutoff radius smaller. This has of course some influence on the solution in the long term but on shorter time scales the solutions are sufficiently similar. Therefore we will test coarse solvers which use 0.5rcut and 0.25rcut as the cutoff radius. Another idea is to use a completely different potential. Crack propagation has already been done in [28] using the Lennard-Jones potential. For sufficiently small time intervals the solution that one obtains using the Lennard-Jones potential may be close enough to the original solution. Therefore the Lennard-Jones potential is used to construct a coarse solver by setting ε = 1 and σ ≈ 0.89 in this potential1 . There are also several different approaches available to construct coarse grained approximations. Deriving these approximations is usually quite complex and would go beyond the scope of this thesis, therefore this approach is not tested here. Finally, we use only a few iterations of the WR algorithm as the coarse solver. In that case we hope that the solution obtained by the WR method is sufficiently close to the actual solution. 1 These values were obtained by fitting the Lennard-Jones forces to the Finnis-Sinclair forces. 39 CHAPTER 4. IMPROVED METHODS FOR THE MOLECULAR DYNAMICS PROBLEM 4.2.3 A multilevel parareal algorithm As shown in the end of chapter 2 (equation (2.15)) the speedup is limited by the speed of the coarse operator. Furthermore, it is quite probable that the the number of parareal iteration will increase when we use a G which is less accurate. Thus we have a big problem now: We cannot use an arbitrary coarse G to obtain a better bound for the speedup in equation (2.15) because this will probably require much more parareal iterations which negates the positive effects of a faster G. Therefore a multilevel parareal algorithm will be derived here. Figure 4.2 shows k ) in the two steps of the classical parareal algorithm: First one calculates F(Uj−1 parallel for all j. The second step is to sequentially apply the coarse solver G in order to calculate Ujk+1 ∀j. In the multilevel algorithm we do not apply G sequentially on the whole time domain. Instead, G is only applied sequentially on some parts of the domain. In order to propagate the solution from earlier time steps to the later ones, even coarser solvers should be used. Figure 4.3 illustrates the idea in the case that one uses three different solvers F, G and C. By doing this we hope to get convergence rates which are not much worse than for the classical parareal algorithm. If this is the case we can get much better performance on a lot of cores as the amount of sequential calculations is much lower now. G G G G G G G G part 2 t 0 T F F F F F F F F part 1 t 0 T Figure 4.2. Sketch of the classical parareal algorithm, dashed lines represent dependencies The foundation for this algorithm is the work in [14] where parareal was motivated as a multigrid algorithm with only two levels. Here, this approach is used 40 4.2. IMPROVING THE PARAREAL ALGORITHM C C C C part 3 t 0 T G G G G G G G G part 2 t 0 T F F F F F F F F part 1 t 0 T Figure 4.3. Sketch of the 3-grid parareal algorithm, dashed lines represent dependencies to extend parareal to more levels. This makes it possible to introduce additional parallelism into the algorithm. The possibility to construct multilevel parareal algorithms has already been pointed out for example in [14] but as far as I know there are no papers available that deal with such an algorithm for non-linear ODEs. For linear ODEs such an algorithm has been presented in [13] which showed quite similar convergence rates for the classical and the multigrid parareal algorithm. A somewhat similar approach has also been used in [17] to create a multilevel time-parallel predictor-corrector method for computational fluid dynamics. 41 CHAPTER 4. IMPROVED METHODS FOR THE MOLECULAR DYNAMICS PROBLEM Parareal as a 2-grid method First, we will derive the classical parareal algorithm as a multigrid method: To do this we have to define a coarseness constant c ∈ N, c > 1, equidistant time steps t0 , t1 , . . . , tcm and vectors u0 , u1 , . . . ucm . Furthermore, we need a fine solver F and a coarse solver G. F(ui ) denotes here the approximate solution of t ∈ [ti , ti+1 ] ẏ = f (y, t) y(ti ) = ui at time ti+1 using the F propagator. G(ui ) is defined in a similar way, namely as the approximate solution of t ∈ [ti , ti+c ] ẏ = f (y, t) y(ti ) = ui at time ti+c using the G propagator. Similar to equation (2.9) we can now represent the sequential application of F to the basic equation (1.4) as a non-linear system of equations: u0 u1 − F(u0 ) .. . ucm − F(ucm−1 ) | {z = } A(u) y0 0 .. . 0 (4.1) | {z } b This nonlinear system A(u) = b can easily be solved by using sequential timestepping. In order to introduce parallelism however, the system should now be solved by a non-linear multigrid approach called FAS (full approximation scheme). This means that one tries to approximate the exact solution u∗ by an iterative scheme. Here, uki will denote the approximation of u∗i in the k-th iteration. The general form of a 2-grid FAS iteration is: 1. Smoothing step: ũ = S(uk , b) (4.2) Ã(U ) = B (4.3) B = R(b − A(ũ)) + Ã(R(ũ)) (4.4) uk+1 = ũ + P (U − R(ũ)) (4.5) 2. Solve a coarse problem: where 3. Correction step: 42 4.2. IMPROVING THE PARAREAL ALGORITHM In this scheme, S is the so called smoothing operator, R the restricting operator, P the prolongation operator and Ã(U ) is the coarse version of A(u). The paper [14] uses now the following definitions to obtain the parareal scheme: The smoothing operation ũ = S(uk , b) is defined by: ũi = uki , ∃j ∈ N0 : i = cj ũi = F(ũi−1 ) + bi , otherwise which means that sequential time-stepping is used inside a block of c time-points while the initial values of each time block remain unchanged. The restriction and prolongation operations are given by: R(u) = u0 uc .. . ucm U0 0c−1 U1 0c−1 .. . P (U ) = 0c−1 , Um where 0c−1 is the zero-vector of size c − 1. Finally we have to specify the coarse problem Ã(U ). Using G instead of F, we define Ã(U ) as A(U ) = U0 U1 − G(U0 ) .. . Um − G(Um−1 ) . (4.6) The normal parareal algorithm solves the nonlinear equation A(U ) = B on the coarse grid exactly by applying G sequentially. Then one can show that each of these FAS iterations is equal to one parareal iteration: First we have to calculate the vector B in the coarse equation A(U ) = B: B = R(b − A(ũ)) + Ã(R(ũ)) = = b0 − ũ0 + ũ0 bc − ũc + F(ũc−1 ) + ũc − G(ũ0 ) .. . bcm − ũcm + F(ũcm−1 ) + ũcm − G(ũc(m−1) ) b0 bc + F(ũc−1 ) − G(ũ0 ) .. . . bcm + F(ũcm−1 ) − G(ũc(m−1) ) 43 CHAPTER 4. IMPROVED METHODS FOR THE MOLECULAR DYNAMICS PROBLEM Solving A(U ) = B exactly by using the coarse operator G we get U = B0 B1 + G(U0 ) .. . Bm + G(Um−1 ) = b0 F(ũc−1 ) − G(ũ0 ) + G(U0 ) + bc .. . F(ũc(m−1) ) − G(ũc(m−1) ) + G(Um−1 ) + bc(m−1) . Doing the final step uk+1 = ũ + P (U − R(ũ)) yields: uk+1 = b0 0 uk+1 = ũi + P (U − R(ũ))i = ũi = F(ũi−1 ) + bi , i uk+1 i 6 ∃j ∈ N : i = jc = ũi + P (U − R(ũ))i = P (U )i = Ui/c = F(ũi−1 ) − G(ũi−c ) + G(Ui/c−1 ) + bi , otherwise. To obtain the classical parareal algorithm we now have to set c = 1. In that case we also have that b0 = y0 and bi = 0, i 6= 0. By using this information we get that ũi = uki and therefore uk+1 = b0 0 uk+1 = ũki + P (U − R(ũ))i = P (U )i = Ui i = F(ũi−1 ) − G(ũi−1 ) + G(Ui−1 ) = F(uki−1 ) − G(uki−1 ) + G(uk+1 i−1 ) Extension to a multigid method Now versions with more than two levels shall be derived. This means that the coarse problem Ã(U ) = B in equation (4.3) will not be solved exactly. Instead, we provide an initial guess for this coarser level and apply recursively one multigrid-parareal iteration to this coarse problem. First, we define the number of different levels by L where level 1 is the finest and L the coarsest level. Then we have to define fine and coarse time-stepping schemes Fl and Gl for each level 1, . . . , L − 1. We must also enforce that Fl = Gl−1 , l > 1, i.e. the coarse propagator on level l must equal the fine propagator on level l + 1. Furthermore, coarseness constants c1 , . . . , cL−1 for the levels are needed and we set c1 = 1. Finally we denote the function A and the vector b on level l by Al and bl . Now we are going to derive the update formulas for each level l where l < L − 1. On every level l we need an initial guess Uik,l , i = 1, . . . , cl ml in each iteration k. Then we have to apply the smoothing operator S(U k,l ) to these initial values: Ũil = Uik,l , ∃j ∈ N0 : i = cl j l Ũil = Fl (Ũi−1 ) + bli , otherwise. 44 4.2. IMPROVING THE PARAREAL ALGORITHM By using this we can calculate the vector bl+1 which is needed for the evaluation on the coarser grid: bl+1 = R(bl − Al (Ũ l )) + Al+1 (R(Ũ l )) bl0 − Ũ0l + Ũ0l blc − Ũcll + Fl (Ũcll −1 ) + Ũcll − Gl (Ũ0l ) .. . = blcl ml − Ũcll ml + Fl (Ũcll ml −1 ) + Ũcll ml − Gl (Ũcll (ml −1) ) bl0 = blcl ml + Fl (Ũcll ml −1 ) − Gl (Ũcll (ml −1) ) bl0 blcl + Fl (Ũcll −1 ) − Gl (U0k,l ) .. . = blcl + Fl (Ũcll −1 ) − Gl (Ũ0l ) .. . ) blcl ml + Fl (Ũcll ml −1 ) − Gl (Uck,l l (ml −1) . The value bl+1 is now used on the next level as the coarse right-hand side and Ũ0l , Ũcll , . . . , Ũcll ml = U0k,l , Uck,l , . . . , Uck,l are used as the initial condition on the l l ml k,l+1 k,l+1 k,l+1 coarse grid U0 , U1 , . . . , Uml . The evaluation on the coarse grid will return k+1,l+1 k+1,l+1 which are now used to calculate the , U1k+1,l+1 , . . . , Um updated values U0 l next iterate: k+1,l+1 Uik+1,l = Ui/c , l Uik+1,l Uik+1,l l = Ũil = Fl (Ũi−1 ) + bli = l = Ũil = Fl (Ũi−1 ) + bli = ∃j ∈ N : i = cl j k,l Fl (Ui−1 ) + bli , k+1,l Fl (Ui−1 ) + bli , ∃j ∈ N : i = cl j + 1 (4.7) otherwise If we are on level L − 1 then the coarse problem Al+1 (U ) = bl+1 is solved exactly. If we use U ∗,l+1 to describe the exact solution of the coarse problem we obtain ∗,l+1 k+1,l Uik+1,l = Ui/c = Gl (Ui−c ) + bl+1 i/cl l l k+1,l k,l k,l = Gl (Ui−c ) + bli + Fl (Ui−1 ) − Gl (Ui−c ), ∃j ∈ N : i = cl j l l k,l l Uik+1,l = Ũil = Fl (Ũi−1 ) + bli = Fl (Ui−1 ) + bli , ∃j ∈ N : i = cl j + 1 k+1,l l Uik+1,l = Ũi = Fl (Ũi−1 ) + bli = Fl (Ui−1 ) + bli , otherwise (4.8) Finally, these values are passed to the finer level where they will serve as the coarse grid solution. Parallel implementation of the multigrid method The question now is in which way the processors should cooperate. The idea that was used here is to assign a number of processors Pl to each level l. After choosing 45 CHAPTER 4. IMPROVED METHODS FOR THE MOLECULAR DYNAMICS PROBLEM the number of processors P1 for the finest level, the values Pl for the levels 2, . . . , L−1 were defined by Pi = Pi−1 ci−1 , i = 2, . . . , L − 1. We will also denote the Pl processors on level l by pl0 , . . . , plPl −1 . At the beginning of each iteration k all processors pli are supposed to have the vector Uik,l locally available. The next thing that is done is to split the calculation of bl+1 into two parts. On level l the parts b̃lcl = blcl + Fl (Ũicl l −1 ), ∀i = 0, . . . , ml k,l k,l are calculated. The calculation of the missing Gl (U(i−1)c ) = Fl+1 (U(i−1)c ) l −1 l −1 is moved to the level l + 1 where this value is already known from the previous k,l iteration. Due to this we can avoid the explicit calculation of Gl (U(i−1)c ) on level l −1 l. By using this and the equations (4.7) and (4.8) we get the operations that each processor pli for l < L − 1 has to do. All different cases are listed below in algorithm 3: Algorithm 3 Updates for processor pli if l = 1 then l+1 b̃i+1 = Fl (Uik,l ) else if l = L − 1 and ∃j ∈ N : i + 1 = jcl then k+1,l k+1,l k,l Ui+1 = Gl (Ui−c ) + Fl (Uik+1,l ) − Fl (Uik,l ) − Gl (Ui−c ) + b̃li l +1 l +1 else if cl = 1 then k,l k,l l l b̃l+1 (i+1)/cl = Fl (Ui ) − Fl (Ui ) + b̃i+1 = b̃i+1 else if ∃j ∈ N : i + 1 = jcl then k+1,l b̃l+1 ) − Fl (Uik,l ) + b̃li+1 (i+1)/cl = Fl (Ui else if ∃j ∈ N : i = jcl then k+1,l Ui+1 = Fl (Uik,l ) − Fl (Uik,l ) + b̃li+1 = b̃li+1 else k+1,l Ui+1 = Fl (Uik+1,l ) − Fl (Uik,l ) + b̃li+1 end if end if If we use these updates and the fact that each processor pli holds Uik,l we get the following parallel scheme: 46 4.2. IMPROVING THE PARAREAL ALGORITHM Algorithm 4 Parallel multilevel parareal algorithm for processor pli Require: Initial value Ui0,l while not converged do Receive Uik+1,l from pli−1 if this value is needed by algorithm 3 k+1,l Receive Ui−c from pli−cl if this value is needed by algorithm 3 l +1 Receive b̃li+1 from pl−1 cl−1 i if l > 1 Calculate the new values according to algorithm 3 if 6 ∃j : i + 1 = jcl then k+1,l Send Ui+1 to pli+1 else if l < L − 1 then l+1 to pl+1 Send b̃(i+1)c (i+1)cl −1 l end if if l = L − 1 and ∃j : i + 1 = jcl then k+1,l to pli+cl Send Ui+1 end if if l < L − 1 and ∃j : i = jcl then k+1,l+1 Receive Ui/c from pl+1 icl l k+1,l+1 Uik+1,l = Ui/c l end if end while The stopping criterion was implemented in a similar fashion as for the classical parareal algorithm. This time we simply apply the stopping criterion only to the coarsest level. This means that we can stop the calculation as soon as ||Ujk,L−1 − Ujk−1,L−1 || < ε, ∀j. Tests showed that ε has to be chosen as ε = 10−4 to achieve similar accuracy as sequential time-stepping, i.e. Edev ≤ 6 · 10−6 . 47 Chapter 5 Evaluation 5.1 Choice of the force splitting for the waveform relaxation Here we have a look at the three different force splittings that were presented in the previous chapter. Again, we will test them by solving the MD problem on the interval [0, 1] using the implicit midpoint rule as the time-stepping scheme. Table 5.1 shows the average number of iterations per window and the corresponding walltime on one processor. As we can see, none of these force splittings is satisfying. Using the Picard splitting, each iteration is quite cheap as we do not have to calculate parts of the Jacobian matrix in contrast to the other two splittings. While being cheap, this version of the WR algorithm does not converge for bigger time steps. Instead, the solution becomes instable for step sizes as small as 5 · 10−2 . The diagonal Newton approach on the other hand behaves very similar to the atom-wise Newton splitting in chapter 3. As the window sizes are still very small we do not gain much from using more cores. Even if 100 cores are used, none of these force splittings is more than 20% faster than the Verlet algorithm on one single processor. Furthermore we still have the problem that the WR algorithm does not conserve the energy very well which leads to very inaccurate results in the long run. 5.2 5.2.1 Improving the parareal algorithm Using windowing Now the influence of windowing is investigated using the same solvers as in the evaluation in chapter 3. Here, the solution was not calculated on the whole time interval at once. Instead, windows of size P ∆T for varying ∆T were used. By looking at figure 5.1 we can see the positive influence of windowing on the convergence rate. This figure shows us that the number of iterations that is required to fulfill the stopping criterion becomes much smaller which should give us better performance 49 CHAPTER 5. EVALUATION algorithm Atom-wise Newton Atom-wise Newton Atom-wise Newton Atom-wise Newton Atom-wise Newton Atom-wise Newton Atom-wise Newton Atom-wise Newton Picard Picard Picard Picard Picard Diagonal Newton Diagonal Newton Diagonal Newton Diagonal Newton Diagonal Newton Diagonal Newton Diagonal Newton Diagonal Newton step size 1 · 10−2 1 · 10−2 1 · 10−2 5 · 10−2 5 · 10−2 5 · 10−2 5 · 10−2 10 · 10−2 1 · 10−2 1 · 10−2 1 · 10−2 5 · 10−2 10 · 10−2 1 · 10−2 1 · 10−2 1 · 10−2 5 · 10−2 5 · 10−2 5 · 10−2 5 · 10−2 10 · 10−2 window size 1 · 10−2 5 · 10−2 10 · 10−2 5 · 10−2 10 · 10−2 20 · 10−2 50 · 10−2 10 · 10−2 1 · 10−2 5 · 10−2 10 · 10−2 5 · 10−2 10 · 10−2 1 · 10−2 5 · 10−2 10 · 10−2 5 · 10−2 10 · 10−2 20 · 10−2 50 · 10−2 10 · 10−2 average number of iterations 4.0 6.0 8.0 10.6 15.5 23.6 no convergence no convergence 5.8 11.6 18 instable instable 4.0 6.0 8 11.0 16.2 25.2 instable no convergence wall-clock time in s 1159 1041 1283 617 664 860 479 570 808 1114 1000 1233 618 680 882 - Table 5.1. Average number of iterations for different window sizes and step sizes according to equation (2.15). As shown in figure 5.2 this is indeed the case, giving us up to 50% better performance for P = 192. It also makes it possible to get convergence if P is smaller. 5.2.2 Choice of the coarse operators Here, several different coarse solvers G for the molecular dynamics problem will be compared. These solvers have already been motivated in the previous chapter and are: • Using a smaller cutoff radius (0.5rcut and 0.25rcut ) • Using the Lennard-Jones potential • Using only a few WR iterations. After looking at table 5.1 the Picard iteration with step size 1 · 10−2 and window size 1 · 10−2 was chosen as it is the fastest solver. The number of WR iterations will be limited to only two iterations here. As this coarse solver would still be slower than F according to table 5.1 we will use a smaller cutoff radius 0.5rcut , too. 50 5.2. IMPROVING THE PARAREAL ALGORITHM 45 40 10−2 35 −2 5 * 10 −1 iterations 30 10 no windowing 25 20 15 10 5 0 0 20 40 60 80 100 t Figure 5.1. Necessary iterations for parareal algorithm on each window for different ∆T and P = 192 We will also apply the windowing technique with ∆T = 10−2 as these coarse operators may only work on a limited time window. Figure 5.3 shows now the numbers of iterations that were necessary in each window for P = 48. We can see that the convergence is quite slow when we are using the Lennard-Jones potential, at least compared to the other cases. One can demonstrate that the solutions that one obtains by using the Lennard-Jones potential differ quiet strongly from the solution using the Finnis-Sinclair potential. The difference will be larger if the used window size increases. If the Lennard-Jones potential is used, the parareal algorithm also becomes unstable if P 50 as the window size P · 10−2 is larger in this case. If we use one of the other coarse solvers the convergence is quite fast. But the performance of the parareal algorithm does not only depend on the speed of convergence but also on the cost of the coarse solver. As we can see in table 5.2 the speedup is very small when we use the two WR iterations as a coarse solver even though convergence is really fast. The reason for this is that the cost of this coarse solver is very high and we almost have CF = CG in this case. smaller cutoff 0.5rcut smaller cutoff 0.25rcut Lennard-Jones WR Speedup 2.6 5.5 0.8 1.2 Table 5.2. Achieved speedups for different coarse time-stepping schemes and P = 48 51 CHAPTER 5. EVALUATION 3.5 3 2.5 SP 2 10−2 1.5 5*10−2 1 10−1 no windowing 0.5 0 0 50 100 150 200 P 250 300 350 400 Figure 5.2. Achieved speedup for the parareal algorithm for different ∆T 20 18 iterations 16 14 0.5rcut 12 0.25rcut 10 Lennard−Jones WR 8 6 4 2 0 0 20 40 60 80 100 t Figure 5.3. Necessary iterations for different coarse time-stepping schemes This means that we basically only have the possibility to use smaller cutoff radiuses to obtain useful coarse solvers. We also cannot use cutoff radiuses smaller than 0.25rcut as the resulting coarse solver becomes unstable in this case. For the cutoff radiuses 0.5rcut and 0.25rcut the speedup for different P was plotted in figure 5.4. By using the less accurate coarse operator with 0.25rcut we are able to obtain a speedup as high as 10 which is much more than the speedup 2.6 that we achieved 52 5.2. IMPROVING THE PARAREAL ALGORITHM in the end of chapter 3. 10 0.5rcut 9 0.25 r cut 8 SP 7 6 5 4 3 2 1 10 2 3 10 10 4 10 P Figure 5.4. Speedup for cutoff radiuses 0.5rcut and 0.25rcut 5.2.3 A multilevel parareal algorithm Due to the lack of suitable coarse time-integrators, only the 3-grid method is used. All used time-stepping schemes are based on the Verlet integrator with step size 10−2 . We will also use windowing with ∆T = 10−2 as in the previous section. The first coarse operator G1 = F2 is the same as G in chapter 3, i.e. we use 0.5rcut as the cutoff radius. The even coarser operator G2 has been constructed by halving the cutoff radius again, as in the previous section about coarse operators. From now on we will always compare the classical parareal algorithm on P cores with the 3-grid algorithm using P1 = P cores on the fine level, i.e. 2P cores in total. First we will have a look at the number of iterations when P = 192. Figure 5.5 shows us the number of iterations in the following four cases: • Classical parareal algorithm using G = G1 , i.e. the coarse solver uses 0.5rcut as the cutoff radius. • Classical parareal algorithm using G = G2 , i.e. the coarse solver uses 0.25rcut as the cutoff radius. • Multilevel parareal algorithm using the solvers as specified above and the constants c1 = 1 and c2 = 2. • Multilevel parareal algorithm using the solvers as specified above and the constants c1 = 1 and c2 = 8. 53 CHAPTER 5. EVALUATION We can see that the convergence speed of the multilevel algorithm is in both cases not faster than for the classical parareal algorithm using G2 as the coarse solver. This was unfortunately the requirement for faster calculations and therefore it is very likely that our algorithm will not be faster than the classical parareal algorithm. This is shown in figure 5.6 where we compare the speedup of the multilevel algorithm with the speedup of the classical parareal algorithm. In that case the multilevel algorithm is slightly slower than the classical parareal algorithm with G = G2 even though the number of iterations was very similar in figure 5.5. A possible explanation is the higher number of communications that are necessary in the multilevel algorithms as the processors on different levels have to exchange data. We should also not forget that the multilevel algorithm uses the double amount of processors here compared to the classical parareal algorithm. 16 classical, 0.5 rcut classical, 0.25 rcut 14 multilevel c2 = 2 multilevel c2=16 iterations 12 10 8 6 4 2 0 20 40 60 80 100 t Figure 5.5. Necessary iterations for the classical parareal algorithm for P = 192 54 5.2. IMPROVING THE PARAREAL ALGORITHM 9 classical, 0.5 rcut 8 classical, 0.25 rcut 7 multilevel c2 = 2 multilevel c2=8 S P 6 5 4 3 2 1 1 10 2 10 P 3 10 Figure 5.6. Speedup for the multilevel and the classical parareal algorithm 55 Chapter 6 Conclusions This thesis showed that it is possible to parallelize at least the previously presented MD problem in time. In contrast to a lot of other papers my evaluation was not based on a very simple toyproblem but on a more realistic example. The more promising way to achieve time-parallelization for this problem seems to be the parareal algorithm which gave us a maximal speedup of 10 here. Even higher speedups may be possible if one can create even faster coarse solvers G that do not lead to instabilities. As equation (2.15) limited the speedup for a high number of cores, a multilevel parareal algorithm was derived to avoid this limit. This approach was not very successful as the convergence was not fast enough to obtain a better speedup. Short tests were also done with the Lorenz attractor using the setup in [16]. During these tests only the convergence speed of the algorithms was analyzed and not the overall performance. These experiments showed that the multilevel with the coarse solver GL−1 can converge much faster than the classical parareal algorithm with coarse time-stepping G = GL−1 if and only if G is a very inaccurate approximation of F. By investing more time into the development of even coarser solvers for this MD problem one can test if this also holds true for this problem. Future work may also include a convergence analysis for the multilevel parareal algorithm as well as advanced scheduling strategies. It is also necessary to investigate how the parareal algorithm behaves when additional spatial parallelization is used to increase the speed of the fine and coarse solvers. In this thesis F and G were calculated using only one processor which made their calculation slow. Using several cores will increase the speed of these time-stepping schemes but the communication costs in the parareal algorithm will get a bigger influence when the cost for the time-stepping declines. This is especially an issue in the multilevel parareal algorithm which requires more communication between the processors. The WR algorithm seems quite useless for this kind of problems. Slow convergence on bigger time-intervals or no convergence at all make this algorithm non-competitive when compared to normal time-stepping. This may be different for other kinds of MD problems as shown in [33]. In this paper the waveform 57 CHAPTER 6. CONCLUSIONS relaxation was shown to be almost as fast as sequential application of the Verlet algorithm in cases where special harmonic potentials force the Verlet algorithm to do very small time steps in order to retain stability. We conclude that the parareal algorithm allows us to parallelize the solution of the MD problem in time. Unfortunately, this algorithm is not really suited for large-scale parallelization as the speedup is limited even if the time domain is very large. The derived multilevel algorithm was not able to overcome this limit but it might still be useful for other problems. 58 Bibliography [1] CRESTA homepage. http://cresta-project.eu/. [2] Lindgren specification. http://www.pdc.kth.se/resources/computers/lindgren/hardware. [3] Titan homepage. http://www.olcf.ornl.gov/titan/. [4] Pierluigi Amodio and Luigi Brugnano. Recent Advances in the Parallel Solution in Time of ODEs. AIP Conference Proceedings, 1048(1):867–870, 2008. [5] Eric Aubanel. Scheduling of tasks in the parareal algorithm. Parallel Computing, 37(3):172 – 182, 2011. [6] L. Baffico, S. Bernard, Y. Maday, G. Turinici, and G. Zérah. Parallel-in-time molecular-dynamics simulations. Phys. Rev. E, 66:057701, Nov 2002. [7] Satish Balay, Jed Brown, Kris Buschelman, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Barry F. Smith, and Hong Zhang. PETSc Web page, 2013. http://www.mcs.anl.gov/petsc. [8] Morten Bjørhus. A note on the convergence of discretized dynamic iteration. BIT Numerical Mathematics, 35(2):291–296, 1995. [9] Stephen D. Bond and Benedict J. Leimkuhler. Molecular dynamics and the accuracy of numerically computed averages, 2007. [10] Kevin Burrage. Parallel methods for systems of ordinary differential equations, 1995. [11] X. Dai, C. Le Bris, F. Legoll, and Y. Maday. Symmetric parareal algorithms for Hamiltonian systems. ArXiv e-prints, November 2010. [12] Luca Formaggia, Marzio Sala, and Fausto Saleri. Domain decomposition techniques. In Are Magnus Bruaset and Aslak Tveito, editors, Numerical Solution of Partial Differential Equations on Parallel Computers, volume 51 of Lecture Notes in Computational Science and Engineering, pages 135–163. Springer Berlin Heidelberg, 2006. 59 BIBLIOGRAPHY [13] S. Friedhoff, R. D. Falgout, T. V. Kolev, S. MacLachlan, and J. B. Schroder. A multigrid-in-time algorithm for solving evolution equations in parallel. https://e-reports-ext.llnl.gov/pdf/705292.pdf, 2012. [14] M. Gander and S. Vandewalle. Analysis of the parareal time-parallel timeintegration method. SIAM Journal on Scientific Computing, 29(2):556–578, 2007. [15] Martin Gander. New Convergence Results for the Parareal Algorithm Applied to ODEs and PDEs. Presentation slides for the DD16 meeting, found at: www.ddm.org/DD16/Talks/gander.pdf. [16] Martin J. Gander and Ernst Hairer. Nonlinear convergence analysis for the parareal algorithm. In Ulrich Langer, Marco Discacciati, David E. Keyes, Olof B. Widlund, and Walter Zulehner, editors, Domain Decomposition Methods in Science and Engineering XVII, volume 60 of Lecture Notes in Computational Science and Engineering, pages 45–56. Springer Berlin Heidelberg, 2008. [17] Izaskun Garrido, Barry Lee, Gunnar E. Fladmark, and Magne S. Espedal. Convergent iterative schemes for time parallelization. Mathematics of Computation, 75(255):pp. 1403–1428, 2006. [18] M. Griebel, S. Knapek, and G. Zumbusch. Numerical Simulation in Molecular Dynamics: Numerics, Algorithms, Parallelization, Applications. Texts in Computational Science and Engineering. Springer, 2010. [19] G. Horton and S. Vandewalle. A space-time multigrid method for parabolic partial differential equations. SIAM Journal on Scientific Computing, 16(4):848– 864, 1995. [20] G. Horton, S. Vandewalle, and P. Worley. An algorithm with polylog parallel complexity for solving parabolic partial differential equations. SIAM Journal on Scientific Computing, 16(3):531–541, 1995. [21] Frank Hülsemann, Markus Kowarschik, Marcus Mohr, and Ulrich Rüde. Parallel geometric multigrid. In Are Magnus Bruaset and Aslak Tveito, editors, Numerical Solution of Partial Differential Equations on Parallel Computers, volume 51 of Lecture Notes in Computational Science and Engineering, pages 165–208. Springer Berlin Heidelberg, 2006. [22] Jan Janssen and Stefan Vandewalle. On SOR Waveform Relaxation Methods. SIAM J. Numer. Anal, 34:2456–2481, 1997. [23] Claude Le Bris. Computational chemistry from the perspective of numerical analysis. Acta Numerica, 14:363–444, 4 2005. 60 BIBLIOGRAPHY [24] E. Lelarasmee, A.E. Ruehli, and A.L. Sangiovanni-Vincentelli. The waveform relaxation method for time-domain analysis of large scale integrated circuits. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 1(3):131 – 145, july 1982. [25] Erik R. Lindahl. Molecular dynamics simulations. In Andreas Kukol, editor, Molecular Modeling of Proteins, volume 443 of Methods Molecular Biology™, pages 3–23. Humana Press, 2008. [26] Jacques-Louis Lions, Yvon Maday, and Gabriel Turinici. Résolution d’EDP par un schéma en temps «pararéel». Comptes Rendus de l’Académie des Sciences - Series I - Mathematics, 332(7):661 – 668, 2001. [27] A. S. Nielsen. Feasibility study of the parareal algorithm. Master’s thesis, Technical University of Denmark, DTU Informatics, 2012. [28] Arthur Paskin, A. Gohar, and G. J. Dienes. Computer simulation of crack propagation. Phys. Rev. Lett., 44:940–943, Apr 1980. [29] Prasenjit Saha, Mount Stromlo, Siding Spring Observatories, Joachim Stadel, and Scott Tremaine. A parallel integration method for solar system dynamics, 1997. [30] A. Satoh. Introduction to Practice of Molecular Simulation: Molecular Dynamics, Monte Carlo, Brownian Dynamics, Lattice Boltzmann and Dissipative Particle Dynamics. Elsevier insights. Elsevier Science, 2010. [31] Nick Schafer and Dan Negrut. A quantitative assessment of the potential of implicit integration methods for molecular dynamics simulation. Journal of computational and nonlinear dynamics, 5(3), 2010. [32] P.J. van der Houwen, B.P. Sommeijer, and J.J.B. de Swart. Parallel predictorcorrector methods. Journal of Computational and Applied Mathematics, 66(1–2):53 – 71, 1996. Proceedings of the Sixth International Congress on Computational and Applied Mathematics. [33] Haim Waisman and Jacob Fish. A space–time multilevel method for molecular dynamics simulations. Computer Methods in Applied Mechanics and Engineering, 195(44–47):6542 – 6559, 2006. [34] David E. Womble. A time-stepping algorithm for parallel computers. SIAM J. Sci. Stat. Comput., 11(5):824–837, September 1990. [35] Yanan Yu, Ashok Srinivasan, and Namas Chandra. Scalable timeparallelization of molecular dynamics simulations in nano mechanics. 2012 41st International Conference on Parallel Processing, 0:119–126, 2006. 61 TRITA-MAT-E 2013:44 ISRN-KTH/MAT/E—13/44-SE www.kth.se

Download PDF

advertisement