UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO ain dom ce spa e m ti g n ti u p m o c time varia ble Hybrid Space and Time Domain Decomposition for Parallel Simulation of Unsteady Incompressible Fluid Flows Jorge Manuel Fernandes Trindade (Mestre) Dissertação para obtenção do Grau de Doutor em Engenharia Mecânica Orientador: Doutor José Carlos Fernandes Pereira Júri Presidente: Reitor da Universidade Técnica de Lisboa Vogais: Doutor Paulo Jorge dos Santos Pimentel de Oliveira Doutor José Carlos Fernandes Pereira Doutor Fernando Manuel Coutinho Tavares Pinho Doutor Pedro Jorge Martins Coelho Doutor José Manuel da Silva Chaves Ribeiro Pereira April 2007 iii Abstract The thesis addresses the parallel calculation of unsteady, incompressible fluid flows on a PC-cluster. The spatial domain decomposition is nowadays a standard technique to perform parallel calculation of the Navier-Stokes equations. The research of an accurate, efficient and robust method for parallel-in-time calculations will extend the parallel calculations options in the context of CFD simulations. The solution is based on the iterative use of a coarse and a finer time-grid calculation in a predictor-corrector fashion. The extension of the parallel-in-time algorithm to hybrid time and space parallel calculations allows the possibility to optimize the speed-up by the choice of the domain to parallelize against the dimensions of the problem and the number of processors available. The discretization option of the incompressible, unsteady form of the NavierStokes equations is a common issue for both, spatial and temporal, parallel strategies. For second-order accurate finite-volume methods, the time derivative of the volume-averaged velocity can be congruently replaced by the time derivative of the cell center velocity. However, on a high-order formulation based on a projection method, it is essential to include in the algorithm a high-order reconstruction step. The fourth-order finite-volume numerical scheme uses the projection method for decoupling velocity and pressure. The inclusion of a high-order step-by-step deaveraging process applied onto the velocity field is a simple and effective method to enhance the accuracy of a finite-volume code for CFD analysis. Key-words: Parallel; Navier-Stokes; High-Order; Unsteady; Incompressible; FiniteVolume. iv v Resumo O cálculo paralelo de escoamentos não-estacionários de fluidos incompressı́veis num ”PC-cluster ” constitui o assunto central da tese. A decomposição do domı́nio espacial é hoje prática corrente na solução paralela das equações de Navier-Stokes. A investigação de um método baseado na decomposição do domı́nio temporal permitirá alargar o leque de opções de paralelização no contexto das simulações numéricas do escoamento de fluidos. A tese apresenta a aplicação de um algoritmo de decomposição temporal para a solução das equações de Navier-Stokes no caso incompressı́vel e não-estacionário baseado na solução iterativa das equações em duas malhas temporais. A escolha da ordem de discretização das equações de Navier-Stokes é um assunto comum em ambas as estratégias de paralelização. Nos esquemas de volumefinito de segunda-ordem de precisão, a derivada temporal da velocidade média no volume de controlo pode ser substituı́da pela derivada da velocidade pontual no centro do volume de controlo. No entanto, para uma formulação de alta-ordem baseada no método de projecção é necessário incluir no algoritmo um passo de reconstrução com alta-ordem de precisão. O esquema numérico de quarta-ordem usa o método de projecção para desacoplar a velocidade e a pressão. A inclusão de um processo de reconstrução do campo de velocidades em cada passo de cálculo é uma forma de aumentar a precisão de um código de volume-finito para a investigação de problemas através da mecânica de fluidos computacional. Palavras-Chave: Cálculo Paralelo; Navier-Stokes; Alta-Ordem; Não-estacionário; Incompressı́vel; Volume-Finito. vi vii Acknowledgments This work was carried out at the Department of Mechanical Engineering at Instituto Superior Técnico, Universidade Técnica de Lisboa. I would like to express my gratitude to my supervisor Prof. José Carlos Pereira for all good advice and for encouraging me during my work. I would also like to thank to Prof. José Manuel C. Pereira for all fruitful discussions and to all my colleagues at the LASEF for creating a stimulating working atmosphere. Finally, I would like to thank to my family for the support and patience. viii ix Nomenclature Latin characters b body force per unit mass cv , cp specific heat at constant volume, pressure CD drag coefficient CL lift coefficient g gravitational acceleration vector h cavity width k iteration counter n unit vector normal to CS directed outwards p pressure P number of processors S parallel speed-up t time variable T time domain length t∗ non-dimensionalized time t∗ = tκ/h2 u horizontal velocity u velocity vector U non-dimensionalized horizontal velocity U = ρuh/µ x horizontal space coordinate X non-dimensionalized horizontal space coordinate X = x/h y vertical space coordinate Y non-dimensionalized vertical space coordinate Y = y/h x Greek characters α diffusivity β thermal coefficient of expansion χ primitive variable for the velocity and temperature field δt fine time-grid step size ∆t coarse time-grid step size ² parallel efficiency γ kinematic viscosity Γ computing time θ, θ0 temperature, reference temperature Θ non-dimensionalized temperature Θ = (θ − θ0 )/(θX=0 − θX=1 ) κ heat conductivity µ viscosity coefficient ρ, ρ0 density, reference density φ scalar ω vorticity Similarity parameters N u Nusselt number P r Prandtl number Ra Rayleigh number Re Reynolds number St Strouhal number xi Abbreviations CF D Computational Fluid Dynamics CF L Courant-Friedrichs-Lewy CS Control surface CV Control volume M P I Message Passing Interface P C Personal computer P V M Parallel Virtual Machine rms root mean square xii Table of Contents Abstract . . . . . Resumo . . . . . Acknowledgments Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . v . vii . ix 1 Introduction 1.1 Parallel numerical simulation . . 1.2 Spatial domain decomposition . . 1.3 Temporal domain decomposition . 1.4 High-order projection methods . . 1.5 Objectives and contributions . . . 1.6 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 6 10 13 16 17 . . . . . . . . . . 19 19 22 23 26 27 28 28 30 33 34 . . . . . . 37 37 43 43 45 49 56 2 Numerical methods 2.1 Governing equations . . . . . . . . . . . . . . . . . 2.2 Solution of Navier-Stokes equations . . . . . . . . . 2.3 Spatial discretization . . . . . . . . . . . . . . . . . 2.4 Time advancement of momentum equations . . . . 2.5 Pressure correction . . . . . . . . . . . . . . . . . . 2.6 High-order finite-volume method . . . . . . . . . . 2.6.1 Spatial discretization . . . . . . . . . . . . . 2.6.2 Time advancement of momentum equations 2.6.3 Velocity de-averaging . . . . . . . . . . . . . 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Space domain decomposition 3.1 Parallel solution of systems of equations . . . . . . . 3.2 High-order finite-volume numerical simulations . . . . 3.2.1 Two-dimensional Taylor vortex decay problem 3.2.2 Counter-rotating vortices interaction . . . . . 3.2.3 Co-rotating vortices merging . . . . . . . . . . 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv TABLE OF CONTENTS 4 Time domain decomposition 4.1 Numerical method . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Parallel-in-time numerical scheme accuracy . . . . . . . . . . . . . . 4.4 Performance model . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Taylor vortex . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1.1 Parallel-in-time results . . . . . . . . . . . . . . . . 4.5.1.2 Comparison between the spatial and the temporal domain decomposition. . . . . . . . . . . . . . . . . 4.5.2 Shedding flow past a two-dimensional square cylinder in a channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2.1 Parallel-in-time simulation . . . . . . . . . . . . . . 4.5.2.2 Evaluation of the proposed performance model . . 4.5.3 Hybrid spatial and temporal domain decomposition . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 62 70 76 78 78 78 84 87 87 91 96 102 5 Conclusions 105 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Suggestions for future work . . . . . . . . . . . . . . . . . . . . . . 108 Bibliography 109 A De-averaging coefficients 117 List of Figures 1.1 1.2 1.3 2.1 2.2 2.3 2.4 PC-cluster current implementation. . . . . . . . . . . . . . . . . . . 1D(a), 2D(b) and 3D(c) strategies for the spatial domain decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Internal boundaries message exchange scheme. . . . . . . . . . . . . 2 Co-located grid (left) and staggered grid (right). . . . . . . . . . . . Labelling scheme for a 2D grid. . . . . . . . . . . . . . . . . . . . . Labelling scheme for 3D face integral evaluation. . . . . . . . . . . . Nine-point stencil for pressure increment gradient operator discretization for the pressure Poisson equation for the two-dimensional case. 24 29 30 Parallel code flow chart for a incremental-pressure projection method (cycle A exists only for the explicit outer iteration coupling). . . . . 3.2 Dependence of the computing time on the number of processors (128 × 128 nodes mesh). . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Dependence of the achieved speed-up on the number of processors (128 × 128 nodes mesh). . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Dependence of the parallel efficiency on the number of processors (128 × 128 nodes mesh). . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Dependence of the achieved speed-up on the number of processors for a 512 × 512 nodes mesh. . . . . . . . . . . . . . . . . . . . . . . 3.6 Dependence of the parallel efficiency on the number of processors for a 512 × 512 nodes mesh. . . . . . . . . . . . . . . . . . . . . . . 3.7 The computational domain considered for the two-dimensional Taylor vortex-decay problem. . . . . . . . . . . . . . . . . . . . . . . . 3.8 Maximum error, L∞ norm, of u-velocity and pressure, at final computed time, dependence on the mesh refinement. . . . . . . . . . . . 3.9 Initial conditions and computational domain for the two-dimensional viscous counter-rotating vortices test case. . . . . . . . . . . . . . . 3.10 Temporal evolution of the adimensionalized maximum vertical velocity component along a horizontal line going through the vortex center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 34 3.1 xv 38 40 41 41 42 42 44 45 46 47 xvi LIST OF FIGURES 3.11 Comparison of the predicted vertical velocity component profile after 100 s for the 128 × 128 nodes grid with the finer grid solution. . 3.12 Comparison of the predicted vertical velocity component profile after 100 s for the 256 × 256 nodes grid with the finer grid solution. . 3.13 Initial vorticity contours. . . . . . . . . . . . . . . . . . . . . . . . . 3.14 Vorticity contours during the merging process at t = 2 s, t = 5 s, t = 6 s, t = 7 s and t = 8 s (Second-order de-averaging on left and fourth-order on the right side). . . . . . . . . . . . . . . . . . . . . . 3.15 Comparison of the vorticity contours during the merging process simulation on a 150 × 150 nodes grid at t = 2 s, t = 5 s, t = 6 s, t = 7 s and t = 8 s (second-order de-averaging on the left and fourthorder on center) with the reference solution on the 600 × 600 nodes grid (right side). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.16 Comparison of the vorticity contours during the merging process simulation on a 300 × 300 nodes grid at t = 2 s, t = 5 s, t = 6 s, t = 7 s and t = 8 s (second-order de-averaging on the left and fourthorder on center) with the reference solution on the 600 × 600 nodes grid (right side). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.17 Vorticity contours during the merging process at t = 2 s, t = 5 s, t = 6 s, t = 7 s and t = 8 s for 300 × 300 (left) and 300 × 300 × 20 (right) nodes grids. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.18 Vorticity contours during the merging process at t = 60 s, t = 220 s, t = 240 s, t = 260 s and t = 300 s (Second-order de-averaging on left and fourth-order on the right side). . . . . . . . . . . . . . . . . 3.19 Predicted vertical velocity component profile after the merging (t = 360 s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 4.5 4.6 Space-time decomposition parallel solver schematic diagram. . . . . Propagating scalar front problem considered for numerical evaluation of the stability domain (solution for t = 1, u = 0.125 and α = 10−3 , after two iterations). . . . . . . . . . . . . . . . . . . . . Sequential coarse time-grid and parallel fine time-grid solutions . . . Error dependence on the iteration number near the stability boundary (4th order Runge-Kutta scheme and diffusive criterion equal to 0.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stability domain for the explicit Euler (a), Adams-Basforth (b), Crank-Nicolson (c) and 4th order Runge-Kutta (d) schemes . . . . Stability domain for the explicit Euler (a), Adams-Basforth (b), Crank-Nicolson (c) and 4th order Runge-Kutta (d) schemes (cont’d) Initial condition (t = 0) and the iterative approximation to the exact solution at t = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 49 50 52 53 54 55 56 61 64 66 67 68 69 71 LIST OF FIGURES 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 Dependence of sequential solution on the fine time-grid L2 error norm on the mesh resolution. . . . . . . . . . . . . . . . . . . . . . Error dependence on the spatial discretization and number of iterations using the implicit Euler and fourth-order Runge-Kutta schemes on the coarse and fine time-grids, respectively. . . . . . . . . . . . . Error dependence on the spatial discretization and number of iterations using the Crank-Nicolson and fourth-order Runge-Kutta schemes on the coarse and fine time-grids, respectively. . . . . . . . Error dependence on the spatial discretization and number of iterations using the implicit Euler and Adams-Bashforth schemes on the coarse and fine time-grids, respectively. . . . . . . . . . . . . . . . . Parallel-in-time solver schematic diagram. . . . . . . . . . . . . . . Parallel-in-time speed-up prediction for two iterations. . . . . . . . Parallel-in-time computing time dependence on the number of iterations performed (32 × 32 nodes; δt = 4 × 10−3 s). . . . . . . . . . . Parallel-in-time solution L1 norm error dependence on the number of iterations performed (32×32 nodes; δt = 4 × 10−3 s). . . . . . . . Dependence of maximum deviation between parallel-in-time and serial solutions on the number of iterations performed (32 × 32 nodes; δt = 4 × 10−3 s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computing time dependence on space resolution (2 iterations; δt = 4 × 10−3 s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speed-up and parallel efficiency dependence on space resolution (2 iterations; δt = 4 × 10−3 s). . . . . . . . . . . . . . . . . . . . . . . . Computing time dependence on the size of the finer time-grid increment (2 iterations on a 32 × 32 nodes mesh). . . . . . . . . . . . . . Spatial domain decomposition parallel efficiency and speed-up on a 64 × 64 nodes mesh and δt = 4 × 10−3 s. . . . . . . . . . . . . . . . Temporal domain decomposition parallel efficiency and speed-up on a 64 × 64 nodes mesh and δt = 4 × 10−3 s. . . . . . . . . . . . . . . Parallel-in-time and domain decomposition methods parallel efficiency ratio on a 64 × 64 nodes mesh. . . . . . . . . . . . . . . . . . Flow configuration and grid. . . . . . . . . . . . . . . . . . . . . . . Force coefficients for Re = 500. . . . . . . . . . . . . . . . . . . . . Predicted vorticity contours and streamlines for Re = 500. . . . . . Lift coefficient and Strouhal number dependence on the number of iterations prescribed. . . . . . . . . . . . . . . . . . . . . . . . . . . Vorticity contours after first (a) and second (b) iteration. . . . . . . Comparison between predicted (lines) and verified (symbols, filled symbols are related to 3 iterations) efficiency of the parallel-in-time method for 2 and 3 iterations prescribed (Φ = M/P ). . . . . . . . . xvii 72 73 74 75 76 78 80 81 81 82 83 84 86 86 87 88 90 91 93 94 95 xviii LIST OF FIGURES 4.28 The dependence of the Nusselt number, at the heated wall, X = 0, and at the vertical center-line X = 1/2, on the non-dimensionalized time for Rayleigh number equal to 103 , 1.4 × 104 and 1.4 × 105 . . . 98 4.29 The Nusselt number temporal evolution, at the vertical center line for Ra = 1.4×105 , dependence on the number of iterations performed. 99 4.30 The dependence of the number of iterations required for convergence on the Rayleigh number and on the number of time sub-domains. . 99 4.31 Computing time required for simulations: a) 128 × 128 nodes ; b) 32 × 32 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.1 A.2 A.3 A.4 A.5 High-order 2-D de-averaging stencil for interior control volumes. . . Cases considered for the use of the 2-D de-averaging coefficients. . . 3-D stencil and de-averaging coefficients for a interior cell. . . . . . 3-D stencil and de-averaging coefficients for a boundary face cell. . . 3-D stencil and de-averaging coefficients for a boundary face cell near a edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6 3-D stencil and de-averaging coefficients for a boundary face cell near a vertex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 3-D stencil and de-averaging coefficients for a cell near a boundary face. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.8 3-D stencil and de-averaging coefficients for a cell near a boundary face and edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.9 3-D stencil and de-averaging coefficients for a cell near a boundary face and vertex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.10 3-D stencil and de-averaging coefficients for a boundary edge cell. . A.11 3-D stencil and de-averaging coefficients for a boundary vertex cell. 118 118 120 121 122 123 124 125 126 127 128 List of Tables 1.1 Configuration of individual machines. . . . . . . . . . . . . . . . . . 3 4.1 Predicted values for St and CL rms . . . . . . . . . . . . . . . . . . . 89 A.1 De-averaging coefficients for a uniform 2-D cartesian spatial grid. . 119 xix xx LIST OF TABLES Chapter 1 Introduction 1.1 Parallel numerical simulation Parallel architectures are increasingly attractive when examining current trends from a variety of perspectives; economical, technological and application demand [1]. For complex engineering problems, the demand for the computer resource of large memory and reasonable turn-around time is more than the one provided by a regular sequential machine. To this end, one can employ powerful supercomputers, which are very expensive for Universities and small research groups. Fortunately, due to the support of high-speed networking, we can construct highperformance systems based on personal computers (PC) components. The use of multiple processors is an effective way to significantly speed-up the solution. A popular and cost-effective approach to parallel computing is cluster computing based, for example, on PCs running the Linux operating system. The present Thesis addresses the parallel calculation of unsteady, incompressible flows on a PC cluster that is nowadays a common practice to achieve the computing power required by theoretical or applied fluid engineering problems. LASEF (Laboratory and Simulation of Energy and Fluids) has experience in parallel processing for Computational Fluid Dynamics (CFD) and for almost two decades used different configurations, starting with transputer based processors [2]. Other research teams in Portugal have used parallel processing in CFD related problems, see e.g. P. Novo [3], P. Coelho [4, 5], L. Palma [6, 7] and a few others, but the number is very limited compared with central Europe countries, USA or 1 2 CHAPTER 1. INTRODUCTION Japan. Therefore, here is included the description of the hardware and software of the 24 nodes PC-cluster used to conduct the numerical experiments. A PC-cluster can be considered as a message-passing architecture parallel machine that provides communications between processors as explicit I/O operations, with the limitation that these communications are much slower than the processor. The effectiveness of this approach depends on the communication network connecting the PCs. The goal of the project was to construct a PC-cluster to provide the relatively inexpensive but high-performing computational capability required for the execution of fluid flow simulations. A 24 unit was successfully assembled, each with one Pentium IV 2.4 GHz processor. The system provides very stable operation with almost 100% up-time and expected speed, and it is very heavily used by a large number of users (about 15). Consequently, it was very important to integrate the system properly so that it complies with the demands of a multi-user remote access environment. All the major aspects of the PC-cluster current implementation are schematically represented on Fig. 1.1. Master PC 100 Mbs Switch 24 Slaves Figure 1.1: PC-cluster current implementation. Pentium CPU was selected over other options because there are stable compilers that run under Linux system; otherwise, one had to purchase an expensive 1.1. PARALLEL NUMERICAL SIMULATION 3 alternative operating system. No monitor is attached to any slave PC, since these are to be connected with network and used without direct user access. The operation of all the individual machines is monitored with a master console. Table 1.1 summarizes the configuration details of the individual machines. Table 1.1: Configuration of individual machines. Master CPU 2.4 GHz Memory 1 GB Hard Drive 65 + 200 GB Slaves 2.4 GHz 1 GB 33 GB Finally, it is worth mentioning a vital hardware-related issue of the PC-cluster, the Fast Ethernet network setup. In order to coordinate the computational tasks among all computer nodes, it is necessary to exchange a large amount of information during the computing process. High-speed communication among them is imperative. One 24-port 100 Mbps Ethernet switch is currently used for the node connection. The master PC is connected to the IST 10Mbps Ethernet, allowing the communication with the outside world. The Linux operating system (Debian distribution, version 3.1) is currently used for all computers of the cluster. The well-known stability, performance, excellent software and troubleshooting support and documentation of this easy-to-use free operation system made it a natural choice for the PC-cluster. All the codes were developed in FORTRAN language using the Absoft compiler version 95 9.0 EP. The MPICH version 2.1 libraries are currently used for inter-processor communication. In parallel computation, the workload is partitioned into smaller pieces, which are taken care by a group of processors. Effective use of a PC-cluster requires a proper distribution of the solution tasks among the available processing nodes. A common approach in CFD is to decompose the spatial domain into a number of partitions and assign the partitions to different nodes, see e.g. [8, 9]. The processing nodes execute the same CFD solver but on different space sub-domains. One should partition the work evenly to the available computing nodes. It should be done such that no node is overloaded while other nodes are waiting for work. At the end of each numerical iteration, processors exchange intermediate results at sub-domain boundaries. Minimizing communication between nodes is the key 4 CHAPTER 1. INTRODUCTION for optimal speed-up. For CFD problems, evenly balanced workload and minimization of communication are dictated by the mesh partitioning. For structured meshes, the domain decomposition is straightforward. For unstructured meshes, a robust and effective algorithm to divide automatically the domain is a demanding necessity. The comparison between different methods of decomposition requires parameters for measuring the performance of codes on parallel machines. The most common performance metrics of parallel processing are the speed-up and the efficiency. The speed-up achieved by a parallel algorithm running on P processors is defined as the ratio of the execution time on a single processor and the execution time of the parallel algorithm on P processors. The parallel speed-up S(P ) can be expressed as S(P ) = Γ1 , ΓP (1.1) where Γ1 and ΓP denote the execution time on one and P processors, respectively. The parallel efficiency, ²(P ), is given by: ²(P ) = S(P ) . P (1.2) The definition of sequential time can be based on either the execution time of the parallel code on one processor, relative speed-up, or on the execution time of the best sequential algorithm available, absolute speed-up. The difference in these two definitions is that the relative speed-up deals with the inherent parallelism of the parallel algorithm under consideration, while the absolute speed-up emphasizes how much faster a problem can be solved with parallel processing when the chosen algorithm is the best available. Practical considerations limit the usefulness of the absolute speed-up and efficiency definitions. The option for the best algorithm may depend on the problem size, hardware used, etc. A constraint on the parallel performance is provided by Amdahl’s law [10], which states that no matter how many processors are used in a parallel calculation, the parallel speed-up for an application will be limited by the fraction of serial code present. The parallel efficiency can be expressed as a product of three factors: commu- 1.1. PARALLEL NUMERICAL SIMULATION 5 nication, numerical and load balancing, ²(P ) = ²com × ²num × ²lb . (1.3) The numerical efficiency represents the increase in the number of operations required to fulfil the convergence criterion due to the changes in the algorithm required for its parallelization. The load balancing efficiency represents the time some processors stay idle due to problem sizing differences on each processor. The communication efficiency represents the time loss in a parallel computation due to communication lag between processors during which computation cannot take place. The communication time can be split into local and global, Γcom = Γloc + Γglob , (1.4) where Γloc and Γglob are the time spent on local and global communications, respectively. The difference between the two is that local communications run in parallel, i.e. all processors are involved simultaneously in communication. During local communication, some processors send while others receive data. Global communication is a limiting factor for massive parallelization because only a certain number of processors are involved in communication at any time between the beginning and the end of data gathering or scattering. Scalability is an important issue in parallel computing and becomes significant when solving large-scale problems with many processors. Scalability refers to the ability of a parallel system including the hardware, software and application to demonstrate a proportional increase in parallel speed-up with the number of processors. Mesh partitioning has a dominant effect on the parallel scalability for problems characterized by almost constant work per point. Poor load balancing induces idleness at synchronization points. However, balancing work alone is not sufficient. Communications must be balanced as well. These two objectives are not entirely compatible. When the problem size is fixed, an increase in the number of processors can begin to have a negative impact on parallel speed-up. The ratio between the time of communication and the time of computation decreases when the size of the problem increases, leading to an increased efficiency. Speed-up usually increases with increasing problem size because large data arrays can reduce 6 CHAPTER 1. INTRODUCTION the fraction of serial code providing increased parallel speed-up. In the last years, high performance computing became increasingly important to scientific advancement. Nowadays, it is recognized that computational power greater than presently available will be needed in order to address the large-scale problems in industry as well as to improve our knowledge in complex scientific problems. Massive parallel technology as the GRID computing is expected to fulfil this present need [11, 12]. GRID computing is a new distributed computing paradigm, similar in spirit to the electric power grid. GRID computing is a form of distributed computing that involves coordinating and sharing computing, application, data, storage, or network resources across dynamic and geographically dispersed organizations. It provides scalable high-performance mechanisms for discovering and negotiating access to geographically remote resources promising to change the way organizations tackle complex computational problems. However, the vision of large-scale resource sharing is not yet a reality in many areas - GRID computing is an evolving area of computing, where standards and technology are still being developed to enable this new paradigm in a near future, see e.g. [13, 14, 15]. 1.2 Parallel simulation by spatial domain decomposition CFD plays an important role in research of physical processes associated with important engineering applications. Current challenges in computing incompressible flows have been recently addressed by Kwak et al. in a review article [16]. Flow solver codes and software tools have been developed to the point that many daily fluid engineering problems can now be computed routinely. However, the predictive capability is considered still very limited, and prediction with accurate physics is yet being accomplished. This will require the inclusion of not only fluid dynamics modelling but also the modelling of other related quantities. These computations will require not only large computing resources but also large data storage and management technologies. The solution of three-dimensional, unsteady, incompressible fluid flows, governed by the parabolic-elliptic nature of the Navier-Stokes and continuity system of equations, usually requires a large amount of computer time due to long time 1.2. SPATIAL DOMAIN DECOMPOSITION 7 simulation coupled with high-resolution meshes. Workstations, networks or multiprocessor systems with distributed memory have become a customary facility for the solution of this sort of problem. Most of the unsteady, incompressible calculations are performed with timestepping algorithms and space domain decomposition. The time-stepping algorithms require the solution at one time-instance before start the calculation for the next time-step. This sequential in time procedure is accomplished by the discretization in space that uses the data parallelism or space domain decomposition technique. It is well known that when the spatial dimension of the problem is low, the parallel speed-up of the spatial domain decomposition method is limited and an increase on the number of processors further decreases the efficiency. To perform parallel programming, a parallel library is needed to provide the communication among the computer nodes in the network. There are two standards for distributed computing: the Parallel Virtual Machine (PVM) and the Message Passing Interface (MPI). The features of both specifications were compared by Geist et al. [17]. PVM is built around the concept of a virtual machine, dynamic collection of potentially heterogeneous computational resources. Since the development start of PVM that portability and heterogeneity are considered much more important than performance. Some performance sacrifice is made in favour of the flexibility to communicate across architectural boundaries. In contrast, MPI was focused on the message-passing task and explicitly states that the resource management is out of scope of MPI. On the framework of parallel calculation on a homogeneous PC-cluster, the MPI standard (MPI 1.2 specification [18]) was chosen to develop the parallel Navier-Stokes solvers. MPI provides the following main features: - the ability to specify communication topologies; - the ability to create derived data-types describing non-contiguous data; - a large set of point-to-point communication routines; - a large set of collective communication routines for communication among groups of processes. The parallelization procedure is based on the grid partitioning technique or data parallelism. The solution domain is subdivided into P non-overlapping sub- 8 CHAPTER 1. INTRODUCTION domains and each sub-domain is assigned to one processor. The objective of domain decomposition is to balance the computational workload and memory occupancy of processing nodes while keeping the inter-node communication as less as possible. The communication between nodes, which is the more important source of the computational overhead, must be minimized. In general, the partitioning method follows one the following strategies (see Fig. 1.2): - The one-directional partitioning of a three dimension computational space; - The partitioning for decrease the amount of data exchange in communications between processors. a) b) c) Figure 1.2: 1D(a), 2D(b) and 3D(c) strategies for the spatial domain decomposition. The one-directional partitioning is easier to program but has a long data communication time, because this partitioning has a large amount of communication on a data exchange part. This trend of data communication appears in parallel ma- 1.2. SPATIAL DOMAIN DECOMPOSITION 9 chines with high data communication latency. The multi-directional partitioning, with smaller message size, is commonly in use and will be applied when possible. Since the sub-domains do not overlap, each processor calculates variable values that are not calculated by other processors. The calculation on control volumes (CVs) along the sub-domain boundaries needs values from CVs allocated to neighbouring processors. This requires an overlap of storage in the case of a distributed memory parallel computer, as indicated in Fig. 1.3. Each time a processor updates a variable needed by a neighbour processor, it is copied to the neighbour processor’s memory. The number of CVs stored in the neighbouring processor is dependent on the order of accuracy of the discretization scheme. boundary Proc. 3 Proc. 1 Proc. 2 Proc. 4 Data overlap areas Proc. n Proc. n+1 Communication step 1 Proc. n-1 Proc. n Communication step 2 Figure 1.3: Internal boundaries message exchange scheme. In contrast to a sequential program, a parallel program needs to be well organized in task load and communication to maximize the performance. Message passing routines, included in the MPI library, provide us the organized structure for parallel computing. If not done carefully, the executing processors can spend much more time waiting for the communication than computing, and thus the 10 CHAPTER 1. INTRODUCTION overall performance of the cluster will be unacceptable. A basic solution is to exploit the so-called ”non-blocking communication,” which allows message passing simultaneously with computing. The performance of a parallel CFD code is dependent on several factors including the characteristics of the numerical algorithm and the hardware. The parallel performance depends to a great extend on inter-processor communication which is a hardware depend parameter. The numerical procedures used to solve the equations have also a significant contribution to the parallel performance. The discretization methods and the time-marching procedure are important parts of a CFD code. A higher-order numerical scheme usually requires more communication time in a parallel environment than a lower-order one. Implicit schemes allow larger time-steps than the easily parallelized explicit ones but require more computational work at each time step and its parallelization is sometimes cumbersome. 1.3 Parallel simulation by temporal domain decomposition The parallelism in the time direction is not common in CFD. The usual approach for setting up parallel CFD calculations is to divide the domain between processors using the spatial domain decomposition. Algorithms that are sequential in time are usually considered to solve parabolic and hyperbolic differential equations numerically. Space domain splitting and the allocation of each sub-domain to a processor is the methodology usually used to perform the parallel computation of the governing fluid flow equations. However, it is well known that domain decomposition techniques are not efficient when the spatial dimension of the problem is small and a large number of processors are used. At each time step, the processors need to exchange boundary variable values with processors holding adjacent sub-domains. For a fixed space sized problem in a distributed memory parallel computer, the communication/computation ratio increases with the number of processors. Although some overlapping between computation and communication is possible, parallel efficiency and speed-up are drastically reduced. Future massive parallel computer systems as GRID computing will increase the number of processors available allowing new boundaries to the problem dimension 1.3. TEMPORAL DOMAIN DECOMPOSITION 11 to be solved. Consequently, parallel-in-time methods will have a high potential application to reduce the computing time that nowadays is only achieved with the standard spatial domain decomposition technique. Several parallel algorithms have been developed in the past to decompose the temporal domain of the problem under consideration [19, 20, 21, 22]. These methods range from space-time multigrid methods to parallel time-stepping methods but their application for the solution of real unsteady fluid flow problems did not become popular, see e.g. [23]. Recently, Lions et al. presented a new approach to parallelize across the time domain of the problem under consideration the solution for the temporal evolution of a parabolic system of equations [24]. The new parallel method was called parareal because the main goal on the initial development was the real time solution of a problem using a parallel structure. Some modifications of the original parareal algorithm have been introduced by Bal [25] to obtain better stability and performance. The method is based on the alternated use of coarse global sequential solvers with fine local parallel ones and the calculation proceeds in an iterative prediction-correction fashion over the entire time domain of the problem. Calculation starts with a sequential solution along the time domain of the problem on a coarse time-grid and is followed by an iterative procedure using the coarse timegrid and a finer one. The predictor step is calculated sequentially on the coarse time-grid and the correction step is based on a solution calculated in parallel using the fine time-grid. This iterative procedure provides successive corrections for the problem solution. Some applications of the parareal algorithm have already been successfully performed. The parareal algorithm was applied on molecular-dynamics simulation [26], and quantum control [27] identifying some bottlenecks that contribute to the low parallel efficiency of the method. The application of the method to solve structure, fluid and coupled fluid-structure model problems has also been considered feasible by Farhat and Chandesris [28]. Previous application of the parallel-in-time method for the solution of the unsteady, incompressible Navier-Stokes equations reported in Reference [29] indicates that the method can be a promising alternative technique for long time simulations on small spatial domains. The stability of the parallel-in-time method has been addressed by Lions et al. [24] for linear partial differential equations under simplified assumptions, such as the knowledge of the exact form of the solution on the fine time-grid evolution. The 12 CHAPTER 1. INTRODUCTION stability issue is also theoretically addressed by Farhat and Chandesris [28] and by Staff and Rønquist [30] but, unfortunately, no general CFL-like criterion for the conditional stability exists, even for the one-dimensional convection-diffusion equation. The theoretical investigation of the stability of the parallel-in-time algorithm allowed important conclusions such as the observed reduction of the conditional stability region of standard explicit schemes. The fulfilment of the stability criteria for isolated fine- and coarse-time-grid sets of equations requires, in the absence of a general criterion, the computation of the stability region numerically. Therefore, numerical experiments were performed to investigate the stability domain when the method is applied to the solution of the one-dimensional transport equation [31]. The conditional stability domain was predicted for several temporal discretization schemes on the coarser time-grid simulating up to one hundred processors. For the test case investigated, no reduction of the stability domain of the finite difference equations of the parallel-in-time method was detected for the ”unconditionally stable” implicit three-level and Euler schemes. The other schemes considered under the parallel-in-time method (explicit Euler, Adams-Bashforth, Crank-Nicolson and fourth-order Runge-Kutta) displayed important reductions on the standard conditional stability domain for sequential calculations. The extension of the parallel-in-time algorithm to hybrid time and space parallel calculations allows the possibility to optimize the speed-up by the choice of the domain to parallelize against the problem dimension under consideration and the number of processors available. A simplified performance model for the parallelin-time method, taking into account several parameters that contribute to the computing time required to perform a parallel-in-time calculation, is essential to provide a parallel efficiency or speed-up prediction. The prediction of the theoretical parallel efficiency or speed-up is relevant to decide when to use parallel-in-time for the solution of a specific problem. Another important issue related to the performance of the parallel-in-time method is the time spent by the communication tasks required by the algorithm. The performance model is validated with solutions of the unsteady Navier-Stokes equations. The comparison of the observed parallel-in-time performance for the solution of an unsteady, incompressible flow problem on a PC-cluster with the performance model prediction will allow to verify the effect of the communication time on the parallel-in-time simulation efficiency. It is believed that in the near future massive parallel computer systems will in- 1.4. HIGH-ORDER PROJECTION METHODS 13 crease the number of processors available allowing new boundaries to the problems dimension. Consequently, time and hybrid (space and time) domain decomposition methods will have a high potential application to reduce the computing time that is nowadays achieved with the standard spatial domain decomposition technique. This will have a positive impact on the solution of partial differential equations that CFD deals with, but also in other areas of computer modelling for engineering and science. 1.4 High-order projection methods Previous sections introduced the methodologies required for parallel fluid flow simulation by spatial or temporal domain decomposition. The discretization of the incompressible, unsteady form of the Navier-Stokes equations, which is introduced in this section, is a common issue for both parallel strategies. The numerical solution of Navier-Stokes equations, written in primitive variables, for unsteady, incompressible flows, faces numerical difficulties due to the lack of a dedicated equation for the pressure temporal evolution. This problem is commonly overcome, apart from pseudo-compressibility and vorticity based methods, by a pressure-velocity coupling method such as the families of projection methods or fractional-step or operator-splitting methods or dedicated algorithms such as PISO [32]. Pressure projection methods are usually preferred to artificial compressibility methods, except for pseudo-transient solutions, to reach the steady-state solution of interest, see [33]. The projection methods were introduced by the pioneering work of Chorin [34] and Temam [35] and several variants of the original method have been presented. All these variants are based on the decomposition of the equation operators. Momentum equations are first updated calculating an intermediate velocity that, in general, will not satisfy mass conservation. After the solution of the pressure Poisson equation, the intermediate velocity field is then corrected in the second step to enforce mass conservation. Many of the schemes use some kind of explicit approximation making the overall algorithm semi-implicit or explicit, see e.g. [34, 35]. Others use a semi-implicit approximation for the convective terms making the algorithm unconditionally stable, see e.g. [36, 37, 38]. The use of an explicit treat- 14 CHAPTER 1. INTRODUCTION ment of convection fluxes and diffusion terms, even imposing severe restrictions on allowed time-step size, requires less storage and computing time per time-step. Consequently, when a detailed flow history is required, explicit methods are often used instead of more stable implicit methods. Another advantage resulting from the use of an explicit method is the straightforward and efficient parallelization. For problems with large meshes, massive parallel computing offers an attractive means to reduce the computing time. Numerous papers appeared in the literature over the past thirty years discussing projection-type methods for solving the incompressible Navier-Stokes equations. Many peculiarities related to the accuracy of the projection methods have been focus of extensive research and discussion. Recurring difficulties encountered are the proper choice of boundary conditions for the auxiliary variables in order to obtain at least second-order accuracy in the computed solution and the formula for the pressure correction at each time-step. E and Liu [39, 40] made a review analysing several projection methods. They performed a normal modes analysis of the semidiscrete in time Stokes equations employing the first-order Chorin’s method and the second-order incremental and non-incremental methods of Bell et al. [37] and Kim and Moin [36], respectively. It revealed that since it is impossible to satisfy the exact boundary condition for the pressure that follow from the semi-discrete equations, the pressure is polluted by either a spurious boundary layer around boundaries where Dirichlet boundary conditions are prescribed or high frequency oscillations. Strikwerda and Lee [41] have also analysed the fractional step methods of Kim and Moin [36], van Kan [38] and Bell et al. [37], for the incompressible Navier-Stokes equations. Their study shows that the pressure at any projection method can be at best first-order accurate because boundary conditions cannot be exactly satisfied on the projection step. Despite both analysis are restricted to implicit schemes for the time advancement step, conclusions are extended to explicit schemes. Brown et al. have also analysed the accuracy of several incremental pressure projection methods [42] identifying the inconsistencies that contribute to reduce the order of accuracy on pressure. The first-order error appears as a boundary layer in the numerical results. Simple modifications in existing methods were also presented to eliminate first-order errors in the computed pressure near solid boundaries. Although the controversy related to the order of accuracy of the projection 1.4. HIGH-ORDER PROJECTION METHODS 15 methods, high-order finite difference numerical schemes have been presented, see e.g. [43, 44, 45], and the formal order of accuracy demonstrated through numerical experiments. The advantages of using higher-order accurate methods for the solution of partial differential equations are well known. Spectral methods are widely used for problems with periodic boundary conditions. Higher-order finite difference methods also present significant advantages over lower-order methods. To achieve the high accuracy demands of some engineering problems simulations, high-order spatial discretizations have gained interest. Higher-order accurate methods have better resolution characteristics of the difference approximations. The resolution characteristics as reviewed by Lele [46] are related to the accuracy with which the difference approximation represents the exact result over the full range of length scales that can be realized on a given mesh. Consequently, higher-order difference methods should require fewer nodes per wavelength compared with the secondorder scheme for a given accuracy. In finite-volume formulations, the resulting high-order finite difference equations are constructed by increasing the spatial accuracy of the fluxes approximation [47]. For second-order accurate methods, the time derivative of the volumeaveraged velocity can be congruently replaced by the time derivative of the cell center velocity. This procedure is also second-order accurate. On a high-order formulation, it is essential to proceed with a high-order reconstruction of the pointwise velocity field. The reconstruction should be performed, at least, with the same order of accuracy of the other operators to keep the desired formal accuracy of the numerical scheme. A local average-based procedure was introduced by Denaro et al. [48] for the solution of incompressible Navier-Stokes equations in the framework of Large Eddy Simulations (LES). This approach was developed in References [49] and [50] where a fourth-order finite-volume method, based on the approximate deconvolution of the integral Navier-Stokes equations, is presented. The deconvolved integral Navier-Stokes equations are solved after discretization in a co-located mesh arrangement by means of a second-order accurate semi-implicit scheme for the time integration and a pressure-free velocity-pressure decoupling. A deconvolution procedure was also proposed by Pereira et al. [51] in the context of a compact fourth-order fully coupled finite-volume method, where the solution proceeds based on variable cell averages and point-wise values are recovered at the end of the com- 16 CHAPTER 1. INTRODUCTION putation. A projection-type fourth-order accurate numerical scheme to solve the integral form of the incompressible Navier-Stokes equations can provide an efficient tool for fluid flow simulations. An explicit formulation for time advancement, like the fourth-order Runge-Kutta scheme, will contribute to an efficient parallel performance of the code. 1.5 Objectives and contributions Unsteady fluid flow phenomena are important for a wide range of engineering problems and tools such as the numerical simulation plays a vital role in providing solutions. For some applications, such as feedback control processes, it would be beneficial to obtain multidimensional fluid flow solutions faster then real time. However, even without real time applications in mind, reducing the computing time to solve unsteady flow problems is always beneficial, as it makes possible to study increasingly larger and more complex problems. The advent of an accurate, efficient and robust method for parallel-in-time calculations will extend the parallel calculations options in the context of CFD calculations. The parallel-in-time or time-domain decomposition option could be selected but also the hybridization of the time and space parallel strategies. Another important issue related to the numerical simulation of unsteady incompressible flows is the use of high-order accurate methods. For a given accuracy, high-order formulations require less storage and computing time per time-step than a second-order scheme. However, for finite-volume formulations it is essential to include a high-order reconstruction procedure to keep the desired accuracy of the numerical scheme. The main objectives of the work presented in the Thesis can be summarized as follows: i) to perform the parallel solution of the unsteady, incompressible Navier-Stokes equations by the spatial, temporal or hybrid (simultaneous space and time) domain decomposition; ii) to analyse the consistency, stability, convergence, efficiency and robustness of the temporal domain decomposition method; 1.6. CONTENTS 17 iii) to develop a fourth-order accurate, in space and time, finite-volume numerical scheme for the solution of the unsteady, incompressible Navier-Stokes equations. Some of the work presented in the Thesis has also been published in References [29, 31, 52, 53] where further details and applications can be found. 1.6 Contents The present Thesis is divided in five Chapters. Chapter 1 introduces the work performed and reported in the Thesis. Chapter 2 describes the numerical methods used for the solution of the unsteady, incompressible Navier-Stokes equations on the framework of a spatial, temporal or hybrid domain decomposition. The governing equations for an unsteady, incompressible fluid flow are presented in Section 2.1. Section 2.2 presents the numerical methods adopted for the parallel solution of the governing equations. The spatial and temporal discretization schemes considered for the parallel-in-time fluid flow simulations are summarized in Sections 2.3 and 2.4, respectively. Section 2.5 is devoted to the pressure correction step of the projection method. A fourth-order accurate finite-volume numerical scheme for the solution of the unsteady, incompressible Navier-Stokes equations is described with detail in Section 2.6. Section 2.7 summarizes the Chapter. Chapter 3 describes the main topics related to the space domain decomposition method for parallel fluid flow simulation. The parallel solution of systems of equations is discussed in Section 3.1. Fluid flow simulations performed with the fourth-order accurate numerical scheme under the framework of the spatial domain decomposition technique are included in Section 3.2 to analyse the accuracy of the method. The two-dimensional Taylor vortex decay problem allows verifying the accuracy of the numerical scheme. Strong non-linear test cases, the interaction between co- and counter-rotating vortices, were also selected to verify the increase of the numerical scheme accuracy promoted by the inclusion of a high-order deaveraging procedure. Detailed conclusions close the Chapter. Chapter 4 is devoted to the presentation of the parallel-in-time method for the solution of the unsteady, incompressible Navier-Stokes equations. Section 4.1 in- 18 CHAPTER 1. INTRODUCTION cludes the detailed presentation of the numerical method applied to the solution of the unsteady, incompressible Navier-Stokes equations. The stability of the method is discussed in Section 4.2. Following Sections are devoted to analyse the accuracy of the method and to propose a performance model. The Chapter closes with the presentation of the results of numerical experiments in Section 4.5 and conclusions in Section 4.6. Chapter 5 closes the Thesis with summarizing conclusions in Section 5.1 and suggestions for future work in Section 5.2. Chapter 2 Numerical methods This Chapter describes the numerical methods used for the solution of the unsteady, incompressible Navier-Stokes equations on the framework of the spatial, temporal or hybrid domain decomposition techniques. The governing equations for unsteady, incompressible fluid flows are presented in Section 2.1. The main topics related to the numerical methods used for the solution of the unsteady, incompressible fluid flow governing equations are introduced in Section 2.2. Sections 2.3 and 2.4 include the spatial and temporal schemes considered for the discretization of the governing equations. Section 2.5 is devoted to the pressure correction step of the projection method. A fourth-order accurate finite-volume numerical scheme for the solution of the unsteady, incompressible Navier-Stokes equations based on the projection method is derived in Section 2.6. Chapter closes with a summary in the Section 2.7. 2.1 Governing equations The equations of conservation of mass and momentum describe the viscous flow of a pure isothermal fluid. For an arbitrary control-volume, conservation of mass requires that the rate of change of mass within the control-volume is equal to the mass flux crossing the control-surface, ∂ ∂t Z Z ρ dV = CV ρu · n dS , (2.1) CS 19 20 CHAPTER 2. NUMERICAL METHODS where ρ, u, CV , CS and n denote the density, velocity, control-volume, controlsurface and a unit vector normal to the control-surface and directed outwards, respectively. For incompressible flows, constant density, Eq. (2.1) reduces to: Z u · n dS = 0 . (2.2) CS The momentum conservation equation, derived from the Newton’s second law of motion, for an arbitrary control-volume CV is: ∂ ∂t Z Z ∇ · (ρuu) dV = ρu dV + X F, (2.3) CV CV where contributions to the summation F come from forces acting at the surface of the control-volume and throughout the volume. For incompressible Newtonian fluid flows, the momentum conservation equation, Eq. (2.3) is: Z Z ∂ ρu dV + ρu (u · n) dS = ∂t CV CS Z Z Z =− pn dS + µ∇u · n dS + ρb dV , (2.4) CS CS CV denoting by µ the viscosity, b the body forces per unit mass and p the pressure. For flows accompanied by heat transfer, an equation for energy conservation, usually with the temperature θ as dependent variable, must be added to complete the set of governing equations. Neglecting the work done by pressure and viscous forces, the equation for temperature reduces to a scalar conservation equation. The integral form of the equation describing conservation of a scalar quantity φ reads: ∂ ∂t where Z P Z ρφ dV + CV ρφu · n dS = X fφ , (2.5) CS fφ represents transport of φ by mechanisms other than convection and any sources or sinks of the scalar. Diffusive transport is described by a gradient 2.1. GOVERNING EQUATIONS 21 approximation. The Fourier’s law for heat diffusion reads: Z fθdif f = κ∇θ · n dS , (2.6) CS where κ is the thermal conductivity. Therefore, neglecting the existence of sources or sinks, the heat conservation equation is: Z Z Z ∂ ρcp θ dV + ρcp θu · n dS = κ∇θ · n dS . ∂t CV CS CS (2.7) Considering constant specific heat and thermal conductivity, this equation can be rewritten as: Z Z Z ∂ κ (2.8) θ dV + θu · n dS = ∇θ · n dS . ∂t CV CS CS ρcp The finite-difference methods consider the differential form of the governing equations that are obtained as follows. Applying the Gauss theorem to Eq. (2.2), the surface integral may be replaced by a volume integral, Z ∇ · u dV = 0 . (2.9) CV Since Eq. (2.9) is valid for any size of the control-volume, it implies that: ∇ · u = 0. (2.10) The vector form of the momentum equation, Eq. (2.4), is obtained applying the Gauss’ divergence theorem to the convective and diffusive terms: ∂ρu + ∇ · (ρuu) = −∇p + µ∇ · (∇u) + ρb . ∂t The vector form of the temperature convection-diffusion equation is: µ ¶ κ ∂θ + ∇ · (θu) = ∇ · ∇θ . ∂t ρcp (2.11) (2.12) Under the considered assumptions, in Cartesian coordinates and tensor notation the differential form of the governing equations for an unsteady, incompressible 22 CHAPTER 2. NUMERICAL METHODS viscous flow is: ∂ui = 0, ∂xi ∂p ∂ 2 ui ∂ρui ∂ (ρui uj ) = − +µ + ρbi , + ∂t ∂xj ∂xi ∂xj ∂xj ∂θ ∂ (θuj ) κ ∂ 2θ = , + ∂t ∂xj ρcp ∂xj ∂xj (2.13) (2.14) (2.15) for mass, momentum and thermal energy conservation, respectively. Considering the Boussinesq approximation, see e.g. Lesieur et al. [54], the density is treated as a constant in the unsteady and convection terms and as a variable in the body forces term. Assuming that the density varies linearly with the temperature, the contribution for the body forces will be given by: (ρ − ρ0 ) gi = −ρ0 gi β (θ − θ0 ) , (2.16) where gi is the ith component of the gravity acceleration, ρ0 stands for the density at the reference temperature θ0 and β is the thermal expansion coefficient. When the only body force to be considered is the buoyancy force, this term of the momentum equation is given by: ρbi = −ρ0 gi β (θ − θ0 ) . 2.2 (2.17) Numerical solution of the Navier-Stokes equations The lack of a dedicated pressure evolution equation in the governing set of equations is responsible for major difficulties encountered to obtain a time accurate prediction for an incompressible, unsteady flow problem. When solving the unsteady, incompressible form of the Navier-Stokes equations, pressure provides coupling between the momentum and mass conservation equations. This coupled system can be solved iteratively using methods such as the SIMPLE [55] or PISO [32] or by the ”divide and conquer ” approach, which has different names under different modifications: fractional-step, operator splitting, projection method, etc. Both approaches 2.3. SPATIAL DISCRETIZATION 23 will be considered for the solution of the unsteady fluid flow problems included in Chapter 4 to analyse the properties of the parallel-in-time decomposition method. The SIMPLE formulation was introduced by Patankar and Spalding [56] and described in detail by Patankar [55]. The acronym SIMPLE stands for SemiImplicit Method for Pressure Linked Equations. The iterative procedure can be interpreted as a pseudo-transient treatment of the governing equations to obtain a steady-state solution. The SIMPLE procedure, although most suited to steady problems, may be easily extended to unsteady problems. The SIMPLE method will be considered for parallel-in-time fluid flow simulations solution when larger time-steps are required by the coarse time-grid predictor step. The projection methods are based on the decomposition of the equation operators. Momentum equations are first updated calculating an intermediate velocity that, in general, will not satisfy mass conservation. This intermediate velocity field is then corrected to enforce mass conservation and pressure field is adjusted. The incremental pressure-correction scheme is used in the projection methods considered for the unsteady fluid flow simulations. The ”old ” pressure gradient is considered in the first step and then corrected in the second step. This procedure became popular after Van Kan [38] who proposed a second-order incremental pressure-correction scheme. Another distinctive feature of the numerical schemes devoted to the solution of the Navier-Stokes equations is the way they treat the convective and diffusive terms. The projection methods considered for the numerical simulations reported in Chapters 3 and 4 range from explicit to implicit formulations. The numerical scheme for each flow simulation included in Chapters 3 and 4 will be stated with the definition of the test case. 2.3 Spatial discretization Two types of grid layout, co-located and staggered grids, may be applied to discretize the appropriate set of equations, equations (2.13), (2.14) and (2.15) for finite-difference formulations or equations (2.2), (2.4) and (2.8) for finite-volume formulations. Both grid layouts, represented in the Fig. 2.1 for the two-dimensional case, will be used for parallel-in-time flow simulation. 24 CHAPTER 2. NUMERICAL METHODS y y j+1 y p,u,v j j+1 y u j p v y j-1 x i-1 x i x y i+1 j-1 x i-1 x i x i+1 Figure 2.1: Co-located grid (left) and staggered grid (right). In the co-located grid arrangement, all dependent variables are located at the same physical location. The co-located arrangement appears to be more natural and simple, allowing a small amount of interpolation when compared with the discretization on a staggered grid. The computer programming for staggered grids, where velocities are centered with the pressure locations, appears to be more complex than for a co-located arrangement due to the different indexing required by each velocity component. The main reason to use this arrangement is that it prevents ”odd–even coupling” (sometimes referred as ”checkerboarding”) between the pressure and the velocity fields, which arises on co-located arrangements. The ”odd–even decoupling problem” needs to be addressed when computing incompressible Navier–Stokes equations on a co-located grid. Various approaches may be used to overcome the problem of ”odd–even decoupling” between the pressure and velocity fields. The technique of interpolating the cell-face velocities via ”momentum interpolation”, first proposed by Rhie and Chow [57], is a popular scheme to achieve this. Another approach, among others, consists into filter out the oscillations. When discretizing the governing equations on a co-located grid, the solution adopted for this problem follows the method proposed by Ye et al. [58]. For a finite-volume second-order discretization scheme, the method can be outlined as follows: - The face-center velocity is used to compute the convective flux from each cell in the finite-volume discretization scheme; - Following the time advance step, the intermediate face-center is computed 2.3. SPATIAL DISCRETIZATION 25 by interpolating the intermediate cell-center velocity; - Once the pressure is obtained by solving the pressure Poisson equation, both the cell-center and face-center velocities are updated separately; - The updated face-center velocity is used to compute the convective flux at the next time step. In addition to developing a stable form of the discrete equations, the temporal and spatial discretization schemes were selected in order to preserve a congruent order of accuracy to the scheme. As it will be described in Section 2.4, first- secondand fourth-order accurate schemes were considered for the temporal advancement of the momentum equations. Therefore, first-, second- and fourth-order accurate schemes were considered for the spatial discretization. The first-order ”upwind ” spatial discretization is employed together with the first-order temporal implicit and explicit Euler schemes. The second-order central difference scheme is considered for the Adams-Bashforth, Crank-Nicolson and three-level implicit time discretization schemes. An advantage of the central difference scheme over non-centered schemes is that it is relatively free of numerical dissipation. The standard second-order accurate staggered grid finite-difference scheme conserves mass, momentum and kinetic energy [59]. While this improves the accuracy of the scheme, it can also lead to non-physical oscillations if an insufficient grid refinement is used [55]. For steady one-dimensional convection-diffusion equation, the central difference scheme will give realistic solutions as long as the cell Peclet number, Pe = u∆h , γ (2.18) is kept less than two, where ∆h is the grid cell dimension and γ is the kinematic diffusion coefficient. However, it is possible to obtain good results with P e > 2, as long as the oscillations are significantly smaller than the other structures in the flow. When necessary, the deferred correction method proposed by Khosla and Rubin [60] was applied for the convection discretization avoiding non-physical oscillations. The fourth-order accurate central difference discretization scheme is considered with the fourth-order explicit Runge-Kutta time discretization in the framework 26 CHAPTER 2. NUMERICAL METHODS of a fourth-order accurate finite-volume numerical scheme that will be derived in Section 2.6. 2.4 Time advancement of momentum equations A fully explicit scheme for both convection and diffusion terms has the advantage that no matrix inversion is required. However, all explicit methods require consideration of the general stability constraints from linear analysis. The Neumann diffusive criterion links the time-step and the square of the grid size. The maximum usable time-step is proportional to the characteristic diffusion time, (∆h)2 /γ where ∆h is the minimum grid cell dimension and γ is the kinematic diffusion coefficient. For convection terms the maximum time-step is proportional to the characteristic convection time ∆h/u. This condition is usually described in terms of the Courant-Friedrichs-Lewy number, CF L = u∆t/∆h. Denoting by H the discrete convection operator, G the discrete gradient operator for pressure and L the discrete diffusion operator, the following explicit and implicit schemes are used in this work for the time advancement of the momentum equations: (i) Explicit methods: - Explicit Euler scheme ¡ ¢ un+1 − un + H (un ) = L (un ) − G pn−1/2 ∆t (2.19) - Adams-Bashforth scheme · ¸ un+1 − un 3 1 ¡ n−1 ¢ n + H (u ) − H u = ∆t 2 2 · ¸ ¡ ¢ 3 1 ¡ n−1 ¢ n = L (u ) − L u − G pn−1/2 2 2 (2.20) - Fourth-order Runge-Kutta scheme The fourth-order Runge-Kutta scheme will be described later in subsection 2.6.2. (ii) Implicit methods: 2.5. PRESSURE CORRECTION 27 - Implicit Euler scheme ¡ ¡ ¡ ¢ ¢ ¢ un+1 − un + H un+1 = L un+1 − G pn−1/2 ∆t (2.21) - Crank-Nicolson scheme · ¸ un+1 − un 1 ¡ n+1 ¢ 1 n + H u + H (u ) = ∆t 2 2 · ¸ ¢ 1 ¡ ¢ 1 ¡ = L un+1 + L (un ) − G pn−1/2 2 2 (2.22) - Three-level implicit scheme ¡ ¢ ¡ ¢ ¡ ¢ 3un+1 − 4un + un−1 + H un+1 = L un+1 − G pn−1/2 (2.23) 2∆t 2.5 Pressure correction On a projection scheme, after the approximation of the velocity field, u∗ , obtained by integration of the momentum equations using one of the schemes above, mass conservation is enforced through a pressure correction step given by: Z CV un+1 − u∗ dV = − ∆t Z 0 ∇p dV , (2.24) CV 0 for a finite-volume discretization scheme, denoting by p and u∗ the pressure increment and the intermediate velocity, respectively. The approximated velocity field is projected onto a subspace of divergence free velocity field requiring that the final velocity field satisfies the integral mass conservation equation: Z un+1 · n dS = 0 . (2.25) CS The integral version of the pressure Poisson equation results: Z 1 ∇p · n dS = ∆t CS Z 0 u∗ · n dS . CS (2.26) 28 CHAPTER 2. NUMERICAL METHODS The calculated pressure correction p0 is then used to correct the velocity field, un+1 = u∗ − ∆t∇p0 , (2.27) and the pressure field, pn+1/2 = pn−1/2 + p0 . 2.6 (2.28) High-order finite-volume method The integral form of the governing equations for an isothermal flow, Eq. (2.2) and (2.4), is used to derive a fourth-order accurate finite-volume method. The method can be outlined as follows. The governing equations are discretized on a staggered cartesian uniform mesh. The time advancement of the momentum equations is performed with the classical four-stage Runge-Kutta explicit scheme and the convective and diffusive fluxes through control-volume faces are approximated by fourth-order accurate polynomial interpolation and Simpson’s rule of integration for high-order accuracy agreement. A high-order de-averaging procedure is required to calculate the time derivative of the volume-averaged velocity congruently with the spatial and temporal discretization schemes. The de-averaging coefficients calculation is based on the Taylor series expansion of the integrated velocity values at cell and neighbourhood cells. 2.6.1 Spatial discretization For one control-volume, the integral form of the momentum equation, Eq. (2.4), reads: Z X X ∂ u= Li − Hi − G , (2.29) ∂t CV i∈S i∈S where Hi and Li stands for the convective and diffusive fluxes through controlsurfaces and G represents the pressure source term. For high-order approximation of the convective and viscous fluxes through control-surfaces in Eq. (2.29), surface integrals are calculated by the fourth-order accurate Simpson’s rule. 2.6. HIGH-ORDER FINITE-VOLUME METHOD 29 NN N ne ∆ y n WW W w P s e E EE se S SS ∆x Figure 2.2: Labelling scheme for a 2D grid. For the two-dimensional case, the integral over Se , see Fig. 2.2, is evaluated as: Z f ds = Se ∆y (fne + 4fe + fse ) , 6 (2.30) where f stands for the convective or diffusive approximations at the required locations. Therefore, the values of f are required at three locations, cell face-center and two vertices, for each cell face. The variable values required for the convective or diffusive approximations should be calculated with fourth-order accurate interpolation and derivatives in order to keep the fourth-order approximation of the integral. The following expressions are used for fourth-order accurate evaluation of the integrand in Eq. (2.30): 27fP + 27fE − 3fW − 3fEE , 48 µ ¶ ∂f 27fE − 27fP + fW − fEE . = ∂x e 24∆x fe = (2.31) (2.32) For the three-dimensional case, more values of f are required for a fourth-order approximation of the face integral. The control-volume face integral is calculated 30 CHAPTER 2. NUMERICAL METHODS by: Z ∆y∆z [fend + fesd + fenu + fesu + 36 +4 (fen + fed + fes + feu ) + 16fe ] , f ds = Se (2.33) considering the notation indicated in the Fig. 2.3. The required values on the edges and vertices are obtained by fourth-order accurate interpolation as in the two-dimensional case. y end en enu ed P y ∆ x e esd eu es ∆z z esu ∆x Figure 2.3: Labelling scheme for 3D face integral evaluation. 2.6.2 Time advancement of momentum equations The time advancement of the momentum equations is performed with the classical explicit four-stage Runge-Kutta scheme. It is generally considered that the fourth-order Runge-Kutta method provides a good balance of computational effort, precision and storage requirement. As an explicit one-step multistage method, it does not need any special requirements to start the calculation. The basic idea on multistage methods is to create a weighted sum of corrections ∆χ to the solution 0 00 000 at several stages within the time-step, χn+1 = χn + C1 ∆χ + C2 ∆χ + C3 ∆χ . . .. The coefficients Ck are calculated by matching this expansion with the correspond- 2.6. HIGH-ORDER FINITE-VOLUME METHOD 31 ing expansion by Taylor series for the desired order of accuracy. The Runge-Kutta methods tend to be non-unique because a large number of parameters are considered during derivation. Considering the initial value problem, dχ = F (t, χ) , dt (2.34) with prescribed initial condition χ (t = 0) = χ0 , the classical fourth-order RungeKutta algorithm is: 0 ∆χ = ∆t F (tn , χn ) , µ ¶ 1 1 00 0 n n ∆χ = ∆t F t + ∆t, χ + ∆χ , 2 2 µ ¶ 1 1 000 00 n n ∆χ = ∆t F t + ∆t, χ + ∆χ , 2 2 ³ ´ 0000 000 ∆χ = ∆t F tn + ∆t, χn + ∆χ , ´ 1³ 0 00 000 0000 χn+1 = χn + ∆χ + 2∆χ + 2∆χ + ∆χ . 6 (2.35) (2.36) (2.37) (2.38) (2.39) For a finite-volume formulation, it is required to perform the integration of the point-wise initial velocity field before start the time-stepping calculation. After interpolation to the required positions on the control-surfaces, the following expression is used to approximate the integral for the two-dimensional case: Z ∆x∆y [(ui )en + (ui )es + (ui )wn + (ui )ws + ui dV = 36 CV +4 ((ui )e + (ui )w + (ui )n + (ui )s ) + 16(ui )P ] . (2.40) For the three-dimensional case, the integral is calculated by: Z ∆x∆y∆z ui dV = [(ui )n + (ui )s + 6 CV (ui )w + (ui )e + (ui )u + (ui )d ] . (2.41) The application of the Runge-Kutta method for the time-stepping solution of the Navier-Stokes equations requires the evaluation of the convective and diffusive R contributions at four stages during each time-step. Denoting by û the term Ω u on the LHS of Eq. (2.29), by F (t, u) the convective and diffusive terms on the RHS 32 CHAPTER 2. NUMERICAL METHODS 0 00 ,000 of the same equation and by u , 0 00 ,000 and p , the velocity and pressure variable values after the first, second and third stages, the solution proceeds accordingly with the projection method and the pressure increment option as follows: Stage 1 û∗ = ûn + ∇ · û∗ = 0 0 û = û∗ − ∆t ∆t ³ n− 1 ´ F (tn , un ) − G p 2 2 2 ³ ´ 1 ∆t 0 G p − pn− 2 2 (2.42) (2.43) (2.44) Stage 2 ∆t ³ n+ 1 0 ´ ∆t ³ 0 ´ F t 2,u − G p û = û + 2 2 ∇ · û∗∗ = 0 ´ ∆t ³ 00 00 0 ∗∗ û = û − G p −p 2 ∗∗ n (2.45) (2.46) (2.47) Stage 3 ³ 1 00 ´ ³ 00 ´ û∗∗∗ = ûn + ∆t F tn+ 2 , u − ∆t G p (2.48) ∇ · û∗∗∗ = 0 (2.49) ³ ´ 000 000 00 ∗∗∗ û = û − ∆t G p − p (2.50) Stage 4 ∆t ∆t ³ n+ 1 0 ´ û∗∗∗∗ = ûn + F (tn , un ) + F t 2,u + 6 3 ³ 000 ´ ∆t ³ n+ 1 00 ´ ∆t ³ n+1 000 ´ t 2,u + F t , u − ∆t G p 3 6 ∗∗∗∗ ∇ · û =0 ³ ´ 000 n+1 ∗∗∗∗ n+ 12 û = û − ∆t G p −p (2.51) (2.52) (2.53) 2.6. HIGH-ORDER FINITE-VOLUME METHOD 33 After each time-advance stage, a pressure correction step is performed to impose the mass conservation. The integrated intermediate velocity is used to calculate mass fluxes for the source terms of the pressure Poisson equation, Eq. (2.26). For a two-dimensional problem, a 25 points stencil is required to obtain consistent discretizations of the gradient and divergence operators in Eq. (2.26). This is a very expensive calculation for an unsteady time-stepping calculation even applying a deferred correction approach. The gradient operator discretization was consequently performed with 9 and 13 points stencils for two-dimensional and three-dimensional domains, see Fig. 2.4 for the two-dimensional case. As it will be verified in Chapter 3, the inconsistence introduced here did not deteriorate the formal order of accuracy of the numerical scheme. The solution of the linear system of equations resulting from the finite volume/difference analogue of Eq. (2.26) is performed by the Bi-CGStab [61] algorithm included in the AZTEC library [62]. After the solution of the pressure Poisson equation, the new averaged divergence free intermediate velocity field is calculated by correcting the intermediate velocity field prediction with the gradient of the computed pressure correction. Finally, a new point-wise velocity field is reconstructed by de-averaging the corrected velocity field. The velocity de-averaging operation is described in the following section. 2.6.3 Velocity de-averaging The new control-volume cell center velocity is calculated as a weighted sum of the volume integrated velocity at cell and neighbourhood cells u= 1 X Ci ûi V i (2.54) where V denotes the volume of each cell. The Taylor series expansion of the integrated velocity values at cell and neighbourhood cells allows to calculate the de-averaging coefficients, Ci , for Eq. (2.54). The derivation of the de-averaging coefficients was performed with symbolic computing software (MATHEMATICA). Appendix A includes the de-averaging coefficients calculated for two- and three-dimensional cartesian uniform grids and details of the calculation. The de-averaging coefficients were calculated to obtain sixth-order accuracy for cells far from the boundaries of the domain. Near 34 CHAPTER 2. NUMERICAL METHODS i, j+2 i, j+1 i-2, j i-1, j i, j i+1, j i+2, j i, j-1 i, j-2 Figure 2.4: Nine-point stencil for pressure increment gradient operator discretization for the pressure Poisson equation for the two-dimensional case. the boundaries, only fourth-order accuracy is achieved due to the non-symmetric stencil. The benefits on the accuracy of numerical scheme resulting from the inclusion of the high-order de-averaging procedure will be verified through numerical experiments in Chapter 3. 2.7 Summary The governing equations and the numerical methods used for the solution of the unsteady, incompressible form of the Navier-Stokes equations on the framework of a spatial, temporal or hybrid domain decomposition technique were presented in this Chapter. The first- and second-order schemes considered for the spatial and temporal discretization are widely used for CFD calculations. Consequently, these discretization schemes were briefly summarized. A fourth-order accurate finite-volume numerical scheme was developed for the solution of the unsteady, incompressible Navier-Stokes equations. The scheme 2.7. SUMMARY 35 is based on the projection method. The time advancement of the momentum equations is performed with the classical explicit four-stage Runge-Kutta scheme and the convective and viscous fluxes through control-volume faces are approximated by Simpson’s rule and fourth-order accurate polynomial interpolation and derivatives. The inclusion of a high-order integration/de-averaging procedure in a high-order finite-volume unsteady, incompressible flow solver should be required to approximate the time derivative of the volume-averaged velocity congruently with the accuracy of the other operators. The de-averaging coefficients calculation is based on the Taylor series expansion of the volume-integrated velocity at cell and neighbourhood cells. 36 CHAPTER 2. NUMERICAL METHODS Chapter 3 Space domain decomposition This Chapter describes the main topics related to the space domain decomposition method for parallel fluid flow simulation. Section 3.1 is devoted to analyse the methods for the parallel solution of systems of equations. Section 3.2 includes numerical fluid flow simulations performed with a finite-volume fourth-order accurate numerical scheme under the framework of the spatial domain decomposition technique. The main purpose of the numerical experiments is to evaluate the accuracy increase resulting from the inclusion of a high-order de-averaging step on the numerical scheme. Chapter closes with detailed conclusions. 3.1 Parallel solution of systems of equations When the parallelization procedure is based on a spatial grid partitioning strategy, the computational domain is partitioned and distributed to several sub-domains, each one assigned to one processor. Each processor contains a sub-grid surrounded by some ”ghost” grid points that are duplicates of grid points hold by other processors. At the start of each time-step, calculation on all processors is synchronized and message exchange is performed for grid points on partition boundaries. For an explicit time advancement method, the parallelization of the calculation procedure for the intermediate velocity fields is straightforward. However, the pressure correction step is a crucial issue to the efficient work of a parallel machine. Figure 3.1 shows schematically the flow chart for a single-step explicit time advancement method. 37 38 CHAPTER 3. SPACE DOMAIN DECOMPOSITION Start Inicialize Distribute initial conditions Increment time and set boundary conditions Exchange partition boundaries U and V values Solve U-equation and V-equation Exchange partition boundaries U and V values Solve P-equation A No Solver Collect residual sums and broadcast decision Converged ? Exchange partition boundaries pressure correction values Yes Correct U, V and P No Final time ? Yes End Figure 3.1: Parallel code flow chart for a incremental-pressure projection method (cycle A exists only for the explicit outer iteration coupling). The solution of the Poisson equation for pressure has the potential to degrade the performance, the achieved speed-up, of a parallel algorithm due to the nature of this equation that requires global communication among the processors. The most CPU-time consuming part of explicit projection methods is the solution of a discrete pressure Poisson equation with Neumann-type boundary conditions formulated to satisfy the continuity equation. More than 80% of the total computational time is usually consumed by the iterative solution of the Poisson equation. There are basically three strategies for solving band-diagonal linear systems in parallel: direct methods, whereby data are exchanged at processor boundaries then local solutions are obtained and joined back together; transpose methods, whereby the data are rearranged to give each processor all of the data it needs to compute a complete solution in serial; and iterative methods, whereby boundary data are exchanged and an initial guess is then iterated to convergence. The availability of reliable packages, like the AZTEC library [62], to numeri- 3.1. PARALLEL SOLUTION OF SYSTEMS OF EQUATIONS 39 cally solve in parallel large linear systems makes the parallelization quickly achievable. AZTEC library includes a number of Krylov iterative methods, such as conjugate gradient (CG), generalized minimum residual (GMRES) and stabilized bi-conjugate gradient (Bi-CGStab), to solve systems of equations. These Krylov methods can be used in conjunction with various pre-conditioners such as polynomial or domain decomposition methods using LU or incomplete LU factorizations within sub-domains. In general, three different methods can be considered for the sub-domains coupling. With the first one, explicit outer iteration coupling, each sub-system of equations is solved independently and data at the partition boundaries are exchanged after each outer iteration (cycle A in Fig. 3.1). With the second method, explicit inner iteration coupling, data transfer is performed after each inner iteration, achieving a better coupling of sub-domains with more data exchange. If the global system of equations is solved, sometimes designated as implicit inner iteration coupling, a strong coupling is achieved with a large amount of data exchange. Since the coupling method also affects the convergence behaviour, the comparative parallel performance with each one of the methods described above is problem dependent [63]. Although problem dependent, some preliminary tests were performed to evaluate the performance of the solution with the Bi-CGStab algorithm [61] included in the AZTEC library and compare with the performance of a solver routine, written in FORTRAN for this purpose, also based on the Bi-CGStab algorithm. A parallel two-dimensional second-order accurate explicit code for the solution of the unsteady, incompressible Navier-Stokes equations was initially developed to conduct the tests. Flow configuration considered for the tests is the Taylor vortex-decay problem [64]. The analytical solution for this flow is: u(x, y, t) = − cos(x) sin(y)e−σt , (3.1) v(x, y, t) = sin(x) cos(y)e−σt , (3.2) 1 (3.3) p(x, y, t) = − (cos (2x) + cos (2y)) e−2σt , 4 for σ = 2/Re. Computations were performed with Re = 1 considering the domain 0 ≤ x, y ≤ π. Initial and boundary conditions are prescribed accordingly with Eq. 40 CHAPTER 3. SPACE DOMAIN DECOMPOSITION 14 local local/AZTEC global computing time (s) 12 10 8 6 4 2 0 5 10 15 number of processors Figure 3.2: Dependence of the computing time on the number of processors (128 × 128 nodes mesh). (3.1) to (3.3). Figure 3.2 shows the dependence on the number of processors of the computing time required for the solution of one time-step on a 128 × 128 nodes mesh. In the Figure, local solution indicates an explicit outer iteration coupling using the FORTRAN routine, local/AZTEC solution indicates an explicit outer iteration coupling using the AZTEC library. The Figure shows also the computing time required by the global solution method performed with the AZTEC library. Figures 3.3 and 3.4 show the parallel speed-up and efficiency obtained with the considered methodologies. Results showed that the calculation performed with the AZTEC library could be up to about three times faster than the one performed with the FORTRAN routine, even for a small number of mesh nodes. Consequently, the last option, the use of the written FORTRAN routine to solve the systems of equations, was discarded. Another set of tests were conducted to compare the performance of the AZTEC solver for the explicit outer iteration coupling scheme and the global solution on a more refined mesh. Figures 3.5 and 3.6 shows the parallel speed-up and efficiency for a 512 × 512 nodes mesh. The tests allowed concluding that it is probably advantageous to solve the systems of equations with the global solution method and the AZTEC library for most of the cases. 3.1. PARALLEL SOLUTION OF SYSTEMS OF EQUATIONS 41 14 local local/AZTEC global ideal 12 speed-up 10 8 6 4 2 5 10 15 number of processors Figure 3.3: Dependence of the achieved speed-up on the number of processors (128 × 128 nodes mesh). 1 0.9 local local/AZTEC global parallel efficiency 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 10 15 number of processors Figure 3.4: Dependence of the parallel efficiency on the number of processors (128 × 128 nodes mesh). 42 CHAPTER 3. SPACE DOMAIN DECOMPOSITION local/AZTEC global ideal 14 12 speed-up 10 8 6 4 2 5 10 15 number of processors Figure 3.5: Dependence of the achieved speed-up on the number of processors for a 512 × 512 nodes mesh. local/AZTEC global 1 0.9 parallel efficiency 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 10 15 number of processors Figure 3.6: Dependence of the parallel efficiency on the number of processors for a 512 × 512 nodes mesh. 3.2. HIGH-ORDER FINITE-VOLUME NUMERICAL SIMULATIONS 3.2 43 High-order finite-volume numerical simulations A fourth-order accurate finite-volume numerical scheme for the unsteady, incompressible form of the Navier-Stokes equations was derived in Section 2.6. The spatial domain decomposition technique was applied to parallelize the solver, reducing the computing time. The present Section is devoted to evaluate and analyse the proposed numerical scheme. Three test cases are considered for that purpose: (i) The Taylor vortex-decay problem has an analytical solution and allows evaluating the order of accuracy of the numerical scheme. (ii) The second case is the simulation of the interaction between a pair of counterrotating vortices. (iii) Finally, the merging process resulting from the interaction between a pair of co-rotating vortices constitutes the last test case. The main purpose of the simulations is to verify the increase on the accuracy of the numerical scheme promoted by the inclusion of the high-order de-averaging procedure rather than investigate the physics of wake vortices interaction. 3.2.1 Two-dimensional Taylor vortex decay problem The two-dimensional Taylor vortex-decay problem [64] was used to evaluate the accuracy of the presented numerical scheme. The analytical solution for this flow is given by Eq. (3.1) to (3.3). Computations were performed with Re = 1 until t = 0.35, considering the domain − 23 π ≤ x, y ≤ 32 π indicated in Fig. 3.7, for grid sizes from 16×16 up to 256×256 nodes. The time-step was set to 10−3 in the coarser mesh and initial Courant number is kept constant for the refined meshes. Dirichlet boundary conditions are prescribed accordingly Eq. (3.1) and (3.2). Simulations were also performed with same parameters but considering cell center point-wise velocity values as the averaged values for each control-volume. This is hereafter referred as the second-order de-averaging procedure. The comparison between the results achieved with each formulation allows to evaluate the influence on the 44 CHAPTER 3. SPACE DOMAIN DECOMPOSITION 6 4 Y 2 0 -2 -4 computational domain -6 -6 -4 -2 0 2 4 6 8 X Figure 3.7: The computational domain considered for the two-dimensional Taylor vortex-decay problem. accuracy of the method resulting from the inclusion of the high-order de-averaging procedure described in Section 2.6. Figure 3.8 plots the maximum error, L∞ norm, of u-velocity and pressure at final computed time as a function of mesh refinement. The Figure shows the important error reduction introduced by the inclusion of the high-order de-averaging procedure into the numerical scheme allowing fourth-order of accuracy in both velocity and pressure. The numerical scheme becomes only second-order accurate when the second-order de-averaging procedure is applied. 3.2. HIGH-ORDER FINITE-VOLUME NUMERICAL SIMULATIONS 10 -1 10 -2 45 u velocity (4th order de-averaging) u velocity (2nd order de-averaging) pressure (4th order de-averaging) pressure (2nd order de-averaging) 4th order slope 2nd order slope 10-3 -4 Error 10 10-5 10 -6 10-7 10 -8 10-9 100 200 300 Number of grid points along each direction Figure 3.8: Maximum error, L∞ norm, of u-velocity and pressure, at final computed time, dependence on the mesh refinement. 3.2.2 Counter-rotating vortices interaction The second test case considered is a pair of symmetric two-dimensional viscous counter-rotating vortices. The initial flow field is created by the superposition of two Lamb-Oseen vortices, i.e. axi-symmetric vortices with a Gaussian vorticity distribution: ω(r) = Γ −r2 /a2 e , πa2 (3.4) where r is the radial distance from the center of each vortex, a the core radius and Γ is the circulation. Symmetry boundary conditions are applied on the left boundary, x = 0, and therefore, only the right semi-plane is calculated. Velocity on remaining boundaries are prescribed accordingly with the superposition of the two vortices and kept constant during the simulation. The spatial domain [0, 100] m×[0, 100] m is discretized by a staggered cartesian uniform mesh. The core radius and the 46 CHAPTER 3. SPACE DOMAIN DECOMPOSITION computational domain 100 80 Y 60 40 20 0 -100 -50 0 50 100 X Figure 3.9: Initial conditions and computational domain for the two-dimensional viscous counter-rotating vortices test case. circulation of the vortices are set equal to 1 m and ±10 m2 s−1 , respectively. The kinematic viscosity is set to 2×10−5 m2 s−1 and consequently, the Reynolds number based on the vortex circulation is 5 × 105 . Initial position of the vortex in the right semi-plane is, x = 20 m, y = 50 m and the simulations were performed up to time t = 100 s. Initial conditions and the computational domain are represented in Fig. 3.9. Figure 3.10 shows the dependence of the temporal evolution of the maximum adimensionalized vertical velocity component along a horizontal line going through the vortex center on the spatial discretization, 128 × 128, 256 × 256 and 512 × 512 grid nodes, and on the order of the de-averaging procedure. As expected, the velocity decay decreases with mesh refinement and for 128 × 128 and 256 × 256 spatial discretizations, the influence of the de-averaging scheme is important. The temporal velocity decay calculated with the second-order de-averaging procedure on the 512 × 512 spatial mesh size, not represented in the Figure, is very similar to the one that is obtained with the fourth-order procedure. Figures 3.11 and 3.12 show the predicted vertical velocity component profile 3.2. HIGH-ORDER FINITE-VOLUME NUMERICAL SIMULATIONS 512x512 256x256 256x256 128x128 128x128 1.1 47 4th order de-averaging 4th order de-averaging 2nd order de-averaging 4th order de-averaging 2nd order de-averaging 1 V-THETA 0.9 0.8 0.7 0.6 0.5 0 25 50 75 100 Time Figure 3.10: Temporal evolution of the adimensionalized maximum vertical velocity component along a horizontal line going through the vortex center. after 100 s for 128 × 128 and 256 × 256 nodes meshes. The comparison with the finer grid solution, 512 × 512 nodes, reveals the better accuracy of the velocity profile predicted with the fourth-order method. Figures 3.10, 3.11 and 3.12 also show that the difference between predictions obtained with the second- and the fourth-order de-averaging procedures decreases with the mesh refinement. 48 CHAPTER 3. SPACE DOMAIN DECOMPOSITION 512x512 4th order de-averaging 128x128 4th order de-averaging 128x128 2nd order de-averaging 1 0.8 0.6 0.4 V-THETA 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 16 18 20 22 24 X Figure 3.11: Comparison of the predicted vertical velocity component profile after 100 s for the 128 × 128 nodes grid with the finer grid solution. 512x512 4th order de-averaging 256x256 4th order de-averaging 256x256 2nd order de-averaging 1 0.8 0.6 0.4 V-THETA 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 16 18 20 22 24 X Figure 3.12: Comparison of the predicted vertical velocity component profile after 100 s for the 256 × 256 nodes grid with the finer grid solution. 3.2. HIGH-ORDER FINITE-VOLUME NUMERICAL SIMULATIONS 3.2.3 49 Co-rotating vortices merging A second flow example involving the interaction between vortices was simulated to verify the influence of the increased accuracy of the fourth-order numerical scheme on a merging process prediction. A pair of two-dimensional Lamb-Oseen vortices with circulation Γ = 100 m2 s−1 , core size a = 1.2 m and separated by b = 6 m, is superposed on the spatial domain [−30, 30] m × [−30, 30] m, large enough to avoid the influence of the boundaries on the vortices interaction. Velocity on boundaries are prescribed accordingly with the superposition of the two vortices and kept constant during the simulation. The kinematic viscosity is set to 10−1 m2 s−1 . Figure 3.13 shows the vorticity contours of the initial flow configuration. 8 6 4 y 2 0 -2 -4 -6 -8 -10 -5 0 5 10 x Figure 3.13: Initial vorticity contours. Firstly, calculations were performed considering a 600 × 600 nodes grid. It was verified that for such a refined mesh the results are the same considering the second- and fourth-order schemes, see Fig. 3.14. This reference solution was then compared with results obtained with coarser spatial discretizations, 150 × 150 and 300 × 300 nodes, together with the secondand fourth-order schemes. The comparisons of the predicted vorticity contours at five time stages are plotted in Fig. 3.15 and 3.16. The inclusion of the highorder de-averaging scheme promoted an important accuracy increase on the coarser grid solution. Figure 3.15 shows that the predictions at t = 5 s and t = 6 s CHAPTER 3. SPACE DOMAIN DECOMPOSITION 8 8 6 6 4 4 2 2 0 y y 50 0 -2 -2 -4 -4 -6 -6 -8 -10 -8 -5 0 5 10 -10 -5 8 8 6 6 4 4 2 2 0 -2 -4 -4 -6 -6 -8 0 5 10 -10 -5 6 4 4 2 2 0 y y 8 6 -2 -4 -4 -6 -6 -8 0 5 10 -10 -5 0 5 10 5 10 5 10 x 8 8 6 6 4 4 2 2 0 y y 10 -8 -5 x 0 -2 -2 -4 -4 -6 -6 -8 -8 -5 0 5 10 -10 -5 x 0 x 8 8 6 6 4 4 2 2 0 y y 5 0 -2 0 -2 -2 -4 -4 -6 -6 -8 -10 0 x 8 -10 10 -8 -5 x -10 5 0 -2 -10 0 x y y x -8 -5 0 x 5 10 -10 -5 0 x Figure 3.14: Vorticity contours during the merging process at t = 2 s, t = 5 s, t = 6 s, t = 7 s and t = 8 s (Second-order de-averaging on left and fourth-order on the right side). 3.2. HIGH-ORDER FINITE-VOLUME NUMERICAL SIMULATIONS 51 denote important velocity differences in the vortex core as well as on the vortex filaments at t = 7 s. The predictions obtained with the fourth-order de-averaging are much closer to the reference solution obtained with a 600 × 600 nodes grid. As expected, the differences on the comparison between the reference solution and the intermediate mesh solution, 300 × 300 nodes, with both numerical schemes are less detectable. Temporal three-dimensional simulations were also performed applying periodic boundary conditions on the stream-wise direction for meshes comprising 150 × 150 × 12 and 300 × 300 × 24 nodes. No differences were detected between the twodimensional and three-dimensional results for the same number of nodes on the plane normal to the stream-wise direction. Figure 3.17 plots the vorticity contours during the merging process for 300 × 300 and 300 × 300 × 20 nodes grids. One should note that the prescribed viscosity is high and none perturbation was applied. Consequently, the flow configuration considered is not appropriated to observe the elliptical instability, which is characterized by a three-dimensional deformation of the vortex cores [65]. The difference between the simulations performed with the fourth-order and the second-order numerical schemes are more significant in the following similar twodimensional merging flow example with different parameters. The circulation of the Lamb-Oseen vortices is reduced to Γ = 10 m2 s−1 and the core size to a = 1 m. The spatial domain, [−50, 50] m × [−50, 50] m, is discretized on a 512 × 512 nodes uniform grid. The kinematic viscosity is also reduced to 1 × 10−3 m2 s−1 . The reductions considered for the core size and the kinematic viscosity delay the merging process. Simulations with the second- and fourth-order numerical schemes were performed and the predicted vorticity contours at five time stages, t = 60 s, t = 220 s, t = 240 s, t = 260 s and t = 300 s, are plotted in Fig. 3.18. This flow configuration allows showing that the better velocity conservation of the fourthorder accurate numerical scheme improves significantly the simulations. Figure 3.18 shows important differences on the vortex merging dynamics for both simulations. After the conclusion of the merging, the vertical velocity component profiles along a horizontal line through the vortex center, presented in Fig. 3.19, are very similar because the small-scale interaction processes have finished and consequently, the mesh refinement is now sufficient to achieve similar solutions with both numerical schemes. CHAPTER 3. SPACE DOMAIN DECOMPOSITION 8 8 6 6 6 4 4 4 2 2 2 0 0 y 8 y y 52 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 -10 -5 0 5 10 -10 -8 -5 10 -10 8 6 6 4 4 4 2 2 2 0 0 y 8 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 0 5 10 -10 0 5 10 -10 4 4 2 2 2 0 0 y 6 4 y 8 6 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 10 -10 0 5 10 -10 6 4 4 4 2 2 2 0 0 y 8 6 y 8 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 10 -10 0 5 10 -10 6 4 4 4 2 2 2 0 0 y 8 6 y 8 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 x 10 -10 0 5 10 5 10 0 -2 5 10 x 6 0 -5 x 8 -5 5 -8 -5 x -10 0 0 -2 5 10 x 6 0 -5 x 8 -5 5 -8 -5 x -10 0 0 -2 5 10 x 8 0 -5 x 6 -5 5 -8 -5 8 -10 0 0 -2 -5 -5 x 6 x y 5 8 -10 y 0 x y y x y 0 -2 -8 -5 0 x 5 10 -10 -5 0 x Figure 3.15: Comparison of the vorticity contours during the merging process simulation on a 150 × 150 nodes grid at t = 2 s, t = 5 s, t = 6 s, t = 7 s and t = 8 s (second-order de-averaging on the left and fourth-order on center) with the reference solution on the 600 × 600 nodes grid (right side). 8 8 6 6 6 4 4 4 2 2 2 0 0 y 8 y y 3.2. HIGH-ORDER FINITE-VOLUME NUMERICAL SIMULATIONS -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 -5 0 5 10 -10 -8 -5 10 -10 8 6 6 4 4 4 2 2 2 0 0 y 8 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 0 5 10 -10 0 5 10 -10 4 4 2 2 2 0 0 y 6 4 y 8 6 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 10 -10 0 5 10 -10 6 4 4 4 2 2 2 0 0 y 8 6 y 8 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 10 -10 0 5 10 -10 6 4 4 4 2 2 2 0 0 y 8 6 y 8 -2 -2 -4 -4 -4 -6 -6 -6 -8 -8 x 10 -10 0 5 10 5 10 0 -2 5 10 x 6 0 -5 x 8 -5 5 -8 -5 x -10 0 0 -2 5 10 x 6 0 -5 x 8 -5 5 -8 -5 x -10 0 0 -2 5 10 x 8 0 -5 x 6 -5 5 -8 -5 8 -10 0 0 -2 -5 -5 x 6 x y 5 8 -10 y 0 x y y x y 0 -2 -10 53 -8 -5 0 x 5 10 -10 -5 0 x Figure 3.16: Comparison of the vorticity contours during the merging process simulation on a 300 × 300 nodes grid at t = 2 s, t = 5 s, t = 6 s, t = 7 s and t = 8 s (second-order de-averaging on the left and fourth-order on center) with the reference solution on the 600 × 600 nodes grid (right side). CHAPTER 3. SPACE DOMAIN DECOMPOSITION 8 8 6 6 4 4 2 2 0 y y 54 0 -2 -2 -4 -4 -6 -6 -8 -10 -8 -5 0 5 10 -10 -5 8 8 6 6 4 4 2 2 0 -2 -4 -4 -6 -6 -8 0 5 10 -10 -5 6 4 4 2 2 0 y y 8 6 -2 -4 -4 -6 -6 -8 0 5 10 -10 -5 0 5 10 5 10 5 10 x 8 8 6 6 4 4 2 2 0 y y 10 -8 -5 x 0 -2 -2 -4 -4 -6 -6 -8 -8 -5 0 5 10 -10 -5 x 0 x 8 8 6 6 4 4 2 2 0 y y 5 0 -2 0 -2 -2 -4 -4 -6 -6 -8 -10 0 x 8 -10 10 -8 -5 x -10 5 0 -2 -10 0 x y y x -8 -5 0 x 5 10 -10 -5 0 x Figure 3.17: Vorticity contours during the merging process at t = 2 s, t = 5 s, t = 6 s, t = 7 s and t = 8 s for 300 × 300 (left) and 300 × 300 × 20 (right) nodes grids. 3.2. HIGH-ORDER FINITE-VOLUME NUMERICAL SIMULATIONS 5 Y Y 5 0 -5 0 -5 -5 0 5 -5 0 X Y Y 5 0 -5 0 -5 -5 0 5 -5 0 X 5 Y Y 5 X 5 0 -5 0 -5 -5 0 5 -5 0 X 5 X 5 Y 5 Y 5 X 5 0 -5 0 -5 -5 0 5 -5 0 X 5 X 5 Y 5 Y 55 0 -5 0 -5 -5 0 5 X -5 0 5 X Figure 3.18: Vorticity contours during the merging process at t = 60 s, t = 220 s, t = 240 s, t = 260 s and t = 300 s (Second-order de-averaging on left and fourthorder on the right side). 56 CHAPTER 3. SPACE DOMAIN DECOMPOSITION 2nd order integration/de-averaging 4th order integration/de-averaging 1 V-THETA 0.75 0.5 0.25 5 10 15 X Figure 3.19: Predicted vertical velocity component profile after the merging (t = 360 s). 3.3 Summary The main topics related to the spatial domain decomposition method for parallel simulation of incompressible, unsteady fluid flows were addressed in this Chapter. For explicit projection schemes, the solution of the pressure Poisson equation is the most time-consuming procedure and consequently, determinant to achieve a good parallel performance. The methodology adopted for the parallel solution of the system of equations was selected after numerical tests evaluating the performance of several options for the coupling between sub-domains. The tests allowed to conclude that it is advantageous to solve the systems of equations with the AZTEC library and the global solution method. The spatial domain decomposition technique was applied to parallelize the fourth-order accurate finite-volume solver code for the unsteady, incompressible form of the Navier-Stokes equations, presented in Chapter 2. The proposed fourthorder accurate finite-volume numerical scheme considers the time advancement of the momentum equations by the classical four-stage Runge-Kutta explicit scheme and the control-volume faces convective and viscous fluxes approximation by Simp- 3.3. SUMMARY 57 son’s rule and fourth-order accurate polynomial interpolation. Three test cases were considered to evaluate and analyse the properties of the proposed numerical scheme. The fourth-order accuracy of the numerical scheme was verified with the two-dimensional Taylor vortex decay problem simulation. When the time derivative of the cell center velocity replaces the time derivative of the volume-averaged velocity, the global numerical scheme becomes second-order accurate. The second and third test cases comprise the simulation of the interaction between a pair of counter- and co-rotating vortices, respectively. The main purpose of the simulations was to verify the increase of the numerical scheme accuracy promoted by the inclusion of the high-order de-averaging procedure rather than investigate the physics of wake vortices interaction. Flow simulations comprising the interaction between Lamb-Oseen vortices provided the following conclusions: (i) The inclusion of the high-order de-averaging scheme promotes an important accuracy increase on the predictions based on coarser grids. (ii) The inclusion of the high-order accurate de-averaging scheme improves significantly the velocity conservation. (iii) The differences between the solutions obtained with and without the inclusion of the high-order de-averaging procedure vanish with mesh refinement and for very refined meshes the predictions are identical. A high-order finite-volume unsteady, incompressible flow solver provides better simulation of small-scale vortex dynamics after the inclusion of a high-order deaveraging procedure to approximate the time derivative of the volume-averaged velocity congruently with the accuracy of the other operators. 58 CHAPTER 3. SPACE DOMAIN DECOMPOSITION Chapter 4 Time domain decomposition This Chapter is devoted to present the parallel-in-time method for unsteady, incompressible fluid flow simulation. Section 4.1 includes the detailed presentation of the numerical method applied to the solution of the unsteady, incompressible form of the Navier-Stokes equations. When solving the Navier-Stokes equations in this predictor-corrector fashion, some problems may emerge due to the use of two temporal grids. The stability of the method is evaluated and discussed in Section 4.2. Another issue that requires some attention is the choice of the time integration method used on each time-grid. Section 4.3 is devoted to analyse the accuracy and convergence of the method. A performance model for the present method is developed in Section 4.4. Section 4.5 includes the results of numerical experiments. The laminar Taylor vortex-decay problem, the shedding flow behind a square cylinder and the natural convection in a square cavity problem, where the hybrid space-time parallelization was achieved, were selected to prove the ability of the time-domain decomposition method to provide parallel fluid flow simulations. The Chapter closes with the presentation of detailed conclusions in Section 4.6. 4.1 Numerical method Considering an unsteady, incompressible Newtonian fluid flow, governed by Eq. (2.10) to (2.12), let us assume that the spatial domain of the problem is divided into control volumes provided in the present case by a cartesian orthogonal mesh. Standard numerical schemes, as those described in Chapter 2, are used for spatial 59 60 CHAPTER 4. TIME DOMAIN DECOMPOSITION and temporal discretization. The parallel-in-time method requires two temporal grids. The present hybrid formulation of the parallel-in-time algorithm is represented schematically in the Fig. 4.1. The Figure shows the time domain decomposition into time-slices and the entire space domain at each time-slice decomposed into spatial sub-domains. The time interval [0, T ] of the problem under consideration is decomposed into a sequence of L sub-domains of size ∆t = T /L, that will be called coarse timegrid. For the present purpose, it is sufficient to consider that an operator G∆t performs the solution of the momentum and energy equations together with the pressure correction Poisson equation for a single time-step ∆t. Another operator, Fδt , is required for the parallel-in-time solution. This operator also corresponds to the solution of the momentum and energy equations together with the pressure correction Poisson equation for the time-span ∆t but uses a finer time-grid. The operator Fδt corresponds to the sequential solution of M time-steps with size δt = ∆t/M , for some integer M . When more than one processor is allocated to each time slice, the communication of the variable values on spatial sub-domain boundaries is required as in the standard spatial domain decomposition technique. Denoting by χ0 the initial velocity and temperature fields and by (χ1 , ..., χL ) the successive fields at ti = i ∆t, where i is the temporal sub-domain number, the parallel solution proceeds as follows: (i) Initialization A coarse time-grid solution is obtained sequentially for the entire time domain of the problem. The processors assigned to each time sub-domain solves the spatial field for a single time-step, ∆t, and communicates the solution to the processors assigned to the next time sub-domain, χ0i = G∆t (χ0i−1 ) (4.1) χ00 = χ0 for all time instances, i = 1, 2, .., L, being the superscript the iteration counter. The operator G∆t performs the solution of the momentum and energy equations together with the pressure correction Poisson equation for a single time-step ∆t. This sequential solution in the coarse time-grid re- 4.1. NUMERICAL METHOD 61 ain om ed c a sp e m ti g itn u p m o c - Sequential coarse time-grid solution Parallel fine time-grid solution time vari a ble Figure 4.1: Space-time decomposition parallel solver schematic diagram. quires correction, provided by the iterative scheme that follows. (ii) Iterative Procedure The initial solution obtained previously is used to start an iterative procedure. Each iteration includes a parallel calculation using the fine time-grid and a sequential one on the coarse time-grid. Firstly, the parallel calculation is performed in the finer time-grid ψik = Fδt (χk−1 i−1 ) (4.2) χk0 = χ0 for 1 ≤ i ≤ L and k ≥ 1. The operator Fδt denotes the parallel solution of the momentum and energy equations together with the pressure correction Poisson equation on the finer time-grid for M time-steps of size δt from ti−1 to 62 CHAPTER 4. TIME DOMAIN DECOMPOSITION ti . Completed the parallel solution on the finer time-grid, the solution jumps at ti are calculated by each processor, according to the difference between the new solution calculated on the finer time-grid and the solution on the coarse time-grid at the previous iteration, . Sik = ψik − χk−1 i (4.3) The solution jumps are also evaluated in parallel because no information is required from other time instances. Finally, a new sequential solution is calculated. For 1 ≤ i ≤ L a solution is predicted using the coarse time-grid solver, χ̃ki = G∆t (χki−1 ) , (4.4) corrected by the solution jumps, χki = χ̃ki + k X Sil , (4.5) l=1 and communicated to the processors that are assigned to the next time-step. Time-marching sequential and parallel tasks are indicated in Fig. 4.1 showing that some overlap between them is possible and so contributing to reduce the sequential tasks overhead. For each time-span, the iterative procedure starts when the first initializing time-step calculation is completed. The parallel speed-up achieved with the above described algorithm is strongly dependent on the number of iterations performed and on the computational time spent on sequential tasks. One should note that at iteration k the solution at time t = tk does not need further corrections because the final solution is found. 4.2 Numerical stability Application of the classical linear stability analysis to this algorithm faces difficulties because there are two time-marching solutions on different time-grids used in a predictor-corrector iterative fashion. Therefore, the stability domain for sev- 4.2. NUMERICAL STABILITY 63 eral time integration schemes was evaluated by numerical experiments using the one-dimensional transport equation model, ∂φ ∂φ ∂ 2φ +u − α 2 = 0, ∂t ∂x ∂x (4.6) applied to solve the propagating scalar front problem indicated in Fig. 4.2. At t = 0 a sharp front is located at x = 0. For subsequent time the front convects to the right with a speed u and its profile loses sharpness under the influence of the diffusivity α. Different numerical schemes were tested comprising the threelevel implicit scheme, the implicit and explicit Euler schemes, the Crank-Nicolson scheme, the Adams-Bashforth scheme and the fourth-order Runge-Kutta scheme. Spatial discretization of first-, second- and fourth-order accuracy was used together with the above denoted temporal discretization schemes to provide a stable formulation for the finite differences analogue of Eq. (4.6). The first-order ”upwind ” spatial discretization is used together with the first-order temporal schemes. For the second-order temporal schemes, the spatial discretization is performed by second-order central differences. The fourth-order Runge-Kutta scheme is used together with the fourth-order central differences scheme. The one-dimensional unsteady convection-diffusion problem will also be used to illustrate the application of the parallel-in-time algorithm. For sake of simplicity, only the temporal domain decomposition will be considered here. Firstly, the time domain for the problem under consideration is divided by P , number of processors devoted to the simulation, calculating the coarse time-grid step size, ∆t. The fine time-grid step size, δt, should be chosen in order to satisfy the stability constraints of the numerical scheme intended to be applied for the temporal evolution on this grid. Another issue related to the choice of this numerical scheme is that it should allow the improvement of the solution accuracy. The implicit Euler scheme together with the ”upwind ” spatial discretization is considered here for the coarse time-grid. The ratio between the coarse and fine time-grid step sizes equal to 20 allows the use of the second-order accurate explicit Adams-Bashforth scheme for the fine time-grid integration along with the second-order central differences scheme for the spatial discretization. Considering these arbitrary assumptions, the application of the parallel-in-time method consists of the following calculations. After discretization of Eq. (4.6), the resulting algebraic equation at node j for 64 CHAPTER 4. TIME DOMAIN DECOMPOSITION Initial condition Solution 1 0.9 0.8 0.7 scalar 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.5 -0.25 0 0.25 0.5 X Figure 4.2: Propagating scalar front problem considered for numerical evaluation of the stability domain (solution for t = 1, u = 0.125 and α = 10−3 , after two iterations). positive u reads: ¡ ¢ ¡ ¢ i+1 φi+1 − φij u φi+1 − φi+1 α φi+1 + φi+1 j j j−1 j−1 − 2φj j+1 + − =0 ∆t ∆x ∆x2 (4.7) where the superscript i refers to the final time of the temporal sub-domain assigned to each processor. The processor allocated to the first time-span solves Eq. (4.7) to the required stopping criterion for the residuals, which corresponds to the coarse time-grid prediction of the variable field at the final time-instance of the first time-span. Then, the processor transmits this variable field to the processor assigned to the following time-span. The processor receives and initializes the variable field with this prediction and solves Eq. (4.7) for the next coarse grid time-step. This procedure continues throughout the entire time-domain of the problem. After conclusion of the tasks comprised in the initializing step, each processor 4.2. NUMERICAL STABILITY 65 is able to start iteration. Iteration begins with a fine time-grid solution. For each processor, it corresponds to solve sequentially the algebraic equation resulting from the discretization of Eq. (4.6) 20 time-steps with size δt = ∆t/20. Calculation starts from the same local initial variable field, already considered for the coarse time-grid prediction, see Fig. 4.3. The algebraic equation for the fine time-grid evolution reads: Ã ¡ ¡ ¢ ¢! m m m φm+1 − φm α φm 3 u φm j j j+1 − φj−1 j−1 − 2φj + φj+1 + − − δt 2 2∆x ∆x2 Ã ¡ ¢ ¡ ¢! m−1 m−1 m−1 α φm−1 + φm−1 1 u φj+1 − φj−1 j−1 − 2φj j+1 − − =0 (4.8) 2 2∆x ∆x2 where the superscript m refers to the time-step counter on the fine time-grid. This fine time-grid calculation allows evaluating the solution jumps at the local final time. This procedure, described by Eq. (4.3), only requires local values of the variable and consequently, is also performed in parallel. The iteration concludes with a sequential coarse time-grid calculation similar to the initializing procedure with an important difference. The variable field prediction at final time-instance, obtained previously by each processor on the coarse time-grid, is now corrected by the solution jumps, applying Eq. (4.5), prior to the transmission to the processor assigned for the next time-span. Repeat the same procedure for the first time-span conducts obviously to the same solution at the local final time-instance. Therefore, the second iteration begins with the solution of the second coarse grid time-span. Figure 4.2 shows the solution of the propagating scalar front problem for t = 1, u = 0.125 and α = 10−3 , after two iterations. Preliminary tests showed that the stability domain of the schemes decreases with an increasing number of processors and number of iterations. The numerical experiments were performed in a single processor, but accordingly with the described parallel-in-time algorithm, simulating the solution on one hundred processors. Stability domain was evaluated considering up to 5, 20 and 100 iterations using the same numerical scheme for both time grids. A persistent increase on the error of the solution during the iterative procedure leading to unrealistic solutions was used to set the criterion for admissible pairs of the CFL number, CF L = u∆t/∆x, and the diffusive parameter, d = α∆t/∆x2 . 66 CHAPTER 4. TIME DOMAIN DECOMPOSITION processor 1 processor 2 time domain clock time coarse time-grid step size - ∆t fine time-grid step size - δt Figure 4.3: Sequential coarse time-grid and parallel fine time-grid solutions Figure 4.4 shows an example when the error dramatically increases by several orders of magnitude but the solution does not blow-up and recovers the accuracy after a large number of iterations. These situations are case dependent and not completely understood. Figure 4.5 shows the conditional stability domain for the Crank-Nicolson, explicit Euler, Adams-Bashforth and fourth-order Runge-Kutta time integration schemes. The results of the numerical experiments show a stability domain reduction for the explicit schemes when compared with their standard conditional stability domain obtained by the Fourier analysis [66]. No reduction of the stability domain has been detected for the implicit ”unconditionally stable” three-level and Euler schemes. In addition, no influence was observed of the finer time-grid step size on the stability domain. 4.2. NUMERICAL STABILITY 10 9 10 7 67 Courant = 2.0 Courant = 1.8 Error 105 103 10 1 10 -1 10 -3 0 25 50 75 100 Iteration Figure 4.4: Error dependence on the iteration number near the stability boundary (4th order Runge-Kutta scheme and diffusive criterion equal to 0.2) 68 CHAPTER 4. TIME DOMAIN DECOMPOSITION Serial solution 5 iterations 20 iterations 100 iterations 1 0.9 0.8 Courant 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Diffusive Criterion a) Serial solution 5 iterations 20 iterations 100 iterations 1 0.9 0.8 Courant 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 b) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Diffusive Criterion Figure 4.5: Stability domain for the explicit Euler (a), Adams-Basforth (b), CrankNicolson (c) and 4th order Runge-Kutta (d) schemes . 4.2. NUMERICAL STABILITY 69 5 5 iterations 20 iterations 100 iterations Courant 4 3 2 1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Diffusive Criterion c) Serial solution 5 iterations 20 iterations 100 iterations 3 2.5 Courant 2 1.5 1 0.5 0 d) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Diffusive Criterion Figure 4.5: Stability domain for the explicit Euler (a), Adams-Basforth (b), CrankNicolson (c) and 4th order Runge-Kutta (d) schemes (cont’d) . 70 CHAPTER 4. TIME DOMAIN DECOMPOSITION 4.3 Accuracy of the parallel-in-time numerical scheme The accuracy of the iterative algorithm for linear parabolic differential equations was addressed by Lions et al. [24] and Bal and Maday [25] considering the exact form of the solution on the fine time-grid but that corresponds to a simplification in the model. The iterative scheme is ideally of order m × (k + 1), being m the accuracy order of the numerical scheme considered for the coarse time-grid solution and k the iteration counter. It is important to analyse how the iterative numerical scheme behaves with some spatial and temporal discretization schemes commonly used for the solution of the Navier-Stokes equations. For this purpose, the one-dimensional scalar transport equation, Eq. (4.6), is applied to solve the propagating scalar pulse problem and analyse the accuracy of the parallel-in-time method. The problem consists of the one-dimensional domain from x = 0 to x = 2, through which fluid velocity is u = 0.25. The diffusivity was set to 10−3 . The initial condition corresponds to a Gaussian wave pulse with peak amplitude unity, 2 /4α φ(x, 0) = e−(x−u) , (4.9) considered centered at x = 0.25, see Fig. 4.6. The time dependent solution for this problem [67], φ(x, t) = √ 1 2 e−(x−u(1+t)) /4α(1+t) , 1+t (4.10) allows to evaluate the error of numerical solution. The time domain considered for this purpose is T equal to 2. The numerical temporal discretization schemes range from first- to fourth-order accurate. Spatial discretization of first-, second- and fourth-order accuracy was used together with the temporal discretization schemes to provide a stable formulation for the finite difference analogue of Eq. (4.6). The first-order ”upwind ” spatial discretization is used together with the first-order temporal schemes (implicit and explicit Euler). For the second-order temporal schemes (Adams-Bashforth, Crank-Nicolson and three-level implicit) the secondorder central differences scheme is applied for the spatial discretization and the 4.3. PARALLEL-IN-TIME NUMERICAL SCHEME ACCURACY 71 1 initial 1 iter 2 iter 3 iter 4 iter 5 iter exact 0.9 0.8 0.7 Scalar 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 X Figure 4.6: Initial condition (t = 0) and the iterative approximation to the exact solution at t = 2. fourth-order Runge-Kutta explicit scheme is used together with the fourth-order central differences scheme. The numerical tests were conducted considering spatial meshes ranging from nx = 50 to nx = 1600 mesh nodes keeping CFL number constant. Therefore, the number of processors used for the calculations depends on the spatial mesh and P = nx/2 yields to a CFL number equal to 5 × 10−1 and 5 × 10−2 on the coarse and fine time-grids, respectively. The number of time-steps on the finer time-grid was set to 10 × P , to obtain the accurate solution on the finer time-grid required by the iterative procedure. The parallel-in-time calculations were performed in a single processor, accordingly with the described algorithm, simulating the required number of processors (up to 800). Firstly, sequential calculations were performed to evaluate the accuracy of each numerical scheme on the finer time-grid. Figure 4.7 shows the dependence of the 72 CHAPTER 4. TIME DOMAIN DECOMPOSITION 10-1 10 -2 10 -3 m=1 m=2 m=4 -4 error 10 10-5 Explicit Euler Implicit Euler Adams-Bashforth Crank-Nicolson 3 level implicit 4th-order Runge-Kutta 10-6 10 -7 10 -8 0.01 0.02 0.03 0.04 2 / nx Figure 4.7: Dependence of sequential solution on the fine time-grid L2 error norm on the mesh resolution. L2 norm of the error on the mesh size for several temporal schemes including also the first, second and fourth order slopes. The Figure shows that only the secondand fourth-order numerical schemes are within the convergence asymptotic region of the discretization. The dependence of the L2 norm of the error on the spatial discretization and on the number of iterations performed is indicated in Fig. 4.8 to 4.10 for different pairs of numerical schemes considered for the coarse and fine time-grid temporal evolution. Among the schemes tested, the fourth-order Runge-Kutta scheme is the best candidate for the fine time-grid integrator. Figure 4.6 displays the initial condition (t = 0) and also the evolution of the solution during the iterative procedure when the first order implicit Euler scheme is used for the temporal evolution on the coarse time-grid. The evolution of the error on the spatial discretization, plotted in Fig. 4.8, shows that the order of accuracy increases with the number 4.3. PARALLEL-IN-TIME NUMERICAL SCHEME ACCURACY 10 73 -1 m=1 10-2 -3 10 -4 m=4 error 10 initial approx. iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 10-5 10 -6 m=6 10-7 10 -8 0.01 0.02 0.03 0.04 2 / nx Figure 4.8: Error dependence on the spatial discretization and number of iterations using the implicit Euler and fourth-order Runge-Kutta schemes on the coarse and fine time-grids, respectively. of iterations but does not reach the formal convergence order (k + 1) because the first-order scheme is not in the convergence asymptotic region. It is possible to detect the same behaviour when using the second-order accurate Crank-Nicolson scheme for the coarse time-grid solution. However, after the second/third iteration, depending on the spatial discretization, no further improvement on the accuracy of the solution is obtained because the accuracy of the fine time-grid solution is already acquired, see Fig. 4.9. The second-order accurate Adams-Bashforth explicit scheme is also a good candidate for the numerical scheme to use for the fine time-grid solution because, although only second-order accurate, is computationally less expensive than the Runge-Kutta scheme. For the problem under consideration, the convergence is achieved after four iterations when the Adams-Bashforth is used together with the 74 CHAPTER 4. TIME DOMAIN DECOMPOSITION 10 initial approx. iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 -1 10-2 10 -3 m=2 -4 error 10 10-5 m=4 10 -6 10-7 10 -8 0.01 0.02 0.03 0.04 2 / nx Figure 4.9: Error dependence on the spatial discretization and number of iterations using the Crank-Nicolson and fourth-order Runge-Kutta schemes on the coarse and fine time-grids, respectively. implicit Euler scheme on the coarse time-grid, see Fig. 4.10. The selection of the numerical schemes for the temporal evolution on each time-grid, besides the stability constraints, should consider that the finer time-grid numerical scheme should provide higher order of convergence than the one used in the coarse time-grid. When using the fourth-order accurate numerical scheme on the fine time-grid, the convergence of the iterative parallel-in-time method requires fewer iterations using a second-order accurate scheme for the coarse time-grid than an unconditionally stable first-order scheme. The speed-up obtained with the parallel-in-time method is strongly dependent on the number of iterations required to meet the convergence criterion. The number of iterations to acquire the accuracy of the fine-grid solution is obviously dependent of the nature of numerical schemes applied. Considering the general theory 4.3. PARALLEL-IN-TIME NUMERICAL SCHEME ACCURACY 10 -1 75 m=1 10-2 10 -3 m=2 -4 error 10 initial approx. iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 10-5 10 -6 10-7 10 -8 0.01 0.02 0.03 0.04 2 / nx Figure 4.10: Error dependence on the spatial discretization and number of iterations using the implicit Euler and Adams-Bashforth schemes on the coarse and fine time-grids, respectively. related to the convergence of numerical schemes, one can approximate the number of iterations required for convergence by the ratio between the order of accuracy of the numerical schemes used for the coarse and fine time-grids solution. This prediction is clearly verified when using the second-order accurate Crank-Nicolson and the fourth-order Runge-Kutta schemes for the coarse and fine time-grid solutions respectively. When using the formal first-order implicit Euler scheme for the coarse time-grid solution, one must consider that the order of accuracy evaluated in the interval between ∆x = 0.00125 and ∆x = 0.005 on the initial approximation is only equal to 0.61. Therefore, the prediction of the number of iterations required to achieve the accuracy of the Runge-Kutta or the Adams-Bashforth schemes fine time-grid solution is 7 and 4, respectively. 76 4.4 CHAPTER 4. TIME DOMAIN DECOMPOSITION Performance model Along with the number of iterations required for convergence, the parallel efficiency of the method is also strongly dependent on the time spent by the communication tasks required by the algorithm. The following is an attempt to derive theoretically the parallel speed-up of the method presently used, even with some simplifying assumptions, in order to evaluate the potential application of the technique for the unsteady, incompressible Navier-Stokes equations. The comparison between the performance prediction and the speed-up effectively achieved with a PC-cluster computation will allow to verify how relevant is the time spent on communication tasks for the efficiency of the parallel-in-time method. Proc. 1 Proc. 2 ... Proc.P-1 time variable Proc. P Tin T1 T2 e im t k c o l c converged solution - Sequential solution on coarse time-grid - Parallel solution on fine time-grid Figure 4.11: Parallel-in-time solver schematic diagram. Figure 4.11 displays two iterations of the cycle. The computing time required to perform the first iteration is denoted by T1 . One should note that after the first iteration the solution at time t1 corresponds to the final solution at this time instance. More generally, at iteration k the solution at time tk does not need 4.4. PERFORMANCE MODEL 77 further iteration because the final solution is achieved. Considering the overlap between sequential and parallel tasks, indicated in Fig. 4.11, and neglecting the communication time, the total computing time of a parallel-in-time solution using P processors, ΓP , can be predicted by: ΓP = Γseq + k Γseq + Γ1 , P (4.11) where Γseq denotes the computing time of the sequential solution on the coarse time-grid, Γ1 the computing time on a single processor and k is the number of iterations prescribed. The parallel speed-up, Eq. (1.1), of a computation based on the time domain decomposition, can be predicted by: S (P ) = Γ1 Γseq + k ΓseqP+Γ1 . (4.12) Considering the same computing time for one time-step on the fine and coarse time-grid, the maximum expected speed-up that can be achieved is approximated by: S (P ) = MP , P + k (1 + M ) (4.13) where M is the ratio between the coarse and the fine time-grid step sizes. Figure 4.12 shows the predicted speed-up of a parallel-in-time calculation, Eq. (4.13), as a function of the number of processors and the ratio between the coarse and the fine time-grid step sizes, when two iterations on the parallel-in-time algorithm are required to meet the convergence criterion. The Figure shows that the speedup achieved for small time-step ratios is worthless. However, important parallel speed-up can be predicted for high time-step ratios. Similar conclusions could be derived considering other number of iterations, k. For high time-step ratios, the speed-up always increases with the number of processors and is limited by P/k. The behaviour of parallel-in-time processing is rather different from the space domain decomposition where, for a fixed spatial dimension of the problem, speedup has a limit imposed by the computation/communication time ratio and no further speed-up improvement can be achieved by increasing the number of processors involved. The penalty on the parallel-in-time method is inherent to the algorithm 78 CHAPTER 4. TIME DOMAIN DECOMPOSITION Parallel Speed-up 100 16 proc. 32 proc. 64 proc. 128 proc. 10 1 1 10 100 1000 10000 Time-steps on fine time grid per processor Figure 4.12: Parallel-in-time speed-up prediction for two iterations. and consequently, the efficiency comparison of the present method with the classical spatial domain decomposition method is adverse in most cases. However, despite the low efficiency obtained, important computing time reductions could be accomplished if a large number of processors are available and a few iterations are required to converge the solution. 4.5 Numerical experiments 4.5.1 Taylor vortex 4.5.1.1 Parallel-in-time results The two-dimensional Taylor vortex-decay problem [64] was used to evaluate the numerical time-parallel technique applied to the integral form of the Navier-Stokes and continuity equations. The analytical solution for this problem was already 4.5. NUMERICAL EXPERIMENTS 79 presented by Eq. (3.1) to (3.3). Computations were performed with Re = 100 and the time domain considered is up to T = 40 s. The staggered spatial meshes comprise 32 × 32 and 64 × 64 nodes to discretize the spatial domain 0 ≤ x, y ≤ π. The time domain was decomposed into P , number of processors, sub-domains (up to 16 because at this time only 16 processors were available on the PC-cluster) yielding ∆t = 40/P s and the finer time step was set equal to 4 × 10−3 s. Firstorder implicit and explicit schemes are used for the temporal evolution on the coarse and on the finer time-grids, respectively. The Bi-CGSTAB [61] method is used to solve the Poisson and the implicit momentum equations. The SIMPLE [55] method is employed to provide the velocity and pressure fields correction during each time iteration on the coarse time-grid. The standard projection method is applied for the finer time-grid solution. The numerical experiments performed suggest that the solution accuracy and the computing time required by the parallel-in-time method depend on several parameters: (i) The number of iterations performed in the algorithm; (ii) The spatial resolution; (iii) The ratio of coarse to fine time-step sizes ∆t/δt; and for each one of the above items depend obviously on the number of processors. The dependence of the computing time and the solution accuracy on items (i), (ii) and (iii) is presented in the following paragraphs. Influence of the number of iterations. The parallel-in-time method involves two integrators, for the coarse and fine time-grids that are used in an iterative fashion. Consequently, the computing time of the parallel-in-time solution is dependent on the number of iterations. Figure 4.13 shows the computing time versus the number of processors for different number of iterations considered. The spatial mesh comprises 32 × 32 nodes. For this test case, the number of iterations is prescribed (2, 3 or 4) and Figure 4.13 shows that the computing time obviously decreases with the increase in the number of processors and increases with the increase in the number of iterations. Figure 4.13 shows also the computing time 80 CHAPTER 4. TIME DOMAIN DECOMPOSITION 120 110 2 3 4 1 100 Computing Time (s) 90 iterations iterations iterations processor 80 70 60 50 40 30 20 10 0 2 4 6 8 10 12 14 16 Number of Processors Figure 4.13: Parallel-in-time computing time dependence on the number of iterations performed (32 × 32 nodes; δt = 4 × 10−3 s). for a non-parallel calculation performed with a fully explicit method using a timestep equal to the finer time-step of the parallel-in-time calculation. The computing time required for the serial calculation is equal to 106 s. The computing time of the parallel-in-time calculation was equal to 41 s on 16 processors and four iterations. The speed-up is rather low, and equal to 4.1, using 16 processors and two iterations. Concerning the accuracy of the method, Figure 4.14 shows the L1 error norm of the calculated solution at t = 40 s as a function of the number of iterations prescribed. A very small error in the velocity field, approximately equal to 1×10−5 , occurs. This error may be considered very small because the maximum value of the velocity components is approximately equal to 0.45. Figure 4.15 shows that the maximum deviation between the parallel-in-time and serial solutions decreases with the increase in the number of iterations. For the present case, the use of four iterations induces a maximum deviation smaller than 1 × 10−6 and consequently, the parallel-in-time and serial solutions are virtually identical. 4.5. NUMERICAL EXPERIMENTS 10 81 -4 Error u velocity v velocity 10-5 10 -6 1 2 3 4 5 Iteration Number Figure 4.14: Parallel-in-time solution L1 norm error dependence on the number of iterations performed (32×32 nodes; δt = 4 × 10−3 s). 10 -4 u velocity v velocity -5 Error 10 10-6 10-7 10 -8 1 2 3 4 5 Iteration Number Figure 4.15: Dependence of maximum deviation between parallel-in-time and serial solutions on the number of iterations performed (32 × 32 nodes; δt = 4 × 10−3 s). 82 CHAPTER 4. TIME DOMAIN DECOMPOSITION Influence of the spatial resolution. The temporal solution of any flow problem requires for each time step the solution of the dependent variables in the discrete spatial domain. The use of an implicit method in the coarse time-grid penalizes the computing time when the number of nodes increases in the spatial mesh. Figure 4.16 shows the computing time required by the present parallelin-time method for two spatial meshes, with 32 × 32 and 64 × 64 nodes. The computing time of a sequential calculation is also shown in the Figure. For both spatial discretizations considered, a substantial computing time saving is verified when using the parallel-in-time method. This computing time saving increases with the increase in the spatial dimension of the problem. Speed-up and parallel efficiency for both spatial meshes considered are represented in Figure 4.17 showing that higher speed-up is achieved on coarser spatial meshes. 800 32 x 32 64 x 64 700 Computing Time (s) 600 500 400 300 200 100 0 2 4 6 8 10 12 14 16 Number of Processors Figure 4.16: Computing time dependence on space resolution (2 iterations; δt = 4 × 10−3 s). 4.5. NUMERICAL EXPERIMENTS 83 1 6 Efficiency 32 x 32 grid Speed-up 32 x 32 grid Efficiency 64 x 64 grid Speed-up 64 x64 grid Parallel Efficiency 0.8 5 0.7 0.6 4 0.5 0.4 3 0.3 0.2 Parallel Speed-up 0.9 2 0.1 0 1 3 5 7 9 11 13 15 1 Number of Processors Figure 4.17: Speed-up and parallel efficiency dependence on space resolution (2 iterations; δt = 4 × 10−3 s). Influence of the ratio of coarse to fine time-step sizes. The computing time of the parallel-in-time calculations can be split into two parts. One is the time required for the sequential procedures of the algorithm and the other is used in parallel calculations. The computing time saving depends on the ratio between parallel and sequential computational efforts carried out during the iteration process. The selection of δt should correspond to the desired time resolution in a serial computation. Three time-grid increments were considered on the fine time-grid, δt1 = 4 × 10 s, δt2 = 4 × 10−3 s and δt3 = 4 × 10−4 s, to which correspond M × P = 103 , 104 and 105 , respectively. Figure 4.18 shows the computing time as a function of the number of processors for the three finer time-grids considered. The computing −2 time saving increases with the increase of the time scales ratio (M ). 84 CHAPTER 4. TIME DOMAIN DECOMPOSITION 10 4 time-step = 4.E-2 time-step = 4.E-3 time-step = 4.E-4 Computing Time (s) 10 3 102 101 10 0 2 4 6 8 10 12 14 16 Number of Processors Figure 4.18: Computing time dependence on the size of the finer time-grid increment (2 iterations on a 32 × 32 nodes mesh). 4.5.1.2 Comparison between the spatial and the temporal domain decomposition. The parallel-in-time method was compared with the standard spatial domain decomposition method. Calculations of the Taylor problem, on a 64 × 64 nodes mesh, were performed with spatial domain decomposition method using up to sixteen processors. The temporal evolution of the momentum equations is performed with the same fully explicit procedure that was applied for the fine time-grid on the parallel-in-time calculation. An explicit outer iteration coupling was used during the solution of the Poisson equation. After each outer iteration, pressure and velocities on partition boundaries are exchanged between processors until convergence is reached. The parallel efficiency rapidly decreases with the increase in the number of processors due to the high ratio between the communication and 4.5. NUMERICAL EXPERIMENTS 85 computation efforts, as shown in Fig. 4.19. The definition of parallel efficiency and speed-up for parallel-in-time results should correspond to the classical one developed for space-domain decomposition calculations. In the present parallel-in-time method the iterative use of two integrators, for the coarser and finer time-grids, causes a speed-up and efficiency decrease. However, large computer time reduction can still be achieved when comparing with a single processor calculation. Figure 4.20 shows the parallel efficiency of the parallel-in-time calculations. The low efficiency of the spatial domain decomposition method is due to the low dimension of the problem, while the low efficiency of the parallel-in-time calculations is inherent to the present method for the reasons explained above. Nevertheless, the efficiencies ratio, between parallel-in-time and spatial domain decomposition methods, is important for the user because it will help to select the domain, spatial or temporal, that should be parallelized. Figure 4.21 shows the parallel efficiencies ratio between parallel-in-time and spatial domain decomposition methods. Figure 4.21 shows that when the number of processors increases, the parallel-in-time method is more efficient than the spatial domain decomposition. This result was expected because the parallel efficiency of the spatial domain decomposition method decreases with an increasing number of processors. 86 CHAPTER 4. TIME DOMAIN DECOMPOSITION 1 4 0.9 Efficiency Speed-up 3.5 0.7 3 0.6 0.5 2.5 0.4 2 0.3 0.2 Parallel Speed-up Parallel Efficiency 0.8 1.5 0.1 0 1 3 5 7 9 11 13 15 1 Number of Processors Figure 4.19: Spatial domain decomposition parallel efficiency and speed-up on a 64 × 64 nodes mesh and δt = 4 × 10−3 s. 1 4 Efficiency Speed-up 0.9 0.7 3 0.6 0.5 0.4 2 0.3 Parallel Speed-up Parallel Efficiency 0.8 0.2 0.1 0 1 3 5 7 9 11 13 15 1 Number of Processors Figure 4.20: Temporal domain decomposition parallel efficiency and speed-up on a 64 × 64 nodes mesh and δt = 4 × 10−3 s. 4.5. NUMERICAL EXPERIMENTS 87 3 Efficiency ratio 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 Number of Processors Figure 4.21: Parallel-in-time and domain decomposition methods parallel efficiency ratio on a 64 × 64 nodes mesh. 4.5.2 Shedding flow past a two-dimensional square cylinder in a channel The numerical simulation of the flow past a two-dimensional square cylinder in a channel for Reynolds number equal to 500 and 1000 was selected to illustrate the application of the parallel-in-time method to a self-sustained demanding unsteady flow problem. Flow around bluff bodies is characterized by the onset of periodic oscillations after a critical Reynolds number, the von Karman vortex street. 4.5.2.1 Parallel-in-time simulation The flow configuration comprises the square cylinder, of width unity, symmetrically confined in a channel with height and length equal to 4 and 24, respectively. The blockage ratio is therefore 1/4 and the upstream face of the obstacle is at a distance 88 CHAPTER 4. TIME DOMAIN DECOMPOSITION 2 Y 1 0 -1 -2 0 10 X 20 Figure 4.22: Flow configuration and grid. equal to 8.5 from the inlet. The flow is impulsively started at t = 0 being a uniform flow prescribed at the inlet. At the outlet, the convective wave open boundary condition was used for velocity components. In addition, no-slip conditions were prescribed on walls. The numerical grid comprises 90 × 38 nodes to discretize the spatial domain of 24 × 4. The local mesh Reynolds number or Peclet number is higher than 2 for the flow considered and consequently a deferred correction method was used for convection discretization [60]. The deferred correction employed about 80% of central differences and the remaining 20% of upwind contribution. At each time-step calculation, the SIMPLE method [55] is used to correct the velocity and pressure fields enforcing a divergence free velocity field. The three-level implicit and the Crank-Nicolson schemes were used on the coarse and finer time-grids, respectively. The time-step sizes are equal to δt = 1 × 10−2 and ∆t = 0.25 in the fine and coarser time-grids, respectively. As the number of processors available is insufficient to perform a parallel-in-time calculation corresponding to the simulation time of several shedding time periods, that would require 2000 processors for a time interval T equal to 500, time-blocks were considered. In this way, time-blocks with size equal to P × ∆t are solved sequentially. Figure 4.23 shows the temporal evolution of the drag and lift coefficients for Reynolds number equal to 500, where the onset of breaking flow can be observed at t ≈ 60 after the impulsive start. The bifurcation of the solution was triggered by numerical noise without any perturbation prescribed to initiate the vortex shedding and in a similar fashion to the pure sequential simulations. Figure also shows 4.5. NUMERICAL EXPERIMENTS 89 Table 4.1: Predicted values for St and CL rms . Re St CL rms 500 0.24 0.52 1000 0.22 1.24 the temporal evolution of the drag and lift coefficients when the periodic flow is established for the same Reynolds number. Table 4.1 shows the predicted values for the Strouhal number, St = f D/U0 where f is the shedding frequency and D is the square cylinder width, and for the rms value of the lift coefficient, CL rms . The predictions are in reasonable agreement with reference solution values reported by Davies et al. [68]. Figure 4.24 shows predicted vorticity contours and streamlines for Reynolds number equal to 500. Flow predictions were also obtained on a single processor with the same parameters used in the finer time-grid of the parallel-in-time procedure. The results were virtually identical, showing that the time decomposition procedure did not deteriorate the solution. Parallel speed-up for the simulations performed with 16 processors and the parallel-in-time procedure was equal to 5.2 and 4.8 for Reynolds number equal to 500 and 1000, respectively. 90 CHAPTER 4. TIME DOMAIN DECOMPOSITION 4 CD CL Force coefficients 3 2 1 0 -1 -2 0 50 100 150 200 Time 4 CD CL Force coefficients 3 2 1 0 -1 -2 400 425 450 475 500 Time Figure 4.23: Force coefficients for Re = 500. 4.5. NUMERICAL EXPERIMENTS 91 Figure 4.24: Predicted vorticity contours and streamlines for Re = 500. 4.5.2.2 Evaluation of the proposed performance model The above described flow configuration for Re = 500 was also used to evaluate the parallel efficiency of the method and compare with the predictions of the performance model derived in Section 4.4. The spatial domain is now discretized on a staggered, uniform 151×26 nodes grid. The deferred correction scheme, blending second-order central and upwind differences, removed the oscillations produced by the central differences schemes on the mesh Reynolds numbers considered. The temporal discretization was performed by the implicit Crank-Nicolson scheme on both time-grids. The selection of the same numerical scheme for both temporal grids approximates the assumption of equal computing time per time-step on the fine and coarse time-grids that was considered when deriving the performance model. Preliminary simulation was performed on a single processor with a time-step 92 CHAPTER 4. TIME DOMAIN DECOMPOSITION size equal to 0.01. The bifurcation of the equations does not arise at the same time on the single processor and on the parallel-in-time solutions. Consequently, the comparison between the single processor and parallel solutions is based on the full periodic established vortex shedding mechanism. The comparison is based on the predictions for St and CLrms . The single processor predictions for St and CLrms are equal to 0.255 and 0.362, respectively. The same spatial grid, with 151 × 26 nodes, is used on the parallel calculations evaluating the dependence of the solution on the number of iterations. Stability constraints related to the numerical method imposed a coarse grid time-step size equal to ∆t = 0.1. Time blocks, corresponding each one to P processors handling P × ∆t, were solved sequentially. The fine time-grid step size is equal to δt = 0.01 and kept constant to the single processor calculation for comparison purposes. Figure 4.25 plots the calculated St and rms value of the lift coefficient as a function of the number of iterations performed in the parallel-in-time algorithm. Figure 4.25 shows that few iterations are required to obtain values of St and CLrms that are virtually equal to those obtained with a single processor calculation. The small number of iterations required for convergence is a consequence of the small time-step size on the coarse time grid imposed by the stability restriction and, consequently the small dimension of each time-block calculated. Figure 4.26 shows the vorticity contours after the first and the second iteration on the parallel-intime algorithm after the establishment of the periodic flow. Very small differences are detectable on those vorticity contours. It is expectable for longer time-blocks an increase on the number of iterations required for convergence. However, the number of processors available for these experiments did not allow proofing this assertion. Other numerical experiments were performed to compare the achieved speedup with the one predicted by the performance model. The sampling frequency, ∆t = 0.1, was kept unchanged in all simulations. Ten time-blocks, consisting each one of P time-steps, were calculated using different number of processors (1, 8, 10, 12, and 16) and different coarse to fine time-grid step size ratios (10, 100, 200 and 1000). A converged flow field solution after the establishment of the periodic flow was used as the initial condition to start these calculations avoiding the influence of the initial time-steps on an impulsive start from rest. The use of a prescribed number of time-blocks on the comparison, instead of a prescribed time 4.5. NUMERICAL EXPERIMENTS 93 0.45 Strouhal Number Lift Coeficient (rms) Strouhal Number (single processor) Lift Coeficient (rms) (single processor) Strouhal Number 0.265 0.4 0.26 0.35 0.255 0.3 0.25 0 1 2 3 4 5 Lift Coeficient (rms) 0.27 0.25 Number of iterations Figure 4.25: Lift coefficient and Strouhal number dependence on the number of iterations prescribed. interval, is necessary to allow the use of the above mentioned cluster dimensions maintaining the sampling frequency. Two and three iterations on the parallel-intime algorithm were prescribed. Figure 4.27 shows the comparison between the verified parallel efficiency of the method and predictions based on the performance model, Eq.(4.13). The predicted efficiency that depends on M , P and k was plotted as a function of Φ = M/P for each number of iterations prescribed on the parallel-in-time algorithm. Figure 4.27 shows good agreement between predicted and verified parallel efficiency. More significant deviations are verified for small values of Φ, when the computing time of the sequential coarse time-grid procedures gives a more important contribution to the total computing time. 94 CHAPTER 4. TIME DOMAIN DECOMPOSITION b) a) Figure 4.26: Vorticity contours after first (a) and second (b) iteration. 4.5. NUMERICAL EXPERIMENTS 95 0,6 2 iterations 3 iterations P=16 P=12 P=10 P=8 Parallel Efficiency 0,5 0,4 0,3 0,2 0,1 0 0,1 1 10 100 1000 Ф Figure 4.27: Comparison between predicted (lines) and verified (symbols, filled symbols are related to 3 iterations) efficiency of the parallel-in-time method for 2 and 3 iterations prescribed (Φ = M/P ). 96 CHAPTER 4. TIME DOMAIN DECOMPOSITION 4.5.3 Hybrid spatial and temporal domain decomposition The test case considered here is the two-dimensional natural convection in a square cavity flow problem. The two-dimensional, unsteady, incompressible flow in a square cavity due to natural convection, governed by the Navier-Stokes and energy equations, is a demanding impulsively started fluid flow problem, suitable to evaluate the main properties of the parallel-in-time method. The test case can be described as follows. The lower and upper walls of the square cavity are insulated and the fluid is considered initially at rest and at temperature θ0 . At time t = 0, the temperatures θ0 + ∆θ and θ0 − ∆θ are prescribed and maintained at the left and right-side walls, respectively. Simulations for Prandtl number equal to 7 and Rayleigh number equal to 103 , 1.4 × 104 and 1.4 × 105 were performed on the PC-cluster with up to 24 nodes. The non-dimensionalized temporal domain of the simulations is [0, 0.05]. The time-step size on the coarser time-grid dictates the use of an implicit method. The prescribed non-dimensionalized time-step size equal to 2 × 10−6 on the finer grid allows the use of an explicit method. The spatial domain is discretized on a uniform 32 × 32 nodes collocated mesh. The convective and diffusive fluxes on the control-volume faces are evaluated with second order central differences together with the Adams-Bashforth scheme on the finer time-grid. The projection method is used to correct the pressure and the velocity fields. When a small number of processors are available, in the present case up to 24 processors, the integration method, using the coarse time-grid, should be implicit to verify stability constraints. For the present problem, the implicit Euler scheme is used together with the first-order ”upwind ” spatial discretization scheme. At each time-step, the SIMPLE [55] method was used to correct the pressure and the velocity fields on the coarse time-grid solution. The stabilized biconjugate gradient method (Bi-CGStab) [61] included in the AZTEC library [62] is used to solve the systems of equations arising from the Poisson and the implicit momentum and energy equations. The convergence criterion used to stop the iterative procedure of the parallel-intime algorithm is based on the maximum value of the calculated solution jump nondimensionalized by the maximum value of the variable at present time-instance. This convergence criterion is set to 5 × 10−3 . 4.5. NUMERICAL EXPERIMENTS 97 Figure 4.28 shows the time dependent Nusselt number, 1 Nu = 2 Z 1 (P r U Θ − 0 ∂Θ )X dY ∂X (4.14) evaluated at the heated wall, X = 0, and at the vertical center-line of the cavity X = 1/2 for Rayleigh number equal to 103 , 1.4 × 104 and 1.4 × 105 . Simulations were performed on 24 processors assigning each time-slice to a single processor. The Nusselt number temporal evolutions are in good agreement with reference reported solutions [69]. Figure 4.29 shows the evolution of the Nusselt number with time, at X = 1/2, for different number of iterations performed on the parallel-in-time algorithm, for Ra = 1.4 × 105 . From Fig. 4.29 one can deduce that the convergence of the parallel-in-time iterative solution is smooth. The number of iterations required for convergence is dependent on the prescribed Rayleigh number and on the number of time sub-domains, see Fig. 4.30. More iterations are required for higher Rayleigh number flows. The lowest Rayleigh number flow considered, Ra = 103 , has a time dependence so smooth that the number of iterations required to convergence remains constant not depending on the number of processors used. For higher Rayleigh number flows, the number of iterations decreases with the number of sub-domains. One should note that the number of iterations is limited by the number of time sub-domains and consequently, for Ra = 1.4 × 105 there is an increase on the number of iterations required to achieve the convergence criterion when the number of processors increases from 8 to 12. The extension of the parallel-in-time algorithm to hybrid time and space parallel calculations allows the possibility to optimise the speed-up by the choice of the domain to parallelize against the dimensions of the problem under consideration and the number of processors available. Calculations were performed for Ra = 1.4 × 104 on two spatial grids, 32 × 32 and 128 × 128 nodes keeping Courant number constant on the fine time grid. Both spatial meshes are too coarse for accurate representative calculations. However, the main purpose of the present work is to apply the hybrid space- and time-domain decomposition technique on a problem solution. To consider finer spatial meshes more processors would be necessary to perform a reasonable comparison between the time-domain and the space-domain decomposition methods. Several configurations of the space and 98 CHAPTER 4. TIME DOMAIN DECOMPOSITION 9 3 Ra = 10 4 Ra = 1.4 x 10 Ra = 1.4 x 105 8 7 Nu (X = 1/2) 6 5 4 3 2 1 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.04 0.05 0.06 t* 9 8 7 Nu (X = 0) 6 5 4 3 2 1 0 0 0.01 0.02 0.03 t* Figure 4.28: The dependence of the Nusselt number, at the heated wall, X = 0, and at the vertical center-line X = 1/2, on the non-dimensionalized time for Rayleigh number equal to 103 , 1.4 × 104 and 1.4 × 105 . 4.5. NUMERICAL EXPERIMENTS 99 iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 converged solution 9 8 7 Nu (X=1/2) 6 5 4 3 2 1 0 0 0.01 0.02 0.03 0.04 0.05 0.06 t* Figure 4.29: The Nusselt number temporal evolution, at the vertical center line for Ra = 1.4 × 105 , dependence on the number of iterations performed. 3 Ra = 10 Ra = 1.4 x 104 5 Ra = 1.4 x 10 16 Number of iterations for convergence 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 5 10 15 20 25 Number of processors Figure 4.30: The dependence of the number of iterations required for convergence on the Rayleigh number and on the number of time sub-domains. 100 CHAPTER 4. TIME DOMAIN DECOMPOSITION time domains decomposition were tested using up to 24 processors. The limited number of nodes of the PC-cluster only allows testing few topologic configurations of processors allocated to time and space domains. The ”wall clock ” computing time as a function of the number of space and time sub-domains is present in Fig. 4.31 a) for the 128 × 128 grid nodes. When only standard domain decomposition is used, the achieved results translate the well-known trend that when the spatial dimension of the problem is large the space domain decomposition is efficient and meaningful computing time reduction is achieved. For the present problem and the spatial mesh comprising 128 × 128 grid nodes, the time decomposition is not efficient because the computing effort ratio between the parallel fine time grid and the sequential coarse time grid solutions constrains the performance of the method. The computing time is dictated by these sequential calculations on the iterative procedure because the number of outer iterations required in the coarse time grid calculation is large. A smaller coarse grid time-step size, and consequently more processors available, is necessary to increase properly the computing effort ratio. When using the coarser space grid of 32 × 32 nodes for the same problem, see Fig. 4.31 b), the space domain decomposition is not efficient. Computing time reduction can only be achieved by the time domain decomposition method. The number of time sub-domains considered, 6 to 24, is sufficient to achieve a suitable computing effort ratio and the computing time saving increases with the number of time sub-domains. The small number of processors available limited the present hybrid space-time parallelization. Nevertheless, the expected computing time decrease with an increasing number of time sub-domains, considering an optimal number of space sub-domains, is potentially indicated in Fig. 4.31 a). From Fig. 4.31 b) it is possible to verify that the small spatial dimension of the problem (32 × 32 nodes) prevents any benefit from the hybrid parallelization. 4.5. NUMERICAL EXPERIMENTS 101 a) 16000 14000 10000 8000 6000 computing time (s) 12000 4000 1 2000 6 0 number of time sub-domains 12 1 2 24 4 9 number of space sub-domains b) 300 200 150 100 computing time (s) 250 50 1 6 0 number of time sub-domains 12 1 2 24 4 9 number of space sub-domains Figure 4.31: Computing time required for simulations: a) 128 × 128 nodes ; b) 32 × 32 nodes. 102 4.6 CHAPTER 4. TIME DOMAIN DECOMPOSITION Summary A parallel-in-time method, based on the temporal domain decomposition, was applied for the solution of the unsteady, incompressible Navier-Stokes equations and extended to a hybrid spatial and temporal formulation. The two-dimensional Taylor vortex-decay problem with Re = 100 was selected to conduct a sensitivity analysis on some of the parallel-in-time method influencing parameters. The following conclusions were derived: (i) The parallel-in-time solution can require less computational time than the single processor solution. Speed-up depends on several parameters that were investigated. (ii) The parallel-in-time computing time decreases with the number of processors and increases with the number of iterations required for convergence of the iterative process, between the sequential coarse time-grid and the finer parallel time-grid solutions. (iii) The parallel efficiency of the parallel-in-time method increases with the decrease of the spatial dimension of the problem. (iv) The parallel efficiency of the present method increases when the computational effort ratio, between fine and coarse time-grid integrators, increases. The conditional stability domain was evaluated by numerical experiments for several numerical temporal discretization schemes applied on the coarser timegrid. For this purpose, calculations of the one-dimensional transport equation were performed simulating up to one hundred processors. No reduction of the stability domain has been detected for the ”unconditionally stable” implicit threelevel and Euler schemes. The other schemes, explicit Euler, Adams-Bashforth, Crank-Nicolson and fourth-order Runge-Kutta, displayed important reductions on their conditional stability domain for the test case investigated. To evaluate the potential efficiency of the present method to fluid flow simulations, the convergence of this iterative method was analysed considering some spatial and temporal discretization schemes commonly used for that purpose. The parallel-in-time solution of the one-dimensional scalar transport equation allowed some conclusions concerning the accuracy and convergence of the iterative method: 4.6. SUMMARY 103 (i) The accuracy of the parallel-in-time solution increases with the number of iterations accordingly with the order of accuracy of the numerical schemes considered for the coarse and fine time-grids. (ii) The number of iterations required for convergence can be estimated by the ratio between the order of accuracy of the numerical schemes considered for the fine and coarse time-grids. Another important issue related to the performance of the parallel-in-time method is the time spent by the communication tasks required by the algorithm. A simplified performance model for the parallel-in-time method was proposed. The flow past a two-dimensional square cylinder between parallel walls for Reynolds number equal to 500 was selected to analyse through numerical experiments the influence of the communication time on the parallel efficiency of the method. The following conclusions could be derived: (i) The agreement between the verified parallel efficiency in the numerical experiments and the one predicted by the theoretical model indicates that the communication overhead does not impose a critical limitation on the application of the present methodology for the unsteady, incompressible NavierStokes equations. (ii) For large ratios between the coarse and fine time-step sizes, substantial computing time reduction can be expected. The application of the parallel-in-time algorithm for the solution of the incompressible, unsteady Navier-Stokes equations was extended to a hybrid (space and time) decomposition and the simulations agree with reference solutions for the considered test cases. The numerical simulation of the two-dimensional, unsteady, incompressible flow in a square cavity due to natural convection was selected to illustrate the application of four computing strategies, sequential, space domain decomposition, time domain decomposition and hybrid space-time decomposition. When the space domain of the problem is small and the standard space domain decomposition method prevents the speed-up of the calculation, the parallel-intime method allows to reduce the computing time. The speed-up achieved with the present method is strongly dependent on the number of iterations required for 104 CHAPTER 4. TIME DOMAIN DECOMPOSITION convergence. An increasing number of time sub-domains decreases the number of iterations required for convergence and consequently, contributes to magnify the computing time saving. The hybrid space-time calculations were successfully performed but 24 processors were not enough to derive final conclusions. The results suggest that the parallel-in-time methodology is promising when the temporal scale of the problem under consideration is large and a large number of processors are available. Chapter 5 Conclusions 5.1 Summary Computational fluid dynamics is actually one of the great challenges in supercomputing. The use of massively parallel processing systems is necessary to solve cost-effectively high resolution problems. The number of nodes of parallel computers will naturally increase in the future having as limit the largest number of linked computers, which is ultimately the World Wide Web with several hundred million processors. For such scenario, a problem considered nowadays as large may become a small one if a very large number of computer nodes are available. For unsteady fluid flow problems, one possible way to fully exploit the large number of nodes available in the future is the temporal domain decomposition method. Another relevant trend on CFD is the use of high-order methods. For a given solution accuracy, high-order methods require less memory (small number of points) and in most cases can save computing time. To take advantage from the more accurate volume-averaged solution on higher-order formulations, it is essential to proceed with the reconstruction of the point-wise velocity field. This should be done also on a higher-order basis to approximate congruently the time derivative of the volume-averaged velocity with the accuracy of the other operators. The spatial domain decomposition technique was applied to parallelize a finitevolume fourth-order accurate solver code for the unsteady, incompressible form of the Navier-Stokes equations. The developed algorithm considers the time advancement of the momentum equations by the classical explicit four-stage Runge-Kutta 105 106 CHAPTER 5. CONCLUSIONS scheme and the control-volume faces convective and viscous fluxes approximation by fourth-order accurate Simpson’s rule and polynomial interpolation. The calculation of the high-order de-averaging coefficients is based on the Taylor series expansion of the integrated velocity values at cell and neighbourhood cells. The fourth-order global accuracy of the numerical scheme was verified by the twodimensional Taylor vortex decay problem simulation. The global numerical scheme becomes second-order accurate when the time derivative of the cell center velocity substitutes the time derivative of the volume-averaged velocity. The flow simulations performed comprising the interaction between vortices provided some more conclusions. The accuracy enhancement obtained by the inclusion of the fourth-order de-averaging procedure was significantly revealed for coarser spatial discretizations. For the same level of accuracy, the proposed fourthorder de-averaging procedure requires less mesh nodes than the second-order one. The differences between the solutions obtained with the proposed fourth-order and the second-order de-averaging procedures vanish with mesh refinement. A parallel-in-time method, based on temporal domain decomposition, was applied for the solution of the unsteady, incompressible Navier-Stokes equations. Numerical experiments were selected to analyse the numerical properties of the method. The solution of the non-linear fluid flow equations with the parallel-intime method is a promising technique but still in the beginning of development and application. Among others, the stability, the theoretical parallel efficiency and the robustness to deal with non-linear complex unsteady flows were investigated. The conditional stability domain was evaluated by numerical experiments solving the one-dimensional scalar transport equation. For this purpose, various numerical schemes, involving first- to fourth-order discretization schemes, were applied for the coarser time-grid solution, considering up to one hundred processors. No reduction of the stability domain was detected for the ”unconditionally stable” implicit schemes. Other schemes, explicit Euler, Adams-Bashforth, CrankNicolson and fourth-order Runge-Kutta, displayed important reductions on their conditional stability domain. The prediction of the parallel efficiency, or speed-up, is relevant to decide when to use the parallel-in-time method for a specific problem. A theoretical performance model for the parallel-in-time method was derived, with some simplifying assumptions, in order to evaluate the potential application of the technique. This 5.1. SUMMARY 107 model takes into account several parameters that contribute to the computing time required to perform a parallel-in-time calculation. The number of iterations required to convergence plays an important role on the efficiency of the parallel-in-time method. The one-dimensional scalar transport equation allowed to evaluate the potential efficiency of the method for fluid flow simulations. The convergence of the iterative method was analysed considering some spatial and temporal discretization schemes commonly used for CFD calculations. The selection of the numerical schemes for the temporal evolution on each time-grid, besides the stability constraints, should be made considering that the numerical scheme for the finer time-grid evolution should provide higher order of convergence than the one used in the coarse time-grid. Considering the general theory related to the convergence of numerical schemes, one can approximate the number of iterations required for convergence by the ratio between the order of accuracy of the numerical schemes used for the fine and coarse time-grids solution. The performance of the parallel-in-time method is also dependent on the time spent by the communication tasks required by the algorithm. The performance model was validated with solutions of the Navier-Stokes equations for a selfsustained unsteady flow problem. The comparison of the observed parallel-in-time performance for the solution of a complex unsteady, incompressible flow problem with the performance model prediction allows to verify that the effect of the communication time overhead of the parallel-in-time simulation do not prevent the application of the method for fluid flow simulations. The simulations of demanding unsteady flow problems allowed to conclude that few iterations are sufficient to obtain parallel-in-time solutions virtually equal to those obtained in a single processor. Significant computing time saving can be achieved for long time simulations of problems with small spatial domain. It is believed that in the near future massive parallel computer systems will increase the number of processors available allowing new boundaries to the problems dimension. Consequently, temporal or hybrid domain decomposition and high-order finite-volume methods will have a high potential application to reduce the computing time of fluid flow simulations. This will have a positive impact on the solution of partial differential equations that CFD deals with as well as in other areas of computer modelling in engineering and science. 108 CHAPTER 5. CONCLUSIONS 5.2 Suggestions for future work Parallel-in-time solution of non-linear fluid flow equations is still in the beginning and many theoretical, numerical and practical topics need further investigation. The main tasks in this area that should be addressed are: (i) Investigate the stability of the algorithm including non-linear stability analysis. (ii) Incorporate in the algorithm automatic options allowing coarser spatial meshes for the coarse time-grid prediction step. (iii) Investigate the extension of the algorithm to a multi-level formulation, in a similar fashion of the spatial multi-grid methods, to increase the performance of the method. Finally, general suggestions for future work are: (i) Import the codes developed for parallel processing in PC-clusters into the GRID massive technology. (ii) Develop error estimation techniques to incorporate into the h-p refinement methodologies for high-order massive parallel calculations. Bibliography [1] D. Culler, J. Singh, and A. Gupta. Parallel Computer Arquitecture - A hardware/software approach. Morgan Kaufmann Publishers, USA, 1999. [2] J. Chamiço. Cálculo Numérico do Escoamento de Convecção Natural usando Computadores de Arquitectura Paralela. MS Thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, 1995. [3] Pedro J. A. V. Novo. Parallel Simulation of Radiative Heat Transfer. PhD Thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, 2000. [4] Pedro Jorge Coelho. Influence of the discretization scheme on the parallel efficiency of a code for the modelling of a utility boiler. In VECPAR, pages 203–214, 1998. [5] P. J. Coelho, P. A Novo, and M. G. Carvalho. Modelling of a utility boiler using parallel computing. Journal of Supercomputing, 13:211–232, 1999. [6] L. M. R. Carvalho and José M. L. M. Palma. Parallelization of CFD code using PVM and domain decomposition techniques. In VECPAR, pages 247– 257, 1996. [7] Vipin Kumar, Franz-Josef Pfreundt, Hans Burkhard, and Jose Laginha Palma. Applications on high performance computers. In Euro-Par, page 409, 2002. [8] J. Dongarra, I. Foster, G. Fox, K. Kennedy, A. White, L. Torczon, and W. Gropp. The Source Book of Parallel Computing. Elsevier Science, USA, 2003. 109 110 BIBLIOGRAPHY [9] A. Quarteroni and A. Valli. Domain Decomposition Methods for Partial Differential Equations. Oxford University Press, Oxford, 1999. [10] G. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS 1967 Spring Joint Computer Conference, volume 40, pages 483–485, 1967. [11] F. Berman, G. Fox, and A. Hey, editors. Grid Computing: Making the Global Infrastructure a Reality. J. Wiley & Sons, 2003. [12] I. Foster and C. Kesselman, editors. The Grid: Blueprint for a New Computing Infrastructure. Elsevier, USA, 2004. [13] N. Karonis, B. Toonen, and I. Foster. Mpich-g2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing, 63(5):551–563, 2003. [14] O. Hazama, T. Hirayama, and G. Yagawa. Development of a grid working environment and its applications to fluid-structure interaction analysis. In P. Neittaanmaki, T. Rossi, K. Majava, and O. Pironneau, editors, Proceedings of European Congress on Computational Methods in Applied Sciences and Engineering, Jyvaskyla, Finland. ECCOMAS, 2004. [15] R.S. Montero, E. Huedo, and I.M. Llorente. Benchmarking of high throughput computing applications on grids. Parallel Computing, 32(4):267–279, 2006. [16] D. Kwak, C. Kiris, and C. Kim. Computational challenges of viscous incompressible flows. Computers & Fluids, 34:283–299, 2005. [17] G. Geist, J. Kohla, and P. Papadopoulos. PVM and MPI: A Comparison of Features. Calculateurs Paralleles, 8(2):137–150, 1996. [18] W. D. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. Scientific and Engineering Computation. MIT Press, Cambridge, MA, 1994. [19] D. Womble. A time-stepping algorithm for parallel computers. SIAM J. Sci. Stat. Comput., 11:824–837, 1990. BIBLIOGRAPHY 111 [20] G. Horton. The time-parallel multigrid method. Comm. Appl. Numer. Methods, 8:585–595, 1992. [21] G. Horton and S. Vandewalle. A space-time multigrid method for parabolic partial differential equations. SIAM J. Sci. Stat. Comput., 16:848–864, 1995. [22] P. Chartrier and B. Philippe. A parallel shooting technique for solving dissipative ODE’s. Computing, 51:209–236, 1993. [23] J. Ferziger and M. Peric. Computational methods for fluid dynamics. SpringerVerlag, 1997. [24] J. Lions, Y. Maday, and G. Turinici. Resolution d’EDP par un schema en temps ’pararéel’. Comptes Rendus de l Academie des Sciences, Paris, Séries I, Mathematique, 332:661–668, 2001. [25] G. Bal and Y. Maday. A ’parareal’ time discretization for non-linear PDE’s with application to the pricing of an american put. In L. Pavarino and A. Toselli, editors, Proceedings of the Workshop on Domain Decomposition, Zurich, Switzerland, volume 23 of Lecture Notes in Computer Science and Engineering Series, Berlin, 2002. Springer. [26] L. Baffico, S. Bernard, Y. Maday, G. Turinici, and G. Zerah. Parallel-in-time molecular-dynamics simulations. Int. J. Quant. Chem., E66:057701, 2002. [27] Y. Maday and G. Turinici. Parallel in time algorithms for quantum control: the parareal time discretization scheme. Int. J. Quant. Chem., 93(3):223–228, 2003. [28] C. Farhat and M. Chandesris. Time-decomposed parallel time-integrators: theory and feasibility studies for fluid, structure, and fluid-structure applications. Int. J. Numer. Methods Engng., 58:1397–1434, 2003. [29] J. Trindade and J. Pereira. Parallel-in-time simulation of the unsteady NavierStokes equations for incompressible flow. Int. J. Numer. Methods Fluids, 45:1123–1136, 2004. 112 BIBLIOGRAPHY [30] G. Staff and E. Rønquist. Stability of the parareal algorithm. In Proceedings of the 15th International Conference on Domain Decomposition Methods, Berlin, 2003. [31] J. Trindade and J. Pereira. Parallel-in-time simulation of two-dimensional unsteady incompressible laminar flows. Num. Heat Transf. B, Fundamentals, 50:25–40, 2006. [32] R. Issa. Solution of implicitly discretized fluid flow equations by operatorsplitting. J. Comp. Phys., 62:40–65, 1986. [33] C. Kiris and D. Kwak. Aspects of unsteady incompressible flow simulations. Computers & Fluids, 31:627–638, 2002. [34] A. Chorin. Numerical solution of the Navier-Stokes equations. Math. Comp., 22:745–762, 1968. [35] R. Temam. Une méthode d’aproximation de la solution des équations de Navier-Stokes. Bull. Soc. Math. France, 98:115–152, 1968. [36] J. Kim and P. Moin. Application of a fractional-step method to incompressible Navier-Stokes equations. J. Comp. Phys., 59:308–323, 1985. [37] P. Colella J. Bell and H. Glaz. A second order projection method for the incompressible Navier–Stokes equations. J. Comput. Phys, 85:257–283, 1989. [38] J. van Kan. A second-order accurate pressure-correction scheme for viscous incompressible flow. SIAM J. Sci. Stat. Comput., 7(3):870—-891, 1986. [39] W. E and J. Liu. Projection method I: convergence and numerical boundary layers. SIAM J. Num. Anal., 32:1017–1057, 1995. [40] W. E and J. Liu. Projection method II: Godunov-Ryabenki analysis. SIAM J. Num. Anal., 33:1597–1621, 1996. [41] J. Strikwerda and Y. Lee. The accuracy of the fractional step method. SIAM J. Num. Anal., 37:37–47, 1999. [42] D. Brown, R. Cortez, and M. Minion. Accurate projection methods for the incompressible Navier–Stokes equations. J. Comput. Phys, 168:464–499, 2001. BIBLIOGRAPHY 113 [43] W. Henshaw. A fourth-order accurate method for the incompressible NavierStokes equations on overlapping grids. J. Comp. Phys., 113:13–25, 1994. [44] W. Henshaw, H. Kreiss, and L. Reyna. Fourth-order-accurate difference approximation for the incompressible Navier-Stokes equations. Computers and Fluids, 23(4):575–593, 1994. [45] N. Kampanis and J. Ekaterinaris. A staggered grid, high-order accurate method for the incompressible Navier–Stokes equations. J. Comp. Phys., 215(2):589–613, 2006. [46] S. Lele. Compact finite difference schemes with spectral-like resolution. J. Comp. Phys., 103:16–42, 1992. [47] Z. Lilek and M. Peric. A fourth-order finite volume method with collocated variable arrangement. Computers & Fluids, 24:239–252, 1995. [48] F.M. Denaro. Towards a new model-free simulation of high-reynolds-flows: local average direct numerical simulation. Int. J. Numer. Methods Fluids, 23:125–142, 1996. [49] G. De Stefano, F.M. Denaro, and G. Riccardi. Analysis of 3d backward-facing step incompressible flows via a local average-based numerical procedure. Int. J. Numer. Methods Fluids, 28:1073–1091, 1998. [50] P. Iannelli, F.M. Denaro, and G. De Stefano. A deconvolution-based fourthorder finite volume method for incompressible flows on non-uniform grids. Int. J. Numer. Methods Fluids, 43:431–462, 2003. [51] J.M.C. Pereira, M.H. Kobayashi, and J.C.F. Pereira. A fourth order accurate finite volume compact method for the incompressible Navier-Stokes solutions. J. Comp. Phys., 166:217–243, 2001. [52] J. Trindade and J. Pereira. Parallel-in-time simulation of 2D laminar unsteady flow around a square obstacle. In Proceedings of ASME Heat Transfer/Fluids Engineering Summer Conference, Charlotte, USA, HT-FED200456796. ASME, 2004. 114 BIBLIOGRAPHY [53] J. Trindade and J. Pereira. Parallel-in-time for the finite-volume solution of incompressible unsteady Navier-Stokes. In P. Neittaanmaki, T. Rossi, K. Majava, and O. Pironneau, editors, Proceedings of European Congress on Computational Methods in Applied Sciences and Engineering, Jyvaskyla, Finland. ECCOMAS, 2004. [54] M. Lesieur, P. Comte, and J. Zinn-Justin, editors. Computational Fluid Dynamics. Elsevier, North-Holland, 1996. [55] S. Patankar. Numerical heat transfer and fluid flow. Hemisphere Publishing Corp., 1980. [56] S. Patankar and D. Spalding. A calculation procedure for heat, mass and momentum transfer in three-dimensional parabolic flows. Int. J. Heat Mass Transfer, 15:1787–1807, 1972. [57] C. Rhie and W. Chow. Numerical study of the turbulent flow past an airfoil with tailing edge separation. AIAA J., 21:1525–1532, 1983. [58] T. Ye, R. Mittal, H. S. Udaykumar, and W. Shyy. An accurate cartesian grid method for viscous incompressible flows with complex immersed boundaries. J. Comp. Phys., 156(2):209–240, 1999. [59] F. W. Harlow and J.E. Welch. Numerical calculation of time-dependent viscous incompressible flow of fluid with free surface. Phys. of Fluids, 8:2182– 2189, 1965. [60] P. Khosla and S. Rubin. A diagonally dominant second-order accurate implicit scheme. Computers & Fluids, 2:207–209, 1974. [61] H. van der Vorst. Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 13(2):631–644, 1992. [62] R. Tuminaro, M. Heroux, S. Hutchinson, and J. Shadid. Official Aztec user’s guide: Version 2.1, Tech. Rep. SAND99-8801,. Sandia National Labs, 1999. http://www.cs.sandia.gov/CRF/aztec1.html. BIBLIOGRAPHY 115 [63] M. Kurreck and S. Wittig. A comparative study of pressure correction and block-implicit finite volume algorithms on parallel computers. Int. J. Numer. Methods Fluids, 24:1111–1128, 1997. [64] G. Taylor. On the decay of vortices in a viscous fluid. Philosophical Magazine, 46:671–675, 1923. [65] S. Le Dizès and F. Laporte. Theoretical predictions for the elliptical instability in a two-vortex flow. J. Fluid Mech., 471:169–201, 2002. [66] J.M.C. Pereira and J.C.F. Pereira. Fourier analysis of several finite differences schemes for the one-dimensional unsteady convection-diffusion equation. Int. J. Numer. Methods Fluids, 36:417–439, 2001. [67] C. Yu and J. Heinrich. Petrov-Galerkin methods for the time-dependent convective transport equation. Int. J. Numer. Methods Engng., 23:883–901, 1986. [68] R. Davies, E. Moore, and L. Purtell. A numerical-experimental study of confined flow around rectangular cylinders. Phys. Fluids, 27:46–59, 1984. [69] J. Patterson and J. Imberger. Unsteady natural convection in a rectangular cavity. J. Fluid Mech., 100(1):65–86, 1980. 116 BIBLIOGRAPHY Appendix A De-averaging coefficients The derivation of the de-averaging coefficients was performed with symbolic computing software (MATHEMATICA). Firstly, a generic function is considered in the de-averaging finite volume stencil. The truncated Taylor series expansion (up to the required order) is then integrated in each control-volume. The resulting system of equations gives the de-averaging coefficients Ci of Eq. (2.54) u= 1 X Ci ûi V i where V denotes the volume of each cell. The de-averaging coefficients were calculated to obtain sixth-order accuracy, requiring 13 finite volume cells for the two-dimensional case, as indicated in Fig. A.1 for cells far from the boundaries of the domain. Near the boundaries, only fourth-order accuracy is achieved due to the non-symmetric stencil. The stencil only includes interior cells. The coefficients for a uniform mesh are indicated in Tab. A.1 accordingly with the control volume location indicated in the Fig. A.2. Coefficients of cell type 1 are considered for all interior control volumes. Consideration of control volumes near other boundaries is straightforward. Figure A.3 includes the stencil and the de-averaging coefficients for an interior cell on a 3-D uniform mesh. Non-symmetric stencils and coefficients required to perform the de-averaging near boundaries, where application of the symmetric stencil is not possible, are indicated in Fig. A.4 to A.11. 117 118 APPENDIX A. DE-AVERAGING COEFFICIENTS ˆ (i u ˆ (i u − 2, j) −1, j +1) ˆ (i u ˆ (i u −1, j) − 1, j − 1) ˆ (i, j u + 2) ˆ (i, j u + 1) ˆ (i, j ) u ˆ (i, j u −1) ˆ (i, j u − 2) ˆ (i u + 1, j + 1) ˆ (i u ˆ (i u + 1, j ) ˆ (i u + 2, j) + 1, j − 1) Figure A.1: High-order 2-D de-averaging stencil for interior control volumes. 6 1 1 1 1 3 2 5 4 Figure A.2: Cases considered for the use of the 2-D de-averaging coefficients. 1 2 3 4 5 6 1771 1440 6679 5760 3137 2880 5503 5760 5503 5760 1003 1440 Ai,j 43 1440 43 1440 1511 2880 1511 2880 1481 2880 31 960 31 960 31 960 3 640 41 80 - 9 320 - Ai+4,j 3 − 640 − 127 640 1 576 1 576 1 576 Ai−1,j−1 Ai+3,j - Ai,j−1 23 − 360 103 − 2880 103 − 2880 - Ai,j+5 - 43 1440 169 − 2880 17 − 1440 1481 2880 Ai+1,j 23 − 360 23 − 360 Ai,j+4 3 − 640 3 − 640 − 127 640 − 127 640 − 127 640 Ai,j+1 23 − 360 Ai,j+3 1 9 2 320 9 3 320 41 4 80 41 5 80 41 5 80 Ai−1,j 23 − 360 23 − 360 103 − 2880 169 − 2880 203 − 5760 - 31 960 Ai+5,j - - 1 576 1 576 1 576 1 − 288 1 − 288 Ai−1,j+1 - 1 576 1 576 Ai−1,j+2 - 1 576 1 576 1 576 1 − 288 1 − 288 1 144 Ai+1,j+1 1 576 1 576 1 − 288 Ai+1,j+2 - - 1 576 1 576 1 576 Ai+1,j−1 3 640 3 640 21 − 320 3 640 3 − 160 511 − 720 Ai+2,j 1 576 Ai+2,j+2 - 3 640 21 − 320 21 − 320 2059 − 2880 2059 − 2880 − 511 720 Ai,j+2 Ai+2,j+1 1 − 288 - 3 640 - 3 640 3 640 Ai−2,j Table A.1: De-averaging coefficients for a uniform 2-D cartesian spatial grid. - 3 640 Ai,j−2 119 675 1 675 1 046 3 046 3 675 1 Figure A.3: 3-D stencil and de-averaging coefficients for a interior cell. − 046 3 0441 79 0441 79 675 1 675 1 0441 79 − − 675 1 675 1 675 1 096 1031 0441 79 − 675 1 0441 79 0441 79 − 046 3 675 1 046 3 675 1 675 1 046 3 − APPENDIX A. DE-AVERAGING COEFFICIENTS 120 046 3 675 1 − − − − Figure A.4: 3-D stencil and de-averaging coefficients for a boundary face cell. − 046 3 − 675 1 675 1 882 1 882 1 0882 971 0882 971 675 1 0291 7112 675 1 − 0882 971 − 0882 971 0441 335 0882 9311 − 046 3 675 1 046 3 882 1 882 1 675 1 675 1 084 19 0291 17 − 121 675 1 − − − − − Figure A.5: 3-D stencil and de-averaging coefficients for a boundary face cell near a edge. 046 3 − 0882 971 04032 7201 − 675 1 042 952 675 1 882 1 882 1 675 1 675 1 − 02511 185 − 046 3 0882 971 0441 335 0882 9311 − 0652 3 675 1 046 3 882 1 882 1 675 1 675 1 084 19 0291 17 − APPENDIX A. DE-AVERAGING COEFFICIENTS 122 675 1 − − − − − − Figure A.6: 3-D stencil and de-averaging coefficients for a boundary face cell near a vertex. − 0882 971 0675 322 882 1 882 1 − 675 1 0291 7391 675 1 0441 335 675 1 675 1 − 027 11 027 11 − 046 3 675 1 061 3 882 1 0882 9311 − 0652 3 046 3 882 1 675 1 675 1 084 19 0291 17 − 123 046 3 675 1 − − Figure A.7: 3-D stencil and de-averaging coefficients for a cell near a boundary face. 046 3 − 675 1 675 1 675 1 675 1 0441 79 0441 79 675 1 021 751 0675 352 675 1 − − 675 1 675 1 0441 79 0441 79 0882 95 − 046 3 675 1 046 3 675 1 675 1 061 3 046 3 − APPENDIX A. DE-AVERAGING COEFFICIENTS 124 046 3 675 1 − − − Figure A.8: 3-D stencil and de-averaging coefficients for a cell near a boundary face and edge. 675 1 675 1 0441 79 0675 352 − 0675 352 675 1 069 1121 675 1 675 1 675 1 − 675 1 675 1 0882 95 − 061 3 0441 79 0882 95 − 046 3 675 1 046 3 675 1 675 1 061 3 046 3 − 125 675 1 − − − − Figure A.9: 3-D stencil and de-averaging coefficients for a cell near a boundary face and vertex. 675 1 675 1 0675 352 0675 352 675 1 675 1 675 1 − 0675 352 084 385 0882 95 675 1 − 675 1 675 1 0882 95 0882 95 − 061 3 675 1 061 3 675 1 − 046 3 046 3 675 1 061 3 046 3 − APPENDIX A. DE-AVERAGING COEFFICIENTS 126 027 14 046 3 − − − Figure A.10: 3-D stencil and de-averaging coefficients for a boundary edge cell. − 063 113 − 882 1 − 675 1 0441 325 882 1 675 1 − 0882 9211 027 14 − 084 19 882 1 0441 325 0882 9211 − 0291 17 675 1 046 3 882 1 675 1 084 19 0291 17 − 127 061 75 061 75 046 904 061 75 069 373 − − Figure A.11: 3-D stencil and de-averaging coefficients for a boundary vertex cell. 069 373 − 069 373 084 19 − 084 19 − 0291 17 0291 17 084 19 0291 17 − 128 APPENDIX A. DE-AVERAGING COEFFICIENTS

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement