DELFT UNIVERSITY OF TECHNOLOGY REPORT 05-05 Efficient Methods for Solving Advection Diffusion Reaction Equations S. van Veldhuizen ISSN 1389-6520 Reports of the Department of Applied Mathematical Analysis Delft 2005 c Copyright 2005 by Delft Institute of Applied Mathematics Delft, The Netherlands. No part of the Journal may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission from Delft Institute of Applied Mathematics, Delft University of Technology, The Netherlands. Efficient Methods for Solving Advection Diffusion Reaction Equations -Literature StudyS. van Veldhuizen August 23, 2005 2 Contents Contents 2 Preface 6 I 9 Mathematical Model of CVD 1 Introduction 11 2 The Mathematical Model 2.1 The Transport Model . . . . . . . . . . . . 2.1.1 Transport Equations for Gas Species 2.1.2 Complete Mathematical Model . . . 2.2 Boundary Conditions . . . . . . . . . . . . . 13 13 15 20 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Stif fness 25 3.1 Example of Stiff Equation . . . . . . . . . . . . . . . . . . . . 25 3.2 Stability Analysis for Euler Forward Method . . . . . . . . . 28 4 Test Problem 4.1 The Computational Grid . . . . . . . . . . . . . . . . . . . . . 4.2 Finite Volume Discretization of the General Transport Equation 4.3 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . 4.4 Chemistry Properties . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Gas-Phase Reaction Model . . . . . . . . . . . . . . . 31 32 32 36 38 38 II 43 Time Integration Methods 1 Introduction 45 4 CONTENTS 2 Runge-Kutta Methods 47 2.1 The Stability Function . . . . . . . . . . . . . . . . . . . . . . 49 2.2 Rosenbrock Methods . . . . . . . . . . . . . . . . . . . . . . . 50 2.3 Runge Kutta Chebyshev Methods . . . . . . . . . . . . . . . 53 2.3.1 First Order (Damped) Stability Polynomials and Schemes 55 2.3.2 Second Order Schemes . . . . . . . . . . . . . . . . . . 57 2.4 Some Remarks on Runge-Kutta Methods . . . . . . . . . . . 59 2.4.1 Properties of Implicit Runge-Kutta methods . . . . . 60 2.4.2 Diagonally Implicit Runge-Kutta methods . . . . . . . 62 2.4.3 The Order Reduction Phenomenon for RK-methods . 64 2.4.4 Order Reduction for Rosenbrock Methods . . . . . . . 67 3 Time Splitting 3.1 Operator Splitting . . . . . . . . . . . . . . . . . . . . 3.1.1 First Order Splitting of Linear ODE Problems 3.1.2 Nonlinear ODE problems . . . . . . . . . . . . 3.1.3 Second and Higher Order Splitting. . . . . . . . 3.2 Boundary Values, Stiff Terms and Steady State . . . . 3.2.1 Boundary Values . . . . . . . . . . . . . . . . . 3.2.2 Stiff Terms . . . . . . . . . . . . . . . . . . . . 3.2.3 Steady State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 71 72 73 73 75 75 75 76 4 IMEX Methods 77 4.1 IMEX- θ Method . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 80 5 IMEX Runge-Kutta-Chebyshev Methods 5.1 Construction of the IMEX scheme . . . . 5.2 Stability . . . . . . . . . . . . . . . . . . . 5.3 Consistency . . . . . . . . . . . . . . . . . 5.4 Final Remarks . . . . . . . . . . . . . . . 6 Linear Multi-Step Methods 6.1 The Order Conditions . . . 6.2 Stability Properties . . . . . 6.2.1 Zero-Stability . . . . 6.2.2 The Stability Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 84 85 86 86 . . . . 89 90 92 93 93 7 Multi Rate Runge Kutta Methods 97 7.1 Multi-Rate Runge-Kutta Methods . . . . . . . . . . . . . . . 98 7.2 Order Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 CONTENTS III 5 Nonlinear Solvers 105 1 Introduction 107 2 Newton’s Method 2.1 Newton’s Method in One Variable . . . . . . 2.2 General Remarks on Newton’s Method . . . . 2.3 Convergence Properties . . . . . . . . . . . . 2.4 Criteria for Termination of the Iteration . . . 2.5 Inexact Newton Methods . . . . . . . . . . . 2.6 Global Convergence . . . . . . . . . . . . . . 2.7 Extension of Secant Method to n Dimensions 2.8 Failures . . . . . . . . . . . . . . . . . . . . . 2.8.1 Non-smooth Functions . . . . . . . . . 2.8.2 Slow Convergence . . . . . . . . . . . 2.8.3 No Convergence . . . . . . . . . . . . 2.8.4 Failure of the Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 109 109 112 113 114 115 117 119 119 119 120 120 3 Picard Iteration 121 3.1 One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3.2 Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 123 3.3 Some Last Remarks . . . . . . . . . . . . . . . . . . . . . . . 124 IV Linear Solvers 125 1 Introduction 127 2 Krylov Subspace Methods 2.1 Krylov Subspace Methods for Symmetric Matrices . 2.1.1 Arnoldi’s Method and the Lanczos Algorithm 2.1.2 The Conjugate Gradient Algorithm . . . . . . 2.2 Krylov Subspace Methods for General Matrices . . . 2.2.1 BiCG Type Methods . . . . . . . . . . . . . . 2.2.2 GMRES Methods . . . . . . . . . . . . . . . 2.2.3 The GMRES Algorithm . . . . . . . . . . . . 2.3 Stopcriterium . . . . . . . . . . . . . . . . . . . . . . 3 Precondition Techniques 3.1 Preconditioned Iterations . . . . . . . . . . 3.1.1 Preconditioned Conjugate Gradient 3.1.2 Preconditioned GMRES . . . . . . . 3.1.3 Preconditioned Bi-CGSTAB . . . . . 3.2 Preconditioned Techniques . . . . . . . . . . 3.2.1 Diagonal Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 132 132 134 136 137 141 142 143 . . . . . . 145 145 145 147 149 150 151 6 CONTENTS 3.3 3.4 V 3.2.2 Incomplete LU Factorization . . . . . . . . . . . . . . 151 Incomplete Choleski Factorization . . . . . . . . . . . . . . . 153 Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Concluding Remarks 155 1 Summary and Conclusions 157 1.1 Time Integration Methods . . . . . . . . . . . . . . . . . . . . 159 1.2 Nonlinear and Linear Solvers . . . . . . . . . . . . . . . . . . 159 2 Future Research 2.1 Test-problem and Time Integration 2.2 Stability . . . . . . . . . . . . . . . 2.3 (Non)-Linear Solvers . . . . . . . . 2.4 Extension to Three Dimensions . . VI Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 161 161 162 162 Appendices 163 A Collocation Methods 165 B Padé Approximations 167 Bibliography 169 Preface Aim of the Research Project Chemical Vapor Deposition creates thin films of material on a substrate via the use of chemical reactions. Reactive gases are fed into a reaction chamber and these gases react in the gas phase or on a substrate and form a thin film or a powder. In order to model this phenomenon the flow of gases is described by the incompressible Navier-Stokes equations, whereby the density variations are taken into account by the perfect gas law. The temperature is governed by the heat equation, whereas the chemical species satisfy the advection-diffusion-reaction equations. Since the reaction of chemical species generates or costs energy, there is a coupling between the heat equation and the advection-diffusion-reaction equations. In classical CVD the generated heat is rather small, so it is possible to decouple the heat equation. Our aim is to develop numerical methods , which are also applicable to laminar combustion simulations, where there is a strong coupling between the heat equation and the advection-diffusion-reaction equations. Typically, in the advection-diffusion-reaction equations the time scales of the slow and fast reaction terms differ order of magnitudes from each other and from the time scales of diffusion and advection, leading to extremely stiff systems. The purpose of this project is to develop robust and efficient solvers for the heat/reaction system, where we firstly assume that the velocity is given. Thereafter this solver should be integrated in a CFD package. For a typical 3D problem a 50 × 50 × 50 grid with 50 chemical species are used. Structure of the Report This report, a summary of recent, relevant literature, is divided into five parts. The first part consists of a general introduction to the C(hemical) V(apor) D(eposition) process and the corresponding mathematical model which describes that process. It appears that the system of PDEs describing 8 CONTENTS CVD is a so-called stiff system. In this first part the notion of stiffness is explained. We conclude with the finite volume discretization of the general transport equation. The second part of this report presents time integration methods that can ‘handle’ stiff problems. Some ’old’ methods, which work fine for stiff problems. as well as recently developed integration methods will be treated. Since the PDEs describing CVD are of the nonlinear type, time integration methods will in general result in solving nonlinear equations. Therefore, the topic of the third part is ‘nonlinear solvers’. We give a treatment of all well known nonlinear solvers. Newton-type nonlinear solvers need solutions of linear systems. The latter is (evidently) the subject of Part IV. Since our aim is to solve three dimensional problems, this will result in huge nonlinear systems. Thus also huge linear systems have to be solved. The most attractive class of linear solvers for three dimensional problems are iterative methods. In Part IV we treat generally known iterative linear solvers for symmetric and general matrices. In Part V some conclusions are made and the scheme for future research is presented. Part I Mathematical Model of CVD Chapter 1 Introduction Thin solid films are used as insulating and (semi)conducting layers in many technological areas such as micro-electronics, optical devices and so on. Therefore, deposition processes are required which produce solid films on a wide variety of materials. Of course, these processes need to fulfill specific requirements with regard to safety, economics, etc. Chemical Vapor Deposition (CVD) is a process that synthesizes a thin solid film from the gaseous phase by a chemical reaction on a solid material. The chemical reactions involved distinguish CVD from other physical deposition processes, such as sputtering and evaporation. A CVD system is a chemical reactor, wherein the material to be deposited is injected as a gas and which contains the substrates on which deposition takes place. The energy to drive the chemical reaction is (usually) thermal energy. On the substrates surface reactions will take place resulting in deposition of a thin film. Basically, the following nine steps, taken from [16], occur in every CVD reaction process: 1. Convective and diffusive transport of reactants from the reactor inlet to the reaction zone within the reactor chamber, 2. Chemical reactions in the gas phase, leading to a multitude of new reactive species and byproducts, 3. Convective and diffusive transport of the initial reactants and the reaction products from the homogeneous reactions to the susceptor surface, 4. Adsorption or chemisorption of these species on the susceptor surface, 5. Surface diffusion of adsorbed species over the surface, 6. Heterogeneous surface reactions catalyzed by the surface, leading to the formation of a solid film, 12 Introduction 7. Desorption of gaseous reaction products, 8. Diffusive transport of reaction products away from the surface, 9. Convective and/or diffusive transport of reaction products away from the reaction zone to the outlet of the reactor. In the case that a CVD process is considered to be fully heterogeneous, step (2) in the above enumeration is does not take place. The above enumeration is illustrated in Figure 1.1, taken from [14]. )* main gas flow "! gas phase reactions #$ '( %& desorption & diffusion & diffusion of adsorption of reactive volatile surface species -. ++, reaction products surface diffusion & reactions −film growth Figure 1.1: Schematic representation of basic steps in CVD (after Jensen, see [14]) Chapter 2 The Mathematical Model The mathematical model describing the CVD process consists of a set of partial differential equations, with appropriate boundary conditions, describing the gas flow, transport of energy, transport of species and the chemical reactions in the reactor. The gas mixture is assumed to behave as a continuum. This assumption is valid when the mean free path length of the molecules is much smaller than a characteristic dimension of the reactor. The Knudsen Number Kn is defined as ξ (2.1) Kn = , L where ξ is the mean free path length of the molecules and L a typical characteristic dimension of the reactor. Thus, the gas-mixture behaves as a continuum when Kn < 0.01. For pressures larger than 100 P a and typical reactor dimensions larger than 0.01 m the continuum approach can be used safely. See also [16] and [21, Chapter 4]. 2.1 The Transport Model The gas mixture in the reactor is assumed to behave as a continuum, ideal and transparent gas1 behaving in accordance with Newton’s law of viscosity. Furthermore, the gas flow is assumed to be laminar (low Reynolds number flow). Since no large velocity gradients appear in CVD gas flows, viscous heating due to dissipation will be neglected. Furthermore, we neglect the effects of pressure variations in the energy equation. The composition of the N component gas mixture is described in terms 1 By transparent we mean that the adsorption of heat radiation by the gas(es) will be small. 14 The Mathematical Model of the dimensionless mass fractions ω i = N X ρi ρ, i = 1, . . . , N , with the property ωi = 1. i=1 P The density ρ = N i=1 ωi ρi of the gas mixture depends on the spatial variable x, temperature T , the pressure P , time t, etc. Usually, in chemistry reaction equilibria are expressed in terms of (dimensionless) molar fractions. The molar fraction of species i is denoted by f i and is related to the mass fraction ωi as fi mi ωi = , m where mi is the molar mass of specie i and m= N X fi mi . i=1 The mass averaged velocity v in an N component gas mixture is defined as v= N X ωi vi . i=1 As a consequence, the diffusive mass fluxes j i (see Section 2.1.1) sum up to zero, e.g., N X i=1 ji = N X i=1 ρωi (vi − v) = ρ N X i=1 ωi vi − ρv N X i=1 ωi = ρv − ρv = 0. The transport of mass, momentum and heat are described respectively by the continuity equation, the Navier-Stokes equations and the transport equation for thermal energy expressed in terms of temperature T . The continuity equation is given as ∂ρ = −∇ · (ρv), ∂t (2.2) kg where ρ is the gas mixture density ( m 3 ) and v the mass averaged velocity ). The Navier-Stokes equations are vector ( m s 2 ∂(ρv) T = −(∇ρv)·v+∇· µ ∇v + (∇v) − µ(∇ · v)I −∇P+ρg, (2.3) ∂t 3 2.1 The Transport Model 15 kg ), I the unit tensor and g the gravitational where µ is the viscosity ( m·s acceleration. The transport equation for thermal energy 2 is cp ∂(ρT ) ∂t = −cp ∇ · (ρvT ) + ∇ · (λ∇T ) + ! N N X X DTi ∇fi Hi +∇ · RT + ∇ · ji Mi fi mi i=1 − K N X X i=1 Hi νik Rkg , (2.4) i=1 k=1 J W with cp specific heat ( mol·K ), λ the thermal conductivity ( m·K ) and R the gas constant. Gas species i has a mole fraction f i , a molar mass mi , a thermal diffusion coefficient DTi , a molar enthalpy Hi and a diffusive mass flux ji . For a definition of diffusive flux we refer to Section 2.1.1. The stoichiometric coefficient of the ith species in the k th gas-phase reaction with net molar reaction rate Rkg is νik . The third term on the right-hand side of (2.4) is due to the Dufour effect (diffusion-thermo effect). The Dufour effect is the reversed process of the Soret effect, which is the process of thermal diffusion. The fourth term on the right represents the transport of heat associated with the inter-diffusion of the chemical species. Both terms, the third and the fourth, are probably not too important in CVD. The last term at the right-hand side of (2.4) represents the heat production or consumption due to chemical reactions in the gas mixture. In processes where the gas-phase reactions are neglected this term will not be important. Also in processes where the reactants are highly diluted in an inert carrier gas this term is negligible. 2.1.1 Transport Equations for Gas Species We formulate the transport equations for the gas species in terms of mass fractions and diffusive mass fluxes. The convective mass flux for the i th gas species is ρωi v. The diffusive mass flux ji of the ith species is defined as ji = ρωi (vi − v), with vi the velocity vector of species i and v the mass averaged velocity. See for instance [2]. Mass diffusion can be decomposed in concentration T diffusion jC i and thermal diffusion ji : T ji = j C i + ji . 2 Professor Kleijn remarked that the well-posedness of this equation is doubtful. Due to the fact that the zero of the enthalpy can be chosen in a free way, the ‘meaning’ of this equation comes into play. Choosing a not-suitable zero of the enthalpy causes undesired large or small values of the enthalpy. 16 The Mathematical Model See [16]. The first type of diffusion, j C i , occurs as a result of a concentration gradient in the system. We will refer to it as ordinary diffusion. Thermal diffusion is the kind of diffusion resulting from a temperature gradient. Ordinary Diffusion Depending on the properties of the gas mixture, different approaches are possible to describe ordinary diffusion. First, in a binary gas mixture, e.g., a gas mixture consisting of two species (N = 2), the ordinary diffusion mass flux is given by Fick’s Law : j1 = −ρD12 ∇ω1 , j2 = −ρD21 ∇ω2 , with Dij the diffusion coefficient for ordinary diffusion in the pair of gases 1 and 2. In for instance [2] is derived that for a binary mixture there is just one ordinary diffusity, D12 = D21 . For a multi-component gas mixture there are two approaches, the StefanMaxwell equations and an alternative approximation derived by Wilke. The Stefan-Maxwell equations are a general expression for the ordinary diffusion fluxes jC i in a multi-component gas mixture. In terms of mass fraction and fluxes they are given as N mX 1 C ωi jC ∇ωi + ωi ∇(ln m) = j − ω j ji , ρ mj Dij j=1 i = 1, . . . , N − 1, (2.5) with m the average mole mass of the mixture m= N X fi mi . i=1 The coefficients Dij are the binary diffusion coefficients for gas species i and j. In an N -component gas mixture there are N − 1 Stefan-Maxwell equations (2.5) and the additional equation N X jC i = 0. i=1 The following explicit expression for j C i is taken from [16], jC i = ρDi − ρωi Di ∇(ln m) + mωi Di N X j=1,j6=i jC i , mj Dij (2.6) 2.1 The Transport Model 17 with Di an effective diffusion coefficient for species i Di = PN 1 fj j=1,j6=i Dij . An other approximate expression for the ordinary diffusion fluxes has been derived by Wilke in the 1950’s. In Wilkes’ approach the diffusion of species i is 0 jC (2.7) i = ρDi ∇ωi , with effective diffusion coefficient D0i = (1 − fi ) PN 1 fj j=1,j6=i Dij . In Wilke’s approach the diffusion is written in the form of Fick’s Law of diffusion with an effective diffusion coefficient instead of a binary diffusion coefficient. Finally we remark that in the case of a binary gas mixture, the StefanMaxwell equations and the Wilke approach lead to Fick’s Law of binary diffusion. In the case of a multi-component gas mixture both approaches are identical if the gas species are highly diluted (f 1 , ω1 1). The following remark is taken from [16]. When Wilke’s approach to the Stefan-Maxwell equations is used to compute the diffusive fluxes, the set of transport equations (2.9) form an independent set, which is not consistent with N X ωi = 1. i=1 To fulfill this constraint one of the transport equations has to be replaced by the constraint itself. Thermal Diffusion A homogeneous gas mixture will be separated due to the effect of thermal diffusion (Soret effect) under the influence of a temperature gradient. This so-called Soret effect is usually small in comparison with ordinary diffusion. Only in systems, actually we mean CVD reactors, with large temperature gradients thermal diffusion may be an important effect. For instance in coldwall CVD reactors we have large temperature gradients. Thermal diffusion causes in general a movement of relatively heavy molecules to ‘colder’ regions of the reactor chamber and a movement of light molecules to hotter parts. The thermal diffusive mass flux is given by jTi = −DTi ∇(ln T ). (2.8) 18 The Mathematical Model In (2.8) DTi is the multi-component thermal diffusion coefficient for species i. In general it will be a function of the temperature T and the composition of the gas mixture. As remarked in [16] the coefficient D Ti is not a function of the pressure. An exact computation of D Ti can be found in [16, Appendix F], where can be observed that indeed no pressure is used. Finally, we remark that adiabatic systems do not have thermal diffusion. The Species Concentration Equations We assume that in the gas-phase there are K reversible reactions. For the mole is given. The balance k th reaction a net molar reaction rate R kg m 3 ·s th equation for the i gas species, i = 1, . . . , N , in terms of mass fractions and diffusive mass fluxes is then given as K X ∂(ρωi ) νik Rkg . = −∇ · (ρvωi ) − ∇ · ji + mi ∂t (2.9) k=1 From the above general PDE for species transport and chemical gas phase reactions, the approximate Wilke and exact Stefan-Maxwell expressions for ordinary diffusion and expression for thermal diffusion we have the following PDEs for solving the species concentrations : Using the Wilke approximation : ∂(ρωi ) ∂t = −∇ · (ρvωi ) + ∇ · (ρD0i ∇ωi ) + ∇ · (DTi ∇(ln T )) + +mi K X νik Rkg , (2.10) k=1 Using the exact Stefan-Maxwell equations : ∂(ρωi ) ∂t = −∇ · (ρvωi ) + ∇ · (ρDi ∇ωi ) + +∇ · (ρωi Di ∇(ln m)) − ∇ · mωi Di +∇ · (DTi ∇(ln T )) + mi K X νik Rkg . N X j=1,j6=i jC i mj Dij + (2.11) k=1 Net Molar Reaction Rates The last term on the right hand side of (2.9), (2.10) and (2.11) describes the consumption and production of the i th species due to homogeneous reactions 2.1 The Transport Model 19 in the gas-phase. In this section we give expressions for the net molar reaction rates Rkg . As before, we assume that there are K reversible gasphase reactions. Since different species can act as reactant and as product, we use the general notation N X i=1 g kk,forward k − νik kAi N X g kk,backward i=1 kνik kAi k = 1, . . . , K. (2.12) In (2.12) Ai , i = 1, . . . , N , represent the different gas species in the reactor g g chamber, kk,forward the forward reaction rate constant and k k,backward the backward reaction rate constant. By taking νik > 0 for the products of the forward reaction νik < 0 for the reactants of the forward reaction and kνik k = νik and k − νik k = 0 for νik ≥ 0 kνik k = 0 and k − νik k = |νik | for νik ≤ 0 equation (2.12) represents a general equilibrium reaction, with reactants appearing at the left-hand side and products on the right-hand side. The net reaction rate Rkg for the k th reaction is given as Rkg = Rkg,forward − Rkg,backward = g = kk,forward N Y i=1 k−νik k ci g − kk,backward N Y kνik k ci , (2.13) i=1 P fi 3 RT . See for instance [16] and [21, Chapter 4]. with ci = g The forward reaction rate constants k k,forward are fitted as a function of the temperature T as follows : g kk,forward (T ) = Ak T βk e −Ek RT , (2.14) where Ak is the pre-exponential factor homogeneous reaction rate for the k th reaction, βk the temperature coefficient and Ek the activation energy for g the k th reaction. Expressing the rate constant k k,forward as done in (2.14) is the (modified) Law of Arrhenius. See also [17]. The backward reaction rate constants are self-consistent with the forward reaction rate constants. Using thermo-chemistry and doing some calculations the following relation appears P N g i=1 νik k (T ) RT k,forward g kk,backward (T ) = , Kkg P0 3 Recall that the mole fraction of the ith gas species fi is related with the mass fraction ωi in the following way mi ωi , fi = m with mi the molar mass of species i and m the average molar mass of the gas-mixture. 20 The Mathematical Model with Kkg the reaction equilibrium constant given by Kkg (T ) = e− 0 (T )−T ∆S 0 (T ) ∆Hk k RT N X ∆Hk0 (T ) = i=1 N X ∆Sk0 (T ) = , with νik Hi0 (T ) and νik Si0 (T ). i=1 Remark 2.1. Typically, the pre-exponential factor homogeneous reaction rate for the k th reaction Ak are of order 105 − 1029 . Due to these large factors Ak the equations (2.10) and (2.11) become very stiff. 2.1.2 Complete Mathematical Model To complete the set of equations we add the ideal gas law, i.e., we assumed the gas mixture to behave as an ideal gas. In terms of of the mass density ρ and molar mass m of the gas it is given as P m = ρRT, (2.15) with P the pressure, R the gas constant and T the temperature. As final result, in the CVD reactor the following coupled set of an algebraic equation and partial differential equations has to be solved. This set consists of the following N +3+d, with d the dimension of space, equations : 1. Continuity equation (2.2), 2. Navier-Stokes equations (d equations, d = 1, 2, 3) (2.3), 3. Transport equation for thermal energy (2.4), 4. Transport equations for the ith gas species, i = 1, . . . , N (2.9) and 5. Ideal gas law (2.15). The variables to be solved from this set of equations are the mass averaged velocity vector v, the pressure P, the temperature T , the density ρ and the mass fractions ωi for i = 1, . . . , N . 2.2 Boundary Conditions In order to solve the system of PDE’s describing the CVD process appropriate boundary conditions have to be chosen. If we start with a simple reactor chamber, consisting of an inflow and an outflow boundary, a number of solid walls and a reacting surface, then on each of these boundary we have to prescribe a condition. 2.2 Boundary Conditions 21 To avoid misunderstanding, we will first give a definition of the normal n. By the normal n we mean the unit normal vector pointing outward the plane or boundary. Solid Wall On the solid, non reacting walls we apply the no slip and impermeability conditions for the velocity vector, v = 0. On this wall we can have either a prescribed temperature T , T = Twall , or a zero temperature gradient for adiabatic walls, n · ∇T = 0. Furthermore, on the solid wall there is no mass diffusion, T n · ji = n · (jC i + ji ) = 0. For adiabatic walls and adiabatic systems the above boundary condition reduces to n · ∇ωi = 0. Reacting Surface On the reacting surface we have to prescribe the surface reactions. We assume that on the surface S different surface reactions can take place. There will be a net mass production P i at these surfaces. For ωi this net mass production is given by Pi = m i S X σis RsS , s=1 where mi is the mole mass, σis the stoichiometric coefficient for the gaseous and solid species for surface reaction s (s = 1, . . . , S) and R sS the reaction rate of reaction s on the surface. We assume the no-slip condition on the wafer surface. Since there is mass production in normal direction the normal component of the velocity will not be equal to zero. Now, we find for the velocity the conditions n·v = S N X 1X σis RsS , mi ρ s=1 n × v = 0. i=1 (2.16) 22 The Mathematical Model on the reacting boundary.4 For the wafer temperature we have T = Twafer . On the wafer surface the net mass production of species i must be equal to the normal direction of the total mass flux of species i. The total mass flux of species i is given by ρωi v + ji . Therefore we have on the wafer surface the following boundary condition n · (ρωi v + ji ) = mi S X σis RsS . s=1 Condition (2.16), n × v = 0, gives always two conditions if we consider the three dimensional case. The normal on the reacting surface is known, n = [n1 , n2 , n3 ]T and the velocity vector is the vector v = [v 1 , v2 , v3 ]T , with v1 , v2 and v3 unknowns. We compute n × v to derive the number of conditions for v. Hence, n2 v3 − n 3 v2 n × v = n3 v1 − n 1 v3 . n1 v2 − n 2 v1 Since ni is known and vi not, we are able to compute the three velocity components from the equation n × v = 0. The first equation gives v 3 = nn3 v2 2 and the second v1 = nn1 n2 n3 v3 2 . Hence, the third equation is also satisfied. Therefore, from an outer product follows a condition for only two components of the velocity vector. Inflow We take as boundary conditions in the inflow n · v = vin n × v = 0, where the inflow velocity vin with prescribed inflow of each species, denoted by Qi for species i, is given as vin = N N X Tin P 0 10−3 1 X −3 Tin Q = 5.66 · 10 Qi .5 i T 0 Pin 60 Ain Ain Pin i=1 4 (2.17) i=1 The outer product of two vectors u and v is defined as u × v = (u2 v3 − u3 v2 )e(1) + (u3 v1 − u1 v3 )e(2) + (u1 v2 − u2 v1 )e(3) , with e(α) the unit vector in the xα direction. The outer product is not symmetric, hence, u × v = −v × u. 5 We used in (2.17) that the standard pressure P 0 = 1.013 · 105 Pa and the standard temperature T 0 = 298.15K. 2.2 Boundary Conditions 23 The temperature in the inflow is prescribed as T = Tin . In [16] the author also prescribes n · (λ∇T ) = 0, to prohibit a conductive heat flow through the inflow opening. After further analysis it was clear that λ = 0 is taken on the boundary 6 . Total mass flow of species i flowing into the reactor chamber must correspond to Qi in the following way, see [16] : −3 Qi mi T n · ρωi v + jC , i + ji = 5.66 · 10 RAin with R the universal gas constant. To satisfy (2.18) we put T n · jC i + ji = 0, and n · (ρωi v) = 5.66 · 10−3 Qi mi . RAin (2.18) (2.19) (2.20) By taking the concentration- and thermal- diffusion coefficients on the boundary equal to zero, we satisfy (2.19). Elementary analysis of (2.20) gives that the inlet concentrations should be taken fixed as Outflow mi Qi . ωi = PN M Q j j j=1 (2.21) In the outflow opening we assume zero gradients in the normal direction of the outflow for the total mass flux vector, the heat flux and the diffusion fluxes. Furthermore, we assume that the velocity at the outflow boundary is in the normal direction. For the velocity we have the boundary conditions n · (∇ρv) = 0, n × v = 0. For the heat and diffusion fluxes we get n · (λ∇T ) = 0, and T n · (jC i + ji ) = 0. 6 Is it possible to have a discontinuous thermal conductivity coefficient λ ? 24 The Mathematical Model Chapter 3 Stiffness The note of stiffness has already been mentioned in Chapter 2. In this (short) chapter we pay some attention to this notion. While the intuitive meaning of stiff is clear to all specialists, much controversy is going on about it’s correct ‘mathematical definition’. The most pragmatical opinion is also historically the first one, Curtiss & Hirschfelder [4] : “Stiff equations are equations where certain implicit methods, in particular BDF, perform better, usually tremendously better, than explicit ones.” A more recent opinion of Hundsdorfer & Verwer [13] : “Stiffness has no mathematical definition. Instead it is an operational concept, indicating the class of problems for which implicit methods perform (much) better than explicit methods.” The eigenvalues of the Jacobian δf δy play certainly a role in this decision, but quantities such as the dimension of the system, the smoothness of the solution or the integration interval are also important. In this chapter we give first an appropriate example. Subsequently we give a stability analysis for the Euler forward method, where the role of the eigenvalues of the Jacobian become clear. 3.1 Example of Stiff Equation As mentioned before, stiff equations belong to the class of ODEs for which explicit methods do not work. There are many ‘standard’ examples of stiff equations. Next, we treat an example coming from chemical reacting systems, taken from [26]. We selected this ‘standard’ example because of the connectivity with the topic of this research. 26 Stiffness Example 3.1. Consider the chemical species A, B and C. The Robertson reaction system is then given as : A B+B B+C 0.04 −→ B (slow ) 3·107 −→ C + B (very fast) 104 −→ A+C (fast) which leads to the equations A : y10 = −0.04y1 +104 y2 y3 , y1 (0) = 1 B : y20 = 0.04y1 −104 y2 y3 −3 · 107 y22 , y2 (0) = 0 C : y30 = 3 · 107 y22 , y3 (0) = 0. Introduce y1 y = y2 , y3 so that (3.1) can be written as y 0 = f (y), y(0) = [1, 0, 0]T , (3.1) (3.2) (3.3) where the definition of f (y) is self-evident. The ODE system will be solved using two methods : 1. Euler Forward, which is an explicit method, yn+1 − yn = f (yn ), τ (3.4) 2. Euler Backward, which is implicit, yn+1 − yn = f (yn+1 ). τ (3.5) The numerical solutions obtained for y 2 (t) with Euler Forward and Backward are displayed in Figure 3.1. We observe that the solution y 2 rapidly reaches a quasi-stationary position in the vicinity of y 20 = 0, which in the beginning is at 0.04 ≈ 3 · 107 y22 , hence y2 ≈ 3.65·−5 , and then very slowly goes back to zero again. It can be seen in Figure 3.1 that the implicit method, i.e. Euler Backward, integrates (3.1) without problem. The explicit Euler Forward has violent oscillations when the step-size τ is too large. We remark that the solution of Euler Backward is of good quality. Apparently, in the sense of Curtiss and Hirschfelder (and Hundsdorfer and Verwer) the system of ODEs (3.1) is stif f. 3.1 Example of Stiff Equation 165 −5 Concentration of species B x 10 1 27 Concentration of species B x 10 6 5 0 4 −1 3 −2 2 1 −3 0 −4 −1 −5 −2 −6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −3 0 0.05 0.1 0.15 time t −5 4 4 3.5 0.25 0.3 0.35 0.25 0.3 0.35 Concentration of species B x 10 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0 0.2 time t −5 Concentration of species B x 10 0.5 0 0.05 0.1 0.15 0.2 time t 0.25 0.3 0.35 0 0 0.05 0.1 0.15 0.2 time t Figure 3.1: Numerical solution for y 2 (t) with Euler Forward with step-size 1 1 1 (upper-left), with τ = 1000 (upper-right), with τ = 1250 (lower-left) τ = 50 1 and Euler Backward with steps-size τ = 50 (lower-right). Remark that the axis in the upper-left figure has another scale, i.e., it is scaled by a factor 10165 . 28 3.2 Stiffness Stability Analysis for Euler Forward Method Let ϕ(t) be a smooth solution of y 0 = f (t, y). We linearize f in its neighborhood as follows y 0 (t) = f (t, ϕ(t)) + δf (t, ϕ(t))(y(t) − ϕ(t)) + · · · δy (3.6) and introduce ȳ(t) = y(t) − ϕ(t) to obtain ȳ 0 (t) = f (t, ϕ(t)) + δf (t, ϕ(t))ȳ(t) + · · · = J(t)ȳ(t) + · · · δy (3.7) As a first approximation, we consider the Jacobian J(t) as constant and neglect the error terms. Omitting the bars we arrive at y 0 = Jy. (3.8) If we now apply the Euler forward method to (3.8), we obtain yn+1 = R(hJ)yn , (3.9) R(z) = 1 + z. (3.10) with To study the behavior of (3.9), we assume that J is diagonalizable with eigenvectors v1 , . . . , vk . As a consequence, y0 can be written in this basis as y0 = k X αi vi . (3.11) i=1 If the λi are the eigenvalues of J, then using (3.11), (3.9) can be written as ym = k X (R(hλi ))m αi vi . (3.12) i=1 It is clear that ym remains bounded for m → ∞, if for all eigenvalues the complex number z = hλi lies in the set S = {z ∈ C : |R(z)| ≤ 1}. (3.13) This leads to the explanation of the results found in the example of the previous section. Example 3.1. (Continued) The Jacobian for the Robertson reaction system (3.1) is given by −0.04 104 y3 104 y2 0.04 −104 y3 − 6 · 107 y2 −104 y2 . (3.14) 0 6 · 107 y2 0 3.2 Stability Analysis for Euler Forward Method 29 In the neighborhood of the equilibrium y 1 = 1, y2 = 0.0000365, y3 = 0 the Jacobian is equal to −0.04 0 0.365 0.04 −2190 0.365 , (3.15) 0 2190 0 with eigenvalues λ1 = 0, λ2 = −0.405 and λ3 = −2189.6. (3.16) The third eigenvalue, λ3 , produces stiffness. For stability for Euler Forward 5 2 = 5474 ≈ 0.0009. we need λ3 τ ≥ −2. Hence, τ ≤ 2189.6 The implicit Euler Backward method is unconditionally stable, thus there are no restrictions on the step-size τ . This confirms the numerical observations. 30 Stiffness Chapter 4 Test Problem The mathematical model of the CVD process described in Chapter 2 consists of a number of coupled non-linear partial differential equations with boundary conditions. In general, it is impossible to solve this set of PDEs analytically. As mentioned in [16], for simple reactor configurations and a drastically simplified transport model useful analytic solutions can be obtained. See for example [7, 18]. In this study we have a general model, which can be applied to various reactor configurations and processes. Therefore, we use numerical methods to find approximate solutions of the full set of PDEs. The aim is to find the most efficient solution method. The finite volume approach is the most widely used class of numerical methods for computing heat and mass transfer in variable property fluid flows with chemical reactions. Therefore, the finite volume approach will be used in this study. A detailed description of the finite volume approach can be found in [24] and numerous other publications. The main goal of this project is to find an efficient solution method for solving the advection-diffusion-reaction equations, which are also known in this report as species equations. In the test problem we restrict ourselves to solve a coupled system of species equations on a 2D computational grid. As a first approximation we also neglect the surface reactions. This results in having inflow, outflow and solid wall boundary conditions. Stiffness in this system comes from the gas phase reactions. In this chapter a 2D axisymmetric test problem will be described. The chapter is organized as follows. We start with defining the computational grid. Subsequently, we give the finite volume discretization of the species equations. The last sections of this chapter are reserved for more details on the chemical properties of the test model. 32 Test Problem 4.1 The Computational Grid We restrict ourselves to the two-dimensional case. We use a cell-centered non-uniform grid. There are two ways to arrange the unknowns on the grid : 1. colocated arrangement, 2. staggered arrangement. Visualization of both arrangements is done on a cell-centered uniform grid in Figure 4.1. When all discrete unknowns are located in the center of a cell, the arrangement is called colocated. In the case that the pressure, temperature and species mass fraction are located in the cell-centers and the velocity components are located at the cell face centers, the arrangement is called staggered. In this test problem we use a colocated grid, because the CFD package X-stream also uses a colocated grid. Recall that the solvers developed in this project will be implemented in X-stream. Figure 4.1: Two-dimensional cell-centered uniform grid with a colocated arrangement of the unknowns (left) and a staggered arrangement of the unknowns (right). 4.2 Finite Volume Discretization of the General Transport Equation The test problem we have in this chapter consists of N coupled species concentration equations. Recall that N represents the number of gas species in the reactor. With respect to the diffusion flux we assume the following : • we neglect thermo-diffusion, thus we consider diffusion to exist of ordinary diffusion only, • ordinary diffusion is described by the Wilke approach (Fick’s law). 4.2 Finite Volume Discretization of the General Transport Equation 33 The species concentration equations then take the form K X ∂(ρωi ) νik Rkg , = −∇ · (ρvωi ) + ∇ · (ρD0i ∇ωi ) + mi ∂t (4.1) k=1 where K is the number of reactions in the gas phase and D 0i the effective diffusion coefficient. Equation (4.1) is valid for all species in the gas mixture. Since the relation N X ωi = 1, (4.2) i=1 holds, only (N −1) coupled species concentration equations have to be solved. The general form of equation (4.1) is ∂(ρφ) = −∇ · (ρvφ) + ∇ · (Γφ ∇φ) + Sφ , ∂t (4.3) whereby φ = ωi , Γφ = ρD0i and S φ = mi K X νik Rkg . (4.4) k=1 The discretization of advection-diffusion-reaction equation (4.3) will be done for its 2D axisymmetric (r, z) form, that is ∂(rρφ) = −∇r,z · (rρvφ) + ∇r,z · (rΓφ ∇r,z φ) + Sφ , ∂t (4.5) ∂· ∂· T , ∂z ) . Remark that the 2D Cartewhere ∇r,z is the operator ∇r,z (·) = ( ∂r sian form (4.3) of (4.5) can be obtained by taking r = 1 in (4.5). Next, we will compute the volume integral over the control volume surrounding the grid point P , see Figure 4.2. Grid point P has four neighbors, indicated by N (orth), S(outh), E(ast) and W (est), and the corresponding walls are indicated by n, s, e and w. For the sake of clarity, we write in the remainder of this chapter u , (4.6) v= v where u is the velocity component in r-direction and v the velocity component in z-direction. Integrating Equation (4.5) over the control volume ∆r∆z surrounding P gives ZZ ∂(rρφ) dr dz = (4.7) ∂t ∆r∆z ZZ − ∇r,z · (rρvφ) + ∇r,z · (rΓφ ∇r,z φ) + Sφ dr dz. (4.8) ∆r∆z 34 Test Problem N ∆z vn n n w ue W uw E P ∆z e s vs S ∆r Figure 4.2: Grid cells By using the Gauß theorem1 we may write Z Z Z Z ∂(rρφ) dr dz = − rρvφ dr + rρvφ dr − rρuφ dz + rρuφ dz + ∂t n s e w ∆r∆z Z Z Z Z ∂φ ∂φ ∂φ ∂φ rΓφ dr − rΓφ dr + rΓφ dz − rΓφ dz + ∂z ∂z ∂r ∂r n s e w ZZ rSφ dr dz.(4.9) ZZ ∆r∆z 1 Theorem (Divergence Theorem of Gauß). For any volume V ⊂ Rd with piecewise smooth closed surface S and any differentiable vector field F we have Z ∇ · F dV = V where n is the outward unit normal on S. Z S F · n dS, 4.2 Finite Volume Discretization of the General Transport Equation 35 The remaining integrals are approximated as d(rP ρP φP ) ∆r∆z = −rn ρn vn φn ∆r + rs ρs vs φs ∆r − re ρe ve φe ∆z + rw ρw vw φw ∆z + dt ∂φ ∂φ ∂φ ∂φ ∆r − rs Γφ,s ∆r + re Γφ,e ∆z − rw Γφ,w ∆z + rn Γφ,n ∂z n ∂z s ∂r e ∂r w rp Sφ,P ∆r∆z. (4.10) In the above formulation the time derivative will not be approximated yet. The next part of this report is dedicated to the subject of time integration methods. The densities ρ and the diffusion coefficients Γ φ at the cell walls are approximated by the harmonic mean as 2ρP ρN , ρP + ρ N 2ΓP ΓN , Γn = ΓP + Γ N ρn = and similar for the other cell walls. Because the density function is smooth in the computational domain, its (harmonic mean) approximation can be replaced by 1 (4.11) ρn = (ρN + ρP ). 2 We remark that the latter is done in [16]. For the approximation of the value of φ and its first derivative on the cell walls, we consider the following three methods : φn = 12 (φN + φP ) ∂φ φN −φP ∂z n = ∆zn φP for vn ≥ 0 1st order upwind scheme : φn = φ N for vn < 0 φN −φP ∂φ ∂z n = ∆zn φN for P e∆n < −2 1 hybrid scheme : φn = (φ + φP ) for |P e∆n | ≤ 2 2 N φP for P e∆n > 2 φN −φ P for |P e∆n | ≤ 2 ∂φ ∆zn ∂z n = 0 for |P e∆n | > 2 central scheme : where the cell Peclet number P e∆ is on the n-wall is defined as P e∆n = and similar for the other walls. ρn vn ∆zn , Γφ,n (4.12) 36 Test Problem The central scheme has second order accuracy, but becomes unstable for large |P e∆n |. It is generally known that for large |P e ∆n | the central scheme leads to wiggles in the solution. The upwind scheme damps these wiggles. The disadvantage is that in turn for damping these wiggles one gets loss of accuracy, i.e. the upwind scheme has only first order accuracy. Furthermore, the upwind scheme produces ‘large’ numerical diffusion. The hybrid scheme combines the advantages of the central and the upwind scheme. For |P e∆ | > 2, it locally switches to the upwind scheme and sets the diffusion contribution to zero. Finally we remark that it is common in CVD processes that the Reynolds number is low. Based on previous research, low Reynolds number imply low Peclet numbers. This means that the using the hybrid scheme results usually in a central differencing scheme. 4.3 Boundary Conditions For every ADR-equation of the form (4.5) a set of discretized boundary conditions is needed. Subsequently we will discretize the inflow, solid wall and outflow boundary conditions. Recall that in the test-problem we do not consider reacting surfaces. Inflow In Figure 4.3 an inflow opening normal to the z-direction has been illustrated. For the species concentration equations (4.5) we wish to impose Figure 4.3: Inflow boundary condition boundary conditions (2.19) and (2.21). The first is simply done by setting the diffusion constant Γφ,n equal to zero. The second is satisfied by replacing φn in (4.10) by φin , where φin is the prescribed inlet concentration. 4.3 Boundary Conditions 37 Solid Wall In Figure 4.4 a solid wall is illustrated. Since there is assumed that there is no thermal diffusion, we only have n · ∇φ = 0, (4.13) on wall w. Substituting this condition into (4.9) gives a modification of the integrals R • rρuφ dz changes into w Z rρuφ dz = rρw uw φP ∆z (4.14) w where we used that φW = φP (comes from the fact that ∂φ/∂r = 0) and φw = 21 (φP + φW ), R w rΓφ ∂φ ∂r dz becomes zero, because ∂φ = 0. ∂r (4.15) Solid Wall • N n w E P W Virtual Point e s S Figure 4.4: Solid wall boundary condition Outflow The outflow boundary condition with respect to the concentrations, without thermo diffusion, is n · ∇φ = 0. Remark that this b.c. is equal to the solid wall boundary condition and therefore we refer to the previous subsection for the discretization. 38 4.4 Test Problem Chemistry Properties In this section we present the details of the chemical model used for this test problem. The gas mixture in the CVD reactor of this test problem consists of 7 species, see Section 4.4.1. This model is a simplified model of a CVD process that deposits silicon Si from silane SiH 4 . The process of depositing silicon from silane is one of the most applied and studied CVD processes in the IC industry. In this problem we use the reactor configuration as given in Figure 4.5. This reactor has one inflow boundary, a number of solid walls, one reacting surface and two outflow boundaries. The reactor is axisymmetric, and hence, the computational domain results into the domain given in Figure 4.6. Furthermore, we assume that the gas flow in the reactor is also axisymmetric. Inflow 10 cm. 30 cm. substrate Outflow 35 cm. Figure 4.5: Reactor geometry We conclude with some last remarks about this CVD process. From the top of the reactor, the inflow boundary, a gas mixture enters the reactor. This gas mixture consists of silane SiH 4 and the carrier gas helium He. When the mixture enters the reactor it has a uniform temperature T in = 300 K and a uniform velocity Uin . The inlet silane mole fraction is f in,SiH4 = 0.001, the rest is helium. At a distance of 10 cm. below the inlet, a susceptor with temperature Ts = 1000 K is placed. The susceptor has a diameter of 30 cm. On the susceptor surface reactions take place leading to the deposition of silicon. In the first numerical experiments, these surface reactions will be omitted. 4.4.1 Gas-Phase Reaction Model Due to gas phase reactions in the reactor the gas mixture consists, besides helium and silane, of the following species : 4.4 Chemistry Properties 39 INFLOW θ z r OUTFLOW SUSCEPTOR Figure 4.6: Computational domain SiH4 SiH2 Si2 H6 H2 He H2 SiSiH2 Si3 H8 We use the following reaction mechanism : G1 : G2 : G3 : G4 : G5 : SiH4 Si2 H6 Si2 H6 SiH2 +Si2 H6 2Si2 SiH2 + H4 SiH4 + SiH2 H2 SiSiH2 + H2 Si3 H8 H2 SiSiH2 Recall from Chapter 2 that the forward reaction rate k forward is fitted according to the modified Law of Arrhenius as g kforward (T ) = Ak T βk e −Ek RT . (4.16) The backward reaction rates are self consistent with PN g i=1 νik kforward (T ) RT g kbackward (T ) = , Kg P0 where the reaction equilibrium K g is approximated by K g (T ) = Aeq T βeq e −Eeq RT . (4.17) In Table 4.1 the forward rate constants A k , βk and Ek are given. The fit parameters for the gas phase equilibrium are given in Table 4.2. 40 Test Problem Reaction Ak βk Ek G1 1.09 · 1025 −3.37 256000 −4.24 243000 0.00 236000 0.00 0 0.00 0 G2 G3 G4 G5 3.24 · 1029 7.94 · 1015 1.81 · 108 1.81 · 108 Table 4.1: Fit parameters for the forward reaction rates (4.16) Reaction Aeq βeq Eeq G1 6.85 · 105 0.48 235000 −1.68 229000 0.00 187000 1.64 233000 0.00 272000 G2 G3 G4 G5 1.96 · 1012 3.70 · 107 1.36 · 10−12 2.00 · 10−7 Table 4.2: Fit parameters for the equilibrium constants (4.17) 4.4 Chemistry Properties 41 Other Data Besides the chemical model of the reacting gases, also some other properties of the gas mixture are needed. As mentioned before, the inlet temperature of the mixture is 300 K, and the temperature at the wafer surface is 1000 K. The pressure in the reactor is 1 atm., which is equal to 1.013 · 10 5 Pa. Other properties are given in Table 4.3. The diffusion coefficients, according to Fick’s Law, are given in Table 4.4. Density ρ(T ) Specific heat cp Viscosity µ Thermal conductivity λ 1.637 · 10−1 · 300 T 5.163 · 103 1.990 · 10−5 1.547 · 10−1 T 300 T 300 [kg/m3 ] [J/kg/K] 0.7 0.8 [kg/m/s] [W/m/K] Table 4.3: Gas mixture properties SiH4 SiH2 H2 SiSiH2 Si2 H6 Si3 H8 H2 4.77 · 10−6 5.38 · 10−6 3.94 · 10−6 3.72 · 10−6 3.05 · 10−6 8.02 · 10−6 T 1.7 300 T 1.7 300 T 1.7 300 T 1.7 300 T 1.7 300 T 1.7 300 Table 4.4: Diffusion coefficients D0i , according to Fick’s Law 42 Test Problem Part II Time Integration Methods Chapter 1 Introduction In this part of the thesis we present time integration methods which are of practical relevance for solving advection diffusion reaction problems. These methods have been selected from recent literature concerning this problem, e.g. [13, 20, 36, 37] As seen in Part I the advection diffusion reaction equation is a PDE with appropriate boundary conditions and initial values. To approximate the solution u(x, t) of an ADR equation, or system of ADR equations, we use the Method of Lines approach. The PDE is then first discretized in space on a certain grid Ωh with mesh width h > 0 to yield a semi discrete system w0 (t) = F (t, w(t)), 0 < t ≤ T, w(0) given, (1.1) m where w(t) = (wj (t))m j=1 ∈ R , with m proportional to the number of grid points in spatial direction. The next step is to integrate the ODE system (1.1) with an appropriate time integration method. A number of methods is treated in the following sections. Among others we will discuss 1. The class of Runge Kutta Methods including the Rosenbrock Methods, 2. Time Splitting Methods, 3. Runge Kutta Chebyshev Methods, 4. Multi-rate (Runge-Kutta) Schemes. We consider the non autonomous initial value problem of the system of ODEs w0 (t) = F (t, w(t)), t > 0, w(0) = w0 , (1.2) with given F : R × Rm → Rm and w0 ∈ Rm . The exact solution will be approximated in the grid points tn = nτ , n = 0, 1, 2, . . . , N , where τ > 0 is the time step. We denote the numerical approximations by w n ≈ w(tn ). 46 Introduction To keep the presentation of the methods short, we assume the time step τ to be constant. However, we emphasize that in many applications variable step sizes should be used to obtain efficient codes and solvers. Chapter 2 Runge-Kutta Methods Runge-Kutta Methods belong to the class of one-step methods, e.g. this is the class of methods that step forward from computed approximations w n at time tn to new approximations wn+1 at the forward time tn+1 using as input only wn . To compute the new approximation w n+1 the Runge-Kutta methods use a number of auxiliary intermediate approximations w ni ≈ w(tn + ci τ ), i = 1, 2, . . . , s, where s is the number of stages. These intermediate approximations serve to obtain a sufficiently level of accuracy for the approximations wn at the grid points tn . The general form of a Runge-Kutta method is wni = wn + τ wn+1 = wn + τ s X j=1 s X αij F (tn + cj τ, wnj ), bi F (tn + ci τ, wni ). i = 1, 2, . . . , s, (2.1) i=1 P The coefficients αij and bi define the particular method and ci = sj=1 αij . Observe that a particular Runge-Kutta method is explicit if α ij = 0 for all j ≥ i. All other particular methods are implicit. If α ij = 0 for all j > i, then we will call this method diagonally implicit. Particular Runge-Kutta methods can be represented in a so-called Butcher-array, e.g., c1 .. . α11 .. . ··· α1s .. . cs αs1 b1 ··· ··· αss bs The order p of a Runge-Kutta method is determined by its coefficients αij , bi and ci . By making use of Taylor series, one can derive conditions in 48 Runge-Kutta Methods order to make the method consistent of order p. These conditions are given in Table 2.1 for p = 1, 2, 3 and 4. Table 2.1 is taken from [13]. order p order conditions 1 bT e = 1 2 bT ce = 1 2 3 b T c2 = 1 3 4 b T c3 = 1 4 bT Ac2 = bT Ac = bT CAc = 1 12 bT A2 c = 1 6 1 8 1 24 Table 2.1: Order conditions of Runge-Kutta methods for p = 1, 2, 3, 4. The vector b = (b1 , . . . , bs )T , c = (c1 , . . . , cs )T , e = (1, . . . , 1)T , the matrix A = (αij )ij , C = diag(ci ) and ck = C k e. Example 2.1. We give some Butcher arrays of simple explicit Runge-Kutta methods. 0 0 0 1 1 1 1 1 2 2 At the left we have the Butcher array of the familiar Euler Forward method of order one. At the right we have a method known as the second-order explicit trapezoidal rule or modified Euler method, e.g., 1 1 wn+1 = wn + τ F (tn , wn ) + τ F (tn + τ, w − n + τ F (tn , wn )). 2 2 An example of explicit method of order four is given by the Butcher array 0 1 2 1 2 1 1 2 1 2 0 0 1 1 6 1 3 1 3 1 6 This method is better known as the method of Runge-Kutta. Completely 2.1 The Stability Function 49 written out it is wn1 = wn , 1 wn2 = wn + τ F (tn , wn1 ), 2 1 1 wn3 = wn + τ F tn + τ, wn2 , 2 2 1 wn4 = wn + τ F tn + τ, wn3 , 2 1 1 F (tn , wn1 ) + F (tn + 1/2τ, wn2 ) wn+1 = wn + τ 6 3 1 1 + F (tn + 1/2τ, wn3 ) + F (tn + τ, wn4 ) . 3 6 Taking ki = τ F (tn +ci τ, wni ), the above method is in the more familiar form k1 = τ F (tn , wn ), 1 1 k2 = τ F tn + τ, wn + k1 , 2 2 1 1 k3 = τ F tn + τ, wn + k2 , 2 2 k4 = τ F (tn + τ, wn + k3 ), 1 wn+1 = wn + (k1 + 2k2 + 2k3 + k4 ) . 6 We will refer to it as the classical fourth order method. 2.1 The Stability Function Consider the Dahlquist test equation w 0 (t) = λw(t), λ ∈ C. Applying a Runge-Kutta method (2.1) to the test equation gives the recursion wn+1 = R(z)wn , with R(z) the stability function and z = τ λ. This function can be found to be R(z) = 1 + zbT (I − zA)−1 e, (2.2) where e = (1, 1, . . . , 1)T . It follows from (2.2) that for explicit R-K methods the stability function R(z) becomes a polynomial of degree ≤ s. For implicit R-K methods it becomes a rational function with both denominator and numerator polynomials of degree ≤ s. Example 2.2. For every explicit R-K method of order p = s and s ≤ 4 the stability function is given by the polynomial 1 1 R(z) = 1 + z + z 2 + · · · + z s . 2 s! 50 Runge-Kutta Methods For s = 3 and 4 the stability regions S = {z ∈ C : |R(z)| ≤ 1} have been plotted in Figure 2.1. Third Order Runge−Kutta Method −4 −3 −2 Fourth Order Runge−Kutta Method −1 3 3 2 2 1 1 0 1 −4 −3 −2 −1 0 1 −1 −1 −2 −2 −3 −3 Figure 2.1: Stability regions S = {z ∈ C : |R(z)| ≤ 1 for s = 3 (left) and s = 4 (right). 2.2 Rosenbrock Methods Rosenbrock methods belong to the family of Runge-Kutta methods. They are named after Rosenbrock, see [27], who was the first one that proposed methods of this kind. In literature different forms have been used. Nowadays Rosenbrock methods are understood to solve an autonomous, stiff ODE system w0 (t) = F (w(t)). We derive the class of Rosenbrock methods from the diagonally implicit Runge-Kutta methods. Recall that an s-stage diagonally implicit RK-method, with F = F (w(t)), is given as wni = wn + τ s X αij F (wnj ), i = 1, 2, . . . , s, j=1 wn+1 = wn + τ s X bi F (wni ) (2.3) i=1 with Butcher array as in Figure 2.2. Rewriting (2.3) gives i−1 X ki = τ F wn + αij kj + αii ki , j=1 wn+1 = wn + s X i=1 bi ki . i = 1, . . . , s, (2.4) 2.2 Rosenbrock Methods 51 c1 .. . cs α11 α21 .. . α22 αs1 b1 ··· ··· .. . αss bs Figure 2.2: Butcher array of an s-stage diagonally implicit RK-method P Now, the idea is not to solve ki from (2.4), but to linearize F wn + i−1 α k + α k ij j ii i j=1 Pi−1 around x = wn + j=1 αij kj : ki = τ F (x) + τ F 0 (x)αii ki . (2.5) This can be interpreted as applying one Newton iteration to (2.4) with starting value ki = 0. Instead of continuing the Newton iteration we consider (2.5) as a new class of methods. Before defining the actual class of Rosenbrock methods, we give some additional remarks to obtain more efficient methods. A lot of computational advantage can be obtained by replacing the Jacobians F 0 (x) by the Jacobian A = F 0 (wn ), in each time step only one Jacobian has to be computed. Furthermore, we gain more freedom by introducing linear combinations of the terms Akj into (2.5). The following class of methods can be defined : Definition 2.3. An s-stage Rosenbrock method is given as ki = τ F (wn + i−1 X αij kj ) + τ A j=1 wn+1 = wn + s X bi ki i X γij kj , j=1 (2.6) i=1 where A = An is the Jacobian F 0 (w(t)). Definition 2.3 is taken from [13]. Again, the coefficients b ij , αij and γij define a particular method and are selected to obtain a desired level of consistency and stability. Remark that computing an approximation w n+1 from wn , in each stage i a linear system of algebraic equations with the matrix I − γ ii τ A has to be solved. To save computing time for large dimension systems w 0 (t) = F (w(t)) the coefficients γii are taken constant, e.g., γii = γ. Then in every timestep the matrix (I − γii A) is the same. To solve these large systems LU decomposition or (preconditioned) iterative methods could be used. Finally we remark that taking A to the zero matrix in (2.6), this leads to a standard explicit Runge-Kutta method. The final remark, taken from 52 Runge-Kutta Methods [13], Rosenbrock methods have proven to be successful in many stiff ODE and PDE problems, especially in the low to moderate accuracy range. As can be found in [13] the definition of order of consistency is the same as for Runge-Kutta. Define βij = αij + γij , ci = i−1 X αij and di = i−1 X βij . j=1 j=1 In Table 2.2, taken from [13], the order conditions for s ≤ 4 and p ≤ 3 and γii = γ = constant can be found. For Rosenbrock methods with constant order p order conditions 1 b1 + b2 + b3 + b4 = 1 2 b 1 d2 + b 3 d3 + b 4 d4 = 1 2 −γ b2 c22 + b3 c23 + b4 d24 = 1 3 b3 β32 d2 + b4 (β42 d2 + β43 d3 ) = 1 6 3 − γ + γ2 Table 2.2: Order conditions of Rosenbrock methods with γ ii = γ for s ≤ 4 and p ≤ 3. coefficients γii = γ the stability function has the following form, R(z) = P (z) , (1 − γz)s where P (z) is a polynomial of degree ≤ s. Example 2.4. By taking s = 1, b1 = 1, α11 = 1 and γ11 = γ, γ a free parameter, in (2.6) we defined the one-stage method wn+1 = wn + k1 k1 = τ F (wn ) + τ Ak1 . Since β11 = 1 + γ and βij = ci = di = 0 for all i = 1, 2, 3, 4 and j = 2, 3, 4, we only have a second order method if γ = 2. The stability function is R(z) = Hence, this method is A-stable 1 1 + (1 − γ)z . 1 − γz for all γ ≥ 1 2 and L-stable (2.7) 2 for γ = 1. 1 A method is called A-stable if the stability region S = {z ∈ C : |R()z| ≤ 1} contains the left half plane C− . 2 A method is called L-stable if the method is A-stable and R|∞| = 0. 2.3 Runge Kutta Chebyshev Methods 53 Example 2.5. We consider the 2-stage method wn+1 = wn + b1 k1 + b2 k2 k1 = τ F (wn ) + γτ Ak1 k2 = τ F (wn + α21 k1 ) + γ21 τ Ak1 + γτ Ak2 , (2.8) with coefficients b1 = 1 − b 2 , α21 = 1 2b2 and γ21 = − γ . b2 Hence, this method is of order 2 for all γ and as long as b 2 6= 0. The stability function is 1 + (1 − 2γ)z + (γ 2 − 2γ + 21 )z 2 . (2.9) R(z) = (1 − γz)2 √ The method is A-stable for γ ≥ 41 and L-stable if γ = 1 ± 21 2. 2.3 Runge Kutta Chebyshev Methods The family of Runge Kutta methods discussed in this section are explicit. They will avoid solving algebraic systems, but posses extended real stability interval with a length proportional to s 2 , with s the number of stages. Therefore the Runge Kutta Chebyshev (RKC) methods could be attractive to solve moderate stiff systems. The principle goal of constructing Runge Kutta methods is to achieve the highest order consistency with a given number of stages s. The methods discussed in this section use a few stages to achieve a low order consistency and the additional stages are used to increase the stability boundary β(s). Definition 2.6. The stability boundary β(s) is the number β(s) such that [−β(s), 0] is the largest segment of the negative real axis contained in the stability region S = {z ∈ C : |R(z)| ≤ 1} . To construct the family of RKC methods we start with the explicit Runge-Kutta methods which have the stability polynomial R(z) = γ0 + γ1 z + · · · + γs z s . (2.10) In order to have first order consistency we take γ 0 = γ1 = 1.3 The following theorem states that every explicit Runge-Kutta method has as optimal 3 This can be verified by considering the test equation y 0 = λy. The local error of the test equation satisfies ez − R(z) = O(τ p ). (2.11) τ th To achieve p order consistency the coefficients γi has to be chosen in such a way that (2.11) satisfies for p. 54 Runge-Kutta Methods stability boundary β(s) = 2s2 , thus the maximum value of β(s) is 2s 2 . The upper boundary of β(s) is achieved if we take the shifted Chebyshev polynomials of the first kind as stability polynomial. The following theorem is taken from [13]. Theorem 2.7. For any explicit, consistent Runge-Kutta method we have β(s) ≤ 2s2 . The optimal stability polynomial is the shifted Chebyshev polynomial of the first kind z Ps (z) = Ts 1 + 2 , s (2.12) where the polynomials Ts (z)4 for z ∈ C are recursively defined as T0 (z) = 1, T1 (z) = z, Tj (z) = 2zTj−1 (z) − Tj−2 (z) 2 ≤ j ≤ s. (2.13) If we take (2.12) as stability polynomial, then we achieve the optimum value for β(s) equal to 2s2 . Proof. Since Ps (z) = 1 + z + O(z 2 ) these polynomials give first order consistency and belong to the class of stability polynomials (2.10) that can be generated by explicit Runge Kutta methods. By definition, it follows that |Ts (x)| ≤ 1 for −1 ≤ x ≤ 1. Therefore, |Ps (x)| ≤ 1 for −2s2 ≤ x ≤ 0. As known, on the interval [−1, 1], Ts (x) has s − 1 points of tangency with the lines y = ±1. As a consequence, Ps (x) has also s − 1 tangential points with these lines. This property of the shifted Chebyshev polynomials of the first kind determines it as the unique polynomial with largest stability boundary. Suppose there exists a second stability polynomial of the class (2.10) with β(s) ≥ 2ss . Since Ps (x) has s − 1 points of tangency with the lines y = ±1, this second stability polynomial then intersects P s (x) at least s − 1 times for x < 0, where intersection points with common tangent are counted double. The difference polynomial has at least s − 1 negative roots, where roots of multiplicity 2 are counted double. Since this second polynomial also belongs to the class (2.10), the difference polynomial is of the form x2 (γ̃2 + · · · + γ̃s xs−2 ). This difference polynomial can have at most s − 2 roots on the negative real axis, thus we have a contradiction. 4 On the real interval [−1, 1] the Chebyshev polynomials can also be defined as Ts (x) = cos(s arccos x). 2.3 Runge Kutta Chebyshev Methods 2.3.1 55 First Order (Damped) Stability Polynomials and Schemes In the previous section we proved that a stability polynomial that is generated by a first order explicit R-K method has as optimal stability boundary 2s2 . The class of shifted Chebyshev polynomials of the first kind seemed to be the class of stability polynomials that achieve the stability boundary 2s2 . If one considers the stability regions of P s (z), then one observes that there are interior points z ∈ (−β(s), 0) where |P s (z)| = 1. This means that a small perturbation in imaginary direction near these points might cause instability. To avoid that kind of situations, the polynomials are slightly modified, so called damping. Following [13], adopting the choice made by Guillo and Lago [9] (already in 1961), the damped form of (2.12) reads Ps (z) = where ω0 = 1 + ε s2 Ts (ω0 + ω1 z) , Ts (ω0 ) (2.14) and ω1 = Ts (ω0 ) . Ts0 (ω0 ) (2.15) The polynomials (2.14) satisfy Rs (z) = 1 + z + O(z 2 ) and thus generate a first order method.5 The stability interval is then determined by the relation Ts (ω0 + ω1 z) ≤ 1. |Ps (z)| = Ts (ω0 ) Using Taylor series of Ts (z) near z = 0 we obtain that the stability interval is determined by the relation −ω0 ≤ ω0 + ω1 z ≤ ω0 , and thus it follows that β(s) = 2ω0 4 ≈ (2 − ε)s2 . ω1 3 (2.16) In Figure 2.3 the stability region of the first order shifted Chebyshev polynomial (with damping), for s = 5, is given. The next problem is to find R-K methods that have stability polynomials like (2.14). The Approach of Van Der Houwen & Sommeijer To find explicit R-K methods with stability polynomials (2.14) we use the idea of van der Houwen and Sommeijer, as we will now explain. Their elegant idea was to use the scaled and damped Chebyshev polynomials of 5 The conditions for first order are Rs (0) = 1 and Rs0 (0) = 1. 56 Runge-Kutta Methods First Order Chebyshev Polynomial 10 5 0 −5 −10 −60 −50 −40 −30 −20 −10 0 −10 0 First Order Damped Chebyshev Polynomial 10 5 0 −5 −10 −60 −50 −40 −30 −20 Figure 2.3: Stability region of P5 (z) without damping (upper) and with damping (lower). order p as stability polynomials to generate an R-K method of order p. To derive the internal stages they use the three term recursion (2.13) and the damped Chebyshev polynomials to define the stability functions of the stability functions of the internal stages. Furthermore they made the ansatz that Rs (z) = as + bs Ts (ω0 + ω1 z). We construct the first order explicit R-K method with stability polynomial (2.14). Then, ω0 and ω1 are already defined, see (2.15). To have a first order method Rs (z) has to satisfy Rs (0) = 1 and Rs0 (0) = 1. It follows that bs = 1 Ts (ω0 ) and as = 1 − bs Ts (ω0 ). For the internal stages to be of order one, we also put Rj (z) = aj + bj Tj (ω0 + ω1 z) with bj = aj = 1 − bj Tj (ω0 ), (2.17) 1 . Tj (ω0 ) Define R0 (z) = a0 + b0 = 1 and imposing the recursion (2.13) where we use that Rj (0) = 1 for all 0 ≤ j ≤ s then shows after some calculations R0 (z) = 1, R1 (z) = 1 + µ̃1 z, Rj (z) = (1 − µj − νj ) + µj Rj−1 (z) + νj Rj−2 (z) + µ̃j Rj−1 (z)z + γ̃j z, 2.3 Runge Kutta Chebyshev Methods 57 for j = 2, . . . , s and where µ̃1 = b1 ω1 , bj 2bj ω0 , νj = − , bj−1 bj−2 2bj ω1 µ̃j = , γ̃j = −aj−1 µ̃j . bj−1 µj = (2.18) From the above relation the RKC integration formula for the nonlinear problem w0 (t) = F (t, w(t)) can de derived by associating R j (z) with the intermediate approximation wnj and the occurrence of z with a function evaluation : wn0 = wn , wn1 = wn + µ̃1 τ F (tn + c0 τ, wn0 ), wnj = (1 − µj − νj )wn + µj wn,j−1 + νj wn,j−2 + +µ̃1 τ F (tn + cj−1 τ, wn,j−1 ) + γ̃j τ F (tn + c0 τ, wn0 ), wn+1 = wns . (2.19) The above scheme obviously belongs to the class of explicit R-K methods. The only coefficient that remains to be defined are c j for 1 ≤ j < s. Observe that Rj (z) = ecj z + O(z 2 ) with j2 Ts (ω0 ) Tj0 (ω0 ) ≈ 2 cj = 0 Ts (ω0 ) Ts (ω0 ) s cs = 1. Remark 2.8. Comparing the s-stage RKC method with a straightforward explicit method like Euler Forward we can conclude the following. The stability boundary of the s-stage RKC method equals β RKC = 2s2 , while for the Euler forward method the stability boundary is β EF = 2. See Figures 2.3 and 2.4. To have stability for both methods, the time-step τ RKC for the RKC method can be s2 as large as the time step of the Euler forward method, i.e., τRKC = s2 τEF . The number of function evaluations for Euler forward is one per time step. For the s-stage RKC method is the number of function evaluations per time step equal to s. We conclude that to integrate one time step with the s-stage RKC method we need s function evaluations. To integrate the same time step with Euler forward we need to integrate s2 smaller time steps τEF , because of the relation τRKC = s2 τEF . Per time step Euler forward needs one function evaluation per time step τEF . Integrating with RKC pays of with a factor s less function evaluations. 2.3.2 Second Order Schemes If we are looking for second order explicit R-K methods with s internal stages, then we first have to find analytical expressions for second order 58 Runge-Kutta Methods Euler Forward Stability Region 1 −3 −2 −1 00 1 −1 Figure 2.4: Stability region of Euler Forward stability polynomials 1 R(z) = 1 + z + z 2 + γ3 z 3 + · · · + γs z s . 2 The free coefficients γi , i = 3, . . . , s have to be chosen such that β(s) is as large as possible. A suitable approximate polynomial in analytical form was given by Bakker [1] : 1 1 3z 1 2 Ts 1 + 2 , (2.20) − Bs (z) = + 2 + 3 3s 3 3s2 s −1 with stability boundary β(s) ≈ 32 (s2 − 1). This polynomial generates about 80% of the optimal interval. The damped version of (2.20) reads ([12]) Bs (z) = 1 + Ts00 (ω0 ) (Ts (ω0 + ω1 z) − Ts (ω0 )) (Ts0 (ω0 ))2 with ω0 = 1 + ε s2 ω1 = (2.21) Ts0 (ω0 ) . Ts00 (ω0 ) The stability boundary is in the damped case equal to 2 2 β(s) = (s2 − 1)(1 − ε). 3 15 In Figure 2.5 the stability regions S = {|B 5 (z)| ≤ 1 : z ∈ C} are given in the undamped as well as the damped case. Using the method of van der Houwen and Sommeijer to find a second order explicit R-K method which has stability polynomial (2.21) gives the following method. Again we choose all the internal stages to have second order consistency, thus 1 Rj (z) = 1 + bj ω1 Tj0 (ω0 )z + bj ω12 Tj0 (ω0 )z 2 + O(z 3 ), 2 2.4 Some Remarks on Runge-Kutta Methods 59 Second Order Chebyshev Polynomial 5 0 −5 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2 0 2 −2 0 2 Second Order Damped Chebyshev Polynomial 5 0 −5 −20 −18 −16 −14 −12 −10 −8 −6 −4 Figure 2.5: Stability region of B5 (z) without damping (upper) and with damping (lower). has to match with 1 Rj (z) = 1 + cj z + (cj z)2 + O(z 3 ). 2 Elementary calculations yields the following method : wn0 = wn , wn1 = wn + µ̃1 τ F (tn + c0 τ, wn0 ), wnj = (1 − µj − νj )wn + µj wn,j−1 + νj wn,j−2 + (2.22) +µ̃1 τ F (tn + cj−1 τ, wn,j−1 ) + γ̃j τ F (tn + c0 τ, wn0 ), wn+1 = wns , with coefficients ω0 = 1 + Tj00 (ω0 ) , bj = 0 (Tj (ω0 ))2 2.4 ε , s2 ω1 = Ts0 (ω0 ) , Ts00 (ω0 ) (2.23) Ts0 (ω0 ) Tj00 (ω0 ) j2 − 1 cj = 00 ≈ 2 , 0 Ts (ω0 ) Tj (ω0 ) s −1 µ̃1 = b1 ω1 , µj = µ̃j = 2bj ω1 , bj−1 2bj ω0 , bj−1 νj = − bj bj−2 , γ̃j = −aj−1 µ̃j . (2.24) (2.25) (2.26) Some Remarks on Runge-Kutta Methods This part of Chapter 2 is reserved to mention some extra properties of Runge-Kutta methods. The properties mainly apply to Implicit RK-methods. 60 Runge-Kutta Methods 2.4.1 Properties of Implicit Runge-Kutta methods A-stability As observed, some Runge-Kutta methods have a stability region equal to the left half plane C− . For stiff problems such stability regions are favorable. Definition 2.9 (Dahlquist 1963). A method with stability region S such that, S ⊃ C− = {z | Re z ≤ 0}, is called A-stable. Implicit Runge-Kutta methods have stability polynomials of the form R(z) = P (z) , Q(z) with deg(P (z)), deg(Q(z)) ≤ s Recall that s is the number of stages. The following observation is easily observed by using the Minimum Modulus Theorem. An implicit Runge-Kutta method is A-stable if and only if |R(iy)| ≤ 1 for all real y, and R(z) analitic for all Re z ≤ 0. Construction of Implicit Runge-Kutta methods In this part we give basic conditions to derive classes of fully implicit RungeKutta methods having good stability properties. The construction of these methods relies on the following assumptions, taken from [11] : B(p) : C(η) : D(ζ) : q−1 = 1q i=1 bi ci Ps cq q−1 = qi j=1 αij cj Ps b q−1 αij = qj (1 − cqj ) i=1 bi ci Ps q = 1, . . . , p, i = 1, . . . , s, q = 1, . . . , η, j = 1, . . . , s, q = 1, . . . , ζ. Condition B(p) means that the quadrature formula (b i , ci ) is of order p. The following fundamental theorem derived by Butcher gives the relation between the conditions stated above and the order of the Runge-Kutta method. Theorem 2.10 (Butcher 1964). If the coefficients b i , ci , αij of a RungeKutta method satisfy B(p), C(η) and D(ζ) with p ≤ η + ζ + 1 and p ≤ 2η + 2, then the method is of order p. 2.4 Some Remarks on Runge-Kutta Methods 61 Gauss Methods Gauss methods are collocation methods 6 based on the Gauss-Legendre quadrature formulas, i.e., c1 , . . . , cs are the zeros of the shifted Legendre polynomial of degree s ds (xs (x − 1)s ) . dxs The weights b1 , . . . , bs are chosen such that B(s) is satisfied. The following theorem is taken from [11]. Theorem 2.11. The s-stage Gauss method is of order 2s. Its stability function is the (s, s)- Padé approximation 7 and the method is A-stable. Examples of Gauss methods are given in Figure 2.6. 1 2 1 2 1 1 2 1 2 − + √ 3 √6 3 6 1 4√ 3 1 + 4 6 1 2 1 4 − 1 4 1 2 √ 3 6 Figure 2.6: Butcher tableaus of Gauss methods of order 2 (left) and 4 (right) Radau Methods Based on the Radau and Lobatto quadrature formulas other Runge-Kutta methods can be constructed. Taking c 1 , . . . , cs as zeros of respectively ds−1 xs (x − 1)s−1 , s−1 dx ds−1 xs−1 (x − 1)s , and s−1 dx ds−2 xs−1 (x − 1)s−1 , s−2 dx (2.27) (2.28) (2.29) we call the method Radau left (2.27), Radau right (2.28) and Lobatto (2.29). For all three methods the weights b1 , . . . , bs are chosen such that B(s) is satisfied. The s-stage Radau IA method is defined by the Radau left method. The coefficients αij , i, j = 1, . . . , s are defined by condition D(s). Since all c j are distinct and the bi 6= 0 this is uniquely possible. Figure 2.7 represents two examples of this method. The s-stage Radau IIA method is defined by the Radau right method and the coefficients αij , i, j = 1, . . . , s are obtained by condition C(s). Theorem II.7.7 of [10] implies that this results in the collocation method based on the zeros of (2.28). Examples are given in Figure 2.8. The following theorem is 6 7 See Appendix A See Appendix B 62 Runge-Kutta Methods 0 1 1 0 2 3 1 4 1 4 1 4 − 41 5 12 3 4 Figure 2.7: Butcher tableaus of Radau IA methods of order 1 (left) and 3 (right) 1 1 1 1 3 1 5 12 3 4 3 4 1 − 12 1 4 1 4 Figure 2.8: Butcher tableaus of Radau IIA methods of order 1 (left) and 3 (right) taken from [11] and stated without proof. Theorem 2.12. The s-stage Radau IA method and the s stage Radau IIA method are of order 2s − 1. Their stability function is the (s − 1, s) (subdiagonal) Padé approximation. Also, both methods are A-stable. For the Lobatto IIIA methods the coefficients α ij are defined by C(s) and therefore it is a collocation method. Lobatto IIIB D(s) is used to define the αij . Finally, for the Lobatto IIIC methods we put αi1 = b1 for i = 1, . . . , s, and determine the remaining αij by C(s − 1). The following theorem, taken without proof from [11], gives the order and stability properties of Lobatto methods. Theorem 2.13. The s-stage Lobatto IIIA, IIIB and IIIC methods are of order 2s − 2. The stability function for the Lobatto IIIA and IIIB methods is the diagonal (s − 1, s − 1)-Padé approximation. The Lobatto IIIC method has as stability function the (s − 2, s)-Padé approximation. Finally, all the Lobatto methods are A-stable. In Table 2.3 we give a summary of these statements. The implicit methods Radau IA and IIA are widely used to solve stiff problems due to their favorable stability properties and simplicity. 2.4.2 Diagonally Implicit Runge-Kutta methods Fully implicit Runge-Kutta methods like Radau IA and IIA are often used in stiff chemistry problems for their superior stability properties. A clearly disadvantage of these methods is that the Runge-Kutta matrix A is full. This means that the system of algebraic equations for w ni must be solved simultaneously. In the situation that the number of stages s increases, this 2.4 Some Remarks on Runge-Kutta Methods method Gauss 63 assumptions order stability function B(2s) C(s) D(s) 2s (s, s)-Padé Radau IA B(2s − 1) C(s − 1) D(s) 2s-1 (s − 1, s)-Padé Radau IIA B(2s − 1) C(s) D(s − 1) 2s-1 (s − 1, s)-Padé Lobatto IIIA B(2s − 2) C(s) D(s − 2) 2s-2 (s − 1, s − 1)-Padé Lobatto IIIB B(2s − 2) C(s − 2) D(s) 2s-2 (s − 1, s − 1)-Padé Lobatto IIIA B(2s − 2) C(s − 1) D(s − 1) 2s-2 (s − 2, s)-Padé Table 2.3: Implicit Runge-Kutta methods can become very costly. To avoid these situations, one can take for A a lower triangular matrix, see Figure 2.9. Then, the s approximations w ni can be solved sub-sequently for i = 1, . . . , s. Such methods we call Diagonally implicit Runge-Kutta methods, or shorter DIRK. c1 c2 .. . α11 α21 .. . α22 cs αs1 b1 αs2 b2 .. . ··· ··· αss bs Figure 2.9: Lower Triangular matrix A To solve each approximation wni , in general a Newton-type iteration with a coefficient matrix I − τ aii F 0 will be needed. By taking the diagonal entries of A equal, one may hope to use repeatedly the stored LU factorization of I − τ aii F 0 . DIRK methods with this additional property are called Singly Diagonally Implicit RK methods (SDIRK). The general structure of the Runge-Kutta matrix belonging to the SDIRK is given in Figure 2.10. The c1 c2 .. . γ α21 .. . γ cs αs1 b1 αs2 b2 .. . ··· ··· γ bs Figure 2.10: Lower Triangular matrix A special structure of the SDIRK schemes makes it possible to simplify the 64 Runge-Kutta Methods order conditions, see Table 2.4, taken from [11]. order 1 2 3 4 previous conditions P bj = 1 P bj αjk = 12 P bj αjk αjl = 13 P bj αjk αjl = 16 P bj αjk αjl αjm = 41 P bj αjk αjl αjm = 81 P 1 bj αjk αjl αjm = 12 P 1 bj αjk αjl αjm = 24 simplified conditions P bj = 1 P bj αjk = 12 − γ P bj αjk αjl = 13 − γ + γ 2 P bj αjk αjl = 16 − γ + γ 2 P bj αjk αjl αjm = 14 − γ + 23 γ 2 − γ 3 P bj αjk αjl αjm = 18 − 56 γ + 23 γ 2 − γ 3 P 1 bj αjk αjl αjm = 12 − 32 γ + 23 γ 2 − γ 3 P 1 bj αjk αjl αjm = 24 − 21 γ + 23 γ 2 − γ 3 Table 2.4: Order conditions for SDIRK methods (simplified conditions) and the ‘original’ conditions of RK methods (previous conditions) The stability function of DIRK methods have the form R(z) = P (z) , (1 − α11 z)(1 − α22 z) · · · (1 − αss z) (2.30) where the numerator P (z) is of degree ≤ s. In the case of an SDIRK scheme (2.30) changes into P (z) . R(z) = (1 − γz)s For more information with respect to stability properties and order conditions of specific (S)DIRK methods we refer to [11]. The following theorem, also taken from [11], gives order barriers for (S)DIRK methods with respect to the vector b. Theorem 2.14. 6, 1. A DIRK method with all b i positive has order at most 2. A SDIRK method with all bi positive has order at most 4. 2.4.3 The Order Reduction Phenomenon for RK-methods To study the accuracy of RK-methods applied to stiff equations Prothero and Robinson [25] proposed to consider the following problem y0 = λ(y − ϕ(x)) + ϕ0 (x) , (2.31) y(x0 ) = ϕ(x0 ) 2.4 Some Remarks on Runge-Kutta Methods 65 with λ ∈ C and Re λ ≤ 0. Note that the exact solution of (2.31) is given by y(t) = ϕ(t). Applying a general RK-method to (2.31) yields wni = wn + h s X j=1 wn+1 = wn + h s X j=1 αij (λ((wni − ϕ(x0 + cj h))) + ϕ0 (x0 + cj h)), bj (λ((wni − ϕ(x0 + cj h))) + ϕ0 (x0 + cj h)). (2.32) We assume that for this RK-method the conditions B(p) and C(q) hold. Then, by replacing wni , wn and wn+1 by the exact solutions wni ϕ(x0 + ci h), wn ϕ(x0 ), wn+1 ϕ(x0 + h), and using Taylor expansions of ϕ(x0 + cj h), ϕ(x0 + h) and ϕ0 (x0 + cj h) near x0 gives ϕ(x0 + ci h) = ϕ(x0 ) + h ϕ(x0 + h) = ϕ(x0 ) + h s X j=1 s X αij ϕ0 (x0 + cj h) + ∆i,h (x0 ), bj ϕ0 (x0 + cj h) + ∆0,h (x0 ), (2.33) j=1 where ∆0,h (x0 ) = O(hp+1 ) and ∆i,h (x0 ) = O(hq+1 ). (2.34) Eliminating internal stages and subtracting (2.33) from (2.32) yields for n=0 w1 − ϕ(x0 + h) = R(z)(w0 − ϕ(x0 )) + δh (x0 ), (2.35) with R(z) the stability function (R(z) = 1 + zb T (I − zA)−1 e). Remark that δh (x0 ) is the local error when w0 = ϕ(x0 ) in (2.35). It is given by δh (x) = −zbT (I − zA)−1 ∆h (x) − ∆0,h (x), (2.36) where ∆h (x) = (∆1,h (x), . . . , ∆s,h (x))T . Replacing x0 by xn and so on, one obtains instead of (2.35) the general recursion wn+1 − ϕ(xn+1 ) = R(z)(wn − ϕ(xn )) + δh (xn ). (2.37) Applying the above recursion n times gives the following general expression for the global error wn+1 − ϕ(xn+1 ) = R(z)n+1 (w0 − ϕ(x0 )) + n X j=0 R(z)n−j δh (xj ). (2.38) 66 Runge-Kutta Methods Remark 2.15. In the non-stiff theory we have z = O(h). Observe that in the non-stiff case the global error (2.38) behaves like O(h p ). In the stiff case we would like to have a time-step that is larger than with |λ| 1. Therefore, the global error (2.38) is studied under the assumption that simultaneously h → 0 and z = hλ → ∞. In Table 2.5 the results for this analysis are given for the s-stage Gauss methods, s-stage Radau methods and s-stage Labotto methods of Section 2.4.1. Comparing Table 2.5 with Table 2.3 we observe that for several methods we have order reduction. 1 |λ| , Method s odd Gauss s even Radau IA Radau IIA Labotto IIIA Labotto IIIB Labotto IIIC s odd s even s odd s even local error O(hs+1 ) O(hs ) z −1 O(hs+1 ) z −1 O(hs+1 ) zO(hs+1 ) z −1 O(hs ) global error O(hs+1 ) O(hs ) O(hs ) z −1 O(hs+1 ) z −1 O(hs ) z −1 O(hs+1 ) zO(hs ) zO(hs+1 ) z −1 O(hs ) Table 2.5: Error for (2.31) in the stiff case under the assumption that simultaneously h → 0 and z = hλ → ∞. For the Gauss methods we verify the results obtained in Table 2.5. Since the Runge-Kutta matrix A is invertible, the vector-matrix product −zb T (I− zA)−1 is equal to T −zb (I − zA) −1 T =b A −1 1 +O . z Observing that for Gauss methods the condition C(η) holds for η = s yields that q = s. Also recall that the construction of Gauss methods implies that B(p) is satisfied for p = s. It follows that the local error δ h (x) (2.36) is of order O(hs+1 ). Denote en = yn − ϕ(xn ). We obtain from the recursion (2.37) and the 2.4 Some Remarks on Runge-Kutta Methods 67 fact that |R(z)| ≤ 1 for all z with Re z ≤ 0 that |en+1 | ≤ |R(z)||en | + |δh (xn )| = |R(z)||en | + Chs+1 ≤ |en | + Chs+1 .. . ≤ |e0 | + nChs+1 = |e0 | + Chs = O(hs ). In the last step we used that nh = O(1). Since the stability function of a Gauss method is the (s, s) Padé approximation, we have for odd s R(∞) = −1. For odd s the global error estimate can be improved using the partial summation n X n κn−j δ(xj ) = X 1 − κn+1−j 1 − κn+1 δ(x0 ) + (δ(xj ) − δ(xj−1 )). 1−κ 1−κ j=1 j=0 The fact that (δ(xj ) − δ(xj−1 )) = O(hq+2 ) and the above partial summation gives, when substituted in (2.38), the desired result. For the verification of the results of the other methods we refer to [11]. 2.4.4 Order Reduction for Rosenbrock Methods Applying Rosenbrock methods (2.6) to the Prothero & Robinson test equation (2.31) gives by straightforward calculation that the global error ε n = yn − y(xn ) satisfies εn+1 = R(z)εn + δh (xn ), (2.39) where R(z) is the stability function. The local error δ h (xn ) in (2.39) is given by δh (x) = ϕ(x) − ϕ(x + h) + bT (I − zB)−1 ∆, (2.40) with B an s × s matrix with entries Bi,j = (αij + γij )ij . In (2.40) vector b has as ith entry bi , i = 1, . . . , s. Finally, the vector ∆ is the s × 1 vector with entries ∆i = z ϕ(x) − ϕ(x + αi h) − γi hϕ0 (x) + hϕ0 (x + αi h) + γi h2 ϕ00 (x), for i = 1, . . . , s. The following result on order reduction for Rosenbrock methods can be proven. Theorem 2.16. The local error δh (x) of a Rosenbrock method applied to the test equation of Prothero & Robinson (2.31) satisfies for h → 0 and z = hλ → ∞ 2 2 X h h 2 00 3 bi ωij αj − 1 ϕ (x) + O(h ) + O δh (x) = , (2.41) 2 z i,j 68 Runge-Kutta Methods where ωij are the entries of B−1 . Proof. Using the Neumann series expansion (I − E)−1 = I + E + E2 + · · · we obtain for (I − zB) −1 −1 B−1 = − I (zB) z B−1 B−1 −1 − (I − ) = z z ! −1 2 B B−1 B−1 + ··· = + I+ − z z z −1 2 B−1 1 B − +O − . z z z3 = = = = The product bT (I − zB)−1 ∆ can then be written as T b (I − zB) −1 1 1 = − bT B−1 − 2 bT B−2 + O z z 1 z3 . (2.42) The first term on the right-hand side of (2.42) is with (B −1 )ij = ωij and using Taylor expansions for ϕ(x + h) equal to 1X (αj h)2 00 0 0 3 − bi ωij z −αj hϕ (x) − ϕ (x) − γj hϕ (x) + O(h ) + z 2 i,j +hϕ0 (x) + αj h2 ϕ00 (x) + γj h2 ϕ00 (x) + O(h3 ) . Rearranging gives X 0 bi ωij (αj + γj )hϕ (x) + i,j X i,j α2j h2 00 bi ωij ϕ (x) + O(h3 ) + 2 1X bi ωij hϕ0 (x) + αj h2 ϕ00 (x) + γj h2 ϕ00 (x) + O(h3 ) . − z i,j Pj−1 P Recall that (B)ij = αij + γij , αj = k=1 αjk and γj = jk=1 γjk . Using these properties one can derive X X bi ωij (αj + γj ) = 1.8 i j 8 Since Bij = αij + γij it follows with the definition of αj and γj that BI = [α1 + T γ 1 , . . . , αn + γn ] , where I is the vector consisting of ones only. Thus, the summation P P T −1 BI = bT I = 1. j ωij (αj + γj ) is equal to b B i bi 2.4 Some Remarks on Runge-Kutta Methods 69 It follows that X i − bi X ωij (αj + γj )hϕ0 (x) = hϕ0 (x), and, j h2 ϕ00 (x) 1X X ωij (αj + γj )h2 ϕ00 (x) = − bi . z z i j Thus, bT (I − zB)−1 ∆ is equal to 0 hϕ (x) + X i,j α2j h2 00 1X bi ωij hϕ0 (x) + O bi ωij ϕ (x) + O(h3 ) − 2 z i,j h2 z . (2.43) The second term on the right-hand side of (2.42), equal to 1 T B−2 b , z2 z2 can be rewritten, almost in the same way as for the first term on the righthand side of (2.42), as 2 h 1X 0 bi ωij hϕ (x) + O . (2.44) z z2 i,j The third term on the right-hand side of (2.42) equals O z12 . Adding up (2.43), (2.44) and the last term on the right-hand side of (2.42) equal to O z12 , and letting z → ∞ gives − bT (I − zB)−1 ∆ = 2 X α2j h2 00 h 0 3 hϕ (x) + bi ωij ϕ (x) + O(h ) + O . 2 z i,j Applying Taylor expansion of ϕ(x + h) near x in (2.40) gives the desired result. Remark 2.17. In the stiff case a Rosenbrock is only third order when X bi ωij α2j = 1, i,j is satisfied. Since none of the Rosenbrock methods defined in Section 2.2 satisfies this extra constraint, their order of convergence is only two for the Prothero & Robinson test equation. To satisfy this extra constraint, a convenient way is to require αsi + γsi = bi , i = 1, . . . , s, and αs = 1. (2.45) This extra requirement implies even δh (x) = O h2 z . In that case the method yields asymptotically exact results for z → ∞. 70 Runge-Kutta Methods Chapter 3 Time Splitting In general it is inefficient (or even infeasible) to apply the same integration formula to different parts of the advection diffusion reaction system in higher dimensions, ut + ∇ · (au) = ∇ · (D∇u) + f (u). If, for example, the discretization of the advection (and diffusion) results in a stiff system, then it calls for an implicit method to solve that part of the equation. If on the other hand the reaction terms are not stiff, then explicit methods are often more suitable for that part of the equation. Or, it could also be the other way around. If we have a stiff reaction term, then we would like to use implicit methods. If we used for spatial discretization limiters, then explicit methods are often more suitable. Solving the system using a simple implicit integration rule results into a large system of nonlinear algebraic equations. Due to simultaneous coupling over species and space the nonlinear system becomes too large to handle. In such cases we want to have an appropriate form of splitting. The general idea behind splitting is to split a complicated system into smaller parts, which can be solved efficiently with suitable integration formulas, for the sake of time stepping. This chapter is organized as follows. We start with the explanation of the concept of (first and second order) time splitting. The chapter is concluded with remarks on the treatment of boundary conditions and stiffness. 3.1 Operator Splitting The technique in this section is called Operator Splitting or Time Splitting. In this section we threat on first order splitting. We focus more on concepts rather than on actual methods. 72 3.1.1 Time Splitting First Order Splitting of Linear ODE Problems Consider the linear homogeneous ODE system w0 (t) = Aw(t), t>0 and w(0) = w0 . (3.1) This system can be seen as a semi-discretization of a linear PDE. Assume that we have for A a two term splitting, e.g., A = A 1 + A2 . The exact solution of (3.1) is given by w(tn+1 ) = eτ A w(tn ), where τ = tn+1 − tn . Since A has a splitting one can use it to approximate a solution of (3.1). The exact solution of (3.1) is then approximated by wn+1 = eτ A2 eτ A1 wn , (3.2) where wn is an approximation of w(tn ) and τ = tn+1 − tn . We have found the simplest splitting, in which the two subproblems d ∗ dt w (t) d ∗∗ dt w (t) = A1 w∗ (t) for tn < t ≤ tn+1 with w∗ (tn ) = wn , ∗∗ ∗∗ ∗ = A2 w (t) for tn < t ≤ tn+1 with w (tn ) = w (tn+1 ), have to be solved one after each other starting from w n . To complete the integration after one step take wn+1 = w∗∗ (tn+1 ). The error introduced by splitting, called the splitting error, can be found by inserting (3.2) into the exact solution. This gives w(tn+1 ) = eτ A2 eτ A1 w(tn ) + τ ρn , with ρn the local truncation error. The fact that τ ρ n is the error introduced per step starting from the exact solution it is the local splitting error. Since 1 eτ A = I + τ (A1 + A2 ) + τ 2 (A1 + A2 )2 + O(τ 3 ), 2 1 τ A2 τ A1 e e = I + τ (A1 + A2 ) + τ 2 (A21 + 2A1 A2 + A22 ) + O(τ 3 ), 2 ρn satisfies 1 1 τA ρn = e − eτ A2 eτ A1 w(tn ) = τ [A1 , A2 ]w(tn ) + O(τ 2 ). τ 2 The commutator of A1 and A2 is defined as [A1 , A2 ] = A1 A2 − A2 A1 . It is obvious that the splitting defined in (3.2) has order one, unless A 1 and A2 commute. As can be found in [13], splitting appears to be stable provided that the sub steps themselves are stable. 3.1 Operator Splitting 3.1.2 73 Nonlinear ODE problems For general nonlinear ODE problems w0 (t) = F (t, w(t)), t>0 and w(0) = w0 . (3.3) with a two term splitting F (t, w) = F1 (t, w) + F2 (t, w), the first order linear splitting (3.2) is d ∗ dt w (t) d ∗∗ dt w (t) = F1 (t, w∗ (t)) for tn < t ≤ tn+1 with w∗ (tn ) = wn , = F2 (t, w∗∗ (t)) for tn < t ≤ tn+1 with w∗∗ (tn ) = w∗ (tn+1 ), giving wn+1 = w∗∗ (tn+1 ) as the next approximation. The nonlinear counterpart of the splitting error is again of order one. As seen in [13], this error can be derived by Taylor expansions of w ∗ (tn+1 ) and w∗∗ (tn+1 ) around t = tn . One then obtains that the local splitting error ρ n is equal to ∂F2 1 ∂F1 F2 − F1 (tn , w(tn )) + O(τ 2 ). ρn = τ 2 ∂w ∂w If the bracketed term equals zero the splitting error is of order two. 3.1.3 Second and Higher Order Splitting. In this section we treat second and higher order splitting methods for linear and nonlinear problems. To get a second order splitting error the idea of symmetry in splitting is proposed by Strang. Linear ODE Problems Consider the linear ODE system (3.1) and assume for A the two-term splitting A = A1 + A2 . The exact solution of (3.1) can be approximated by (3.2). Interchanging the order of A 1 and A2 after each half time step will lead to symmetry and to better accuracy. The solution of (3.1) is then approximated by 1 1 1 1 1 1 wn+1 = e 2 τ A1 e 2 τ A2 e 2 τ A2 e 2 τ A1 wn = e 2 τ A1 eτ A2 e 2 τ A1 wn . (3.4) The above splitting is proposed by Strang [33] and is therefore called Strang Splitting. The splitting error of Strang splitting has a formal consistency of order two. By series expansion the local truncation error, or splitting error, is found as ρn = 1 2 τ ([A1 , [A1 , A2 ]] + 2[A2 , [A1 , A2 ]]) w(tn+ 1 ) + O(τ 4 ). 2 24 (3.5) 74 Time Splitting Another second order symmetrical splitting of Strang is 1 τ A1 τ A2 e e + e τ A2 eτ A1 wn , wn+1 = 2 with splitting error ρn = − (3.6) 1 2 τ ([A1 , [A1 , A2 ]] + [A2 , [A2 , A1 ]]) w( tn ) + O(τ 3 ). 12 The splitting (3.6) is more expensive than (3.4) 1 , but has as advantage that the factors eτ A1 eτ A2 and eτ A2 eτ A1 can be computed parallel. In the case of a multi-component splitting of A, e.g., A = A 1 + A2 + A3 , the symmetrical Strang splitting method (3.4) is just a repeated application of itself. Hence, for the three-term splitting we obtain 1 1 1 1 wn+1 = e 2 τ A1 e 2 τ A2 eτ A3 e 2 τ A2 e 2 τ A1 wn , (3.7) which has also a second order splitting error. In the same way the alternative Strang splitting (3.6) can be generalized. Nonlinear ODE Problems The extension of the Strang splitting (3.4) is straightforward, d ∗ w (t) = F1 (t, w∗ (t)) for tn < t ≤ tn+ 1 2 dt with w∗ (tn ) = wn , d ∗∗ w (t) = F2 (t, w∗∗ (t)) for tn < t ≤ tn+1 dt with w∗∗ (tn ) = w∗ (tn+ 1 ), 2 d ∗∗∗ w (t) = F1 (t, w∗∗∗ (t)) for tn+ 1 < t ≤ tn+1 2 dt ∗∗ with w∗∗∗ (tn+ 1 ) = wn+1 , 2 w∗∗∗ (t giving wn+1 = n+1 ) as the next approximation. Again we will have formal consistency of order two. This result can be obtained using Taylor expansion under the condition that the ODE is autonomous. Referring to [13] we state that for autonomous problems we also have a second order splitting error. 1 This follows by writing (3.6) out as „ «n “ ” “ ” 1 e τ A 1 e τ A 2 + e τ A 2 e τ A 1 · · · e τ A 1 e τ A 2 + e τ A 2 e τ A 1 w0 . wn = 2 3.2 Boundary Values, Stiff Terms and Steady State 75 Higher-Order Splittings Examples of higher-order splittings are the fourth order splitting wn+1 = 1 4 1 (S 1 τ )2 − Sτ 3 2 3 wn , (3.8) 1 where Sτ = e 2 τ A1 eτ A2 e 2 τ A1 is the second order Strang splitting operator. Since the above splitting (3.8) has negative a negative weight, it is not known for what kind of problems this scheme will be stable or not. Another fourth order splitting scheme derived by Yoshida [39] and Suzuki [34] reads wn+1 = Sθτ S(1−2θ)τ Sθτ wn , (3.9) √ with θ = (2− 3 2)−1 ≈ 1.35. Observe that 2−θ < 0, which means that a time step with negative time has to be taken. As mentioned in [13], reversing the time for diffusion or (stiff) reaction terms this will lead to ill-posedness. Higher order splittings can be used for conservative problems where boundary conditions are not relevant, such as the Schrödinger equation, see [13]. 3.2 3.2.1 Boundary Values, Stiff Terms and Steady State Boundary Values For PDE problems, where the boundary conditions are important, difficulties with splitting may occur. The boundary conditions for the PDE problem have a physical interpretation, while boundary conditions for the sub steps are missing. These sub steps have sometimes a physical interpretation. Therefore one may have to reconstruct boundary conditions for the specific splitting under consideration. General analysis of boundary conditions for splitting methods is, at present, still lacking. The rule of thumb we will use is that the treatment of the boundaries should coincide as much as possible with the scheme in the interior of the domain. 3.2.2 Stiff Terms Assume we have the ODE problem w 0 (t) = Aw(t), where A has a two-term splitting A = A1 + A2 . Assume also that kA1 k is bounded and A2 has an eigenvalue equal to − 1 , 1 > 0. Verwer et. al. [35, 32] showed that the first order splitting with A as above, see Section 3.1, will retain its first order accuracy. In [35, 32] also is shown that the second order Strang splitting will in general give only an order one accuracy. 76 3.2.3 Time Splitting Steady State In the case we have to solve an advection-diffusion-reaction problem with stiff reaction terms, the following behavior usually occurs. First there will be a short time interval where concentrations rapidly change, the so called transient phase. After the transient phase, there will be a steady state, e.g. the reactions between the species are in balance. If one solves an advection-diffusion-reaction system with time or operator splitting, then the steady states are not returned exactly. This can easily seen in the following linear ODE case. Suppose the advection-diffusionreaction system has been discretized in space. We then have the semi discrete system w0 (t) = Aw(t), which we solve with time splitting, thus A = A 1 +A2 . In the steady state the concentrations are in balance, thus w 0 (t) = 0. This means that in the steady state yields A 1 + A2 = 0. Since the two term splitting introduces an error of at least order one, this error will also be returned in the steady state. Hence, this is in general the case for all splitting methods. Other time integration methods like Runge-Kutta methods, Runge-KuttaChebyshev methods, etc. return steady states exactly. Chapter 4 IMEX Methods Many advection-diffusion-reaction equations have a natural splitting into a non-stiff or moderate stiff part and a stiff part. To solve the non-stiff or moderate part explicit solvers are suitable. To solve the stiff part one would use an implicit method. To split the non-stiff and stiff part time splitting is an option, but has disadvantages like splitting errors, boundary conditions, etc. Another disadvantage could be that the subproblem cannot be solved by a multi-step method, i.e. it needs information from previous time steps. In this part we consider IMEX methods, which are methods that are a suitable mix of implicit and explicit methods. For instance, there exist IMEX multi-step and IMEX Runge-Kutta methods. The main idea of an IMEX method will be sketched in the next section, where we treat the IMEX- θ method. 4.1 IMEX- θ Method Suppose we have the nonlinear system or semi-discretization w0 (t) = F (t, w(t)), where F (t, w(t)) has the natural splitting F (t, w(t)) = F0 (t, w(t)) + F1 (t, w(t)), with F0 is a non-stiff and F1 stiff. In advection-diffusion-reaction systems the non-stiff term is for instance advection and the stiff terms the discretized diffusion and reactions. The non-stiff term is suitable for explicit time integration while the stiff terms are more suitable for implicit integration methods. An example of a method that combines explicit as well as implicit treatment of respectively the non-stiff term F 0 (t, w(t)) and stiff term F1 (t, w(t)) 78 IMEX Methods is the following one : wn+1 = wn + τ (F0 (tn , wn ) + (1 − θ)F1 (tn , wn ) + θF1 (tn+1 , wn+1 )) , (4.1) where the parameter θ ≥ 21 . This method is a combination of the Euler Forward method, which is explicit, and the implicit θ -method. Methods that are mixtures of IM plicit and EX plicit methods are called IMEX methods. The method given in (4.1) is called the IMEX- θ Method. Truncation error The truncation error of the IMEX- θ method can be derived by inserting the exact solution of w 0 (t) = F (t, w(t)) into (4.1). We then obtain w(tn+1 ) = w(tn ) + τ (F0 (tn , w(tn )) + (1 − θ)F1 (tn , wn )+ θF1 (tn+1 , w(tn+1 ))) + τ pn , (4.2) where pn is the truncation error. Using Taylor series of w(t n+1 ), F1 (tn+1 , w(tn+1 )) and F0 (tn+1 , w(tn+1 )) near tn , we obtain pn = 1 − θ τ w00 (tn ) + θτ F00 (tn , w(tn )) + O(τ 2 ). 2 Stability Consider the test equation w0 (t) = λ0 w(t) + λ1 w(t), and let zj = τ λj for j = 0, 1. Applying the IMEX- θ method to this test equation gives 1 + z0 + (1 − θ)z1 R(z0 , z1 ) = . (4.3) 1 − θz1 Stability of the IMEX- θ method thus requires |R(z0 , z1 )| ≤ 1. (4.4) To analyze the stability region (4.4) we have two starting points : 1. Assume the implicit part of the method to be stable, in fact A-stable, and investigate the stability region of the explicit part, 2. Assume the explicit part of the method to be stable and investigate the stability region of the implicit part. 4.1 IMEX- θ Method 79 Starting with the first point, we assume the implicit part of the IMEXθ method to be A-stable. Define the set D0 = {z0 ∈ C : the IMEX scheme is stable ∀z1 ∈ C− }, where C− is the set C− = {z ∈ C : Re z ≤ 0}. The question is whether the set D0 is smaller, larger or equally shaped in comparison with the stability region of Euler Forward. Using the maximum modulus theorem and the assumption that the implicit part of the IMEX- θ method gives that z1 can be replaced by z1 = i · r, r ∈ R. Substituting (4.3) into (4.4) gives after some elementary calculations that z 0 = x0 + iy0 ∈ D0 if and only if for all r ∈ R yields (2θ − 1)r 2 + 2(θ − 1)y0 r − (2x0 + x20 + y02 ) ≥ 0. In the above inequality the left-hand side is a cubic function of r. For θ > 21 this function is larger or equal than zero when the discriminant is less or equal than zero. We then obtain that z 0 = x0 + iy0 ∈ D0 iff θ 2 y02 + (2θ − 1)(1 + x0 )2 ≤ 2θ − 1. For θ = 21 the stability region of the IMEX method reduces to the line segment [−2, 0], while for θ = 1 the stability region of Euler Forward is obtained. See Figure 4.1. 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −3 −1.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 Figure 4.1: Boundary of regions D0 (left) and D1 for θ = 0.5 (circles), 0.51 (-·-), 0.6 (- -) and 1 (solid) The alternative is to assume that the explicit part of the IMEX method is stable, which means that z0 is an element of the set S0 = {z0 : |1 + z0 | ≤ 1}. The question now is, what is the set D1 = {z1 ∈ C : the IMEX scheme is stable ∀z0 ∈ S0 }. Some elementary calculations yield that z 1 ∈ D1 iff 1 + |(1 − θ)z1 | ≤ |1 − θz1 |, 80 IMEX Methods is smaller, larger or equally shaped in comparison with the stability region of Euler Backward. The boundary of the stability regions are for several values of θ plotted in Figure 4.1. Hence, for θ = 12 we obtain as stability region for the IMEX method the negative real axis. For θ = 1 we obtain as stability region the stability region of Euler Backward. We remark that for θ = 1 the IMEX method has favorable stability properties. It could be seen as a form of time splitting where we first solve the explicit part with Euler Forward and the implicit part with Euler Backward. However, using this IMEX method we do not have errors as consequence of • intermediate results that are inconsistent with the full equation, • intermediate boundary conditions to solve these intermediate results. If one uses this IMEX- θ method with θ = 12 , then one has to pay a little more attention to stability. If, for instance, one has a system with complex eigenvalues, then the IMEX method will not be stable for θ = 21 . Steady State If we are in steady state, which means that F (w) = 0, F is independent of t, then the splitting of F (t, w(t)) becomes in steady state F (w) = F 0 (w) + F1 (w), where F , F0 and F1 are independent of t. If this steady state w is a stationary point of the IMEX-θ scheme, then this means that the IMEX-θ scheme returns steady states exactly. Inserting the steady state w into (4.1) gives w = w + τ (F0 (w) + (1 − θ)F1 (w) + θF1 (w)) = w. The conclusion is that steady states are returned exactly by the IMEX-θ scheme. 4.2 Concluding Remarks The general idea of IMEX methods is explained in Section 4.1 using the IMEX- θ method. This idea can also be used in Multi-step methods, RungeKutta methods and so on. The IMEX- θ method is also the simplest example of an IMEX R-K method. In the next example we give a more sophisticated example of an IMEX R-K method. Example 4.1. We combine the explicit and implicit trapezoidal rule in the following way : 1 1 ∗ ∗ ), = wn + τ F0 (tn , wn ) + τ F1 (tn , wn ) + τ F1 (tn+1 , wn+1 wn+1 2 2 1 1 ∗ wn+1 = wn + τ F (tn , wn ) + τ F (tn+1 , wn+1 ). 2 2 4.2 Concluding Remarks 81 This method has second order consistency, which can be verified by simple calculations. Applying this IMEX trapezoidal rule to the test equation y 0 = (λ0 + λ1 )y we get R(z0 , z1 ) = 1 + 1 + 12 z0 (z0 + z1 ), 1 − 12 z1 as stability function. If we take the limit for z 1 → −∞1 , then we have as stability criterion |R(z)| = |1 + z0 | ≤ 1. Hence, this IMEX trapezoidal rule is not suitable for advection-diffusionreaction problems if advection is treated explicitly and discretized with higher order differences. For more detailed information we refer to [13]. Another example of an IMEX R-K method is the subject of the next chapter, titled IMEX Runge-Kutta Chebyshev methods. Since this class of IMEX R-K methods recently is developed for solving advection-diffusionreaction equations, see [36], we devote the next chapter to it. 1 The implicit part is unconditionally stable. 82 IMEX Methods Chapter 5 IMEX Runge-Kutta-Chebyshev Methods In this chapter the IMEX extension of Runge-Kutta-Chebyshev methods are introduced. The IMEX extension proposed in [36, 37] is an RKC method that also deals with highly stiff (reaction) terms. The highly stiff terms are treated implicitly and the moderate stiff terms, for instance coming from diffusion, are treated in an explicit way. As starting point we have the Runge Kutta Chebyshev methods defined in Section 2.3. Recall that the general form of a RKC method is W0 = w n , W1 = wn + µ̃1 τ F (tn + c0 τ, W0 ), Wj = (1 − µj − νj )wn + µj Wj−1 + νj Wj−2 + +µ̃1 τ F (tn + cj−1 τ, Wj−1 ) + γ̃j τ F (tn + c0 τ, W0 ), wn+1 = Wns . (5.1) Suppose we have a linear system w0 (t) = F (t, w(t)), (5.2) where F (t, w(t)) can be split as F (t, w(t)) = FE (t, w(t)) + FI (t, w(t)). The term FI (t, w(t)) is supposed to be too stiff to be treated efficiently by a Runge-Kutta-Chebyshev method (5.1). The term F E (t, w(t)) is the moderate stiff term which can still be treated explicitly by (5.1). 84 IMEX Runge-Kutta-Chebyshev Methods 5.1 Construction of the IMEX scheme The first stage of (5.1) becomes in the IMEX-RKC scheme W1 = wn + µ̃1 τ FE (tn + c0 τ, W0 ) + µ̃1 τ FI (tn + c1 τ, W1 ), (5.3) with µ̃1 = b1 ω1 . Note that the highly stiff term is treated implicitly. The other s − 1 subsequent stages of (5.1) will be modified in a similar way. The stability function of the first stage of (5.1) is equal to R1 (z0 , z1 ) = 1 + b 1 ω1 z0 , 1 − b 1 ω1 z1 (5.4) when (5.3) is applied to the test equation w 0 (t) = λ0 w(t) + λ1 w(t) with zj = λj τ . Following [36], we impose b1 = so that R1 (z0 , z1 ) = 1 , ω0 1 + b1 ωω01 z0 1− ω1 ω0 z 1 . (5.5) Observe that the choice for b1 is different than in Section 2.3. This choice for b1 enables the following ansatz Ansatz 5.1. All stability functions R j (z0 , z1 ) from the internal stages j, j = 1, . . . , s, of the IMEX-RKC scheme are taken of the form Rj (z0 , z1 ) = aj + bj Tj The coefficients b1 = 1 ω0 , ω0 + ω 1 z0 1 − ωω10 z1 ! , aj = 1 − bj Tj (ω0 ). (5.6) b0 = b2 and bj for j ≥ 2 is taken as bj = Tj00 (ω0 ) . (Tj0 (ω0 ))2 These stability functions of the IMEX-RKC scheme are an extension of the stability functions of the RKC scheme. Taking z 1 = 0 in (5.6) we obtain the internal stability functions (2.17). The construction of the IMEX scheme is based on the following. Rewrite (5.6) into −aj Rj (z0 , z1 ) ω0 + ω 1 z0 Tj (x) = + , x= , bj bj 1 − ωω10 z1 5.2 Stability 85 and apply the recursion (2.13). Inserting x gives (1 − ω1 )Rj ω0 bj ω1 )+2 Rj−1 · (ω0 + ω1 z0 ) ω0 bj−1 bj bj ω1 −2 aj−1 (ω0 + ω1 z0 ) + aj−2 (1 − z1 ) bj−1 bj−1 ω0 ω1 bj Rj−2 · (1 − z1 ). − bj−2 ω0 = aj (1 − Identifying Rj with Wj and Rj z1 with τ FI (tn +cj τ, Wj ) and so on we obtain Wj − µ̃1 τ FI,j = (aj − µj aj−1 − νj aj−2 )W0 + µj Wj−1 + νj Wj−2 + µ̃j τ FE,j−1 + γ̃j τ FE,0 − νj µ̃1 τ FI,j−2 − µ̃1 (aj − νj aj−2 )τ FI,0 , where FI,j = FI (tn + cj τ, Wj ) and so on. The coefficients µ̃j , γ̃j , . . . are defined as in Section 2.3. Using aj − µj aj−1 − νj aj−2 = 1 − νj − µj we get W0 = w n , W1 = wn + µ̃1 τ FE,0 + µ̃1 τ FI,1 , Wj = (1 − µj − νj )wn + µj Wj−1 + νj Wj−2 + (5.7) +µ̃j τ FE,j−1 + γ̃j τ FE,0 , [γ̃j − (1 − µj − νj )µ̃1 ]τ FI,0 − νj µ̃1 τ FI,j−2 + µ̃1 FI,j wn+1 = Ws . Remark 5.2. As already remarked before, the scheme (5.7) is an IMEX extension of (5.1) in a natural way. Using easy calculations there can be obtained that also this IMEX extension of RKC returns the steady state exactly. Remark 5.3. In each stage the following system of nonlinear algebraic equations has to be solved Wj − µ̃1 τ FI (tn + cj τ, Wj ) = Vj , with Vj known and Wj unknown. In the case that the implicit part of F (t, w(t)) consists of the ‘stiff’ reaction only, then F I has no underlying spatial grid connectivity. 5.2 Stability Consider the test-equation w0 (t) = λE w(t) + λI w(t), and assume that both eigenvalues are real and non-positive. For many practical cases this imposes no restriction. 86 IMEX Runge-Kutta-Chebyshev Methods Applying the IMEX-RKC method on the above test-equation gives the stability function ! ω0 + ω 1 z0 , Rs (z0 , z1 ) = as + bs Ts 1 − ωω10 z1 with z0 = λE τ and z1 = λI τ . For stability we require |Rs (z0 , z1 )| ≤ 1. Since we assumed that λI ∈ [−∞, 0] it follows that z1 ∈ [−∞, 0]. Therefore ω + ω z 0 1 0 (5.8) ≤ |ω0 + ω1 z0 | . 1 − ωω10 z1 From (5.8) it follows easily that |Rs (z0 , z1 )| ≤ 1, as long as z0 ∈ [−β(s), 0]. The IMEX extension of the RKC scheme is unconditionally stable for the implicit part and the stability condition for the explicit part remains unchanged. 5.3 Consistency The local truncation error of the RKC scheme is of order two. The IMEX extension of the second order RKC scheme has the following local error, ρn = τ Sn + O(τ 2 ), s2 − 1 with s the number of stages and Sn = FI0 (w(tn ))F (w(tn )) (FI0 denotes the Jacobian of FI ). If the number of stages s is large, then the influence of the extra term above does not harm. In the case the number of stages is relatively small, there will be order reduction. This means that in that case we have only order one consistency. 5.4 Final Remarks In actual computations this IMEX extension of the RKC scheme is used in the same manner as the RKC scheme. The only difference is that in each stage a nonlinear algebraic system has to be solved. When solving this nonlinear system is cheap, as e.g. with stiff chemical reactions giving rise to small sized systems decoupled over the grid, the IMEX-RKC scheme (5.7) is more efficient than the fully explicit form (5.1). 5.4 Final Remarks 87 Advection-diffusion-reaction problems that are advection dominated or problems where advection is discretized with (higher order) upwind schemes give rise to (more) complex eigenvalues of the Jacobians of the right-handside of (5.2). In that particular case one would like to increase the stability region in complex direction. By increasing the damping parameter ε the strip along the real line becomes wider. See Figure 5.1. A disadvantage is Second Order Damped Chebyshev Polynomial Second Order Chebyshev Polynomial 5 5 0 0 −5 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2 0 2 −5 −20 −18 −16 −14 −12 −10 −8 −6 −4 −2 0 2 Figure 5.1: Stability regions for the second order polynomial P 5 with damp2 ) (left) and large (ε = 10) ing parameter ε small (ε = 13 that the stability boundary β(s) will become smaller. The result is that for the moderate stiff terms like discretized diffusion, a smaller time-step must be taken to maintain stability. 88 IMEX Runge-Kutta-Chebyshev Methods Chapter 6 Linear Multi-Step Methods One-step methods like Runge-Kutta methods compute w n+1 using only the previous approximation wn . Linear multi-step methods on the other hand, use also additional previous approximations w n−i , i = 0, . . . , k, with a fixed integer k. This integer k is the total number of previous approximations wn−i used to compute the next approximation w n+1 . Next, we give the definition of an k stage linear multi-step method. Definition 6.1. The linear k-step method is defined by the formula k X j=0 αj wn+j = τ k X βj F (tn+j , wn+j ), n = 0, 1, . . . , (6.1) j=0 which uses the k past values wn , . . . , wn+k−1 to compute wn+k . Remark that the most advanced level is tn+k instead of tn+1 . The method is explicit when βk = 0 and implicit otherwise. Furthermore, we will assume that αk > 0. The relation with the class of Runge-Kutta methods is described in the following remark. Remark 6.2. Taking k = 1 in Definition 6.1 gives a one-step method. In this case of linear one-step methods there is some overlap with Runge-Kutta methods. For instance, the θ method, wn+1 = wn + τ ((1 − θ)F (tn , wn ) + θF (tn+1 , wn+1 )) , belongs to the class of Runge-Kutta methods. Its Butcher array is given in Table 6.1. Obviously, the θ method also belongs to the class of linear multi-step methods. An advantage of linear multi-step methods with respect to Runge-Kutta methods is the following. In every new approximation w n+k of explicit linear multi-step methods only one function evaluation is needed against s 90 Linear Multi-Step Methods 0 1 0 0 1−θ 0 1 θ Table 6.1: Butcher array of the θ method for explicit s stage Runge-Kutta methods. We remark that explicit linear multi-step methods use more memory to store the previous (k − 2) function evaluations. A disadvantage of linear multi-step methods is that the first k−1 approximations can not be computed with the linear k-step scheme. To compute the first (k − 1) approximations, one could use a 1. Runge-Kutta scheme, 2. use for the first step a linear 1-step method, for the second approximation a linear 2-step method, . . . and for the (k − 1) st approximation a linear (k − 1)-step scheme. 6.1 The Order Conditions A linear multi-step method is called consistent of order p if the local error satisfies y(tn+k ) − w n+k = O(τ p+1 ), where y(tn+k ) is the exact solution and w n+k the approximation obtained by inserting the past k exact values in (6.1). To find order conditions for order p consistency we will do the following. First we introduce the linear difference operator L, associated with (6.1), as L(y, t, τ ) = k X i=0 αi y(t + iτ ) − τ βi y 0 (t + iτ ) . Here y(t) is some differentiable function defined on an interval that contains tn , . . . , tn+k . Lemma 6.3. Consider y 0 (t) = F (t, y) with F (t, y) continuously differentiable. Let y(t) be its exact solution. For the local error one has −1 ∂F y(tn+k ) − wn+k = αk I − τ βk (tn+k , η) L(y, tn , τ ). (6.2) ∂y In the case that F (t, y) is a scalar function η is some value between y(t n+1 ) and wn+1 . If F (t, y) is a vector valued function, ∂F ∂y (tn+k , η) is the Jacobian of F (t, y), whose rows are evaluated at possibly different values η j lying on the segment between (y(tn+k ))j and (wn+k )j . 6.1 The Order Conditions 91 Proof. By definition, the ‘numerical’ solution w n+k is obtained by the equation k−1 X j=0 αj y(tn+j ) − τ βj F (tn+j , y(tn+j )) + αk wn+k − τ βk F (tn+k , wn+k ) = 0. Inserting the operator L(y, tn , τ ) gives L(y, tn , τ ) = (6.3) αk (y(tn+k ) − wn+k ) − τ βk (F (tn+k , y(tn+k )) − F (tn+k , wn+k )). Statement (6.2) in Lemma 6.3 follows from the mean value theorem applied to the right-hand side of (6.3). Following [10], a multi-step method (6.1) is said to be of order p, if one of the following conditions is satisfied 1. for all sufficiently regular functions y(t) we have L(y, t, τ ) = O(τ p+1 ), 2. the local error of (6.1) is O(τ p+1 ) for all sufficiently regular differential equations y 0 (t) = F (t, y(t)). Next, observe that Lemma 6.3 implies that the above conditions 1. and 2. are equivalent. The following theorem gives the conditions for coefficients αi and βi , i = 0, . . . , k, to get a multi-step scheme of order p. Theorem 6.4. The multi-step method (6.1) is of order p if and only if k X αi = 0 and k X αi ij = j βi ij−1 for j = 1, 2, . . . , p. (6.4) i=0 i=0 i=0 k X Proof. The method is of order p when L(y, t, τ ) = O(τ p+1 ). Since L(y, t, τ ) = k X i=0 αi y(t + iτ ) − τ βi y 0 (t + iτ ) , Taylor expansions of y(t + iτ ) and y 0 (t + iτ ) near t gives the conditions (6.4) by setting L(y, t, τ ) = O(τ p+1 ). Example 6.5. Backward Differentiation Formulas, or shorter BDF methods, are implicit and defined by βk = 1 and βj = 0 for j = 0, . . . , k − 1, with αj chosen such that the order is optimal, namely k. The 1-step BDF method is Backward Euler. The 2-step method is 1 3 wn+2 − 2wn+1 + wn = τ F (tn+2 , wn+2 ), 2 2 (6.5) 92 Linear Multi-Step Methods and the three step BDF is given by 3 1 11 wn+3 − 3wn+2 + wn+1 − wn = τ F (tn+3 , wn+3 ). 6 2 3 (6.6) In chemistry applications the BDF methods belong to the most widely used methods to solve stiff chemical reaction equations, due to their favorable stability properties. However, for k > 6 the BDF-methods are unstable, see [10, Theorem 3.4, page 329]. See [13]. For more background on stability properties for linear multistep methods, we refer to the next section. 6.2 Stability Properties We study in this part the stability properties of multi-step methods. As usual we will use the familiar test equation y 0 (t) = λy(t), where λ ∈ C. First, some properties on linear recursions are needed. Consider the linear recursion formula αk yn+k + αk−1 yn+k−1 + · · · + α0 yn = 0, (6.7) with αi ∈ C. We define its characteristic polynomial as the polynomial π(X) of degree k k X π(X) = αi X i . i=0 The following lemma, formulated without proof, gives the general solution of (6.7) in terms of roots of it characteristic polynomial. Lemma 6.6. Let χ1 , . . . , χl be the roots of π(X) of respective multiplicity m1 , . . . , ml . If each root of π(X) has multiplicity 1, then l = k. The general solution of (6.7) is given by yn = p1 (n)χn1 + · · · + pl (n)χnl , (6.8) where pi (n) are polynomials of degree (mi − 1). Then, the following can be deduced. From (6.8) it follows that boundedness of the sequence (yn )n≥0 , is guaranteed when the roots χ1 , . . . , χl lie in the unit disc and that roots on the unit disc are simple. The latter two conditions are called the root condition. To derive stability properties we will use the above results. The coefficients of the linear multi-step method will be contained in the polynomials p(X) = k X i=0 αi X i and σ(X) = k X i=0 βi X i . (6.9) 6.2 Stability Properties 6.2.1 93 Zero-Stability A typical multi-step requirement is the notion of zero-stability. Consider the trivial equation w 0 (t) = 0. Applying a linear k-step method to this equation gives, by substituting F (t, w(t)) = 0 in (6.1), the linear recursion k X αi wn+i = 0. i=0 The above linear recursion has characteristic polynomial p(X). Definition 6.7. A linear multi-step method (6.1) is said to be zero-stable, when its characteristic polynomial p(X) satisfies the root condition, i.e., the roots χi , i = 0, . . . , k, with multiple roots repeated, satisfy |χi | ≤ 1 and |χi | < 1 if χi is not simple. In other words, Definition 6.7 puts forward that if a linear multi-step method is not zero stable, then this method even fails to solve the trivial equation w 0 (t) = 0 in a stable manner. We remark that this requirement holds trivially for consistent one-step methods. The First Dahlquist Barrier Next, some remarks on maximal attainable orders of linear multi-step methods will follow. Firstly, we remark that due to the order conditions (6.4) it follows by counting the number of coefficients α i and βi that the maximal order of an implicit linear k-step method (i.e. β k 6= 0) is 2k. For an explicit linear k-step method (i.e. β k = 0) this is 2k − 1. As mentioned above, a linear multi-step method makes only sense if it is also zero-stable. Zero-stability reduces for explicit methods the maximal attainable order to p = k. For implicit methods it reduces to p = 2b k+2 2 c. This is better known as the first Dahlquist barrier. For a derivation of this result, we refer to [5, 10]. 6.2.2 The Stability Region Applying the linear k-step method (6.1) to the test equation w 0 (t) = λw(t), with λ ∈ C gives k X (αj − zβj )wn+j = 0, (6.10) j=0 for n = 0, 1, . . . and where z = hλ. The linear recursion (6.10) has the characteristic polynomial ρ(X) = p(X) − zσ(X) = k X j=0 (αj − zβj )X j . 94 Linear Multi-Step Methods By definition, the multi-step method is stable if the sequence (w n )n≥0 is bounded. Boundedness of (wn )n≥0 is guaranteed when ρ(X) satisfies the root condition1 . Therefore, the stability region is the set S ⊂ C such that z ∈ S ⇔ ρz (X) satisfies the root condition. Furthermore, zero-stability is guaranteed if and only if 0 ∈ S. The characteristic polynomial ρ(X) has k roots. As generally known 2 , only one of these roots approximates the exact solution w(t) = e λt = ez , up to O(z p+1 ). This particular root is called the superior root. The other k − 1 roots are called the spurious roots. Spurious roots do not play a role in the accuracy of the method. Sometimes they can cause instability. Determination of the Stability Region To determine the stability region of a linear multi-step method we observe that its boundary is described by |ρ z (χ̃)| = 1, where χ̃ is one of the roots of the characteristic polynomial ρz (X). Recall that z ∈ S ⇔ ρz (χ) satisfies the root condition. In the case that ρz (χ) satisfies the root condition, the boundary of this region is described in the χ-plane is the unit disk, i.e. χ = e iθ with 0 ≤ θ ≤ 2π. In the z-plane this boundary ∂S can be found as follows. The parameter z and χ are related to each other by ρz (χ) ≡ p(χ) − zσ(χ) = 0. From this relation follows that the map χ 7→ z is given by z= p(χ) . σ(χ) (6.11) The boundary ∂S is then described by inserting χ = e iθ , with 0 ≤ θ ≤ 2π. Thus, the boundary ∂S of the stability region S is described by z= p(eiθ ) , σ(eiθ ) 0 ≤ θ ≤ 2π. (6.12) The curve (6.12) must be considered as a left oriented curve. The stability region S is the part left from the curve, as long as it is not empty. 1 2 The root condition is among others given in Definition 6.7 This can be shown by inserting w(t) = eλt and F (t, w) = λw into k X αj w(tn+j ) = τ j=0 with ρn+k−1 the defect of order p. k X j=0 βj F (tn+j , w(tn+j )) + τ ρn+k−1 , 6.2 Stability Properties 95 A-Stability To use linear multi-step methods in for instance chemical applications, Astability of the particular method is desirable. Define C̄ = C ∪ ∞. We say that z = ∞ belongs to the stability region if the polynomial σ(χ) satisfies the root condition. This makes indeed sense, because the roots of ρ z (χ) tend to the roots of σ(χ) for z → ∞. Therefore, a linear multi-step method is said to be A-stable if S ⊃ z ∈ C̄ : Re z ≤ 0 or z = ∞ . Unlike Runge-Kutta methods, there are not many A-stable linear multistep methods. It can be derived that a linear multi-step method can be at most second order consistent. This fundamental result is called the second Dahlquist barrier. At first sight, this second Dahlquist barrier seems to be a disappointing result. However, by introducing the notion of A(α) stability the problem is (partially) solved. A method is said to be A(α) stable if S ⊃ z ∈ C̄ : z = 0, ∞ or | arg(−z)| ≤ α . For many stiff ODE problems the A-stability property is not needed since for these problems the eigenvalues with large modulus stay away from the imaginary axis. Example 6.8. The BDF methods are for k = 1, 2 A-stable. For 3 ≤ k ≤ 6 the BDF methods is A(α)-stable, with α depending on k. See Table 6.2. k α 1 90◦ 2 90◦ 3 86◦ 4 73◦ 5 51◦ 6 17◦ Table 6.2: Values of α for given k. In Figure 6.1 the stability regions for the BDF-2 method (6.5) and the BDF-3 method (6.6) are given. Notice that the BDF-3 method is indeed not A-stable, as can be seen in Figure 6.1. For the BDF-3 method we show how the boundary ∂S and the stability region S can be found. Recall that the three step BDF method is 11 3 1 wn+3 − 3wn+2 + wn+1 − wn = τ F (tn+3 , wn+3 ). 6 2 3 Applying this method to the test equation gives the recursion 3 1 11 wn+3 − 3wn+2 + wn+1 − wn − zwn+3 = 0. 6 2 3 (6.13) The characteristic polynomial corresponding to the above recursion (6.13) is then ρ(X) = p(X) − zσ(X), 96 Linear Multi-Step Methods Stability Region for BDF−2 Stability Region for BDF−3 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −4 −3 −2 −1 0 1 2 −3 −4 −3 −2 −1 0 1 2 Figure 6.1: Stability regions of the BDF-2 method (left) and the BDF-3 method (right). with 3 1 11 3 X − 3X 2 + X − , 6 2 3 σ(X) = X 3 . p(X) = and The boundary ∂S of the stability region S of BDF-3 is described by z = = p(eiθ ) σ(eiθ ) 11 iθ 3 (e ) −3(eiθ )2 + 32 eiθ − 13 6 (eiθ )3 , 0 ≤ θ ≤ 2π. One have to consider this as an oriented curve. The stability region is then the part left of it. Chapter 7 Multi Rate Runge Kutta Methods In this chapter we consider ODEs y 0 = f (t, y), (7.1) where y ∈ Rn and f : R × Rn → Rn . We assume that the solution y and the right-hand side of (7.1) can be split up into rapidly (Active) and slowly (Latent) varying subsystems. By splitting up yA (t) y(t) = , yA (t) ∈ RnA , yL (t) ∈ RnL and nA + nL = n, yL (t) (7.2) the ODE system (7.1) can then be written as a split system 0 (t) = f (y (t), y (t)), y (t ) = y yA A A L A 0 A,0 , yL0 (t) = fL (yA (t), yL (t)), yL (t0 ) = yL,0 (7.3) which for simplicity, but without loss of generality, is assumed to be autonomous. Problems like (7.1) with a distinction between active and latent terms / subsystems arise frequently form discretization of PDEs of the advectiondiffusion-reaction type, see e.g. [13]. Similar problems arise from applications in electronic circuits, multi-body dynamics and pneumatics. With respect to splitting up the solution y(t) we have some remarks. The first remark is that the splitting is time dependent, that is, for time t+∆t the splitting can be different from the splitting at time t. Secondly, it can also vary in space. For example, in a chemical reactor with a heated susceptor, only in the area above the susceptor will be some active components in a gas mixture. In all other parts of the reactor all components will be slow, in general. For more information with respect to the first remark we refer to [6]. With respect to the second remark, no literature could be found. However, 98 Multi Rate Runge Kutta Methods recently (February 2004), a research project started at the CWI, Amsterdam, which has as topic the contents of the above remarks. For sake of complicity, this research project is within the cluster MAS3, theme Nonlinear Dynamics and Complex Systems. In this chapter we introduce some basics on multi-rate strategies and define a multi-rate Runge-Kutta approach. 7.1 Multi-Rate Runge-Kutta Methods In the following we assume that the solution y(t) of the initial value problem (7.1) is split into active components y A (t), hopefully a small subset of y(t), and latent components yL (t). This results in a split system (7.3), which is assumed to be autonomous. The active components yA are integrated with a small step-size h, whereas the latent components are integrated with the large stpsize H. We remark that the methods to integrate the subsystems can, but do not have to, be the same. At the end of each macro-step H a synchronization of the micro-steps h and macro-steps H is performed. The following definition of a multi-rate Runge-Kutta method (MRK) for the numerical solution of (7.3) is taken from [19]. Definition 7.1. Assume we have split system (7.3) and that the active respectively latent part is integrated with a micro-step h respectively macrostep H. For each macro-step H and micro-step h we have the relation h= 1 H. m (7.4) The active components yA are integrated by for each λ = 0, 1, 2, . . . , m − 1 : λ kA,i = fA yA,λ + h yA,λ+1 = yA,λ + h s X i=1 s X j=1 λ λ aij kA,j , ỸL,i , i = 1, 2, . . . , s, λ bi kA,i ≈ y(t0 + (λ + 1)h), λ ≈ y (t + (λ + c )h) and c = Here, ỸL,i L 0 i i follow in the next section. Ps j=1 aij . (7.5) (7.6) λ will More details on ỸL,i 7.1 Multi-Rate Runge-Kutta Methods 99 The latent components yL are integrated by s X λ kL,i = fL ỸA,i , yL,0 + H āij kL,j , j = 1, 2, . . . , s̄, (7.7) j=1 yL,1 = yL,0 + H s̄ X i=1 b̄i kL,i ≈ y(t0 + H), where ỸA,i ≈ yA (t0 + c̄i H) and c̄i = more detailed information on ỸA,i . Ps̄ j=1 āij . (7.8) In the next section we give As can be seen in the above definition, the coupling between the active and latent subsystem, and vice versa, is performed by the intermediate λ . These intermediate values can be computed by stage values ỸA,i and ỸL,i interpolation or extrapolation according to the following strategies : • Fastest first strategy : – Integrate the active components y A with m steps of step-size h; yL -values are obtained by extrapolation from the previous step, – Integrate the latent components y L with one macro-step H; yA values are obtained by interpolation, • Slowest first strategy : – Integrate the latent components y L with one macro-step H; yA values are obtained by extrapolation from the previous step, – Integrate the active components y A with with m steps of step-size h; yL -values are obtained by interpolation, The extrapolation / interpolation strategies described above are natural choices, but these approaches inevitably turn the Runge-Kutta method into λ is a two-step method. In [19] an approach for the stage values ỸA,i and ỸL,i given such that the multi-rate Runge-Kutta method is a one (macro) step method. This can be achieved, with no extra cost,by using the following formulas for the stage values : λ ỸL,i = yL,0 + h s̄ X (γij + ηj (λ))kL,j i = 1, 2, . . . s, j=1 λ ỸA,i = yA,0 + H s X λ = 0, 1, . . . , m − 1, 0 (γ̄ij )kA,j i = 1, 2, . . . s̄. (7.9) (7.10) j=1 This strategy resembles the slowest first strategy in the sense that y A -values are obtained by extrapolation from the first micro-step. 100 Multi Rate Runge Kutta Methods Another idea to compute the ỸA,i -values is to use a lower order method. We refer to [19] for more information behind this idea. We remark that the method using this idea is called MRKII. The method given by (7.9) is called MRKI. 7.2 Order Conditions We remark that we have not mentioned any conditions for γ ij , γ̄ij and ηj (λ) to maintain the order of the given MRK methods. We will follow reference [19] to give some conditions to obtain a method of respectively order two and three. Given two third order RK-methods for the integration of the active and latent components of (7.3). These methods are joined in an MRK by the coupling coefficients γij , γ̄ij and ηj (λ), which must be chosen carefully to retain the order of the method. To ensure the MRKI to be of order two the following assumptions have to be satisfied : s̄ X ηj (λ) = λ, s̄ X γij = ci i = 1, 2, . . . , s, (7.11) j=1 j=1 and s X γ̄ij = c̄i i = 1, 2, . . . , s. (7.12) j=1 Remark that no extra conditions are needed. To be of order three the following conditions, in addition to the classical ones, have to be fulfilled : s̄ s X X i=1 j=1 bi γij c̄j = 1 , 6m s̄ s X X bi ηj (λ)c̄j = i=1 j=1 λ(λ + 1) , 2m (7.13) and s s̄ X X i=1 j=1 b̄i γ̄ij cj = m . 6 (7.14) Note that what has been said about MRKI applies to explicit as well as implicit methods, and even a mix of both, like implicit for the latent part and explicit for the active. From this point onwards one can construct particular MRKI methods. In [19] for example, one constructs a third order explicit MRK-method. 7.3 Stability 7.3 101 Stability Following [20], we adopt the test-problem, ẏA yA α11 α12 =A , A= ∈ R2×2 , ẏL yL α21 α22 (7.15) because multi-rate formulas require at least two equations. Assuming that α11 , α22 < 0 and γ = α12 α21 < 1, α11 α22 (7.16) ensures that both eigenvalues of A have negative real parts. The parameter γ can be viewed as a measure for the coupling between the equations. Also a measure κ for the stiffness of the system is introduced, i.e., κ= α22 . α11 (7.17) A compound step of the multi-rate Runge-Kutta methods suggested in the previous section applied to (7.15) can be expressed as yA,m yA,0 =K , (7.18) yL,1 yL,0 where the matrix K depends on the step-sizes H and h, and the coefficients of the method. Observe that the method is asymptotically stable if and only if the eigenvalues of K are both within the unit disk. The following scenario will be studied. For both parts, the active and the latent part of y(t), a method has been chosen to integrate the particular subsystem. Step-sizes H and h are chosen such that the uncoupled system (α12 = α21 = 0) is stable. This means that the stability functions R A (hα11 ) and RL (Hα22 are both of absolute value less than one. Remark that for the case α12 = α21 = 0 the eigenvalues of K are equal to m . Therefore, it seems convenient to express the matrix K in terms RL and RA of RA and RL , rather than in terms of H and h. Now it is interesting to study the stability region, that is the values γ, R A and RL for which eigenvalues of K are in the unit disk, and how this region varies with increasing m and stiffness ratio κ. To demonstrate this idea, we use the following example taken from [20]. Suppose we have for the active part the Euler Forward scheme and for the latent part the Euler Backward scheme, i.e., yA,λ+1 = yA,λ + hα11 yA,λ , yL,1 = yL,0 + Hα22 yL,1 . λ = 0, 1, 2, . . . , m − 1, (7.19) (7.20) Remark that for every step-size H the stability function |R L | < 1 and that |RA | < 1 if and only if −2 < α11 h < 0. The stability functions RA and RL 102 Multi Rate Runge Kutta Methods are related through H = mh as RL = 1 . 1 + m(1 − RA )κ (7.21) Using this and a lot of calculations, which are done in [20], one obtains a set of inequalities that determine a set of values for γ and R A such that the eigenvalues of K are in the unit disk. The following stability regions are obtained for κ = 1 and κ = 100, see Figure 7.1. We can conclude that the stability region becomes smaller with increasing m and increasing κ. In this case the coupling has not much influence, but there are multi-rate methods where ‘strong’ coupling gives very small stability regions. A modification of the synchronization of the active and latent part, gives a stability region as in 7.2, which is taken from [20]. We observe that in the second case coupling and stiffnes influence the range of step-sizes h that integrate (7.19) in a stable way. 7.3 Stability 103 κ=1 1 stable 0.5 γ 0 −0.5 unstable −1 −1.5 −2 −1 −0.8 −0.6 −0.4 −0.2 0 Ra 0.2 0.4 0.6 0.8 1 0.8 1 κ = 100 1 stable 0.5 γ 0 −0.5 unstable −1 −1.5 −2 −1 −0.8 −0.6 −0.4 −0.2 0 Ra 0.2 0.4 0.6 Figure 7.1: Boundaries of the stability regions for m = 1 (–)) and m = 10 (solid). The methods are stable to the right of the boundaries and unstable to the left. 104 Multi Rate Runge Kutta Methods κ = 100 1 0.5 γ 0 −0.5 unstable −1 −1.5 −2 −1 −0.8 −0.6 −0.4 −0.2 0 Ra 0.2 0.4 0.6 0.8 1 Figure 7.2: Boundaries of the stability regions for m = 1 (–)) and m = 10 (solid). The methods are stable to the right of the boundaries and unstable to the left. Part III Nonlinear Solvers Chapter 1 Introduction Solving a partial differential equation numerically gives after spatial and time discretization in general a (huge) system of nonlinear equations. There are several methods to solve the general nonlinear equation F (x) = 0, where x ∈ Rn and F : Rn → Rm , m, n ∈ N. Well known are the fixed point iteration, the Newton iteration, Broyden method and so on. In this part we will treat the most important ones. First we present the Newton iteration in one and more variables and also Broyden’s method. In Chapter 3 we discuss the Picard iteration. 108 Introduction Chapter 2 Newton’s Method The contents of this chapter is based on references [15, 23]. We present general properties and facts of Newton’s method, which we will also call Newton iteration. This chapter has the following construction. First we present Newton’s method in one and more variables and present some general properties. Then, we continue with convergence properties and criteria to terminate the iteration. Furthermore, we pay some attention to modifications of the Newton iteration. Finally we conclude this chapter with possible failures of these nonlinear solvers. 2.1 Newton’s Method in One Variable We shortly discuss the Newton iteration that solves the nonlinear equation f (x) = 0, (2.1) where f (x) is a real-valued function in the variable x ∈ R. The iteration xk+1 = xk − f (xk ) , f 0 (xk ) k = 0, 1, 2, . . . (2.2) is known as the Newton Iteration. This method can also be represented graphically as done in Figure 2.1. Properties and a derivation of this method will be done for the more general case in the next section. 2.2 General Remarks on Newton’s Method In order to solve n dimensional systems F (x) = 0, with F : R n → Rn and x ∈ Rn , the Newton iteration (2.2) can be generalized into xk+1 = xk − F 0 (xk )−1 F (xk ). (2.3) 110 Newton’s Method 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1.5 −1 −0.5 0 0.5 1 1.5 Figure 2.1: Illustration of the Newton method Importance of the above method rests on the fact that, under certain conditions on F , the error kxk+1 − xk (remark that x is a solution of F (x) = 0) can be estimated by the inequality kxk+1 − xk ≤ ckxk − xk2 . (2.4) Inequality (2.4) yields for a c ∈ R and a certain norm defined on R n . The above inequality states that the (k + 1) th error is proportional to the k th error squared. We will not give an exact proof of this estimate, but only a sketch. For the exact proof we refer to [23, Chapter 10]. The sketch of the proof of inequality (2.4) starts with a sketch of the derivation of the method of Newton. Assume F : Rn → Rm and that F is Fréchet differentiable 1 in xk . Then 0 = F (x) = F (xk ) + F 0 (x)(x − xk ) + R(x − xk ), where lim h→0 (2.6) R(h) = 0. khk For a proof we refer to [23]. If xk is in a neighborhood of x, then it is natural to neglect the term R(x − xk ). Furthermore, we assume that F 0 (xk ) is non-singular in a neighborhood of x, which implies that m = n. The difference x − xk can then be approximated by the solution of the system F 0 (xk )h = −F (xk ), with h = x − xk . (2.7) As a new approximation of x we take xk+1 = xk + h = xk − F 0 (xk )−1 F (xk ). (2.8) 1 The mapping F : D ⊂ Rn → Rm is Fréchet differentiable at x ∈ intD if there is an A ∈ L(Rn , Rm ) (space of linear mappings from Rn → Rm ) such that lim h→0 kF (x + h) − F (x) − Ahk = 0. khk (2.5) The linear operator A is denoted by F 0 (x) and is called the Fréchet derivative of F at x. 2.2 General Remarks on Newton’s Method 111 If the second Fréchet derivative is bounded in a neighborhood of x and F 0 (x) is non-singular, then it can be shown that there exists an α such that kR(x − xk )k ≤ αkx − xk k2 . (2.9) For a proof we refer to [23, Theorem 3.3.6]. Subtract (2.8) of (2.6) gives F 0 (xk )(x − xk ) = R(x − xk ). (2.10) From the above equality follows kF 0 (xk )kk(x − xk )k = kR(x − xk )k. (2.11) Using (2.9) we obtain inequality (2.4). Newton’s method is theoretically attractive, but in practice there can be some difficulties. In each step of the iteration a linear system needs to be solved. Especially in practical applications the dimensions of the systems to solve can be up to one million or even larger. Another difficulty is that in each step not only the n components of F (x k ) have to be evaluated, but also the n2 entries of its Jacobian. In the case that the partial derivatives of each component have a simple functional form the Jacobian can be computed exactly. In all other cases it is more desirable to avoid these computations. Almost all modifications of Newton iterations are constructed in such a way to avoid the explicit computation of (partial) derivatives. An example of such modification is to use instead of the exact Jacobian an approximate of the Jacobian. One can approximate the Jacobian for example by finite differences. The price on has to pay for this modification is that the convergence to a solution will become more slowly, i.e., more iterations are needed to solve the problem. However, the overall cost of solving F (x) = 0 is usually significant less, because the computation of the Newton step is less expensive. Some last general remarks on Newton’s method and iteration methods in general. For each iterative method that solves F (x) = 0, it is required that the method is norm reducing in the sense that kF (xk+1 )k ≤ kF (xk )k k = 0, 1, 2, . . . , (2.12) holds for some norm defined on Rn . In general the ‘standard Newton iteration’ (2.3) does not necessarily satisfy this requirement. This problem is solved by the simple modification xk+1 = xk − ωk F 0 (xk )−1 F (xk ), (2.13) where the relaxation-parameter ωk is chosen such that (2.12) holds. More on this in Section 2.6. 112 Newton’s Method Another general modification of Newton’s method is to reevaluate F 0 (x) occasionally. The iteration scheme then becomes xk+1 = xk − F 0 (xp(k) )−1 F (xk ), k = 0, 1, 2, . . . (2.14) where p(k) is some integer less or equal to k. For p(k) ≡ k we have the ‘standard Newton iteration’ (2.3) and for p(k) ≡ 0 the simplified Newton method or chord method. A disadvantage of the chord method is that we have linear convergence, i.e., kxk+1 − xk ≤ ζkxk − xk, for a ζ ∈ (0, 1). 2.3 Convergence Properties In this section we give a measure of the rate of convergence of an iterative process. An example of an iterative process is for instance the method of Newton. n Definition 2.1. Let {xn }∞ n=0 be a sequence in R that converges to a fixed x ∈ R. Further, let k · k be a norm defined on R n . If there exist positive constants λ ∈ R and p ∈ [1, ∞) such that kxn+1 − xk = λ, n→∞ kxn − xkp lim (2.15) then we say that xn converges to x with order p and asymptotic constant λ. Whenever a process converges with order one, p = 1, then we call the convergence q-linear. If a process converges with p = 1 and λ = 0, then we call such a process q-super-linearly convergent. Processes that converge with p = 2 are called q-quadraticly convergent. We remark that q-quadratic convergence is a special case of q-super-linear convergence. 2 In general, every process that converges with order p greater than one is super-linearly convergent. As seen in the previous section Newton’s method converges quadratically. The result is summarized in the next theorem. Theorem 2.2. Assume that F (x) = 0 has a solution, which we denote by x∗ . Further, assume that the Jacobian F 0 (x∗ ) is nonsingular and Lipschitz continuous near x∗ . If x0 is sufficiently near x∗ , then the Newton sequence 2 Assume that our process converges quadratically. Recall that this also means that limn→∞ kxn − xk = 0. Then, the fact that the iterative process converges quadratically implies that for n sufficiently large there exist a K > 0 such that kxn+1 −xk ≤ Kkxn −xk2 . It follows that kxn+1 − xk Kkxn − xk2 lim ≤ lim = 0. n→∞ kxn − xk n→∞ kxn − xk 2.4 Criteria for Termination of the Iteration 113 exists, i.e. F 0 (xn ) is nonsingular for all n ≥ 0, and the sequence converges to x∗ . Furthermore, there exists a K > 0 such that kxn+1 − xk ≤ Kkxn − xk2 , (2.16) for n sufficiently large. Usually in practice it is more efficient to approximate the Newton step in some way. One (obvious) way to do this is to approximate the Jacobian F 0 (xn ) in such a way that computation of the derivative is avoided. The secant method for scalar equations approximates the derivative using finite differences. It uses the two most recent iterations to form the difference quotient, i.e., F (xn )(xn − xn−1 ) , (2.17) xn+1 = xn − F (xn ) − F (xn−1 ) where xn is the current iteration and xn−1 the previous iteration. As general known, if the secant method converges, then it converges with order ϕ, where ϕ is the golden ratio.3 We remark that the secant method must be initialized with two points. One way to do that is to let x −1 = 0.99x0 . We took this from [15]. Finally we remark that (2.17) cannot be extended to systems of nonlinear equations, because the denominator in the fraction would be a difference of vectors. There are many generalizations to n dimensions of the secant method. We will discuss one of them in Section 2.7. 2.4 Criteria for Termination of the Iteration In practice we do not like to iterate forever. Therefore, an iterative method should be stopped if the approximate solution is accurate enough. A termination criterion is an essential part of an iterative method. A weak criterion is useless, while a too severe criterion makes the iterative method expensive or, even worser, it might never stop. While one cannot know the error without knowing the solution, in most cases the norm of F (xk ) can be used as a reliable indicator of the rate of decay of norm of the error ken k = kx − xn k as the iteration progresses. Termination of the iteration takes place, based on this heuristic, when kF (xk )k ≤ τr kF (x0 )k + τa , (2.18) where τr is the relative error tolerance and τ a the absolute error tolerance. Both tolerances are important. For instance, it is a poor idea to put τ a = 0. In the case that the initial iterate is near the solution x, then (2.18) is impossible to satisfy. 3 The√golden ratio ϕ is the positive root of the equation ϕ2 − ϕ − 1 = 0, meaning that ϕ = 1+2 5 . The golden ratio also satisfies the recurrence relation ϕn = ϕn−1 + ϕn−2 . Taking n = 1 gives the special case ϕ = 1 + ϕ−1 . 114 2.5 Newton’s Method Inexact Newton Methods In the Newton iteration there are two difficulties. The first one is to compute the Jacobian at the current iterate x n . Secondly, computation of the solution of the equation F 0 (xn )s = −F (xn ), (2.19) can be very hard, or even impossible. Instead of solving the Newton step exactly one could also approximate it using some iterative method. More on iterative methods can be found in Part ??. Assume that the nth iterate is known and that the Jacobian F 0 (xn ) as well as F (xn ) are known. Solving (2.19) by using an iterative method gives with a initial start-vector s0 a sequence of vectors {sn }n≥0 that converges to the solution s of (2.19). After k iterations the residual r k is equal to rk = F 0 (xn )sk + F (xn ). (2.20) A good criterion to stop the iterative process that approximates the solution of (2.19) is kF 0 (xn )sk + F (xn )k ≤ η η ∈ R. (2.21) kF (xn )k For more on stopping criteria for iterative methods we refer to Section 2.4 and Part ??. Criterion (2.21) rewritten gives the so-called inexact Newton condition kF 0 (xn )sk + F (xn )k ≤ ηkF (xn )k, (2.22) where the parameter η is called the forcing term. Remark that by choosing a smaller value for η the inexact Newton method makes it more like Newton’s method. Secondly we remark that during the Newton iteration the forcing term η can be varied, thus for the nth iterate we have ηn . In the following theorem, taken without proof from [15], we give some convergence results. Theorem 2.3. Assume that F (x) = 0 has a solution, F 0 is Lipschitz continuous near this solution and that F 0 (x) is nonsingular. Then there are δ and η̄ such that, if x0 ∈ B(δ), {ηn } ⊂ [0, η̄], then the inexact Newton iteration xn+1 = xn + sn , where kF 0 (xn )sn + F (xn )k ≤ ηn kF (xn )k, (2.23) converges linearly to x. Moreover, if η n → 0, then the convergence is qsuper-linear. 2.6 Global Convergence 2.6 115 Global Convergence In this section we use an example taken from [15] to illustrate the theory. To illustrate that the Newton iteration only converges locally we apply it to the function f (x) = arctan(x), (2.24) with initial iterate x0 = 10. This initial value is too far from the root, so that the local convergence theory is not valid. The Newton step can be computed easily and is equal to s= f (x0 ) 1.5 ≈ ≈ −150. 0 0 f (x ) −0.01 (2.25) Hence, this computed Newton step is in the correct direction, but is far too large in magnitude. By continuing the iteration we obtain the following sequence of iterates x0 = 10 x1 = −138 x2 = 2.9 · 104 (2.26) x3 = −1.5 · 109 .. . Since the computed Newton step s points in the correct direction we can apply the following simple modification. We reduce the Newton step by half until kf (x)k has been reduced. We then get a convergence behavior as in Figure 2.2 5 10 0 Absolute Nonlinear Residual 10 −5 10 −10 10 −15 10 −20 10 −25 10 0 2 4 6 8 Nonlinear iterations 10 12 Figure 2.2: The absolute residual on log scale versus the number of iterations. The problem arctan(x) = 0 is solved by a Newton iteration with initial iterate x0 = 10. The Newton step is computed in such a way that (2.28) is satisfied. We now return to the general case of F (x) = 0. In order to describe this artifice in a clear way we define the Newton direction as d = −F 0 (xn )−1 F (xn ), (2.27) 116 Newton’s Method and the Newton step s as a positive scalar multiple of the Newton direction d. In the local convergence theory there is no difference between the Newton direction and the Newton step, i.e. s = d. The Newton step will be determined as follows. Find the smallest integer m such that kF (xn + 2−m d)k < (1 − α2−m )kF (xn )k, (2.28) and let the Newton step be s = 2−m d. Condition (2.28) is called the sufficient decrease of kF k. The parameter α in (2.28) is a small number, chosen in such a way to satisfy the sufficient decrease condition as easy as possible. In [15] α = 10−4 . Methods like the one described above are called line search methods. This kind of methods search for a decrease in kF k along the segment [x n , xn + d]. Some problems can be more efficient solved if the step-length reduction 1 instead of 21 . To make is more aggressive, for instance by factors of 10 this possible, we can do the following after two reductions by half. Build a quadratic polynomial model of ψ(λ) = kF (xn + λd)k2 , (2.29) which is based on interpolation of ψ at the three most recent values of λ. The minimizer of the quadratic model then becomes the next value for λ. As safeguard we claim that the reduction in λ is at least a factor of 2 and at most a factor of 10. This proposed algorithm, taken from [15], generates a sequence of λm ’s with λ0 = 1 1 λ1 = 2 (2.30) (2.31) and λm such that 1 λm 1 ≤ ≤ 10 λm−1 2 for all m ∈ {m ∈ N|m ≥ 2}. (2.32) The line search will be terminated for the smallest m ≤ 0 such that kF (xn + λm d)k < (1 − αλm )kF (xn )k, (2.33) holds in a certain norm. For more details or other ways to implement a line search we refer to [15]. For completeness the above is sketched in Algorithm 2.1. Finally we remark that for smooth (vector)functions F , there are only three possible scenarios for the iteration process of Algorithm 2.1 to occur : 1. the sequence {xn }n≥0 converges to a solution of F (x) = 0, 2. the sequence {xn }n≥0 will become unbounded, 2.7 Extension of Secant Method to n Dimensions 117 Algorithm 2.1 Line Search Evaluate F (x); τ ← τr kF (x)k + τa . while kF (x)k > τ do Find d = −F 0 (x)−1 F (x). If no such d can be found, then terminate with failure. λ = 1. while kF (x + λd)k > (1 − αλ)kF (x)k do 1 1 , 2 ] is computed by minimizing the polynomial λ ← σλ, where σ ∈ [ 10 2 model of kF (x + λd)k . end while x ← x + λd. end while 3. the Jacobian F 0 (xn ) will become singular for a certain x n . The line search algorithm as presented here is the most simplest way to find a root of F (x), when the initial iterate is far from a root. There are many alternatives to this line search algorithm that can sometimes overcome stagnation or, in the case of many solutions, find the solution that is appropriate to the physical problem. Three of these other methods are trust region globalization, pseudo-transient continuation and homotopy methods. These alternatives will not be discussed in this report. For more information we refer to [15]. 2.7 Extension of Secant Method to n Dimensions For solving the scalar equation f (x) = 0 the secant method is given as xk+1 = xk − f (xk )(xk − xk−1 ) , f (xk ) − f (xk−1 ) (2.34) where xk is the current iteration and xk−1 the iteration before. As seen before, the secant method converges with order equal to the golden ratio. One of the most famous extensions of the secant method to n dimensions is Broyden’s method. We will shortly show how this method can be derived from the (k + 1)-point sequential secant method, which is derived in [23, Chapter 7]. We remark that this method is not used in practice. Methods that are used in practice can be derived from this method. The (n+1)-point sequential secant method is the n dimensional extension of the scalar secant method and given as k xk+1 = xk − A−1 k F (x ), k = 0, 1, . . . , (2.35) Γk = [q k−1 , . . . , q k−n ], (2.36) where Ak = Γk Hk−1 , Hk = [pk−1 , . . . , pk−n ], 118 Newton’s Method with pi = xi+1 − xi and q i = F (xi+1 ) − F (xi ). (2.37) x 0 , . . . , x−n . Remark that this method needs n initial values, e.g., The matrix Ak is an approximation of the Jacobian in the k th iteration. For the next iteration the approximation of the Jacobian will be given by Ak+1 = Ak − F (xk+1 )(v k )T , (2.38) −1 with v k the first row of Hk+1 . For a derivation of this result, see [23], Theorem 7.3.1. This result suggests that A k+1 is obtained from Ak by addition of a matrix of rank one. Therefore, the more general form of (2.38) is Ak+1 = Ak + uk (v k )T , for some uk , v k ∈ Rn . (2.39) Using the Sherman - Morrison formula 4 gives −1 A−1 k+1 = Ak − k k T −1 A−1 k u (v ) Ak , k 1 + (v k )T A−1 k u (2.41) where we assumed the denominator to be unequal to zero and A 0 nonsingular. This is the first constraint on v k and uk , e.g., k 1 + (v k )T A−1 k u 6= 0. (2.42) From now on it is possible to construct a wide range of methods. Next, we assume that (v k )T pk 6= 0 and uk = F (xk+1 ) (v K )T pk k ∈ N. (2.43) k+1 ) = A−1 q k , The first constraint becomes then, using p k + A−1 k F (x k (v k )T A−1 k q 6= 0. (2.44) Then, (2.41) becomes −1 A−1 k+1 = Ak − k+1 )(v k )T A−1 A−1 k F (x k . k (v k )T A−1 q k (2.45) Broyden’s method [3] is a special case of the method defined by (2.43), in which v k is taken equal to pk . We remark that (2.44) may not be equal to zero. The corresponding algorithm, see [3], is given as Algorithm 2.2. In Algorithm 2.2 the matrix A−1 k is represented by the matrix B k . Furthermore, we remark that the update Bk+1 in State 8 of the Broyden algorithm is in practice not computed in that way. For more information on the Broyden algorithm we refer to [3, 15]. 4 Sherman - Morrison formula : Let the n × n matrix A be invertible and let u, v ∈ Rn . Then A + uv T is invertible if and only if 1 + v T A−1 u 6= 0, and then (A + uv T )−1 = A−1 − A−1 uv T A−1 . 1 + v T A−1 u (2.40) 2.8 Failures 119 Algorithm 2.2 Broyden 1: Obtain an initial estimate x0 of the solution. 2: Obtain an initial value of the iteration matrix B 0 . For instance, this can be done by approximating the Jacobian at x 0 and inverting the result. 3: Compute Fi = F (xi ) i = 0, 1, 2, . . . 4: Compute pi = −Bi Fi . 5: Choose λi such that xi+1 = xi + λi pi implies kFi+1 k < kFi k. 6: Test kFi+1 k for convergence. 7: Compute yi = Fi+1 − Fi . B F (xk+1 )(p )T Bi 8: Compute Bi+1 = Bi − i (p )T B yi . i i i 9: Repeat from Step 4. 2.8 Failures In this section we discuss possible failures of the nonlinear solvers presented in this chapter. Among others we treat possible causes for failure to converge. In the case that the problem has no solution, then any solver will have trouble finding a solution. For example, if f (x) = e −x , then the Newton iteration will converge to −∞ for any starting point. 2.8.1 Non-smooth Functions All variations of the Newton iteration which are discussed in this chapter are intended to find roots of F (x) = 0 for which F 0 (x) is Lipschitz continuous. Solving problems with a discontinuous Jacobian causes unpredictable behavior of the codes. An example of a function with a non-smooth Jacobian is the vector norm. 2.8.2 Slow Convergence In the case one has a nonlinear solver that converges, but not as fast as it should, then probably one of the following three operations is inaccurate : • Computation of the Jacobian, • Computation of the Jacobian-vector product, • Linear solver. Super-linear convergence which follows from the local convergence theory only holds if the correct linear system is solved accurately. In order to get the expected super-linear convergence, one could check the following things. 1. Check the computation of the Jacobian, 120 Newton’s Method 2. If you use a iterative linear solver, then make sure that the termination criteria for the linear solver are set tight enough. 2.8.3 No Convergence If the Newton iteration does not converge, then either the iteration will become unbounded or the Jacobian will become singular. We give possible causes of failure of convergence. The first one is to check whether the problem has a solution or not. In the case your problem has a solution one of the following remarks might cause the failure. • Inaccurate function evaluation. If the error in the function evaluation is higher than the machine roundoff, then the computed Newton direction, which is computed by a difference Jacobian, can be poor enough for the iteration to fail. • Singular Jacobian. If the Jacobian becomes (almost) singular, then the step lengths will become zero. In the case that one terminates the iteration when the step length approaches zero and does not check whether F approaches zero, then one concludes incorrectly that the problem has been solved. • If the line search algorithm as presented in Section 2.6 fails for the problem, one should use one of the alternatives presented in the same section. 2.8.4 Failure of the Line Search In the case that there is no singular Jacobian and the line search algorithm reduces the step size to an unacceptable small value, then this means that the computed Newton direction is poor. In order to let the line search algorithm perform better, the computation of the Newton direction must become more accurate. Furthermore, we repeat that convergence of the line search algorithm is based on an exact Jacobian. A difference approximation to a Jacobian or Jacobian-vector product is usually, but not always, sufficient. For more background on failures of Newton’s method and the line search algorithm we refer to [15]. Chapter 3 Picard Iteration Picard iteration is also known as fixed point iteration. We first give the definition of a fixed point. Definition 3.1. A fixed point is a point that does not change upon application of a map or a system of differential equations. For example, a point p such that f (p) = p holds for a given map f : R → R is called a fixed point. Points of an autonomous system of ordinary differential equations at which dx1 = f1 (x1 , . . . , xn ) = 0 dt dx2 = f2 (x1 , . . . , xn ) = 0 dt (3.1) .. . dxn = fn (x1 , . . . , xn ) = 0 dt are also known as fixed points. In this chapter we restrict ourselves to finding a fixed point of a given map F : Rn → Rn . First, we consider the scalar case, e.g., n = 1. 3.1 One Dimension To find a solution of the scalar equation f (x) = x, or equivently, to find a fixed point of the given map f : R → R the following approximation can be used. Choose a start-value x0 and determine xn by xn = f (xn−1 ) for n ≥ 1. If the sequence converges to x and f is continuous, then x = lim xn = lim f (xn−1 ) = f lim xn−1 = f (x). (3.2) n→∞ n→∞ n→∞ From Equation (3.2) follows that x is a fixed point of f (x). This approximation of the fixed point is called the fixed point or Picard iteration. The following theorems give sufficient conditions for existence and uniqueness of a fixed point, and convergence of the sequence {x n }n≥0 to a fixed point. Both theorems are taken from [38]. 122 Picard Iteration 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Figure 3.1: Graphical illustration of the fixed point method Theorem 3.2. If f ∈ C[a, b] and f (x) ∈ [a, b] for all x ∈ [a, b], then f has a fixed point in [a, b]. Furthermore, if f 0 (x) exists for x ∈ [a, b] and there exists a positive constant k < 1 such that |f 0 (x)| ≤ k for all x ∈ [a, b], (3.3) then the fixed point in [a, b] is unique. Proof. See [38], page 86. Theorem 3.3. Assume f ∈ C[a, b], f (x) ∈ [a, b], x ∈ [a, b] and |f 0 (x)| ≤ k < 1 for x ∈ [a, b]. The fixed point iteration converges to x for every start-value x0 ∈ [a, b]. Proof. The assumptions imply that f has a unique fixed point in [a, b]. Using the mean-value theorem we obtain |xn − x| = |f (xn−1 ) − f (x)| = |f 0 (ξ)||xn−1 − x| ≤ k|xn−1 − x|. (3.4) By induction follows lim |xn − x| ≤ lim k n |x0 − x| = 0. n→∞ n→∞ (3.5) We conclude that xn converges to x. The fixed point iteration is illustrated in Figure 3.1 for the map f (x) = with start-value x0 = 0. x2 4 , +3 (3.6) 3.2 Higher Dimensions 3.2 123 Higher Dimensions We start with the definition of a contraction, or contraction mapping. Definition 3.4. A mapping F : D ⊂ Rn → Rn is contractive on a set D0 ⊂ D if there is an α < 1 such that kF (x) − F (y)k ≤ αkx − yk for all x, y ∈ D0 . We call such a mapping F a contraction mapping, or shorter a contraction on D0 . Note that the contractive property as given in the above definition is norm dependent. Clearly, a contraction is Lipschitz-continuous. The following theorem gives a basic result on the existence and uniqueness of a fixed point of a contraction. This result is known as the Banach Fixed Point Theorem. Theorem 3.5. Let F be a contraction mapping from a closed subset D of a Banach space S onto D. Then there exists a unique z ∈ D such that F (z) = z. Then, the following theorem presents a criterion for convergence of the Picard iteration in more dimensions. Theorem 3.6. Assume that F is a contraction on a closed subset D of R n . The fixed point iteration xk+1 = F (xk ), (3.7) converges to a fixed point x for every start-value x 0 , if the spectral radius of the Jacobian of F (x) is less than one. Proof. We obtain from the mean-value theorem that kxk+1 − xk = kF (xk ) − F (x)k = kJF (ξk+1 )(xk − x)k, (3.8) where ξk+1 ∈ (xk , x). For any multiplicative norm k · k, the norm of JF (ξk+1 )(xk − x) is less or equal to kJF (ξk+1 )kk(xk − x)k. (3.9) By induction we obtain kxk+1 − xk ≤ kJF (ξk+1 )kkJF (ξk )k · · · kJF (ξ1 )kk(x0 − x)k. (3.10) In order to have convergence of the fixed point iteration, kJ F (ξi )k should be less than one for all i ∈ [1, k + 1]. Since for every multiplicative norm holds ρ(A) ≤ kAk, (3.11) a neccesary condition for convergence is ρ(J F (x)) < 1 in a sphere with radius kx0 − xk around x. 124 3.3 Picard Iteration Some Last Remarks In this section we give some last remarks considering the fixed point iteration. To terminate the iteration it is sufficient to fulfill the condition kxk+1 − xk k ≤ tol. (3.12) In the case that the Picard iteration does not converge, i.e., the spectral radius of the Jacobian becomes larger than one, a relaxation parameter ω can be used. The so-called Picard iteration scheme with relaxation parameter is xk+1 = xk + ω(F (xk ) − xk ), with ω > 0. (3.13) Part IV Linear Solvers Chapter 1 Introduction In Part III we saw that solving nonlinear equations iteratively needs solutions of linear systems. For example, the standard Newton iteration xk+ = xk − F 0 (xk )−1 F (xk ), (1.1) needs in each iteration the solution of the system F 0 (xk )sk = F (xk ). (1.2) In this part we discuss methods which solve linear algebraic systems. Generally, we distinguish direct and iterative solution methods. In this chapter we first give some examples of direct solution methods. Then it also will become clear that we focus only on iterative methods. The further contents of this chapter consists of the presentation of the basic ideas behind iterative methods. Examples of direct solution methods are the LU factorization and the Choleski factorization. The LU factorization is based on the following observation. Suppose we have the system Ax = b, (1.3) with A square (n × n) and nonsingular. It can be shown that there exist Gauß transformations M1 , . . . , Mn−1 such that Mn−1 · · · M1 A = U, (1.4) with U an upper triangular matrix. Then A = (Mn−1 · · · M1 )−1 U = LU, (1.5) with L = (Mn−1 · · · M1 )−1 . It can be shown that L is a lower triangular matrix. 128 Introduction If the LU factorization of A is obtained, then the solution of Ax = b is easy to compute. The solution of LU x = b can be split into two parts, i.e., first the solution of the lower triangular system Ly = b and then the solution of the upper triangular system U x = y. To compute an LU factorization of A takes 2n3 /3 flops. Solving Ly = b and U x = y requires 2n 2 flops. In the case that A is symmetric positive definite (SPD), then the factorization A = GGT , (1.6) with G a lower triangular matrix, exists. This factorization is known as the Choleski factorization and costs n 3 /3 flops. Problems arising from discretized PDEs lead in general to large sparse systems of equations. Direct solutions methods can be impractical if A is large and sparse, because the L and U can be dense. This is especially the case for 3D problems. We then need a considerable amount of memory and the solving the triangular system can many floating point operations. Therefore, we restrict ourselves to iterative methods to solve linear systems. The amount of work to solve Ax = b iteratively depends on the number of iterations and the amount of work per iteration. The basic idea behind solving Ax = b, with A invertible, iteratively is the following. Construct a splitting A = B + (A − B), so that B is “easy invertible”. The original linear system can then be rewritten as Bx + (A − B)x = b. (1.7) By putting Bxi+1 + (A − B)xi = b, we introduce a sequence of approximate solutions x can be rewritten as (1.8) i ∞ i=0 . Equation (1.8) xi+1 = (I − B −1 A)xi + B −1 b, (1.9) xi+1 = Qxi + s, (1.10) or equivalently as with Q = I − B −1 A and s = B −1 b. We call an iterative process consistent if the matrix I − Q is non-singular and if (I − Q) −1 s = A−1 b. If the iteration process is consistent, then the equation (I − Q)x = s has the unique solution xi→∞ ≡ x. We call an iterative process convergent if the sequence x0 , x1 , x2 , . . . converges to a limit independent of the start-value x 0 . For example, the Jacobi method has the splitting A = D − C, with D a diagonal matrix and C a matrix that contains zeros on the main diagonal. By taking B equal to D the Jacobi method reads Dxi+1 = Cxi + b. (1.11) 129 In the Gauß-Seidel method we also split up C = C 1 + C2 , with a lower triangular matrix C1 and an upper triangular matrix C2 . Then, by taking B = D − C1 the Gauß-Seidel method reads (D − C1 )xi+1 = C2 xi + b. (1.12) In our application of CVD we get after discretization and applying a time integration method a huge nonlinear system to solve. The size of this nonlinear system depends on 1. the spatial dimension d (d = 1, 2, 3), 2. the number of chemical species, 3. solving the nonlinear equations coupled or decoupled. As mentioned above, to solve this nonlinear system also linear systems have to be solved. These are huge and sparse linear systems. Therefore, iterative solution methods are more efficient than direct methods, especially in the three-dimensional case. Using these basic iterative methods as described above a class of efficient solution methods can be derived. This class, the class of the so-called Krylov subspace methods will be discussed in the next chapter. We conclude with presenting preconditioner techniques in Chapter 3, which are used to accelerate Krylov subspace methods. 130 Introduction Chapter 2 Krylov Subspace Methods In Chapter 1 the basic iterative method is defined to approximate the solution of a linear system Ax = b iteratively by xi+1 = (I − B −1 A)xi + B −1 b, (2.1) where B is an “easy invertible” matrix coming from the splitting A = B + (A − B). We also remark that the iteration process (2.1) needs a start approximation x0 . Rewriting (2.1) gives xi+1 = xi + B −1 (b − Axi ) = xi + B −1 r i , (2.2) where r i = b − Axi is called the ith residual. Using this notation, the first steps of the iteration process are x0 = x 0 , x1 = x0 + B −1 r 0 , x2 = x1 + B −1 r 1 = = x0 + B −1 r 0 + B −1 (b − Ax1 ) = = x0 + B −1 r 0 + B −1 (b − A(x0 + B −1 r 0 )) = = x0 + (2B −1 − B −1 AB −1 )r 0 , .. . which implies that xi ∈ x0 + span B −1 r 0 , B −1 A(B −1 )r 0 , . . . , (B −1 A)i−1 (B −1 )r 0 . (2.3) The subspace K i (A, r 0 ) := span{r 0 , Ar 0 , . . . , Ai−1 r 0 } is called the Krylovspace of dimension i corresponding to the matrix A and initial residual r 0 . Each approximation xi , i = 1, 2, . . ., that is computed by a basic iterative method is an element of x0 + K i (M −1 A, M −1 r 0 ). 132 Krylov Subspace Methods This chapter is divided into two parts. In the first part the Conjugate Gradients method will be treated, which is a Krylov Subspace method for symmetric matrices. In the other part of this chapter we discuss Krylov subspace methods for general matrices. 2.1 Krylov Subspace Methods for Symmetric Matrices In this section we present Krylov subspace methods for solving Ax = b, (2.4) where A is a symmetric matrix. The CG method is the most prominent iterative method for solving linear systems (2.4) in the case that A is SPD. CG will be derived from the Lanczos Algorithm, which will be presented in the next section. 2.1.1 Arnoldi’s Method and the Lanczos Algorithm The Arnoldi Algorithm is a procedure for building an orthonormal basis of the Krylov subspace K i (A, r 0 ). It is given in Algorithm 2.1, which is taken from [28]. Algorithm 2.1 Arnoldi Choose a vector v1 of norm 1, for j = 1, 2, . . . , m do Compute hij = (Avj , vi ) for i = 1, 2, . . . , j, P Compute wj = Avj − ji=1 hij vi , hj+1,j = kwj k2 , if hj+1,j = 0 then Stop, end if vj+1 = wj /hj+1,j end for A few simple properties of the algorithm are given in Proposition 2.1 and 2.2. Proposition 2.1. Assume that Algorithm 2.1 does not stop before the m th step. Then the vectors v1 , v2 , . . . , vm from an orthonormal basis of the Krylov subspace K m (A, r 0 ) = span{v1 , Av1 , . . . , Am−1 v1 }. (2.5) Proposition 2.2. Denote by Vm the n × n matrix with column vectors v1 , . . . , vm , by H̄m the (m + 1) × n Hessenberg matrix whose nonzero entries 2.1 Krylov Subspace Methods for Symmetric Matrices 133 hij are defined by Algorithm 2.1, and by H m the matrix obtained from H̄m by deleting its last row. Then the following relations hold : AVm = Vm Hm + wm eTm (2.6) = Vm+1 H̄m , (2.7) = Hm . (2.8) VmT AVm For the proofs of Proposition 2.1 and 2.2 we refer to [28]. In the case that the matrix A is symmetric the Hessenberg H m will become symmetric tridiagonal, see [28, Theorem 6.2]. The standard notation used to describe the Lanczos Algorithm is obtained by setting αj ≡ hjj , βj ≡ hj−1,j , (2.9) and if Tm denotes the resulting Hm matrix, it is of the form α1 β2 β2 α1 β2 . .. βm−1 αm−1 βm βm αm . (2.10) This leads to the following variant of Arnoldi’s Algorithm (see Algorithm 2.1), which is called the Lanczos Algorithm. It is given as Algorithm 2.2. Algorithm 2.2 Lanczos Algorithm Choose an initial vector v1 of norm 1. Set β1 ≡ 0 and v0 ≡ 0, for j = 1, 2, . . . , m do wj = AVj − βj vj−1 , αj = (wj , vj ), wj = w j − α j vj , βj+1 = kwj k2 , if βj+1 = 0 then Stop end if vj+1 = wj /βj+1 end for Algorithm 2.2 garantuees that in exact algorithmic the vectors v i , i = 1, 2, . . . , m, are orthogonal. In practice, when we do not compute in exact algorithmic, exact orthogonality of these vectors is only observed at the beginning of the process. At some point the v i ’s start to loose their global orthogonality rapidly. For more information on this subject we refer to [28]. 134 2.1.2 Krylov Subspace Methods The Conjugate Gradient Algorithm As mentioned before, the Conjugate Gradient Method, shortly CG, is the best known iterative techniques for solving sparse symmetric positive definite linear systems. Shortly described, CG is a realization of an orthogonal projection technique onto the Krylov subspace K m (A, r 0 ), where r 0 is the initial residual. We will derive the CG method form the Lanczos Method for Linear Systems. This Lanczos Method for Linear Systems approximates the solution of Ax = b, with AT = A, by an orthogonal projection onto K m (A < r 0 ). Given an initial guess x0 to the linear system Ax = b and the Lanczos vectors v1 , . . . , vm together with the tridiagonal matrix T m , the orthogonal projection is given as xm = x 0 + V m ym , −1 ym = Tm (βe1 ). (2.11) The actual algorithm is given as Algorithm 2.3. Algorithm 2.3 Lanczos Method for Linear Systems 1: Compute r 0 = b − Ax0 , β = kr 0 k and v1 = r 0 /β, 2: for j = 1, 2, . . . , m do 3: wj = AVj − βj vj−1 , (If j = 1 set β1 v0 ≡ 0) 4: αj = (wj , vj ), 5: w j = w j − α j vj , 6: βj+1 = kwj k2 , 7: if βj+1 = 0 then 8: Set m = j and go to 10 9: end if 10: Set Tm = tridiag(βi , αi , βi+1 ) and Vm = [v1 , . . . , vm ], −1 (βe ) and xm = x0 + V y . 11: Compute ym = Tm 1 m m 12: end for Remark from Algorithm 2.3 that the approximate solution is given as −1 −1 xm = x 0 + V m ym = x m = x 0 + V m Um Lm (βe1 ), (2.12) where we used the LU factorization of T m = Lm Um . Letting −1 Pm = V m Um and zm = L−1 m (βe1 ), (2.13) gives xm = x0 +Pm z m . As remarked in [28], the last column of P m , denoted by pm , can be updated from the previous pi ’s and vm by pm = with λm = βm ηm−1 1 (vm − βm pm−1 ), ηm and ηm = αm − λm βm . (2.14) (2.15) 2.1 Krylov Subspace Methods for Symmetric Matrices In addition, as shown in [28], we have zm−1 with ζm = −λm ζm−1 zm = ζm and ζ1 = β. 135 (2.16) As a result, xm can be updated at each step as xm = xm−1 + ζm pm . In the next proposition we give some properties of the Lanczos algorithm. Proposition 2.3. Let r m = b − Axm , m = 0, 1, . . ., be the residual vectors produced by Algorithm 2.3 and pm be the auxiliary vectors as given above. Then, 1. Each residual vector r m is such that r m = σm vm+1 , where σm is a certain scalar. As a result, the residual vectors are orthogonal to each other. 2. The auxiliary vectors pi form an A-conjugate set, i.e., (Api , pj ) = 0, for i 6= j. 1 Proof. For a proof we refer to [28]. From the above proposition we can derive a modified version of Algorithm 2.3 by imposing orthogonality and conjugate conditions. This modified version of Algorithm 2.3 is the CG method. First we remark that xj+1 = xj + αj xj . (2.17) Therefore, the residual vectors must satisfy r j+1 = r j − αj Apj . (2.18) To have the residual vectors orthogonal, it is necessary to impose (r j − αj Apj , pj ) = 0. It follows that αj must be equal to αj = (r j , r j ) . Apj , r j ) (2.19) Using Proposition 2.3 and Equation (2.14) we observe that the next search direction pj+1 is a linear combination of r j+1 and pj . After rescaling the p vectors appropriately, it follows that pj+1 = r j+1 + βj pj . (2.20) A first consequence of the above relation is (Apj , r j ) = (Apj , pj − βj−1 pj−1 ) = (Apj , pj ), 1 To ensure that (Ax, x) is an inner product, A must be SPD (2.21) 136 Krylov Subspace Methods because Apj is orthogonal to pj−1 . Remark that (2.19) becomes αj = (r j , r j )/(Apj , pj ). A second consequence of (2.20) is the following. Since pj+1 is orthogonal to Apj , we can derive that βj in (2.20) must be equal to βj = − 1 (r j+1 , (r j+1 − r j )) (r j+1 , r j+1 ) (r j+1 , Apj ) = = . (pj , Apj ) αj (Apj , pj ) (r j , r j ) (2.22) Putting these relations together gives Algorithm 2.4, i.e., the CG algorithm. Algorithm 2.4 Conjugate Gradient 1: Compute r 0 = b − Ax0 and p0 = r 0 , 2: for j = 1, 2, . . . , until convergence do 3: αj = (r j , r j )/(Apj , pj ), 4: xj+1 = xj + αj pj , 5: r j+1 = r j − αj Apj , 6: βj = (r j+1 , r j+1 )/(r j , r j ), 7: pj+1 = r j+1 + βj pj . 8: end for 2.2 Krylov Subspace Methods for General Matrices In the previous section we have discussed the CG method. This Krylov subspace method can only be used to solve (sparse) symmetric positive definite linear systems. In this section we discuss the best known Krylov subspace methods for general (real) sparse linear systems, i.e., we consider Krylov subspace methods to solve Ax = b, with A an n × n nonsingular real matrix. Considering the case of symmetric positive definite linear systems, we have seen that CG has the properties : • Krylov subspace based on A, • optimality : the residual is minimized in a certain norm, • short recurrences; only the results of one foregoing step are necessary, work and memory do not increase for an increasing number of iterations. In [8] it has been shown that it is impossible to obtain a Krylov subspace method for general matrices, which satisfies the above properties. From this we conclude that a Krylov method for general matrices has either an optimality property but long recurrences, or no optimality and short recurrences. A method that has an optimality property and short recurrences is not a Krylov subspace method. 2.2 Krylov Subspace Methods for General Matrices 137 It appears that there are essentially three different ways to solve nonsymmetric real linear systems : 1. Solve the normal equations AT Ax = AT b with CG, 2. Construct a basis for the Krylov subspace by a 3-term bi-orthogonality relation, 3. Make all the residuals explicitly orthogonal in order to have an orthogonal basis for the Krylov subspace. In this report only the last two classes will be treated. For more information on the first class of methods we refer to [22, 28]. 2.2.1 BiCG Type Methods In the previous chapter we saw that the CG method is derived from the Lanczos Algorithm, i.e. Algorithm 2.2. It appears that there are various generalizations of the Lanczos Algorithm for general matrices. One of these generalizations, the Arnoldi procedure (Algorithm 2.1), has already been treated. Another extension of Lanczos to general matrices is the Lanczos Bi-orthogonalization Algorithm. The concept of the Lanczos Biorthogonalization Algorithm is different in concept from Arnoldi in the sense that it relies on bi-orthogonal sequences instead of orthogonal sequences. In this section we shortly present the Lanczos Bi-orthogonalization Algorithm and mention how Bi-Conjugate Gradient (BiCG) can be derived from it. Furthermore, we present the basic ideas behind Conjugate Gradient Squared (CGS) and Bi-Conjugate Gradient STABilized (Bi-CGSTAB) The latter algorithm, i.e. Bi-CGSTAB, is together with GMRES one of the best known iterative methods for general sparse real linear systems. Lanczos Bi-Orthogonalization The algorithm proposed by Lanczos for non-symmetric matrices builds a pair of bi-orthogonal bases for the two subspaces K m (A, v1 ) = span{v1 , Av1 , . . . , Am−1 v1 }, (2.23) K m (AT , w1 ) = span{w1 , AT w1 , . . . , (AT )m−1 w1 }. (2.24) and The algorithm that achieves this is Algorithm 2.5. There can be proved that if Algorithm 2.5 does not break down before step m, then the vectors v i and wj , i, j = 1, 2, . . . , m, form a bi-orthogonal system. Comparing this algorithm with Arnoldi’s method we remark the following. From a practical point of view, the Lanczos Algorithm has an advantage over Arnoldi’s method, because it requires only a few vectors of storage, if 138 Krylov Subspace Methods Algorithm 2.5 Lanczos Bi-Orthogonalization Procedure 1: Choose two vectors v1 , w1 such that (v1 , w1 ) = 0, 2: Set β1 = δ1 = 0 and w0 = v0 = 0, 3: for j = 1, 2, . . . , m do 4: αj = (Avj , wj ), 5: v̂j+1 = Avj − αj vj − βj vj−1 , 6: ŵj+1 = AT wj − αj wj − βj wj−1 , 7: δj+1 = |(v̂j+1 , ŵj+1 )|1/2 , 8: if δj+1 = 0 then 9: Stop. 10: end if 11: βj+1 = (v̂j+1 , ŵj+1 )/δj+1 , 12: wj+1 = ŵj+1 /βj+1 , 13: vj+1 = v̂j+1 /δj+1 , 14: end for no re-orthogonalization is performed. In particular, six vectors of length n and a tridiagonal matrix are needed for storage. This amount of storage is independent of m. Bi-Conjugate Gradient The Bi-Conjugate Gradient (BiCG) procedure can be derived from Algorithm 2.5 in exactly the same way as the CG Algorithm was derived from Algorithm 2.2. Implicitly, the BiCG algorithm solves not only the original system Ax = b, but also a dual linear system A T x∗ = b∗ . The dual system is often ignored in the algorithm. The BiCG procedure is a projection process onto K m (A, v1 ) = span{v1 , Av1 , . . . , Am−1 v1 }, (2.25) K m (AT , w1 ) = span{w1 , AT w1 , . . . , (AT )m−1 w1 }. (2.26) orthogonally to The vector v1 is taken as usual v1 = r0 /kr0 k2 . The vector w1 is arbitrary, provided that (v1 , w1 ) 6= 0. The algorithm is given as Algorithm 2.6 Conjugate Gradient Squared Each step of BiCG requires a matrix-vector product with both A and A T . Secondly, observe that the vectors p ∗i and wj , generated by AT do not contribute directly to the solution. Due to this observations, one might ask, is it possible to bypass the use of AT and still generate iterates related to BiCG ? The answer is given as the Conjugate Gradient Squared (CGS) method. 2.2 Krylov Subspace Methods for General Matrices 139 Algorithm 2.6 Bi-Conjugate Gradient 1: Compute r0 = b − Ax0 . Choose r0∗ such that (r0 , r0∗ ) 6= 0. 2: Set p0 = r0 and p∗0 = r0∗ , 3: for j = 1, 2, . . . , until convergence do 4: αj = (rj , rj∗ )/(Apj , p∗j ), 5: xj+1 = xj + αj pj , 6: rj+1 = rj − αj Apj , ∗ 7: rj+1 = rj∗ − αj AT p∗j , ∗ )/(r , r ∗ ), 8: βj = (rj+1 , rj+1 j j 9: pj+1 = rj+1 − βj pj , ∗ 10: p∗j+1 = rj+1 − βj p∗j , 11: end for The CGS method was developed in 1984 by Sonneveld [31]. In the BiCG method, the residual rj can be regarded as the product of r0 and the j th degree polynomial R in A, i.e., rj = Rj (A)r0 , (2.27) satisfying the constraint Rj (0) = 1. This same polynomial satisfies also rj∗ = Rj (AT )r0∗ , (2.28) by construction of the BiCG method. Similarly, the conjugate-direction polynomial Pj (t) satisfies the following expressions pj = Pj (A)r0 and p∗j = Pj (AT )r0∗ . (2.29) Also, note that the scalar αj in BiCG is given as αj = (rj , rj∗ ) (Apj , p∗j ) = (R2j (A)r0 , r0∗ ) (Rj (A)r0 , Rj (AT )r0∗ ) = , (APj (A)r0 , Pj (AT )r0∗ ) (APj2 (A)r0 , r0∗ ) (2.30) which indicates that if it is possible to obtain a recursion for the vectors R2j (A)r0 and Pj2 (A)r0 , then computing αj and βj cause no problem. Expression (2.30) suggests that if R j (A) reduces r0 to a smaller vector rj , then it might be advantageous to apply this ‘contraction’-operator twice and compute R2j (A)r0 . The iteration coefficients can still be recovered from these vectors, and it turns out to be easy to find the corresponding approximations for x. This approach is called the Conjugate Gradient Squared method. See Algorithm 2.7. For a complete derivation we refer to [28]. Observe that in Algorithm 2.7 are no matrix-vector products with A T . Instead, two matrix-vector products with A are performed in each step. In general, one should expect the resulting algorithm to converge twice as fast as BiCG. Therefore, what has been done is to replace the matrix-vector products with AT with more useful work. 140 Krylov Subspace Methods Algorithm 2.7 Conjugate Gradient Squared 1: Compute r0 = b − Ax0 . Choose r0∗ arbitrary, 2: Set p0 = u0 = r0 , 3: for j = 1, 2, . . . , until convergence do 4: αj = (rj , r0∗ )/(Apj , r0∗ ), 5: qi = uj − αj Apj 6: xj+1 = xj + αj (uj + qj ), 7: rj+1 = rj − αj A()uj + qj ), 8: βj = (rj+1 , r0∗ )/(rj , r0∗ ), 9: uj+1 = rj+1 − βj qj , 10: pj+1 = u∗j+1 − βj (qj + βj pj ), 11: end for A drawback of the CGS method is the following. Since the polynomials are squared, rounding errors tend to have a more damaging effect than in the standard BiCG algorithm. Bi-CGSTAB The CGS algorithm is based on squaring the residual polynomial, and, in cases of irregular convergence, this may lead to substantial build-up of rounding errors, or possibly even overflow. The Bi-Conjugate Gradient Stabilized (Bi-CGSTAB) Algorithm is a variation of CGS that has a remedy for this difficulty. During the development of Bi-CGSTAB one used a residual of the form rj = Qj (A)Rj (A)r0 , (2.31) instead of rj = R2j (A)r0 . In (2.31) the polynomial Qj (A) is a new polynomial which is defined recursively at each step with the goal of ‘stabilizing’ or ‘smoothing’ the convergence behavior of the original algorithm. In particular, Qj is defined recursively by Qj+1 (t) = (1 − ωj t)Qj (t). (2.32) The above recursion is equivalent with Qj+1 (t) = (1 − ω0 t)(1 − ω1 t) · · · (1 − ωj t), (2.33) where the scalars ωi , i = 0, 1, . . . , j have to be determined. For the determination of the scalars ω i , i = 0, 1, . . . , j, and the complete derivation of the Bi-CGSTAB algorithm we refer to [28], pp 216-219. The Bi-CGSTAB algorithm is given as Algorithm 2.8. 2.2 Krylov Subspace Methods for General Matrices 141 Algorithm 2.8 Bi-CGSTAB 1: Compute r0 = b − Ax0 . Choose r0∗ arbitrary, 2: Set p0 = r0 , 3: for j = 1, 2, . . . , until convergence do 4: αj = (rj , r0∗ )/(Apj , r0∗ ), 5: si = rj − αj Apj 6: wj = (Asj , sj )/(Asj , Asj ), 7: xj+1 = xj + αj pj + wj sj , 8: rj+1 = sj − wj Asj , (r ,r0∗ ) α 9: βj = (rj+1 × wjj , ∗ ,r j 0) 10: pj+1 = rj+1 + βj (pj − wj Apj ), 11: end for 2.2.2 GMRES Methods The Generalized Minimum Residual Method (GMRES) is a projection method on the Krylov subspace K m (A, v1 ), with v1 = r0 /kr0 k2 . This technique minimizes the residual norm over all vectors in x 0 + K m (A, v1 ). In order to do this, GMRES generates a sequence of orthogonal vector with long recurrences due to the non-symmetry. GMRES uses a variant of Arnoldi’s Algorithm (Algorithm 2.1) to find an orthonormal basis. We first give the idea behind the GMRES algorithm. GMRES Derivation Any vector x in x0 + K m (A, v1 ) can be written as x = x0 + Vm y, (2.34) where y is an m-vector. Define J(y) = kb − Axk2 = kb − A(x0 + Vm y)k2 , (2.35) b − Ax = b − A(x0 + Vm y) = (2.36) = βv1 − Vm+1 H̄m y = (2.38) and remark that = r0 − AVm y = (2.37) = Vm+1 (βe1 − H̄m y), (2.39) where we used Proposition 2.2, β = kr 0 k2 and e1 = (1, 0, . . . , 0)T . Since the column-vectors of Vm+1 are orthonormal, we obtain J(y) = kb − A(x0 + Vm y)k2 = kβe1 − H̄m yk2 . (2.40) 142 Krylov Subspace Methods The GMRES approximation is the vector of x 0 +K m (A, v1 ) which minimizes J(y). Using (2.40) this approximation can easily be obtained by xm = x 0 + V m ym where ym = argminy kβe1 − H̄m yk2 . (2.41) Observe that ym is the solution of an (m + 1) × m least-squares problem, which is inexpensive to solve (for m typically small). 2.2.3 The GMRES Algorithm Summarizing the above remarks and results gives the following algorithm. Algorithm 2.9 GMRES 1: Compute r 0 = b − Ax0 , β = kr0 k2 and v1 = r0 /β 2: Define the (m + 1) × m matrix H̄m = {hij }1≤i≤m+1,1≤j≤m . Set H̄m = 0. 3: for j = 1, 2, . . . , m do 4: Compute wj = Avj , 5: for i = 1, . . . , j do 6: hij = (wj , vi ), 7: wj = wj − hij vi , 8: end for 9: hj+1,i = kwj k2 , 10: if hj+1,j = 0 then 11: Set m = j and go to 15, 12: end if 13: vj+1 = wj /hj+1,j 14: end for 15: Compute ym = argminy kβe1 − H̄m yk2 and xm = x0 + VM ym . We remark that in Algorithm 2.9 the lines 3 to 14 represent Arnoldi’s Algorithm for finding an orthonormal basis of K m (A, v1 ). If Algorithm 2.9 is examined carefully, we observe that GMRES has a breakdown when hj+1,j = 0 at a given step j. In this situation the next Arnoldi vector cannot be generated. However, in this situation, the residual vector is equal to zero and GMRES will deliver the exact solution at this step. For a proof of this we refer to [28, Proposition 6.10]. The last remark considering Algorithm 2.9 is the following. In the case the iteration number m is large, the GMRES algorithm becomes impractical by lack of memory and increasing computational requirements. This follows directly from the fact that during the Arnoldi steps (lines 3-14 of Algorithm 2.9) the number of vectors to be stored increases. To remedy this problem one can think of restarting or truncating the algorithm. More information on restarting and truncating can be found in [28]. 2.3 Stopcriterium 2.3 143 Stopcriterium In most of the iterative methods given in this chapter the following statement is included : for j = 1, 2, . . . , until convergence do. To specify the statement ‘until convergence’ various stop criteria can be chosen. It depends on each practical situation which of them should be chosen for an accurate solution. We give two standard stop criteria. Other criteria can be found in for example [22, 29]. • Criterion 1 : krj k2 ≤ ε The main disadvantage of this stop criterion is that it is not scaling invariant. By this we mean that if kr j k2 ≤ ε, then this does not hold for k100rj k2 , although the accuracy for xi remains the same, • Criterion 2 : krj k2 kbk2 ≤ε In this criterion the norm of the residual is small in comparison with the norm of the right-hand side. Replacing ε by ε/K 2 (A) gives that the relative error in x is less than ε, i.e. krj k2 kx − xi k2 ≤ K2 (A) ≤ ε. kxk2 kbk2 (2.42) 144 Krylov Subspace Methods Chapter 3 Precondition Techniques Krylov subspace methods for solving (sparse) linear systems are well founded theoretically, but appear to suffer from slow convergence from typical applications as CFD. In this chapter we first describe preconditioning without being specific about the particular preconditioners used. In Section 3.2 we will give some widely used preconditioners as, for example, ILU. Efficiency and robustness of iterative methods can be improved by using preconditioning techniques. In one sentence, preconditioning can be described as : Transforming the original linear system into one which has the same solution, but which is likely to be easier to solve with an iterative solver. In practical applications, the reliability of iterative solution methods depends more on the quality of the preconditioner than on the particular Krylov subspace accelerator used. 3.1 Preconditioned Iterations In this section we discuss the preconditioned versions of the Krylov subspace methods of the previous chapter, using a generic preconditioner. We will mainly follow Reference [28]. We start with the Preconditioned Conjugate Gradient method. 3.1.1 Preconditioned Conjugate Gradient Consider a linear system Ax = b, whereby A is symmetric positive definite. Furthermore, we assume that a preconditioner M is available, which approximates A in some yet-undefined sense. This preconditioner M is also to be assumed symmetric positive definite. From a practical point of view, the 146 Precondition Techniques only requirement for M is that linear systems M x = b̃ are inexpensive to solve. This is a neccesary requirement, because preconditioned algorithms will require a linear system solution of M x = b̃ at each step. Then, we solve the left preconditioned system M −1 Ax = M −1 b. (3.1) Note that in general this system is not symmetric. To preserve symmetry we introduce the M -inner product. The M -inner product is defined as (x, y)M = (M x, y) = (x, M y). (3.2) By observing that (M −1 Ax, y)M = (Ax, y) = (x, Ay) = = (x, M (M −1 A)y) = (x, M −1 Ay)M , (3.3) we see that M −1 A is self-adjoint for the M -inner product. As an alternative, we replace the Euclidean inner product in the CG algorithm by the M -inner product. We now will rewrite the CG algorithm for the M -inner product. Therefore we distinguish the original residue r j = b − Axj and the preconditioned residual zj = M −1 rj . For the preconditioned system we then obtain the following sequence of parameters (compare with Algorithm 2.4) : 1. αj = (zj , zj )M /(M −1 Apj , pj )M , 2. xj+1 = xj + αj pj , 3. rj+1 = rj − αj Apj and zj+1 = M −1 rj+1 , 4. βj = (zj+1 , zj+1 )M /(zj , zj )M , 5. pj+1 = zj+1 + βj pj . The M -inner products (zj , zj )M and (M −1 Apj , pj )M do not have to be computed explicitly, i.e. (zj , zj )M = (rj , zj ) and (M −1 Apj , pj )M = (Apj , pj ). (3.4) Using the latter observation, we obtain the Preconditioned Conjugate Gradient Algorithm, given as Algorithm 3.1. 3.1 Preconditioned Iterations 147 Algorithm 3.1 Preconditioned Conjugate Gradient Compute r0 = b − Ax0 , z0 = M −1 r0 and p0 = z0 , for j = 0, 1, . . . , until convergence do αj = (rj , zj )/(Apj , pj ), xj+1 = xj + αj pj , rj+1 = rj − αj Apj , zj+1 = M −1 rj+1 , βj = (rj+1 , zj+1 )/(rj , zj ), pj+1 = zj+1 + βj pj . end for Left, Right and Split Preconditionering To transform the linear system Ax = b with a preconditioner M , such that the transformed system is asier’to solve, we have three options : 1. Left Preconditioning : Solve M −1 Ax = M −1 b, 2. Right Preconditioning : Solve AM −1 u = b with u = M x, 3. Split Preconditioning : If M is SPD, then there exists a matrix L such that M = LLT . Solve L−1 AL−T u = b with x = LT u. The Left Preconditioning is already given in (3.1) and used to determine the Preconditioned CG Algorithm. Right Preconditioning has the advantage that the original residual r i = b − Axi is computed implicitly by b − AM −1 ui = b − Axi . An advantage of Split Preconditioning is that the linear system is symmetric. Finally, we remark that the three options given above can be used to construct preconditioned iterative solution methods. In this report we give preconditioned versions of • (Left) Preconditioned CG, • Left Preconditioned GMRES, • Right Preconditioned GMRES, and • (Left) Preconditioned Bi-CGSTAB. For more information on preconditioned iterative methods we refer to [28, 29] 3.1.2 Preconditioned GMRES As for preconditioning the CG method, we have for the GMRES method again the three options for applying the preconditioning operation, i.e. left, 148 Precondition Techniques right and split preconditioning. In this section we treat the left and right preconditioning of GMRES. There will be a fundamental difference with respect to the right preconditioning. Left Preconditioned GMRES Left Preconditioned GMRES solves the left preconditioned linear system M −1 Ax = M −1 b, (3.5) with A and M general matrices. The straightforward application of GMRES to (3.5) gives Algorithm 3.2, i.e., the Left Preconditioned GMRES Algorithm. Algorithm 3.2 Left Preconditioned GMRES 1: Compute r0 = M −1 (b − Ax0 ), β = kr0 k2 and v1 = r0 /β 2: for j = 1, 2, . . . , m do 3: Compute w = M −1 Avj , 4: for i = 1, . . . , j do 5: hij = (w, vi ), 6: w = w − hij vi , 7: end for 8: hj+1,j = kwk2 and vj+1 = w/hj+1,j , 9: end for 10: Define Vm = [v1 , . . . , vm ] and H̄m = (hi,j )1≤i≤j+1,1≤j≤m , 11: Compute ym = argminy kβe1 − H̄m yk2 and xm = x0 + Vm ym . 12: If satisfied Stop, else set x0 = xm and go to 1. In Algorithm 3.2 an orthogonal basis is constructed for the left preconditioned Krylov subspace span = {r0 , M −1 Ar0 , . . . , (M −1 A)m−1 r0 }. (3.6) Furthermore, we remark that in Algorithm 3.2 all residuals and their norms correspond to the preconditioned residual r m = M −1 (b − Axm ). The ‘original’ residuals rm,original = b − Axm are not computed. In the case that the stop criterion is based on rm , this can cause some difficulties. A possible solution is to compute the residuals r m explicitly. Right Preconditioned GMRES The Right Preconditioned GMRES Algorithm is based on solving AM −1 u = b with u = M x. (3.7) In this situation the new variable u never needs to be invoked explicitly. By simple analysis of (3.7), as done in [28], one obtains that one preconditioning 3.1 Preconditioned Iterations 149 operation is needed at the end of the loop, instead of at the beginning in the case of the left preconditioned version. We obtain then Algorithm 3.3. Algorithm 3.3 builds an orthogonal basis of the right preconditioned Krylov Algorithm 3.3 Right Preconditioned GMRES 1: Compute r0 = b − Ax0 , β = kr0 k2 and v1 = r0 /β 2: for j = 1, 2, . . . , m do 3: Compute w = AM −1 vj , 4: for i = 1, . . . , j do 5: hi,j = (w, vi ), 6: w = w − hi,j vi , 7: end for 8: hj+1,j = kwk2 and vj+1 = w/hj+1,j , 9: Define Vm = [v1 , . . . , vm ] and H̄m = (hi,j )1≤i≤j+1,1≤j≤m , 10: end for 11: Compute ym = argminy kβe1 − H̄m yk2 and xm = x0 + M −1 Vm ym . 12: If satisfied Stop, else set x0 = xm and go to 1. subspace span = {r0 , AM −1 r0 , . . . , (AM −1 )m−1 r0 }. (3.8) Note that the residual norm is now relative to the initial system, since the algorithm obtains the residual b − Ax m = b − AM −1 um , implicitly. This is an essential difference with the left preconditioned GMRES algorithm. By comparing left and right preconditioning GMRES the following proposition, taken from [28], can be proven. Proposition 3.1. The approximate solution obtained by left or right preconditioned GMRES is of the form xm = x0 + sm−1 (M −1 A)z0 = x0 + M −1 sm−1 (AM −1 )r0 , (3.9) where z0 = M −1 r0 and sm−1 is a polynomial of degree (m − 1). The polynomial sm−1 minimizes the residual norm kb − Axm k2 in the right preconditioning case, and the preconditioned residual norm kM −1 (b − Axm )k in the left preconditioning case. In most practical situations, the difference in the convergence behavior of the two approaches is not significant. The only exception is when M is ill-conditioned which could lead to substantial differences. 3.1.3 Preconditioned Bi-CGSTAB We restrict ourselves by giving the preconditioned Bi-CGSTAB algorithm. It is given as Algorithm 3.4. 150 Precondition Techniques Algorithm 3.4 Preconditioned Bi-CGSTAB 1: Compute r0 = b − Ax0 and choose r0∗ arbitrary, 2: Set p0 = r0 , 3: for j = 0, 1, 2, . . . , m do 4: p̃j = M −1 pj , 5: s̃j = M −1 sj , 6: vj = Asj , 7: αj = p̃j /(Ap̃j , r̃0 ), 8: sj = rj − αj Ap̃j , 9: wj = (vj , sj )/(vj , vj ), 10: xj+1 = xj + αj p̃j + wj s̃j , 11: rj+1 = sj − wj vj , α (r ,r̃0 ) × wjj , 12: βj = (rj+1 j ,r̃0 ) 13: pj+1 = rj + βj pj − wj Ap̃j , 14: end for 3.2 Preconditioned Techniques In this section we give some of the well known precondition techniques. Among others we will treat the Diagonal Preconditioner, ILU, and ...... Finding a good preconditioner to solve a sparse linear system is often viewed as a combination of art and science. Roughly speaking, a preconditioner is a explicit or implicit modification of an original linear system which makes it “easier” to solve by a given iterative method. An example of an explicit form of preconditioning is to scale all rows of the system to make the diagonal elements equal to one. The resulting system can be solved by a Krylov subspace method and may require fewer iterates to converge than with the original system. We remark that this is not guaranteed. As another example, solving the linear system M −1 Ax = M −1 b, (3.10) where M −1 is some complicated mapping that may involve FFT transforms, integral calculations, etc., may be another form of preconditionering. Here, it is unlikely that the matrix M −1 and M −1 A can be computed explicitly. Instead, the iterative processes operate with A and M −1 whenever needed. In practice, the (explicit or implicit) preconditioning operation M −1 should be inexpensive to apply to an arbitrary vector. In this report we only treat matrix based preconditioners. Analytic preconditioners like AILU, SoV, . . ., are not suited for the linear systems arising from our application. 3.2 Preconditioned Techniques 3.2.1 151 Diagonal Scaling We transform the original system Ax = b, with A is SPD, to the transformed system Ãx̃ = b̃, (3.11) where Ã = P −1 AP −T , x = P −T x̃ and b̃ = P −1 b. The matrix M = P −1 P −T is called the preconditioner. A simple choice for P is a diagonal matrix with the elements p (3.12) Pi,i = Ai,i , for i = 1, 2, . . . , n. Then, it is easy to derive that −1 −T Ãi,i = Pi,i Ai,i Pi,i = 1, for i = 1, 2, . . . , n. (3.13) In [30] it has been shown that for this choice for P the condition number of Ã is minimized. For this preconditioner it is advantageous to apply CG to Ãx̃ = b̃, since Ã is easy to calculate. 3.2.2 Incomplete LU Factorization One of the simplest ways to obtain a preconditioner is to perform an incomplete factorization of the original matrix A. This gives a factorization of the form A = LU − R, where L and U have the same non-zero structure as the lower and upper parts of the original matrix A respectively, and R is the residual of the factorization. In a practical implementation the ILU factorization depends on the implementation of the Gaussian elimination. The general ILU factorization is given in Algorithm 3.5. Here P is the zero pattern set such that P ⊂ {(i, j) : i 6= j; 1 ≤ i, j ≤ n}, (3.14) and aij are the elements of A. If we choose for P the empty set, i.e. P = ∅, Algorithm 3.5 General ILU Factorization (IKJ version) 1: for i = 2, . . . , n do 2: for k = 1, . . . , i − 1 and if (i, j) ∈ / P do 3: aik = aik /akk , 4: for j = k + 1, . . . , n and for (i, j) ∈ / P do 5: aij = aij − aik akj . 6: end for 7: end for 8: end for then Algorithm 3.5 gives the complete LU decomposition of A. 152 Precondition Techniques Zero Fill-In ILU (ILU(0)) Denote by N Z(A) the set of pairs (i, j), 1 ≤ i, j ≤ n such that a ij 6= 0 and by Z(A) the set of pairs (i, j), 1 ≤ i, j ≤ n such that a ij = 0. The ILU technique with no fill-in, denoted by ILU(0), consists of taking the zero pattern P to be precisely the zero pattern of A. This defines the ILU factorization in general terms : Choose any pair of L and U such that the elements of A − LU are zero in the locations N Z(A). These constraints do not define the ILU(0) factors in a unique way since there are, in general, infinitely many pairs of L and U which satisfy these requirements. However, the standard ILU(0) technique is defined constructively using Algorithm 3.5 with P = Z(A) and adding the requirement Lii = 1. ILU(0) reduces Algorithm 3.5 to Algorithm 3.6. The accuracy Algorithm 3.6 ILU(0) 1: for i = 2, . . . , n do 2: for k = 1, . . . , i − 1 and if (i, k) ∈ / Z(A) do 3: aik = aik /akk , 4: for j = k + 1, k + 2, . . . , n and for (i, j) ∈ / Z(A) do 5: aij = aij − aik akj . 6: end for 7: end for 8: end for of the ILU(0) factorization may be insufficient to yield an adequate rate of convergence, see for example [28, Section 10.3.2]. ILU(p) In order to improve this convergence rate as well as the efficiency, more accurate ILU factorizations can be performed by allowing some fill-ins. This is better known as the ILU(p) method, where p is the level of fill-in. In this section we explain ILU(p) in particular for p = 1. The ILU(1) factorization results form taking P to be zero pattern of the product L0 U0 , where L0 and U0 are obtained via the ILU(0) factorization. By doing this, we actually consider a matrix with additional off-diagonal elements which are actually zero in the additional matrix A. The factors L 1 and U1 of ILU(1) are obtained by performing ILU(0) to this matrix, i.e. the matrix L0 U0 . By induction, we can define ILU(p) for p = 2, 3, . . . Remark that for increasing values of p we get a higher amount of fill-in in both L and U , which could be inefficient. Therefore, one should look for the optimal mix between the amount of work and the convergence rate. 3.3 Incomplete Choleski Factorization 153 Other variants of ILU There are several variants of ILU(p) which differ from the variant described above. For instance we have the Modified ILU (MILU) which attempts to reduce the effect of dropping by compensating for the discard entries. A popular strategy is to add up all the elements that have been dropped at the completion of the k-loop in Algorithm 3.5. Then this sum is subtracted from the diagonal entry in U . The compensation strategy described above is the MILU method. Another variant of ILU which has a more accurate factorization is ILUT or ILU(tol). Roughly speaking, the ILUT method drops elements which are smaller than a specified value τ . For example, in [28], τ = 10 −4 . More details about MILU of ILUT can be found in [28, Chapter 10]. 3.3 Incomplete Choleski Factorization In the case that the matrix A is SPD we are able to construct an Incomplete Choleski Factorization, i.e. A = LL T − R. The matrix L has the same non-zero structure as the lower part of A. Here we described the IC(0) preconditioner. The IC(p) Algorithm for p ≥ 1, can be constructed in an analogous way as ILU(p) is constructed in the previous section. For more information about Incomplete Choleski Factorization we refer to [29]. 3.4 Multigrid The last remark on preconditioners is that also multigrid can be used as a preconditioning technique. This can easily be done by choosing M −1 = (I − Q)A−1 , where Q is the multigrid iteration operator. See for example [22]. With a varying preconditioner, like multigrid with a different cycle in each iteration, a Krylov subspace method in which the preconditioner can change from iteration to iteration is needed. The Flexible GMRES (FGMRES), or the GMRESR method allows such allows such a varying preconditioner. More information on Multigrid and Multigrid as preconditioner can be found in e.g. [22]. 154 Precondition Techniques Part V Concluding Remarks Chapter 1 Summary and Conclusions The process of CVD is mathematically described by : 1. continuity equation, 2. incompressible Navier-Stokes equations, 3. transport equation for thermal energy, 4. transport equation for gas species, or advection-diffusion-reaction equations, 5. ideal gas law. The main purpose of this research project is to develop robust and efficient solvers for the heat reaction system, i.e., the coupled system of the transport equation for thermal energy and the advection-diffusion-reaction equations. The first approximation is to solve the test problem, which consists of a small number (say 10) of coupled advection-diffusion-reaction equations. Using the Method of Lines approach, we first discretize in space using the finite volume approach resulting in a ODE system y 0 = f (t, y), y ∈ RN and f (t, y) : RN +1 → RN . (1.1) In (1.1) the integer N is the number of grid cells in the finite volume discretrization. This ODE system (1.1) is a so-called stiff system. The next step is to find a suitable time integration method. In general this method will contain implicit parts, or even it can be fully implicit, which results in nonlinear systems. To solve nonlinear systems, solutions of linear systems are also needed. These remarks are summarized in Figure 1.1. 158 Summary and Conclusions Mathematical Model Spatial Discretization Time Integration Method Non−Linear Solver Linear Solver Figure 1.1: Schematic representation of steps to solve the mathematical model of CVD 1.1 Time Integration Methods 1.1 159 Time Integration Methods In literature one can find many time integration methods that are developed to integrate stiff systems. Nevertheless, the most succesfull ones, in practical applications, are usually the most famous ones as • Rosenbrock methods, • BDF, implicit linear multi-step, • Implicit Runge Kutta methods, like RADAU methods. We remark that these methods are fully implicit, except Rosenbrock methods which are derived from DIRK methods. The idea behind Rosenbrock methods is not to solve the nonlinear equations resulting from the implicit scheme for the stage approximations, but to compute the intermediate approximations from the linearized nonlinear system. See Part II, Section 2.2. More recent research in the area of solving advection-diffusion-reaction equations is focused on “splitting up” the ODE system (1.1) into slow and fast varying components. Splitting up an ODE system can be done in two ways, i.e., splitting up into two subsystems and solve them separately or split the system up into a part for explicit and a part for implicit treatment. Examples of the first method are Operator Splitting and Multi-rate (Runge Kutta) methods. IMEX extensions of Runge Kutta (Chebyshev) methods are examples of the second class. 1.2 Nonlinear and Linear Solvers The iterative methods studied for solving linear and nonlinear systems are all standard techniques, see Part III and IV. The most important question is how to find for a given time integration method the best suitable (with respect to efficiency and robustness) nonlinear solver and (eventually) linear solver. 160 Summary and Conclusions Chapter 2 Future Research We formulate the research subjects for future work. 1. 2D Test-problem 2. Selection of time integration methods 3. Investigate how coupling and stiffness influences stability 4. Relation time integration method and (non) linear solvers 5. 3D Test-problem : aim is to solve CVD system with 50 species on 50 × 50 × 50 grid 2.1 Test-problem and Time Integration Methods The first two points of the above enumeration are connected to each other. For the given test problem we will make a selection of time integration methods and compare their efficiency and robustness for the CVD application. Of course, the selection will consist of both well known methods like Rosenbrock, et. and recently developed method like IMEX RKC, etc. 2.2 Stability By the third point in the above enumeration we mean that for some methods, like for instance Multi rate Runge Kutta methods, both coupling between slow and fast components and the enormous variations in reaction rates(the latter produces stiffness), influence the stability. As shown in [20], a simple multi-rate extension of Backward Euler can have a very small stability region for one of the subsystems. For other multi-rate extensions we could not 162 Future Research find any references on stability. To our knowledge a general theory on stability for multi-rate methods is not known. In the case we apply such time integration methods, we also have to investigate the influence of coupling between subsystems and stifness on stability. 2.3 (Non)-Linear Solvers In order to solve the stiff ODE system resulting from spatial discretization, some nonlinear and linear systems have to be solved. For this typical CVD application we have to search for the optimal (non)linear solvers. By optimal we mean efficient and robust. 2.4 Extension to Three Dimensions As formulated in the Preface, the aim of this research project is to solve the coupled heat / reaction system on a three dimensional spatial grid. To have a realistic model for CVD and/or combustion one has to model approximately 50 different chemical species. This results in a nonlinear coupled PDE system of 49 coupled advection-diffusion-reaction equations and one heat equation. The last step is then to integrate the developed solver(s) in a CFD package. Part VI Appendices Appendix A Collocation Methods In this part we pay a little attention to collocation methods. Collocation is an old an universal concept in numerical analysis. Applied to an ordinary differential equation w0 (t) = f (t, w(t)), (A.1) the idea is as follows. Search for a polynomial u(t) of degree s, with s a positive integer, whose derivative coincides at s given points with the vector field of the differential equation (A.1). Definition A.1. For s a positive integer and c 1 , . . . , cs real constants (typically between 0 and 1), the corresponding collocation polynomial u(x) of degree s is defined by u(xn ) = wn initial value 0 (A.2) u (xn + ci h) = f (xn + ci h, u(xn + ci h)) i = 1, . . . , s. (A.3) (A.4) The numerical solution is then given by wn+1 = u(xn + h). (A.5) If some ci ’s coincide, the collocation condition (A.3) will contain higher derivatives and lead to multi-derivative methods, see [10]. We suppose all ci ’s to be different. Theorem A.2. The collocation method as defined in the above definition is equivalent to the s-stage Implicit Runge-Kutta method with αij = Z 0 ci Lj (t)dt, bj = Z 0 1 Lj (t)dt and Lj (t) = Y t − ck . cj − c k k6=j (A.6) 166 Collocation Methods Proof. Put u0 (xn + ci h) = ki such that 0 u (xn + ci h) = s X kj Lj (t). (A.7) j=1 Then integrate u(xn + ci h) = w0 + h Z ci u0 (xn + ci h)dt, 0 and insert into (A.3) together with (A.6). We then obtain an implicit RungeKutta scheme. Appendix B Padé Approximations Padé approximations are rational functions which, for a given degree of the numerator and the denominator, have highest order of approximation. Their origin lies in the theory of continued fractions and they played a fundamental role in Hermite’s proof of the transcendency of e. These optimal approximations can be obtained for the exponential function e z , which is given in the following theorem. Theorem B.1 is taken from [11]. Theorem B.1. The (k, j)-Padé approximation to e z is given by Rkj (z) = Pkj (z) , Qkj (z) (B.1) where k k(k − 1) z2 z+ · + ··· j+k (j + k)(j + k − 1) 2! k(k − 1) · · · 1 zk ··· + · , (j + k)(j + k − 1) · · · (j + 1) k! Pkj (z) = 1 + and j j(j − 1) z2 z+ · + ··· k+j (k + j)(k + j − 1) 2! j(j − 1) · · · 1 zj ··· + · = Pjk (−z). (k + j)(k + j − 1) · · · (k + 1) j! (B.2) Qkj (z) = 1 − (B.3) The error ez − Rkj (z) is then given by ez − Rkj (z) = (−1)j j!k! z j+k+1 + O(z j+k+2 ). (j + k)!(j + k + 1)! (B.4) It is the unique rational approximation to e z of order j + k, such that the degrees of numerator and denominator are k and j, respectively. In Table B.1 the first Padé approximations to e z are given. 168 Padé Approximations k=0 k=1 k=2 j=0 1 1 1+z 1 1+z+ z2! 1 j=1 1 1−z 1+ 12 z 1− 12 z 1+ 32 z+ 31 z2! 1− 13 z j=2 1 2 1−z+ z2! 1+ 13 z 1+ 12 z+ 61 z2! 2 2 1− 23 z+ 13 z2! 2 2 2 1− 12 z+ 61 z2! Table B.1: (k, j)-Padé approximations to e z for k, j = 0, 1, 2 Bibliography [1] M. Bakker, Analytical Aspects of a Minimax Problem(in Dutch), Technical Note TN62, Mathematical Center, Amsterdam [2] R.B. Bird, W.E. Stewart and E.N. Lightfoot, Transport Phenomena, John Wiley & Sons, (2002) [3] C.G. Broyden, A New Method of Solving Nonlinear Simultaneous Equations, Computer Journal, 12, pp. 94-99, (1969) [4] C.F. Curtiss and J.O. Hirschfelder, Integration of Stiff Equations, Proc. Nat. Acad. Sci. 38, pp. 235-243, (1952) [5] G. Dahlquist, Convergence and Stability in the Numerical Integration of Ordinary Differential Equations, Math. Scand. 4, pp. 33-53, (1956) [6] Ch. Engstler and Ch. Lubich, Multi-rate Extrapolation Methods for Differential Equations with different time scales, Computing 58, pp. 173 - 185, (1997) [7] F.C. Eversteijn, P.J.W. Severin, C.H.J. van den Brekel and H.L. Peek A Stagnant Layer Model for the Epitaxial Growth of Silicon from Silane in a Horizontal Reactor, J. Electrochem. Soc. 117, pp.925-931, (1970) [8] V. Faber and T. Manteuffel, Necessary and sufficient conditions for the existence of a conjugate gradient method, SIAM J. Num. Anal., 21, pp. 356-362, (1984) [9] A. Guillo and B. Lago, Domaine de Stabilité Associé aux Formules d’Integration Numérique d’Equations Différentielles, a pas Sépar]’es et pas Liés. Recherche de Formules a Grand Rayon de Stabilité., Ier. Congr. assoc. Fran. Calcul, AFCAL, Grenoble, pp.43-56, (1961) [10] E. Hairer, S.P. Nørsett and G. Wanner, Solving Ordinary Differential Equations I : Nonstiff Problems, Springer Series in Computational Mathematics, 8, Springer, Berlin, (1987) 170 BIBLIOGRAPHY [11] E. Hairer and G. Wanner, Solving Ordinary Differential Equations II : Stiff and Differential-Algebraic Problems, Second Edition, Springer Series in Computational Mathematics, 14, Springer, Berlin, (1996) [12] P.J. van der Houwen and B.P. Sommeijer, On the Internal Stability of Explicit, m-stage Runge-Kutta Methods for Large m Values, Z. Angew. Math. Mech., vol 60, pp. 479-485 [13] W. Hundsdorfer and J.G. Verwer, Numerical Solution of TimeDependent Advection-Diffusion-Reaction Equations, Springer Series in Computational Mathematics, 33, Springer, Berlin, (2003) [14] K.F. Jensen, Modeling of Chemical Vapor Deposition Reactors, in Modeling of Chemical Vapor Deposition Reactors for Semiconductor Fabrication, Course notes, University of California Extension, Berkeley, USA, (1988) [15] C.T. Kelley, Solving Nonlinear Equations with Newton’s Method, Fundamentals of Algorithms, SIAM, (2003) [16] C.R. Kleijn, Transport Phenomena in Chemical Vapor Deposition Reactors, PhD thesis, Delft University of Technology, (1991) [17] C.R. Kleijn, Computational Modeling of Transport Phenomena and Detailed Chemistry in Chemical Vapor Deposition- A Benchmark Solution, Thin Solid Films, 365, pp. 294-306, (2000) [18] A.E.T. Kuiper, C.H.K. van den Brekel, J. de Groot and F.W. Veltkamp, Modeling of Low Pressure CVD Processes, J. Electrochem. Soc. 129, pp. 2288-2291, (1982) [19] A. Kværnø and P. Rentrop, Low Order Multi-rate Runge-Kutta Methods in Electric Circuit Simulation, Preprint No. 99/1, IWRMM, Universitt Karlsruhe (TH), 1999 [20] A. Kværnø, Stability of Multi-rate Runge-Kutta Schemes, Int. J. Differ. Equ. Appl., pp.97-105, (2000) [21] M. Meyyapan, Computational Modeling in Semiconductor Processing, Artech House, Boston, (1995) [22] C.W. Oosterlee, H. Bijl, H.X. Lin, S.W. de Leeuw, J.B. Perot, C. Vuik, P. Wesseling, Lecture Notes for Course “Computational Science and Engineering”, Lecture Notes, Delft University of Technology, (2005) [23] J.M. Ortega and W.C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Reprint of the 1970 original, Classics in Applied Mathematics 30, SIAM, Philadelphia, (2000) BIBLIOGRAPHY 171 [24] S.V. Patankar, Numerical Heat Transfer and Fluid Flow, Hemisphere Publ. Corp., Washington, (1980) [25] A. Prothero and A. Robinson, On the Stability and Accuracy of OneStep Methods for Solving Stiff Systems of Ordinary Differential Equations, Math. of Comput., 28, pp.145-162, (1974) [26] H.H. Robertson, The Solution of a Set of Reaction Rate Equations, In : J. Walsh ed. : Numer. Anal., an Introduction, Academ. Press, pp. 178-182, (1966) [27] H.H. Rosenbrock, Some General Implicit Processes for the Numerical Solution of Differential Equations, Comp. J. 5, pp. 329-330, (1963) [28] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing, Boston, (1996) [29] A. Segal and C. Vuik, Computational Fluid Dynamics II, J.M. Burgerscentrum Course Notes, (2004) [30] A. van der Sluis, Conditioning, Equilibration and Pivoting in Linear Algebraic Systems, Numer. Math., 15, pp. 74-86, (1970) [31] P. Sonneveld, CGS : A Fast Lanczos Type Solver for Non-symmetric Linear Systems, SIAM J. Sci. Stat. Comp., 10, pp. 36-52, (1989) [32] B. Sportisse, An Analysis of Operator Splitting in the Stiff Case, J. Comput. Phys. 161, pp. 140-168, (2000) [33] G. Strang, On the Construction and Comparison of Difference Schemes, SIAM J. Numer. Anal. 5, pp.506-517, (1968) [34] M. Suzuki, Fractal Decomposition of Exponential Operators with Applications to Many-Body Theories and Monte Carlo Simulations, Phys. Lett. A 146, pp. 319-323, (1990) [35] J.G. Verwer and B. Sportisse, A Note on Operator Splitting in a Stiff Linear Case, Report MAS-R9830, CWI, Amsterdam, (1998) [36] J.G. Verwer and B.P. Sommeijer, An Implicit-Explicit Runge-KuttaChebyshev Scheme for Diffusion-Reaction Equations, SIAM Journal on Sci. Comp., 25, pp.1824-1835, (2004) [37] J.G. Verwer, B.P. Sommeijer and W. Hundsdorfer, RKC Time-Stepping for Advection-Diffusion-Reaction Problems, Journal of Comp. Physics, 201, pp. 61-79, (2004) [38] C. Vuik, Numerieke Methoden voor Differentiaalvergelijkingen, Lecture Notes, Delft University of Technology, (2000) 172 BIBLIOGRAPHY [39] H. Yoshida, Construction of Higher Order Symplectic Integrators, Phys. Lett. A 150, pp. 262-268, (1990)

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement